NLP1000_weka_clustering_jcreator_etc
This note has not been edited yet. Content may be subject to change.
20 min late but idt sir noticed


to download
- weka
- JCreator (or any IDE)
- Resources > Nutch
- Canvas: Languages - 5 xx.csv
- Worksheet 4 (Gdrive > Materyales)
- Scrapestorm
Title
Dice coefficient
- a statistical measure used to evaluate the similarity between 2 sets of data
2 (A ∩ B) / |A| + |B|

2 (A ∩ B) / |A| + |B|
PUWEDE / PWEDE
Try to get the value of n using 1-gram, 3-gram, etc.
Language Identification
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$
- n-grams are used in language identification
- Identified Language
- X is the text input
- Γ is the set of target languages
- S(X, L) is the similarity score of X with language L
We have tackled
- LanguageTool: LID Activity
- LT uses Apache Nutch to generate Trigrams
- we also have Apache Tika which uses LID
Applications
- Corpus Building
- automatically crawl documents from the web
- automatically classify: Language
- a corpus: is a large, structured collection of texts (written, spoken, transcribed) used for linguistic research and NLP
- Corpus building: process of creating the datasets including cleaning and annotation (e.g., tagging parts of speech, labeling language).
- The goal is to create a resource that represents how a language is used in real contexts.
- Measuring Language Similarity
instead of using the text input and computing similarity against other languages, how about we compare the model of a language against the other languages?
- can be done using a similarity matrix

- the lowest match would be bikol/waray
- the highest match is cebuano/hiligaynon
- then it can be used to cluster different languages
Weka Hands-on Activity

- open Weka GUI Chooser
- we're also going to use JCreator, if you don't want to use it refer to the source code so you can use a different IDE
- select explorer

- open
Languages - 5 xx.csv- the values in this file are numerical
- if it says
N/A, change it to 1
Weka is a data mining tool that can (1) Classify, (2) Cluster, (3) Associate (Data association), (4) Select Attributes (attribute ranker for visual? engineering and find out what contributes to the accuracy), (5) Visualize
- on the top tab, click
Cluster Choose(choose your algorithm) >SimpleKMeans- Click on
SimpleKMeansafter choosing it in Choose to edit the setup of SimpleKMeans - set
distanceFunctiontoEuclideanDistance - set
numClustersto3and select Ok - click
Ignore Attributesand selectXX

=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 2
Within cluster sum of squared errors: 1.3903798326399064
Initial starting points (random):
Cluster 0: 0.72,0.76,0.75,1,0.66
Cluster 1: 0.73,1,0.81,0.76,0.72
Cluster 2: 1,0.73,0.73,0.72,0.7
Missing values globally replaced with mean/mode
Final cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(5.0) (1.0) (2.0) (2.0)
=======================================================
bik 0.776 0.72 0.73 0.85
ceb 0.804 0.76 0.905 0.725
hil 0.808 0.75 0.905 0.74
tgl 0.778 1 0.755 0.69
war 0.766 0.66 0.735 0.85
Time taken to build model (full training data) : 0 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 1 ( 20%)
1 2 ( 40%)
2 2 ( 40%)
- the higher the number, the closer it is to a centroid
- kMeans clustering: plot 3 values, each one is a centroid
- plot a second value, determine if it's related to one of the existing centroids.
- if it's closer to centroid A than B, the new datapoint is related to that cluster with centroid A
- another iteration, recompute the centroid, add another data point, etc.
- if it converges (no change in membership), then it's complete
Reading the results:
bikol
- 0.72 distance with cluster 0
- 0.73 distance with cluster 1
- 0.85 distance with cluster 2 <-- therefore bikol is part of cluster 2
cebuano is in cluster 1
hiligaynon is in cluster 1
taglog is in cluster 0
waray is in cluster 2
Cluster 0 has 1 member, Cluster 1 has 2 members, and Cluster 2 has 2 members.
what we just did is use the Similarity Matrix to obtain the different Clusters.
- you can also use Python for clustering and similarity matrix
- we can use Weka to classify, cluster, and perform data association, etc.
- you can export the model and create a java application that does the things here
each time you perform an experiment, it shows up in the Result List on the lower left:

- you can right click and
visualize the cluster assignments

Interpreting the results
we have tagalog, cebuano & hil, war & bikol

- based on the language family tree based on Ethnologue, tagalog is in one cluster.
- cebuano and hiligaynon is in the same cluster, but they are in different sub-family trees
- cebuano and hiligaynon share some orthographic characteristics
- you can also look at historical data - ex. a town used cebuano and hiligaynon
- we can check orthographic, societal, transportation context to interpret the clusters
we used a feature set (in this activity we used a csv files)
Creating the Feature Set
in order to create the CSV file...
- map/location data
- phonetic alphabet
- words
- ibang n-grams or tokenization
- transportation network
- another app: GabMap
how do you create the similarity matrix?
- done by determining the dice coefficient:
2 (A ∩ B) / |A| + |B| - need Set A (Bikol) and Set B (Cebuano)
- need to do it with all language pairs
- can be done with vlookup
activity 5: create a feature set that you can use in Weka or other tools
- for similarity matrix, you can use the worksheet - Week 04 - Similarity matrix
- mini project for the project 2
- covers only 5 languages
- get bonus points if you use a feature set that is composed of geographic or transportation networks (mentioned 43:00)

- contains bikol character trigrams
- the frequency was from the Cheat Sheet (LID exercise)
using vlookup
- provides you with a result given a particular primary ey
- primary key: language itself
- if it's 1,
_anis present in the top 20 trigrams of bikol, cebuano, hil, tgl, war - if its 0, in the top 20 trigrams of hiligaynon, you don't see
_nibased on orthographic data
question: "non-breaking prefixes" - mr. mrs. etc
- there are some abbreviations of mr. and mrs. in the english bible
question: lexical normalization is not recommended because the affixes themselves contain information (there are certain affixes unique to certain languages) so removing them would affect the results
50:38

- gold standard: use Ethnologue for the actual tree and that's how you will evaluate it
- include the PDF file and source code
- related works are in the Gdrive > Resources > NLP folder
How to make n-grams?
- the character trigrams are the
_an,_ka,_ma, etc. - watch previous recording
Formal Definition: Given a word W = W1... Wk, another sequence Z = Z1...Zn is an n-gram of W if there exists a strictly increasing sequence of i1... 1n of indexes W such that for all j = 1... n, we have Wi_j = Z_j

- increasing the value of n increases the word more and more, but it won't be able to capture single-letter or two-letter words and so on
- n is the number of characters so
_is also considered
Apache Nutch (& JCreator)
55:13
- in the Nutch folder: Nutch.jsd, .jcp, .jcu, .jcw are used for JCreator
- if you don't want to use it, use the src folder within the Nutch folder.
- run the jdk file if you don't have it yet
JCreator instructions
- open the workspaces
- configure > set to Java/JDK
- data are in Nutch > classes > Trigram (build and run the file)

// Nutch.java
InputStream input = new FileInputStream("Trigam\\Bikol.txt");
// Bikol.txt is the training data
// Apache Nutch is a trigram generator -> put the language files in Nutch and it will generate the ngp file
OutputStream output = new FileOutputStream("Trigram\\bik.ngp");
// the ngp file is the trigram profile
// Trigram\\ is the Trigram folder, which is inside classes
NGramProfile testing = new NGramProfile("tl", 1, 4);
// generate 1-gram to 4-gram
testing = testing.create("tl", input, "UTF-8");
testing.save(output); // save into the output
- inside classes > Trigram > there are different text files
- these are text collected from an online bible
sir stopped recording after giving us ilke 5 min