NLP1000_weka_clustering_jcreator_etc

#public/classnotes

RAW FILE

This note has not been edited yet. Content may be subject to change.

20 min late but idt sir noticed

_attachments/Pasted image 20250928154928.png

to download

weka
JCreator (or any IDE)
Resources > Nutch
Canvas: Languages - 5 xx.csv
Worksheet 4 (Gdrive > Materyales)
Scrapestorm

Title

Dice coefficient

a statistical measure used to evaluate the similarity between 2 sets of data

2 (A ∩ B) / |A| + |B|
_attachments/Pasted image 20250928155756.png

2 (A ∩ B) / |A| + |B|
PUWEDE / PWEDE
Try to get the value of n using 1-gram, 3-gram, etc.

Language Identification

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

n-grams are used in language identification
Identified Language
X is the text input
Γ is the set of target languages
S(X, L) is the similarity score of X with language L

We have tackled

LanguageTool: LID Activity
LT uses Apache Nutch to generate Trigrams
we also have Apache Tika which uses LID

Applications

Corpus Building
- automatically crawl documents from the web
- automatically classify: Language
- a corpus: is a large, structured collection of texts (written, spoken, transcribed) used for linguistic research and NLP
- Corpus building: process of creating the datasets including cleaning and annotation (e.g., tagging parts of speech, labeling language).
- The goal is to create a resource that represents how a language is used in real contexts.
Measuring Language Similarity

instead of using the text input and computing similarity against other languages, how about we compare the model of a language against the other languages?

can be done using a similarity matrix

_attachments/Pasted image 20250928160921.png

the lowest match would be bikol/waray
the highest match is cebuano/hiligaynon
then it can be used to cluster different languages

Weka Hands-on Activity

_attachments/Pasted image 20250928161156.png

open Weka GUI Chooser
we're also going to use JCreator, if you don't want to use it refer to the source code so you can use a different IDE
select explorer

_attachments/Pasted image 20250928161407.png

open Languages - 5 xx.csv
- the values in this file are numerical
- if it says N/A, change it to 1

Weka is a data mining tool that can (1) Classify, (2) Cluster, (3) Associate (Data association), (4) Select Attributes (attribute ranker for visual? engineering and find out what contributes to the accuracy), (5) Visualize

on the top tab, click Cluster
Choose (choose your algorithm) > SimpleKMeans
Click on SimpleKMeans after choosing it in Choose to edit the setup of SimpleKMeans
set distanceFunction to EuclideanDistance
set numClusters to 3 and select Ok
click Ignore Attributes and select XX

_attachments/Pasted image 20250928162344.png

=== Clustering model (full training set) ===

kMeans
======

Number of iterations: 2
Within cluster sum of squared errors: 1.3903798326399064

Initial starting points (random):

Cluster 0: 0.72,0.76,0.75,1,0.66
Cluster 1: 0.73,1,0.81,0.76,0.72
Cluster 2: 1,0.73,0.73,0.72,0.7

Missing values globally replaced with mean/mode

Final cluster centroids:
                         Cluster#
Attribute    Full Data          0          1          2
                 (5.0)      (1.0)      (2.0)      (2.0)
=======================================================
bik              0.776       0.72       0.73       0.85
ceb              0.804       0.76      0.905      0.725
hil              0.808       0.75      0.905       0.74
tgl              0.778          1      0.755       0.69
war              0.766       0.66      0.735       0.85

Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===

Clustered Instances

0      1 ( 20%)
1      2 ( 40%)
2      2 ( 40%)

the higher the number, the closer it is to a centroid
kMeans clustering: plot 3 values, each one is a centroid
plot a second value, determine if it's related to one of the existing centroids.
if it's closer to centroid A than B, the new datapoint is related to that cluster with centroid A
another iteration, recompute the centroid, add another data point, etc.
if it converges (no change in membership), then it's complete

Reading the results:
bikol

0.72 distance with cluster 0
0.73 distance with cluster 1
0.85 distance with cluster 2 <-- therefore bikol is part of cluster 2
cebuano is in cluster 1
hiligaynon is in cluster 1
taglog is in cluster 0
waray is in cluster 2

Cluster 0 has 1 member, Cluster 1 has 2 members, and Cluster 2 has 2 members.

what we just did is use the Similarity Matrix to obtain the different Clusters.

you can also use Python for clustering and similarity matrix
we can use Weka to classify, cluster, and perform data association, etc.
you can export the model and create a java application that does the things here

each time you perform an experiment, it shows up in the Result List on the lower left:
_attachments/Pasted image 20250928163933.png

you can right click and visualize the cluster assignments

_attachments/Pasted image 20250928164030.png

Interpreting the results

we have tagalog, cebuano & hil, war & bikol

_attachments/Pasted image 20250928164130.png

based on the language family tree based on Ethnologue, tagalog is in one cluster.
cebuano and hiligaynon is in the same cluster, but they are in different sub-family trees
- cebuano and hiligaynon share some orthographic characteristics
- you can also look at historical data - ex. a town used cebuano and hiligaynon
- we can check orthographic, societal, transportation context to interpret the clusters

we used a feature set (in this activity we used a csv files)

Creating the Feature Set

in order to create the CSV file...

map/location data
phonetic alphabet
words
ibang n-grams or tokenization
transportation network
another app: GabMap

how do you create the similarity matrix?

done by determining the dice coefficient: 2 (A ∩ B) / |A| + |B|
need Set A (Bikol) and Set B (Cebuano)
need to do it with all language pairs
can be done with vlookup

activity 5: create a feature set that you can use in Weka or other tools

for similarity matrix, you can use the worksheet - Week 04 - Similarity matrix
mini project for the project 2
covers only 5 languages
get bonus points if you use a feature set that is composed of geographic or transportation networks (mentioned 43:00)

_attachments/Pasted image 20250928164925.png

contains bikol character trigrams
the frequency was from the Cheat Sheet (LID exercise)

using vlookup

provides you with a result given a particular primary ey
primary key: language itself
if it's 1, _an is present in the top 20 trigrams of bikol, cebuano, hil, tgl, war
if its 0, in the top 20 trigrams of hiligaynon, you don't see _ni based on orthographic data

question: "non-breaking prefixes" - mr. mrs. etc

there are some abbreviations of mr. and mrs. in the english bible

question: lexical normalization is not recommended because the affixes themselves contain information (there are certain affixes unique to certain languages) so removing them would affect the results

50:38
_attachments/Pasted image 20250928165835.png

gold standard: use Ethnologue for the actual tree and that's how you will evaluate it
include the PDF file and source code
related works are in the Gdrive > Resources > NLP folder

How to make n-grams?

the character trigrams are the _an, _ka, _ma, etc.
watch previous recording

Formal Definition: Given a word W = W1... Wk, another sequence Z = Z1...Zn is an n-gram of W if there exists a strictly increasing sequence of i1... 1n of indexes W such that for all j = 1... n, we have Wi_j = Z_j

_attachments/Pasted image 20250928170335.png

increasing the value of n increases the word more and more, but it won't be able to capture single-letter or two-letter words and so on
n is the number of characters so _ is also considered

Apache Nutch (& JCreator)

55:13

in the Nutch folder: Nutch.jsd, .jcp, .jcu, .jcw are used for JCreator
if you don't want to use it, use the src folder within the Nutch folder.
run the jdk file if you don't have it yet

JCreator instructions

open the workspaces
configure > set to Java/JDK
data are in Nutch > classes > Trigram (build and run the file)

_attachments/Pasted image 20250928170742.png

// Nutch.java
InputStream input = new FileInputStream("Trigam\\Bikol.txt"); 
// Bikol.txt is the training data
// Apache Nutch is a trigram generator -> put the language files in Nutch and it will generate the ngp file
OutputStream output = new FileOutputStream("Trigram\\bik.ngp");
// the ngp file is the trigram profile
// Trigram\\ is the Trigram folder, which is inside classes

NGramProfile testing = new NGramProfile("tl", 1, 4);
// generate 1-gram to 4-gram
testing = testing.create("tl", input, "UTF-8");
testing.save(output); // save into the output

inside classes > Trigram > there are different text files
these are text collected from an online bible

sir stopped recording after giving us ilke 5 min