NLP1000_weka_clustering_jcreator_etc

RAW FILE

This note has not been edited yet. Content may be subject to change.

20 min late but idt sir noticed

_attachments/Pasted image 20250928154928.png
_attachments/Pasted image 20250928154943.png

to download

  • weka
  • JCreator (or any IDE)
  • Resources > Nutch
  • Canvas: Languages - 5 xx.csv
  • Worksheet 4 (Gdrive > Materyales)
  • Scrapestorm

Title

Dice coefficient

  • a statistical measure used to evaluate the similarity between 2 sets of data

2 (A ∩ B) / |A| + |B|
_attachments/Pasted image 20250928155756.png

2 (A ∩ B) / |A| + |B|
PUWEDE / PWEDE
Try to get the value of n using 1-gram, 3-gram, etc.

Language Identification

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

  • n-grams are used in language identification
  • Identified Language
  • X is the text input
  • Γ is the set of target languages
  • S(X, L) is the similarity score of X with language L

We have tackled

  • LanguageTool: LID Activity
  • LT uses Apache Nutch to generate Trigrams
  • we also have Apache Tika which uses LID

Applications

  • Corpus Building
    • automatically crawl documents from the web
    • automatically classify: Language
    • a corpus: is a large, structured collection of texts (written, spoken, transcribed) used for linguistic research and NLP
    • Corpus building: process of creating the datasets including cleaning and annotation (e.g., tagging parts of speech, labeling language).
    • The goal is to create a resource that represents how a language is used in real contexts.
  • Measuring Language Similarity

instead of using the text input and computing similarity against other languages, how about we compare the model of a language against the other languages?

  • can be done using a similarity matrix

_attachments/Pasted image 20250928160921.png

  • the lowest match would be bikol/waray
  • the highest match is cebuano/hiligaynon
  • then it can be used to cluster different languages

Weka Hands-on Activity

_attachments/Pasted image 20250928161156.png

  • open Weka GUI Chooser
  • we're also going to use JCreator, if you don't want to use it refer to the source code so you can use a different IDE
  • select explorer

_attachments/Pasted image 20250928161407.png

  • open Languages - 5 xx.csv
    • the values in this file are numerical
    • if it says N/A, change it to 1

Weka is a data mining tool that can (1) Classify, (2) Cluster, (3) Associate (Data association), (4) Select Attributes (attribute ranker for visual? engineering and find out what contributes to the accuracy), (5) Visualize

  • on the top tab, click Cluster
  • Choose (choose your algorithm) > SimpleKMeans
  • Click on SimpleKMeans after choosing it in Choose to edit the setup of SimpleKMeans
  • set distanceFunction to EuclideanDistance
  • set numClusters to 3 and select Ok
  • click Ignore Attributes and select XX

_attachments/Pasted image 20250928162344.png

=== Clustering model (full training set) ===

kMeans
======

Number of iterations: 2
Within cluster sum of squared errors: 1.3903798326399064

Initial starting points (random):

Cluster 0: 0.72,0.76,0.75,1,0.66
Cluster 1: 0.73,1,0.81,0.76,0.72
Cluster 2: 1,0.73,0.73,0.72,0.7

Missing values globally replaced with mean/mode

Final cluster centroids:
                         Cluster#
Attribute    Full Data          0          1          2
                 (5.0)      (1.0)      (2.0)      (2.0)
=======================================================
bik              0.776       0.72       0.73       0.85
ceb              0.804       0.76      0.905      0.725
hil              0.808       0.75      0.905       0.74
tgl              0.778          1      0.755       0.69
war              0.766       0.66      0.735       0.85

Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===

Clustered Instances

0      1 ( 20%)
1      2 ( 40%)
2      2 ( 40%)
  • the higher the number, the closer it is to a centroid
  • kMeans clustering: plot 3 values, each one is a centroid
  • plot a second value, determine if it's related to one of the existing centroids.
  • if it's closer to centroid A than B, the new datapoint is related to that cluster with centroid A
  • another iteration, recompute the centroid, add another data point, etc.
  • if it converges (no change in membership), then it's complete

Reading the results:
bikol

  • 0.72 distance with cluster 0
  • 0.73 distance with cluster 1
  • 0.85 distance with cluster 2 <-- therefore bikol is part of cluster 2
    cebuano is in cluster 1
    hiligaynon is in cluster 1
    taglog is in cluster 0
    waray is in cluster 2

Cluster 0 has 1 member, Cluster 1 has 2 members, and Cluster 2 has 2 members.

what we just did is use the Similarity Matrix to obtain the different Clusters.

  • you can also use Python for clustering and similarity matrix
  • we can use Weka to classify, cluster, and perform data association, etc.
  • you can export the model and create a java application that does the things here

each time you perform an experiment, it shows up in the Result List on the lower left:
_attachments/Pasted image 20250928163933.png

  • you can right click and visualize the cluster assignments

_attachments/Pasted image 20250928164030.png

Interpreting the results

we have tagalog, cebuano & hil, war & bikol

_attachments/Pasted image 20250928164130.png

  • based on the language family tree based on Ethnologue, tagalog is in one cluster.
  • cebuano and hiligaynon is in the same cluster, but they are in different sub-family trees
    • cebuano and hiligaynon share some orthographic characteristics
    • you can also look at historical data - ex. a town used cebuano and hiligaynon
    • we can check orthographic, societal, transportation context to interpret the clusters

we used a feature set (in this activity we used a csv files)

Creating the Feature Set

in order to create the CSV file...

  • map/location data
  • phonetic alphabet
  • words
  • ibang n-grams or tokenization
  • transportation network
  • another app: GabMap

how do you create the similarity matrix?

  • done by determining the dice coefficient: 2 (A ∩ B) / |A| + |B|
  • need Set A (Bikol) and Set B (Cebuano)
  • need to do it with all language pairs
  • can be done with vlookup

activity 5: create a feature set that you can use in Weka or other tools

  • for similarity matrix, you can use the worksheet - Week 04 - Similarity matrix
  • mini project for the project 2
  • covers only 5 languages
  • get bonus points if you use a feature set that is composed of geographic or transportation networks (mentioned 43:00)

_attachments/Pasted image 20250928164925.png

  • contains bikol character trigrams
  • the frequency was from the Cheat Sheet (LID exercise)

using vlookup

  • provides you with a result given a particular primary ey
  • primary key: language itself
  • if it's 1, _an is present in the top 20 trigrams of bikol, cebuano, hil, tgl, war
  • if its 0, in the top 20 trigrams of hiligaynon, you don't see _ni based on orthographic data

question: "non-breaking prefixes" - mr. mrs. etc

  • there are some abbreviations of mr. and mrs. in the english bible

question: lexical normalization is not recommended because the affixes themselves contain information (there are certain affixes unique to certain languages) so removing them would affect the results

50:38
_attachments/Pasted image 20250928165835.png

  • gold standard: use Ethnologue for the actual tree and that's how you will evaluate it
  • include the PDF file and source code
  • related works are in the Gdrive > Resources > NLP folder

How to make n-grams?

  • the character trigrams are the _an, _ka, _ma, etc.
  • watch previous recording

Formal Definition: Given a word W = W1... Wk, another sequence Z = Z1...Zn is an n-gram of W if there exists a strictly increasing sequence of i1... 1n of indexes W such that for all j = 1... n, we have Wi_j = Z_j

_attachments/Pasted image 20250928170335.png

  • increasing the value of n increases the word more and more, but it won't be able to capture single-letter or two-letter words and so on
  • n is the number of characters so _ is also considered

Apache Nutch (& JCreator)

55:13

  • in the Nutch folder: Nutch.jsd, .jcp, .jcu, .jcw are used for JCreator
  • if you don't want to use it, use the src folder within the Nutch folder.
  • run the jdk file if you don't have it yet

JCreator instructions

  • open the workspaces
  • configure > set to Java/JDK
  • data are in Nutch > classes > Trigram (build and run the file)

_attachments/Pasted image 20250928170742.png

// Nutch.java
InputStream input = new FileInputStream("Trigam\\Bikol.txt"); 
// Bikol.txt is the training data
// Apache Nutch is a trigram generator -> put the language files in Nutch and it will generate the ngp file
OutputStream output = new FileOutputStream("Trigram\\bik.ngp");
// the ngp file is the trigram profile
// Trigram\\ is the Trigram folder, which is inside classes

NGramProfile testing = new NGramProfile("tl", 1, 4);
// generate 1-gram to 4-gram
testing = testing.create("tl", input, "UTF-8");
testing.save(output); // save into the output
  • inside classes > Trigram > there are different text files
  • these are text collected from an online bible

sir stopped recording after giving us ilke 5 min