NLP1000_language_exploration_and_identification

#public/classnotes

RAW FILE

This note has not been edited yet. Content may be subject to change.

no mp 2 specs notes here i stopped listening

Language Exploration

familiar?

vlookup, clustering

exploration: language identification

download languagetool with language identification functionalities

Ngem soda ti ininumna.
Ang lahat ay nakapangyayari.
Pagkamaopay hito nga sumat!

guide questions:

What languages were detected?
Is the detection more accurate when the sentences are longer?
What kind of algorithm do you think is being used? Does it analyze the input at the word level or at the character level?

Language Identification

process of identifying (using computational means) which language the text input is in
categories are the different languages

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

identified language = has highest similarity score
X is the text input
upside down L is the set of target languages
S(X,L) is the similarity score of X with language L

Manual method: Dictionary-based approach

check word 1 against dictionary A, B, C, … for each word

Exercise 2

determine the language of the text input
complete the matrix (1 if present in the dictionary, 0 if not)
dictionary = cheat sheet page 1
you can use vlookup on google sheets/excel (copy paste the dictionary)

Argmax Function

some variables change when using argmax
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

identified language
X is the text input
upside down L is the set of target languages
S(X, L) is the similarity score of X with L (dictionary)
- presence or absence; total
we are matching the text input X among the different libraries

regarding the dictionary-based approach...

what happens if each dictionary has 200k entries?
it's time consuming
so we'll focus on frequently occurring keywords → this is where frequency and language modelling comes in

Model-based approach

language modelling: the process of creating smaller representations of a language
- smaller representations = language models
- requires large data to train
expressed in terms of n-grams and their frequency count
- word n-gram
- character n-gram

n-grams

word n-gram: n-character slice of a sentence
character n-gram: n-character slice of a word

How do you create a language model?

for ex. "to be or not to be"

count how many times each word appears

Word Bigram

we add an underscore to indicate that the specific words "to" is at the start of the sentence and "be" is at the end
_to be or not to be_
_to 1
to be 2
be or 1
or not 1
not to 1
be_ 1

Word Level: www.speech.sri.com/projects/srilm (wouldnt load, maybe it's https://www.sri.com/platform/srilm/?)
Character Level models: Apache Nutch

Word unigram of the English Wikipedia
_attachments/Pasted image 20250923232621.png

most frequently occurring words, these are called stop words

Exercise 3

determine the language of the text input
complete the matrix: write 1 if the word is present in the unigram model, else 0

Argmax Function

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

identified language
X is the text input
upside down L is the set of target languages
S(X, L) is the similarity score of X with L (unigram model)
- presence/absence; total
now we're comparing it with the unigram model
large training data (TGL) → create unigram model (TGL)
given the input, compare it to one model for a specific language and another model for another language and so on; based on the similarity score, get the argmax

instead of focusing on a dictionary-based approach, you can easily focus on the top 100 or 1000 using word unigrams

Exercise 4

we're given a paragraph, instead of manually checking we have to model the input text

_attachments/Pasted image 20250923233455.png

we'll have something like this (language model)
now we don't have to compare the text but instead compare the model of the text
determine the language of the text input using its language model

Argmax Function

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

Identified language
X is the n-gram model of the text input
capital Gamma is the set of target languages
S(X, L) is the similarity score of X with L (unigram model)
- presence/absence; total

text input → unigram model ==> into different language models

for the past few exercises, we have been using words but words are still big is there a way to make words smaller?

use character n-grams: n-character slice of a string
use character sequences instead of word sequences

character n-grams

FORMAL DEF: given a word W = W1 … Wk, another sequence Z = Z1...Zn is an n-gram of W if there exists a strictly increasing sequence i1...1n of indexes W such that for all j = 1...n, we have Wi = Z.

n-character slice of a word
word we want to model: "word"

1-gram/unigram: w, o, r, d
2-gram/bigram: _w, wo, or, rd, d_
3-gram/trigram: _wo, wor, ord, rd_
4-gram: _wor, word, ord_

"an apple a day"
_an_
_apple_
_a_ (can't capture this if we have a 4 gram)
_day_

advantages
disadvantages

there are limitations as you increase the value of n, some words will not be captured

trigram example
_attachments/Pasted image 20250923234819.png

Exercise 5

determine the language of the text input
complete the matrix (1 if trigram is present in the trigram model in pg 4-5)

Argmax Function

what changed is what X represents
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

Identified language
X is the n-gram model of the text input
capital Gamma is the set of target languages
S(X, L) is the similarity score of X with L (trigram model)
- presence/absence; total

Other Similarity Measures

Trigram Rank

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

identified language
X is the n-gram model of the text input
capital Gamma is the set of target languages
S(X, L) is the similarity score of X with L (trigram model)
- trigram rank (instead of presence/absence; total)
on cheat sheet look at the number inside the parenthesis (ex. _an (2))
for tagalog, rank 1 is ng_ with 73k instances

Exercise 6

determine the language of the text input
complete the matrix:
- if the trigram is present in the trigram model
- write the rank of the trigram
- write 21 otherwise
- count the total
- with rank, the lower the number, the better it is

Applications

Corpus Building
- automatically crawl documents from the web
- automatically classify: language, genre, domain
measuring language similarity

Tools

language models: SRILM, Apache Nutch (character ngrams)
Language identification: Apache Tika in LanguageTool

Exercise 7

determine the language of the text input
complete the matrix
use your own metrix (word/n-gram)

Code-switching?

Siya ay eating egg waffles

Similarity (n-grams)

1:12:04
used in cognates, variants, similar sounding language identification

Dice coefficient

_attachments/Pasted image 20250924001202.png
a statistical measure used to identify the similarity between 2 data sets

_attachments/Pasted image 20250924001228.png
dice coefficient is 2*5/6+5

n-grams are useful so we can compare the dice coefficient not from single letters but from bigrams and trigrams:

ex. confusable drug names _attachments/Pasted image 20250924001432.png

both have intersection as 2
_A AM AR ...
_A AM MI ... the bigrams of diovan and amikin don't result in anything

we can also use it to detemine which languages are closely related
_attachments/Pasted image 20250924001613.png
saan closely related ang Ilocano?

compare the trigram profile with other trigram profiles
ilan ang similar trigrams sa ibang languages?
compare the Ilocano model with different language models
if we're able to do that, we can get a similarity matrix
if we perform clustering, we can create a family tree

MP 2 Specs

1:16:23