NLP1000_language_exploration_and_identification
This note has not been edited yet. Content may be subject to change.
no mp 2 specs notes here i stopped listening
Language Exploration
familiar?
- vlookup, clustering
exploration: language identification
- download languagetool with language identification functionalities
Ngem soda ti ininumna.
Ang lahat ay nakapangyayari.
Pagkamaopay hito nga sumat!
guide questions:
- What languages were detected?
- Is the detection more accurate when the sentences are longer?
- What kind of algorithm do you think is being used? Does it analyze the input at the word level or at the character level?
Language Identification
- process of identifying (using computational means) which language the text input is in
- categories are the different languages
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$
- identified language = has highest similarity score
- X is the text input
- upside down L is the set of target languages
- S(X,L) is the similarity score of X with language L
Manual method: Dictionary-based approach
- check word 1 against dictionary A, B, C, … for each word
Exercise 2
- determine the language of the text input
- complete the matrix (1 if present in the dictionary, 0 if not)
- dictionary = cheat sheet page 1
- you can use vlookup on google sheets/excel (copy paste the dictionary)
Argmax Function
some variables change when using argmax
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$
- identified language
- X is the text input
- upside down L is the set of target languages
- S(X, L) is the similarity score of X with L (dictionary)
- presence or absence; total
- we are matching the text input X among the different libraries
regarding the dictionary-based approach...
- what happens if each dictionary has 200k entries?
- it's time consuming
- so we'll focus on frequently occurring keywords → this is where frequency and language modelling comes in
Model-based approach
- language modelling: the process of creating smaller representations of a language
- smaller representations = language models
- requires large data to train
- expressed in terms of n-grams and their frequency count
- word n-gram
- character n-gram
n-grams
- word n-gram: n-character slice of a sentence
- character n-gram: n-character slice of a word

How do you create a language model?
for ex. "to be or not to be"
- count how many times each word appears
Word Bigram
- we add an underscore to indicate that the specific words "to" is at the start of the sentence and "be" is at the end
_to be or not to be_
_to1
to be2
be or1
or not1
not to1
be_1
Word Level: www.speech.sri.com/projects/srilm (wouldnt load, maybe it's https://www.sri.com/platform/srilm/?)
Character Level models: Apache Nutch
Word unigram of the English Wikipedia

- most frequently occurring words, these are called stop words
Exercise 3
- determine the language of the text input
- complete the matrix: write 1 if the word is present in the unigram model, else 0
Argmax Function
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$
- identified language
- X is the text input
- upside down L is the set of target languages
- S(X, L) is the similarity score of X with L (unigram model)
- presence/absence; total
- now we're comparing it with the unigram model
- large training data (TGL) → create unigram model (TGL)
- given the input, compare it to one model for a specific language and another model for another language and so on; based on the similarity score, get the argmax
instead of focusing on a dictionary-based approach, you can easily focus on the top 100 or 1000 using word unigrams
Exercise 4
- we're given a paragraph, instead of manually checking we have to model the input text

- we'll have something like this (language model)
- now we don't have to compare the text but instead compare the model of the text
- determine the language of the text input using its language model
Argmax Function
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$
- Identified language
- X is the n-gram model of the text input
- capital Gamma is the set of target languages
- S(X, L) is the similarity score of X with L (unigram model)
- presence/absence; total
text input → unigram model ==> into different language models
for the past few exercises, we have been using words but words are still big is there a way to make words smaller?
- use character n-grams: n-character slice of a string
- use character sequences instead of word sequences
character n-grams
FORMAL DEF: given a word W = W1 … Wk, another sequence Z = Z1...Zn is an n-gram of W if there exists a strictly increasing sequence i1...1n of indexes W such that for all j = 1...n, we have Wi = Z.
n-character slice of a word
word we want to model: "word"
- 1-gram/unigram: w, o, r, d
- 2-gram/bigram:
_w, wo, or, rd,d_ - 3-gram/trigram:
_wo, wor, ord,rd_ - 4-gram:
_wor, word,ord_
"an apple a day"
_an_
_apple_
_a_ (can't capture this if we have a 4 gram)
_day_
advantages
disadvantages
- there are limitations as you increase the value of n, some words will not be captured
trigram example

Exercise 5
- determine the language of the text input
- complete the matrix (1 if trigram is present in the trigram model in pg 4-5)
Argmax Function
what changed is what X represents
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$
- Identified language
- X is the n-gram model of the text input
- capital Gamma is the set of target languages
- S(X, L) is the similarity score of X with L (trigram model)
- presence/absence; total
Other Similarity Measures
Trigram Rank
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$
- identified language
- X is the n-gram model of the text input
- capital Gamma is the set of target languages
- S(X, L) is the similarity score of X with L (trigram model)
- trigram rank (instead of presence/absence; total)
- on cheat sheet look at the number inside the parenthesis (ex.
_an (2)) - for tagalog, rank 1 is
ng_with 73k instances
Exercise 6
- determine the language of the text input
- complete the matrix:
- if the trigram is present in the trigram model
- write the rank of the trigram
- write 21 otherwise
- count the total
- with rank, the lower the number, the better it is
Applications
- Corpus Building
- automatically crawl documents from the web
- automatically classify: language, genre, domain
- measuring language similarity
Tools
- language models: SRILM, Apache Nutch (character ngrams)
- Language identification: Apache Tika in LanguageTool
Exercise 7
- determine the language of the text input
- complete the matrix
- use your own metrix (word/n-gram)
Code-switching?
Siya ay eating egg waffles
Similarity (n-grams)
1:12:04
used in cognates, variants, similar sounding language identification
Dice coefficient

a statistical measure used to identify the similarity between 2 data sets

dice coefficient is 2*5/6+5
n-grams are useful so we can compare the dice coefficient not from single letters but from bigrams and trigrams:
ex. confusable drug names 
- both have intersection as 2
_A AM AR ..._A AM MI ...the bigrams of diovan and amikin don't result in anything
we can also use it to detemine which languages are closely related

saan closely related ang Ilocano?
- compare the trigram profile with other trigram profiles
- ilan ang similar trigrams sa ibang languages?
- compare the Ilocano model with different language models
- if we're able to do that, we can get a similarity matrix
- if we perform clustering, we can create a family tree
MP 2 Specs
1:16:23