NLP1000_language_exploration_and_identification

RAW FILE

This note has not been edited yet. Content may be subject to change.

no mp 2 specs notes here i stopped listening

Language Exploration

familiar?

  • vlookup, clustering

exploration: language identification

  • download languagetool with language identification functionalities
Ngem soda ti ininumna.
Ang lahat ay nakapangyayari.
Pagkamaopay hito nga sumat!

guide questions:

  1. What languages were detected?
  2. Is the detection more accurate when the sentences are longer?
  3. What kind of algorithm do you think is being used?  Does it analyze the input at the word level or at the character level?

Language Identification

  • process of identifying (using computational means) which language the text input is in
  • categories are the different languages

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

  • identified language = has highest similarity score
  • X is the text input
  • upside down L is the set of target languages
  • S(X,L) is the similarity score of X with language L

Manual method: Dictionary-based approach

  • check word 1 against dictionary A, B, C, … for each word

Exercise 2

  • determine the language of the text input
  • complete the matrix (1 if present in the dictionary, 0 if not)
  • dictionary = cheat sheet page 1
  • you can use vlookup on google sheets/excel (copy paste the dictionary)

Argmax Function

some variables change when using argmax
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

  • identified language
  • X is the text input
  • upside down L is the set of target languages
  • S(X, L) is the similarity score of X with L (dictionary)
    • presence or absence; total
  • we are matching the text input X among the different libraries

regarding the dictionary-based approach...

  • what happens if each dictionary has 200k entries?
  • it's time consuming
  • so we'll focus on frequently occurring keywords → this is where frequency and language modelling comes in

Model-based approach

  • language modelling: the process of creating smaller representations of a language
    • smaller representations = language models
    • requires large data to train
  • expressed in terms of n-grams and their frequency count
    • word n-gram
    • character n-gram

n-grams

  • word n-gram: n-character slice of a sentence
  • character n-gram: n-character slice of a word
    _attachments/Pasted image 20250923232125.png

How do you create a language model?

for ex. "to be or not to be"

  • count how many times each word appears

Word Bigram

  • we add an underscore to indicate that the specific words "to" is at the start of the sentence and "be" is at the end
    _to be or not to be_
    _to 1
    to be 2
    be or 1
    or not 1
    not to 1
    be_ 1

Word Level: www.speech.sri.com/projects/srilm (wouldnt load, maybe it's https://www.sri.com/platform/srilm/?)
Character Level models: Apache Nutch

Word unigram of the English Wikipedia
_attachments/Pasted image 20250923232621.png

  • most frequently occurring words, these are called stop words

Exercise 3

  • determine the language of the text input
  • complete the matrix: write 1 if the word is present in the unigram model, else 0

Argmax Function

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

  • identified language
  • X is the text input
  • upside down L is the set of target languages
  • S(X, L) is the similarity score of X with L (unigram model)
    • presence/absence; total
  • now we're comparing it with the unigram model
  • large training data (TGL) → create unigram model (TGL)
  • given the input, compare it to one model for a specific language and another model for another language and so on; based on the similarity score, get the argmax

instead of focusing on a dictionary-based approach, you can easily focus on the top 100 or 1000 using word unigrams

Exercise 4

  • we're given a paragraph, instead of manually checking we have to model the input text

_attachments/Pasted image 20250923233455.png

  • we'll have something like this (language model)
  • now we don't have to compare the text but instead compare the model of the text
  • determine the language of the text input using its language model

Argmax Function

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

  • Identified language
  • X is the n-gram model of the text input
  • capital Gamma is the set of target languages
  • S(X, L) is the similarity score of X with L (unigram model)
    • presence/absence; total

text input → unigram model ==> into different language models

for the past few exercises, we have been using words but words are still big is there a way to make words smaller?

  • use character n-grams: n-character slice of a string
  • use character sequences instead of word sequences

character n-grams

FORMAL DEF: given a word W = W1 … Wk, another sequence Z = Z1...Zn is an n-gram of W if there exists a strictly increasing sequence i1...1n of indexes W such that for all j = 1...n, we have Wi = Z.

n-character slice of a word
word we want to model: "word"

  • 1-gram/unigram: w, o, r, d
  • 2-gram/bigram: _w, wo, or, rd, d_
  • 3-gram/trigram: _wo, wor, ord, rd_
  • 4-gram: _wor, word, ord_

"an apple a day"
_an_
_apple_
_a_ (can't capture this if we have a 4 gram)
_day_

advantages
disadvantages

  • there are limitations as you increase the value of n, some words will not be captured

trigram example
_attachments/Pasted image 20250923234819.png

Exercise 5

  • determine the language of the text input
  • complete the matrix (1 if trigram is present in the trigram model in pg 4-5)

Argmax Function

what changed is what X represents
$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

  • Identified language
  • X is the n-gram model of the text input
  • capital Gamma is the set of target languages
  • S(X, L) is the similarity score of X with L (trigram model)
    • presence/absence; total

Other Similarity Measures

Trigram Rank

$ \begin{aligned}
& \hat{L} = arg max \space S(X,L) \
& L = \Gamma \
\end{aligned}
$

  • identified language
  • X is the n-gram model of the text input
  • capital Gamma is the set of target languages
  • S(X, L) is the similarity score of X with L (trigram model)
    • trigram rank (instead of presence/absence; total)
  • on cheat sheet look at the number inside the parenthesis (ex. _an (2))
  • for tagalog, rank 1 is ng_ with 73k instances

Exercise 6

  • determine the language of the text input
  • complete the matrix:
    • if the trigram is present in the trigram model
    • write the rank of the trigram
    • write 21 otherwise
    • count the total
    • with rank, the lower the number, the better it is

Applications

  • Corpus Building
    • automatically crawl documents from the web
    • automatically classify: language, genre, domain
  • measuring language similarity

Tools

  • language models: SRILM, Apache Nutch (character ngrams)
  • Language identification: Apache Tika in LanguageTool

Exercise 7

  • determine the language of the text input
  • complete the matrix
  • use your own metrix (word/n-gram)

Code-switching?

Siya ay eating egg waffles

Similarity (n-grams)

1:12:04
used in cognates, variants, similar sounding language identification

Dice coefficient

_attachments/Pasted image 20250924001202.png
a statistical measure used to identify the similarity between 2 data sets

_attachments/Pasted image 20250924001228.png
dice coefficient is 2*5/6+5

n-grams are useful so we can compare the dice coefficient not from single letters but from bigrams and trigrams:

ex. confusable drug names _attachments/Pasted image 20250924001432.png

  • both have intersection as 2
  • _A AM AR ...
  • _A AM MI ... the bigrams of diovan and amikin don't result in anything

we can also use it to detemine which languages are closely related
_attachments/Pasted image 20250924001613.png
saan closely related ang Ilocano?

  • compare the trigram profile with other trigram profiles
  • ilan ang similar trigrams sa ibang languages?
  • compare the Ilocano model with different language models
  • if we're able to do that, we can get a similarity matrix
  • if we perform clustering, we can create a family tree

MP 2 Specs

1:16:23