NLP1000_nlp_and_languagetool
This note has not been edited yet. Content may be subject to change.
Announcements
Invited talks:
Oktubre 7, 6:00 n.g. - 09:00 n.g. (online)
Oktubre 17, 6:00 n.g. - 09:00 n.g. (in-person)
Oktubre 24, 6:00 n.g. - 09:00 n.g. (in-person)
ngayong gabi
I-download ang LanguageTool (Complete o demo package)
I-download ang grammar.xml
Activity 2? Praat Activity

- yes, the first oh in halo-halo is different

if you're interested in sociophonetics...

- L1: native language

- during the first few months, professors delivered mostly in english
- after a few months, there was a shift from english to filipino
Activity 3 Regex Steps
- remove in-text citation
\[\d\]replace with newline - remove the
[1][2]etc.

- eliminates in-text citation and creating new lines in one step
after focus: focus on the after object
___ focus: focus on the object
- shows you where the emphasis in the sentence is
Natural Language Processing
- use computational means to translate language from one form to another
- we focus on languages used by humans → the philippines is a gold mine for this
Philippine Languages: Ethnologue
- the number of established languages listed for Philippines is 186
- 184 are living, 2 are extinct
- also listed are 7 unestablished langauges and 2 macrolanguages

Free Word Order in Philippine Languages
- we can come up with a sentence and arrange it in multiple ways

Code-switching in Philippine Languages
- intra-sentential:
- siya ay eating doon
- intra-word: you have the root word in english but the affixes are in another languages
- nagla-library
- borrowing: we use orthographic tools in one language (no idea what this means)
- naglalaybrari
one thesis topic: code switching section for at least 3 languages
Rich Morphology in Philippine Languages
- prefix: start of the word
- infix: in between the word
- suffix: end of the word
- circumfixation: start and end
- reduplication: duplicate the affix or the first syllable of the root word
- plurality: naganap, nagaganap, magaganap etc.
- verb focus: sinulat vs sumulat
- verb aspect:
Sumulat si Ana ng liham.
Kumain si Pedro ng mangga.
Uminom si Maria ng tubig.
Lumakad si Juan papunta sa paaralan.
Bumili si Liza ng tinapay.
Pumunta sila sa palengke.
Gumising si Mark nang maaga.
Sumayaw ang mga bata sa entablado.
Humiga si Lola sa kama.
Lumangoy si Carlo sa dagat.
Search for: (.*)um(.*)
Replace with: \1in\2
(Note: In some editors, use $1 instead of \1.)
if we change the actor focus, there might be something wrong with the sentence.
Sumulat si Ana ng liham should be: Sinulat ang liham ni Ana.
Challenges when dealing with Philippine Languages
- AI bias: english to filipino assigns a gender based on the training data of the model
- isa siyang doktor → he is a doctor
- isa siyang nars → she is a nurse
- variations of words exist: you can spell a particular word multiople ways
Filipino Orthography
Abecedario: 32 letters
- a, b, c, d, e, f, g, h, i, j, k, l, m, n, ng ñ, another enye-g,
- o, p, q, r, s, t, u, v, w, x, y, z
- ch, ll, rr
1940: 20 letters - a, b, k, d, e, g, h, i, l, m, n, ng, o, p, r, s, t, u, w, y
1987: Additional 8 letters - c, f, j, ñ, q, v, x, z
Activity & KWF
KWF - online dictionary https://kwfdiksiyonaryo.ph/
K[ou]mperensiya
- komperensiya - majority
- kumperensiya - on KWF
K[ou]mpanya - kompanya - majority, in KWF
- kumpanya
K[ou]mpleto - kompleto - in KWF
- kumpleto - majority

- diacritic marks - may accent, accented words
Why is spelling important
- it affects accuracy
- Kuwentuhan mo ako - tell me a story
- Kwentuhan mo ako - talk to me
- part-of-speech tagging in German (Scheible et al. 2011)
- intent classification, slot-filling

Existing Works

- there is an issue with precision (word variance that have the d vs r, it won't be simple)
- not all words are spelling variants - simply extracting [ou] or word pairs will not result to pure spelling variants

check Extracting Filipino Spelling Variants in Google Drive

Example:
rituwal vs. ritwal
CuwV vs. CwV // Consonant and Vowel
[b-df-hj-np-tv-z]uw[aeiou] vs. [b-df-hj-np-tv-z]w[aeiou]
Tagalog Wikipedia
- wikipedia is available in different Philippine languages
- download entire tagalog wikipedia
- august 20 snapshot
- 11m words recorded

Wiki article: https://en.wikipedia.org/wiki/Tagalog_Wikipedia
TL Wiki dump: https://dumps.wikimedia.org/tlwiki/20240820/tlwiki-20240820-pages-articles.xml.bz2
Extractor: https://github.com/apertium/WikiExtractor

- you can change it to cebwiki, tlwiki, etc.
- https://dumps.wikimedia.org/tlwiki/
Patterns that give us spelling variants
- and we're sure that there are


- succeeding V (vowel) and succeeding C (consonant)

the rest are on the slides

- we need tools that adhere to KWF guidelines and balance …

LanguageTool
- next week: Langauge Identification and Web Crawling
- check the exploration activities

-
open
LanguageToolGUI.jar -
LT Complete Package >
dist>resourceandrulesfolder -
resource> contains language codes (you can search these in Ethnologue and Wikipedia) -
rules >
tl>grammar.xml> open in npp or any word editor
rules can be represented by...
- tokens
- regex
<token regexp="yes"> - tags
Cleaned grammar.xml (on Canvas)
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../print.xsl" ?>
<?xml-stylesheet type="text/css" href="../rules.css"
title="Easy editing stylesheet" ?>
<!--
Tagalog Grammar and Typo Rules for LanguageTool
See tagset.txt for the different POS, Lexical Categories, and corresponding attributes
Copyright (C) 2011 Nathaniel Oco and Allan Borra (http://www.dlsu.edu.ph/research/centers/adric/nlp/)
$Id: grammar.xml,v 1.129 2010/11/13 23:24:21 dnaber Exp $
-->
<rules lang="tl" xsi:noNamespaceSchemaLocation="../rules.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<category name="Sound and Letter Change">
<rule id="D" name="d (r)">
<pattern case_sensitive="no" mark_from="1">
<token>siya</token>
<token>din</token>
</pattern>
<message>
<suggestion>rin</suggestion>
</message>
<!--
<pattern case_sensitive="no" mark_from="1">
<token regexp="yes">.*[aeiou]</token>
<token regexp="yes">ding?|dito|ditong|daw|diyang?|dyang?|doong?</token>
</pattern>
<message>
<suggestion>
<match no="2" case_conversion="startlower" regexp_match=".(.*)" regexp_replace="r$1"/>
</suggestion>
</message>
-->
<short>Sound and Letter Change</short>
<example correction="rin" type="incorrect">Ka <marker>din</marker></example>
<example type="correct">Ka <marker>rin</marker></example>
</rule>
</category>
<!--
<category name="Style Rules">
<rule id="A AY B" name="a ay b (b a)">
<pattern case_sensitive="no" mark_from="0">
<token>ang</token>
<token>aso</token>
<token>ay</token>
<token postag="V.*" postag_regexp="yes"/>
</pattern>
<message>
<suggestion>
\4 \1 \2
</suggestion>
</message>
<short>AY Removal</short>
<example correction="kumakain ang aso" type="incorrect"><marker>ang aso ay kumakain</marker></example>
<example type="correct"><marker>kumakain siya</marker></example>
</rule>
</category>
-->
</rules>
- the entire grammar.xml is enclosed in rules tag
- in rules, there are categories (spelling, grammar, style)
- in categories, there are rules
- in each rule, there are patterns and message, (with suggestions)
rename the old grammar.xml to something else and paste the new grammar.xml there

- one way to determine if it's still working, check "Siya daw raw po."
<rule id="D" name="d (r)">
<pattern case_sensitive="no" mark_from="1">
<token>siya</token>
<token>daw</token> <!--daw-->
</pattern>
<message>
<suggestion>raw</suggestion> <!--rin-->
</message>
- Change line 16 to daw
- Change line 19 to raw
- Run LanguageToolGUI.jar
- "Siya din raw po" won't do anything because we didn't set it up correctly
- we don't have to declare all tokens like siya sila, etc. we can use regex instead.
i fucked up restore it to default
