NLP1000_nlp_and_languagetool

RAW FILE

This note has not been edited yet. Content may be subject to change.

Announcements

Invited talks:
Oktubre 7, 6:00 n.g. - 09:00 n.g. (online)
Oktubre 17, 6:00 n.g. - 09:00 n.g. (in-person)
Oktubre 24, 6:00 n.g. - 09:00 n.g. (in-person)

ngayong gabi
I-download ang LanguageTool (Complete o demo package)
I-download ang grammar.xml

Activity 2? Praat Activity

_attachments/Pasted image 20250916181400.png

  • yes, the first oh in halo-halo is different
    _attachments/Pasted image 20250916181422.png

if you're interested in sociophonetics...
_attachments/Pasted image 20250916181517.png

  • L1: native language

_attachments/Pasted image 20250916181608.png

  • during the first few months, professors delivered mostly in english
  • after a few months, there was a shift from english to filipino

Activity 3 Regex Steps

  1. remove in-text citation \[\d\] replace with newline
  2. remove the [1][2] etc.

_attachments/Pasted image 20250916181816.png|500

  • eliminates in-text citation and creating new lines in one step

after focus: focus on the after object
___ focus: focus on the object

  • shows you where the emphasis in the sentence is

Natural Language Processing

  • use computational means to translate language from one form to another
  • we focus on languages used by humans → the philippines is a gold mine for this

Philippine Languages: Ethnologue

  • the number of established languages listed for Philippines is 186
    • 184 are living, 2 are extinct
  • also listed are 7 unestablished langauges and 2 macrolanguages

_attachments/Pasted image 20250916182246.png

Free Word Order in Philippine Languages

  • we can come up with a sentence and arrange it in multiple ways
    _attachments/Pasted image 20250916182323.png

Code-switching in Philippine Languages

  • intra-sentential:
    • siya ay eating doon
  • intra-word: you have the root word in english but the affixes are in another languages
    • nagla-library
  • borrowing: we use orthographic tools in one language (no idea what this means)
    • naglalaybrari

one thesis topic: code switching section for at least 3 languages

Rich Morphology in Philippine Languages

  • prefix: start of the word
  • infix: in between the word
  • suffix: end of the word
  • circumfixation: start and end
  • reduplication: duplicate the affix or the first syllable of the root word
  • plurality: naganap, nagaganap, magaganap etc.
  • verb focus: sinulat vs sumulat
  • verb aspect:
Sumulat si Ana ng liham.  
Kumain si Pedro ng mangga.  
Uminom si Maria ng tubig.  
Lumakad si Juan papunta sa paaralan.  
Bumili si Liza ng tinapay.  
Pumunta sila sa palengke.  
Gumising si Mark nang maaga.  
Sumayaw ang mga bata sa entablado.  
Humiga si Lola sa kama.  
Lumangoy si Carlo sa dagat.

Search for: (.*)um(.*)
Replace with: \1in\2
(Note: In some editors, use $1 instead of \1.)

if we change the actor focus, there might be something wrong with the sentence.
Sumulat si Ana ng liham should be: Sinulat ang liham ni Ana.

Challenges when dealing with Philippine Languages

  • AI bias: english to filipino assigns a gender based on the training data of the model
    • isa siyang doktor → he is a doctor
    • isa siyang nars → she is a nurse
  • variations of words exist: you can spell a particular word multiople ways

Filipino Orthography

Abecedario: 32 letters

  • a, b, c, d, e, f, g, h, i, j, k, l, m, n, ng ñ, another enye-g,
  • o, p, q, r, s, t, u, v, w, x, y, z
  • ch, ll, rr
    1940: 20 letters
  • a, b, k, d, e, g, h, i, l, m, n, ng, o, p, r, s, t, u, w, y
    1987: Additional 8 letters
  • c, f, j, ñ, q, v, x, z

Activity & KWF

KWF - online dictionary https://kwfdiksiyonaryo.ph/
K[ou]mperensiya

  • komperensiya - majority
  • kumperensiya - on KWF
    K[ou]mpanya
  • kompanya - majority, in KWF
  • kumpanya
    K[ou]mpleto
  • kompleto - in KWF
  • kumpleto - majority

_attachments/Pasted image 20250916183652.png

  • diacritic marks - may accent, accented words

Why is spelling important

  • it affects accuracy
    • Kuwentuhan mo ako - tell me a story
    • Kwentuhan mo ako - talk to me
  • part-of-speech tagging in German (Scheible et al. 2011)
  • intent classification, slot-filling
    _attachments/Pasted image 20250916184014.png

Existing Works

_attachments/Pasted image 20250916184145.png

  • there is an issue with precision (word variance that have the d vs r, it won't be simple)
  • not all words are spelling variants - simply extracting [ou] or word pairs will not result to pure spelling variants

_attachments/Pasted image 20250916184334.png

check Extracting Filipino Spelling Variants in Google Drive
_attachments/Pasted image 20250916184431.png
Example:
rituwal vs. ritwal
CuwV vs. CwV // Consonant and Vowel
[b-df-hj-np-tv-z]uw[aeiou] vs. [b-df-hj-np-tv-z]w[aeiou]

Tagalog Wikipedia

  • wikipedia is available in different Philippine languages
  • download entire tagalog wikipedia
  • august 20 snapshot
  • 11m words recorded
    _attachments/Pasted image 20250916184544.png

Wiki article: https://en.wikipedia.org/wiki/Tagalog_Wikipedia
TL Wiki dump: https://dumps.wikimedia.org/tlwiki/20240820/tlwiki-20240820-pages-articles.xml.bz2
Extractor: https://github.com/apertium/WikiExtractor

_attachments/Pasted image 20250916184829.png

Patterns that give us spelling variants

  • and we're sure that there are
    _attachments/Pasted image 20250916184919.png
    _attachments/Pasted image 20250916184940.png
  • succeeding V (vowel) and succeeding C (consonant)
    _attachments/Pasted image 20250916185002.png

the rest are on the slides

_attachments/Pasted image 20250916185017.png

  • we need tools that adhere to KWF guidelines and balance …

_attachments/Pasted image 20250916185125.png

LanguageTool

  • next week: Langauge Identification and Web Crawling
  • check the exploration activities

_attachments/Pasted image 20250916185633.png

  • open LanguageToolGUI.jar

  • LT Complete Package > dist > resource and rules folder

  • resource > contains language codes (you can search these in Ethnologue and Wikipedia)

  • rules > tl > grammar.xml > open in npp or any word editor

rules can be represented by...

  • tokens
  • regex <token regexp="yes">
  • tags

Cleaned grammar.xml (on Canvas)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="../print.xsl" ?>
<?xml-stylesheet type="text/css" href="../rules.css" 
title="Easy editing stylesheet" ?>
<!--
Tagalog Grammar and Typo Rules for LanguageTool
See tagset.txt for the different POS, Lexical Categories, and corresponding attributes
Copyright (C) 2011 Nathaniel Oco and Allan Borra (http://www.dlsu.edu.ph/research/centers/adric/nlp/)
$Id: grammar.xml,v 1.129 2010/11/13 23:24:21 dnaber Exp $
-->
<rules lang="tl" xsi:noNamespaceSchemaLocation="../rules.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema">
	<category name="Sound and Letter Change">
		<rule id="D" name="d (r)">
			<pattern case_sensitive="no" mark_from="1">
				<token>siya</token>
				<token>din</token>
			</pattern>
			<message>
				<suggestion>rin</suggestion>
			</message>
<!--
			<pattern case_sensitive="no" mark_from="1">
				<token regexp="yes">.*[aeiou]</token>
				<token regexp="yes">ding?|dito|ditong|daw|diyang?|dyang?|doong?</token>
			</pattern>

			<message>
				<suggestion>
					<match no="2" case_conversion="startlower" regexp_match=".(.*)" regexp_replace="r$1"/>
				</suggestion>
			</message>
-->
			<short>Sound and Letter Change</short>
			<example correction="rin" type="incorrect">Ka <marker>din</marker></example>
			<example type="correct">Ka <marker>rin</marker></example>
		</rule>
	</category>
<!--
	<category name="Style Rules">
		<rule id="A AY B" name="a ay b (b a)">
			<pattern case_sensitive="no" mark_from="0">
				<token>ang</token>
				<token>aso</token>
				<token>ay</token>
				<token postag="V.*" postag_regexp="yes"/>
			</pattern>
			<message>
				<suggestion>
					\4 \1 \2
				</suggestion>
			</message>
			<short>AY Removal</short>
			<example correction="kumakain ang aso" type="incorrect"><marker>ang aso ay kumakain</marker></example>
			<example type="correct"><marker>kumakain siya</marker></example>
		</rule>
	</category>
-->

</rules>
  • the entire grammar.xml is enclosed in rules tag
  • in rules, there are categories (spelling, grammar, style)
  • in categories, there are rules
  • in each rule, there are patterns and message, (with suggestions)

rename the old grammar.xml to something else and paste the new grammar.xml there

_attachments/Pasted image 20250916191402.png

  • one way to determine if it's still working, check "Siya daw raw po."
		<rule id="D" name="d (r)">
			<pattern case_sensitive="no" mark_from="1">
				<token>siya</token>
				<token>daw</token> <!--daw-->
			</pattern>
			<message>
				<suggestion>raw</suggestion> <!--rin-->
			</message>
  • Change line 16 to daw
  • Change line 19 to raw
  • Run LanguageToolGUI.jar
  • "Siya din raw po" won't do anything because we didn't set it up correctly
  • we don't have to declare all tokens like siya sila, etc. we can use regex instead.

i fucked up restore it to default
_attachments/Pasted image 20250916192329.png