NLP1000_natural_language_processing

#public/classnotes

RAW FILE

This note has not been edited yet. Content may be subject to change.

Announcements

no f2f sept 12, 19, 26
makeup sessions sa oktubre

sir's mic sucks in this recording

Assignment Tab

activity 1 - introductions ✅
activity 2 - audio analysis ✅
activity 3 - regex ✅
mco1 oktubre 4 11:30 n.u.

Natural Language Processing

pag-aaral ng wika

phonology <-- we are here
morphology and character level
syntax
semantics

Ano ang pwedeng gawin sa speech data?

there are conferences on speech data and databases
you can also sell your speech data
- Commercial Speech Corpora
  - Linguistic Data Consortium
  - Data Ocean
  - Defined.ai
  - IEEE Data Port
galing sa O-COCOSDA Country Report: https://ieeexplore.ieee.org/document/10800439/

kung ang NLP may ACL Anthology, ano ang sa speech?

ISCA (ISCA Archive)

bat importante ang speech?

the direction is moving towards multimodal artificial intelligence

pag-aaral ng wika

phonology
morphology and character level ← moving here
syntax
semantics

Regular Expressions

symbols that you can use to find and replace text
in STALGCM: * and union
Ex. ^ . * + - $ () |
will be used heavily in the MCO1 work

download regex.txt and sample text from canvas

regex.txt is the cheatsheet for regex

The
Tagalog
Wikipedia
(Tagalog:
Wikipediang
Tagalog)
is
the
Tagalog
language
edition
of
Wikipedia,
which
was
launched
on
December
1,
2003.
It
has
75,983
articles
and
is
the
74th
largest
Wikipedia
according
to
the
number
of
articles
as
of
November
18,
2019.[1]
This
is
significantly
fewer
articles
than
the
Cebuano
Wikipedia,
the
largest
Philippine-based
language
version
of
Wikipedia,
which
currently
has
more
than
5,379,000
articles,
and
the
Waray
Wikipedia,
which
has
more
than
1,264,000
articles.[2][3]

However,
the
Tagalog
Wikipedia
has
an
article
depth
of
31.99,
compared
to
3.52
for
the
Waray
Wikipedia
and
2
for
the
Cebuano
Wikipedia,
as
of
November
18,
2019.[4]
Mr apple
Ms. apple
Mr. apple
aaaaaaaaab
aaaaaaaab
ab
bbbbbbbb
b

Why do we need to study Regex?

it's the "duct tape of text processing"
used in pattern-matching and data cleaning
there are some people who are paid 6 digits per month just cleaning data

Mga paghahanda

text editor na may regular expression (notepad++, notepadpp, sublime, vscode)
regex tester (https://regex101.com, https://rubular.com)
- these are also online text editors
sample text file (from canvas)
paste the sample text in the text editor

RegEx Symbols

is it all instances or all characters?
is it at the start of the word or at the start of the line?

note: as a convention: basahin ang panuto kung kailangang maglaygay ng ^ at $ (sir will indicate whether you have to include the caret and dollar sign)

caret ^a: highlights all characters of "a" at the start of the line
open/close bracket []: ^[aeiou] highlights all "a", "e", "i", "o", "u"
- the open close bracket used to lookup several characters
dollar sign `---

RAW FILE

This note has not been edited yet. Content may be subject to change.

: `[aeiou]---

RAW FILE

This note has not been edited yet. Content may be subject to change.

: highlights all characters "a", "e", "i", "o", "u" at the end of the line
- if caret matches the start of the line, dollar sign matches the end of the line
- `ion---

RAW FILE

This note has not been edited yet. Content may be subject to change.

for certain affixes (like those that end with ion)

dash symbol -: ^[a-m] allows you to specify a range (in this case, a range from a to m)
- the range is in ASCII format so if you put ^[0-a]
- depending on your setting, you might encounter capital letters other times it will not
dot symbol .: allows you to set a number of characters to be highlighted
- `^...---

RAW FILE

This note has not been edited yet. Content may be subject to change.

selects all 3 character words

parentheses (): allows you to group
- `^(...)---

RAW FILE

This note has not been edited yet. Content may be subject to change.

selects all three characters
- `^(...)(...)---

RAW FILE

This note has not been edited yet. Content may be subject to change.

look for 6 characters but group the 1st and 2nd set (different colors), the grouping would be evident if we do a replace
- try replace (...)(...) with \$2$1 which swaps the places of the first group and the second group
- enter key is \r\n, if we replace that with a space , that will remove all new line

pipe symbol: |: "or"
- ^(a|e|i|o|u) finds all single characters at the start of the line containing a/e/i/o/u
- what's the difference with ^[aeiou]: idk what sir said but gpt says it's a character class, match any one character from the set.
kleene plus symbol +
- ^a+b: match any number of a followed by a b
- kleene star is 0 or more instances, while kleene plus is 1 or more instances
question mark ? match the previous token 0 or 1 times, ? means optional so the previous character is optional
- M...? +.* (note the space inbetween ? and +)
  - will it match a line or a substring of a line?
  - dot means any character, star means 0 or more instances = anything
  - if you have a plus = 1 or more instances of the previous token
  - M..? Mr, Mrs, Mr. but not Mrs. because that's 4 characters already
- ^a?b
  - ^ anchors to the start (of the line if multiline, otherwise the whole string).
  - a? makes “a” optional (0 or 1).
  - b must follow.
  - Matches: strings/lines that start with b or ab (but not aab, xb, cab, etc.).

Advanced RegEx Activity with RegEx Look Ahead and Look Behind

For each pattern, what did it match in your test text?
Based on the matches, what do you think the RegEx is doing?

a(?=b) Positive Lookahead: Assert that the Regex below matches, matches characters "a" where it's behind a b
a(?=p) Assert that the Regex below matches, matches characters "a" where it's behind a p
a(?!b) Negative Lookahead: Assert that the Regex below does not match, matches characters "a" such that "b" is not directly after
(?<=a)b Positive Lookbehind: Assert that the Regex below matches, matches characters "b" such that an a is behind it
(?<!a)b Negative Lookbehind: Assert that the Regex below does not match, matches characters b such that an "a" is not directly behind it

Exploration: LanguageTool

(58:14)
_attachments/Pasted image 20250916215151.png

MCO discussion here

(59:41)

Announcements

Assignment Tab

Natural Language Processing

Ano ang pwedeng gawin sa speech data?

kung ang NLP may ACL Anthology, ano ang sa speech?

bat importante ang speech?

pag-aaral ng wika

Regular Expressions

Why do we need to study Regex?

Mga paghahanda

RegEx Symbols

: `[aeiou]---

: highlights all characters "a", "e", "i", "o", "u" at the end of the line - if caret matches the start of the line, dollar sign matches the end of the line - `ion---

selects all three characters - `^(...)(...)---

Advanced RegEx Activity with RegEx Look Ahead and Look Behind

Exploration: LanguageTool

MCO discussion here

: highlights all characters "a", "e", "i", "o", "u" at the end of the line
- if caret matches the start of the line, dollar sign matches the end of the line
- `ion---

selects all three characters
- `^(...)(...)---