NLP1000_natural_language_processing

RAW FILE

This note has not been edited yet. Content may be subject to change.

Announcements

  • no f2f sept 12, 19, 26
  • makeup sessions sa oktubre

sir's mic sucks in this recording

Assignment Tab

  • activity 1 - introductions ✅
  • activity 2 - audio analysis ✅
  • activity 3 - regex ✅
  • mco1 oktubre 4 11:30 n.u.

Natural Language Processing

pag-aaral ng wika

  • phonology <-- we are here
  • morphology and character level
  • syntax
  • semantics

Ano ang pwedeng gawin sa speech data?

kung ang NLP may ACL Anthology, ano ang sa speech?

  • ISCA (ISCA Archive)

bat importante ang speech?

  • the direction is moving towards multimodal artificial intelligence

pag-aaral ng wika

  • phonology
  • morphology and character level ← moving here
  • syntax
  • semantics

Regular Expressions

  • symbols that you can use to find and replace text
  • in STALGCM: * and union
  • Ex. ^ . * + - $ () |
  • will be used heavily in the MCO1 work

download regex.txt and sample text from canvas

  • regex.txt is the cheatsheet for regex
The
Tagalog
Wikipedia
(Tagalog:
Wikipediang
Tagalog)
is
the
Tagalog
language
edition
of
Wikipedia,
which
was
launched
on
December
1,
2003.
It
has
75,983
articles
and
is
the
74th
largest
Wikipedia
according
to
the
number
of
articles
as
of
November
18,
2019.[1]
This
is
significantly
fewer
articles
than
the
Cebuano
Wikipedia,
the
largest
Philippine-based
language
version
of
Wikipedia,
which
currently
has
more
than
5,379,000
articles,
and
the
Waray
Wikipedia,
which
has
more
than
1,264,000
articles.[2][3]

However,
the
Tagalog
Wikipedia
has
an
article
depth
of
31.99,
compared
to
3.52
for
the
Waray
Wikipedia
and
2
for
the
Cebuano
Wikipedia,
as
of
November
18,
2019.[4]
Mr apple
Ms. apple
Mr. apple
aaaaaaaaab
aaaaaaaab
ab
bbbbbbbb
b

Why do we need to study Regex?

  • it's the "duct tape of text processing"
  • used in pattern-matching and data cleaning
  • there are some people who are paid 6 digits per month just cleaning data

Mga paghahanda

  • text editor na may regular expression (notepad++, notepadpp, sublime, vscode)
  • regex tester (https://regex101.com, https://rubular.com)
    • these are also online text editors
  • sample text file (from canvas)
  • paste the sample text in the text editor

RegEx Symbols

  1. is it all instances or all characters?
  2. is it at the start of the word or at the start of the line?

note: as a convention: basahin ang panuto kung kailangang maglaygay ng ^ at $ (sir will indicate whether you have to include the caret and dollar sign)

  • caret ^a: highlights all characters of "a" at the start of the line
  • open/close bracket []: ^[aeiou] highlights all "a", "e", "i", "o", "u"
    • the open close bracket used to lookup several characters
  • dollar sign `---

RAW FILE

This note has not been edited yet. Content may be subject to change.

: `[aeiou]---

RAW FILE

This note has not been edited yet. Content may be subject to change.

: highlights all characters "a", "e", "i", "o", "u" at the end of the line
- if caret matches the start of the line, dollar sign matches the end of the line
- `ion---

RAW FILE

This note has not been edited yet. Content may be subject to change.

for certain affixes (like those that end with ion)

  • dash symbol -: ^[a-m] allows you to specify a range (in this case, a range from a to m)
    • the range is in ASCII format so if you put ^[0-a]
    • depending on your setting, you might encounter capital letters other times it will not
  • dot symbol .: allows you to set a number of characters to be highlighted
    • `^...---

RAW FILE

This note has not been edited yet. Content may be subject to change.

selects all 3 character words

  • parentheses (): allows you to group
    • `^(...)---

RAW FILE

This note has not been edited yet. Content may be subject to change.

selects all three characters
- `^(...)(...)---

RAW FILE

This note has not been edited yet. Content may be subject to change.

look for 6 characters but group the 1st and 2nd set (different colors), the grouping would be evident if we do a replace
- try replace (...)(...) with \$2$1 which swaps the places of the first group and the second group
- enter key is \r\n, if we replace that with a space , that will remove all new line

  • pipe symbol: |: "or"
    • ^(a|e|i|o|u) finds all single characters at the start of the line containing a/e/i/o/u
    • what's the difference with ^[aeiou]: idk what sir said but gpt says it's a character class, match any one character from the set.
  • kleene plus symbol +
    • ^a+b: match any number of a followed by a b
    • kleene star is 0 or more instances, while kleene plus is 1 or more instances
  • question mark ? match the previous token 0 or 1 times, ? means optional so the previous character is optional
    • M...? +.* (note the space inbetween ? and +)
      • will it match a line or a substring of a line?
      • dot means any character, star means 0 or more instances = anything
      • if you have a plus = 1 or more instances of the previous token
      • M..? Mr, Mrs, Mr. but not Mrs. because that's 4 characters already
    • ^a?b
      • ^ anchors to the start (of the line if multiline, otherwise the whole string).
      • a? makes “a” optional (0 or 1).
      • b must follow.
      • Matches: strings/lines that start with b or ab (but not aab, xb, cab, etc.).

Advanced RegEx Activity with RegEx Look Ahead and Look Behind

  1. For each pattern, what did it match in your test text?
  2. Based on the matches, what do you think the RegEx is doing?
  • a(?=b) Positive Lookahead: Assert that the Regex below matches, matches characters "a" where it's behind a b
  • a(?=p) Assert that the Regex below matches, matches characters "a" where it's behind a p
  • a(?!b) Negative Lookahead: Assert that the Regex below does not match, matches characters "a" such that "b" is not directly after
  • (?<=a)b Positive Lookbehind: Assert that the Regex below matches, matches characters "b" such that an a is behind it
  • (?<!a)b Negative Lookbehind: Assert that the Regex below does not match, matches characters b such that an "a" is not directly behind it

Exploration: LanguageTool

(58:14)
_attachments/Pasted image 20250916215151.png

MCO discussion here

(59:41)