NLP1000_trigram_applications_scrapestorm_nutch

RAW FILE

This note has not been edited yet. Content may be subject to change.

Download and install JCreator, download Nutch folder
GDrive → Resources → LanguageTool → Nutch
ScrapeStorm (and register an account): https://www.scrapestorm.com/

semicolon is not a sentence boundary marker
yes you can use jupyter notebook to document the entire process as long as there's a file that shows the steps involved

_attachments/Pasted image 20250926181157.png

we shifted online for 2 sessions, so we need only to attend one class

Review: Trigram Applications

LanguageTool: LID

Apache Nutch
Apache Tika
Exercise #7: Ilocano
- can be used: word-level, character n-gram level, subword tokens
Trigrams can be used to create similarity matrix and clustering (from the n-grams or sim matrix)

_attachments/Pasted image 20250926181421.png

Project 3: use low-resource languages (the resources are not there)
we don't have the resources for Tagalog, so we use the resources for a closely related language Bikol so we'll be able to translate to Waray-Waray
Project 1 related to Project 2, Project 2 related to Project 3, paper can be about any of these projects

Getting data online with Scrapestorm

use Flowchart Mode

https://www.inquirer.net/search/?q=migration#gsc.tab=0&gsc.q=migration&gsc.page=1

_attachments/Pasted image 20250926182303.png

select the first link
operation tips > click all elements in turn
- click each one and open the page (it will go through all 10 pages but you have to specify how to go from page 1 to 2 to 3 and so on)
operation tips > extract data from all webpages by automatic paging? > no, extract only the current webpage (for now)
hover mouse over title --> click the title --> extract element
- it will create a database field

_attachments/Pasted image 20250926183013.png

click start on bottom right
we can't schedule it because that's for Premium and above

_attachments/Pasted image 20250926183146.png

the count will increase on left side

_attachments/Pasted image 20250926183254.png

if you export later, you can view the data by clicking:
_attachments/Pasted image 20250926183433.png

it will list the titles of the articles that you downloaded

_attachments/Pasted image 20250926183534.png

there are some rows that are blank --> it may have downloaded an advertisement or an article that no longer exists

_attachments/Pasted image 20250926183645.png

export and view the file (excel)

Download: Title, Author, Date, Contents (actual article contents)

You can also download news articles, etc. using Calibre
you can use Beautiful Soup on Python but there's a chance you'll be detected → Scrapestorm has some randomness that mimics humans
Limited only allows exporting of 10 rows

data cleaning is needed because it will include the advertisements

it's possible that the field you selected doesn't cover the entire author list or it's a different field --> due to inconsistencies in certain organizations

usually sir has a research assistant to scrape but if he's on his own he will use third-party tools

Creating n-grams with Apache Nutch & JCreator/IDE

JCreator is a Java IDE

File Structure

-Nutch
--Nutch.jcw
--Classes
---Trigram
--Jar
--Src
---Nutch.java
-Apache Nutch bin and src
-JCreator installation file

open workspace on JCreator
top > configure > options > JDK profiles > set the correct path for the jdk
build then run the file (Data: Nutch/classes/Trigram)

Ethics

you can scrape social media but you may be violating the user agreement
we have to respect the following:
Robots.txt: https://www.gmanetwork.com/robots.txt
User Agreement: https://services.inquirer.net/user_agreement/

there used to be a faculty in Gokongwei 1st floor that runs 24/7 just collecting datasets using Calibre