NLP1000_trigram_applications_scrapestorm_nutch

RAW FILE

This note has not been edited yet. Content may be subject to change.

Download and install JCreator, download Nutch folder
GDrive → Resources → LanguageTool → Nutch
ScrapeStorm (and register an account): https://www.scrapestorm.com/

  • semicolon is not a sentence boundary marker
  • yes you can use jupyter notebook to document the entire process as long as there's a file that shows the steps involved

_attachments/Pasted image 20250926181157.png

  • we shifted online for 2 sessions, so we need only to attend one class

Review: Trigram Applications

LanguageTool: LID

  • Apache Nutch

  • Apache Tika

  • Exercise #7: Ilocano

    • can be used: word-level, character n-gram level, subword tokens
  • Trigrams can be used to create similarity matrix and clustering (from the n-grams or sim matrix)

_attachments/Pasted image 20250926181421.png

  • Project 3: use low-resource languages (the resources are not there)
  • we don't have the resources for Tagalog, so we use the resources for a closely related language Bikol so we'll be able to translate to Waray-Waray
  • Project 1 related to Project 2, Project 2 related to Project 3, paper can be about any of these projects

Getting data online with Scrapestorm

  • use Flowchart Mode

https://www.inquirer.net/search/?q=migration#gsc.tab=0&gsc.q=migration&gsc.page=1

_attachments/Pasted image 20250926182303.png

  • select the first link
  • operation tips > click all elements in turn
    • click each one and open the page (it will go through all 10 pages but you have to specify how to go from page 1 to 2 to 3 and so on)
  • operation tips > extract data from all webpages by automatic paging? > no, extract only the current webpage (for now)
  • hover mouse over title --> click the title --> extract element
    • it will create a database field

_attachments/Pasted image 20250926183013.png

  • click start on bottom right
  • we can't schedule it because that's for Premium and above

_attachments/Pasted image 20250926183146.png

  • the count will increase on left side

_attachments/Pasted image 20250926183254.png

if you export later, you can view the data by clicking:
_attachments/Pasted image 20250926183433.png

  • it will list the titles of the articles that you downloaded

_attachments/Pasted image 20250926183534.png

  • there are some rows that are blank --> it may have downloaded an advertisement or an article that no longer exists

_attachments/Pasted image 20250926183645.png

  • export and view the file (excel)

Download: Title, Author, Date, Contents (actual article contents)

You can also download news articles, etc. using Calibre
you can use Beautiful Soup on Python but there's a chance you'll be detected → Scrapestorm has some randomness that mimics humans
Limited only allows exporting of 10 rows

data cleaning is needed because it will include the advertisements

it's possible that the field you selected doesn't cover the entire author list or it's a different field --> due to inconsistencies in certain organizations

usually sir has a research assistant to scrape but if he's on his own he will use third-party tools


Creating n-grams with Apache Nutch & JCreator/IDE

JCreator is a Java IDE

File Structure

-Nutch
--Nutch.jcw
--Classes
---Trigram
--Jar
--Src
---Nutch.java
-Apache Nutch bin and src
-JCreator installation file

  • open workspace on JCreator
  • top > configure > options > JDK profiles > set the correct path for the jdk
  • build then run the file (Data: Nutch/classes/Trigram)

Ethics

there used to be a faculty in Gokongwei 1st floor that runs 24/7 just collecting datasets using Calibre