NLP1000_trigram_applications_scrapestorm_nutch
This note has not been edited yet. Content may be subject to change.
Download and install JCreator, download Nutch folder
GDrive → Resources → LanguageTool → Nutch
ScrapeStorm (and register an account): https://www.scrapestorm.com/
- semicolon is not a sentence boundary marker
- yes you can use jupyter notebook to document the entire process as long as there's a file that shows the steps involved

- we shifted online for 2 sessions, so we need only to attend one class
Review: Trigram Applications
LanguageTool: LID
-
Apache Nutch
-
Apache Tika
-
Exercise #7: Ilocano
- can be used: word-level, character n-gram level, subword tokens
-
Trigrams can be used to create similarity matrix and clustering (from the n-grams or sim matrix)

- Project 3: use low-resource languages (the resources are not there)
- we don't have the resources for Tagalog, so we use the resources for a closely related language Bikol so we'll be able to translate to Waray-Waray
- Project 1 related to Project 2, Project 2 related to Project 3, paper can be about any of these projects
Getting data online with Scrapestorm
- use Flowchart Mode
https://www.inquirer.net/search/?q=migration#gsc.tab=0&gsc.q=migration&gsc.page=1

- select the first link
- operation tips > click all elements in turn
- click each one and open the page (it will go through all 10 pages but you have to specify how to go from page 1 to 2 to 3 and so on)
- operation tips > extract data from all webpages by automatic paging? > no, extract only the current webpage (for now)
- hover mouse over title --> click the title --> extract element
- it will create a database field

- click start on bottom right
- we can't schedule it because that's for Premium and above

- the count will increase on left side

if you export later, you can view the data by clicking:

- it will list the titles of the articles that you downloaded

- there are some rows that are blank --> it may have downloaded an advertisement or an article that no longer exists

- export and view the file (excel)
Download: Title, Author, Date, Contents (actual article contents)
You can also download news articles, etc. using Calibre
you can use Beautiful Soup on Python but there's a chance you'll be detected → Scrapestorm has some randomness that mimics humans
Limited only allows exporting of 10 rows
data cleaning is needed because it will include the advertisements
it's possible that the field you selected doesn't cover the entire author list or it's a different field --> due to inconsistencies in certain organizations
usually sir has a research assistant to scrape but if he's on his own he will use third-party tools
Creating n-grams with Apache Nutch & JCreator/IDE
JCreator is a Java IDE
File Structure
-Nutch
--Nutch.jcw
--Classes
---Trigram
--Jar
--Src
---Nutch.java
-Apache Nutch bin and src
-JCreator installation file
- open workspace on JCreator
- top > configure > options > JDK profiles > set the correct path for the jdk
- build then run the file (Data: Nutch/classes/Trigram)
Ethics
- you can scrape social media but you may be violating the user agreement
we have to respect the following: - Robots.txt: https://www.gmanetwork.com/robots.txt
- User Agreement: https://services.inquirer.net/user_agreement/
there used to be a faculty in Gokongwei 1st floor that runs 24/7 just collecting datasets using Calibre