Proper/Possible methods for extracting unstructured data from websites

Question

I'm working in Python, using Scrapy, and NLTK to try to understand how I can extract data from college websites.

My scraper can navigate through the university websites and find their tuition fees pages perfectly , but when trying to extract specific fees like :

Resident
Non Resident
Per Credit Hour
Per Semester

I'm running into trouble due to the data being so unstructured from site to site.

I've tried using NLTK to parse data based on parts of speech tags and regex chunking to try to extract sentences such as "tuition cost for resident: $12,500" but colleges can display this data in a number of ways.

Here is my question:

Are there any better ideas/methodologies that I should be looking into that can help me with extracting this type of data?

David Marx · Accepted Answer · 2018-03-14 06:03:15Z

You need to build a couple of classifiers. First, you need a classifier to give you thumbs up that you want to parse the page at all. Call this the "is_relevant" model. Once you've determined that a page is relevant, you should pass it through a separate classifier for each data element you're hoping to capture (or a multiclass classifier capable of recognizing each of those elements and distinguishing them from content you're not interested in).

$\begingroup$ can you elaborate on what you mean by classifiers? $\endgroup$

Daoud
– Daoud

2018-03-16 00:33:52 +00:00
Commented Mar 16, 2018 at 0:33 — Daoud
– Daoud, Commented Mar 16, 2018 at 0:33

Stack Exchange Network

Proper/Possible methods for extracting unstructured data from websites

1 Answer 1

Hot Network Questions

Proper/Possible methods for extracting unstructured data from websites

1 Answer 1

Related

Hot Network Questions