0
$\begingroup$

I'm working in Python, using Scrapy, and NLTK to try to understand how I can extract data from college websites.

My scraper can navigate through the university websites and find their tuition fees pages perfectly , but when trying to extract specific fees like :

  1. Resident
  2. Non Resident
  3. Per Credit Hour
  4. Per Semester

I'm running into trouble due to the data being so unstructured from site to site.

I've tried using NLTK to parse data based on parts of speech tags and regex chunking to try to extract sentences such as "tuition cost for resident: $12,500" but colleges can display this data in a number of ways.

Here is my question:

Are there any better ideas/methodologies that I should be looking into that can help me with extracting this type of data?

$\endgroup$

1 Answer 1

0
$\begingroup$

You need to build a couple of classifiers. First, you need a classifier to give you thumbs up that you want to parse the page at all. Call this the "is_relevant" model. Once you've determined that a page is relevant, you should pass it through a separate classifier for each data element you're hoping to capture (or a multiclass classifier capable of recognizing each of those elements and distinguishing them from content you're not interested in).

$\endgroup$
1
  • $\begingroup$ can you elaborate on what you mean by classifiers? $\endgroup$ Commented Mar 16, 2018 at 0:33

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.