Skip to content
This repository was archived by the owner on Aug 26, 2022. It is now read-only.

Bug 1271509: Sample database and production scraper#4076

Merged
escattone merged 15 commits intomdn:masterfrom
jwhitlock:sample_database_wip_1271509
Jul 7, 2017
Merged

Bug 1271509: Sample database and production scraper#4076
escattone merged 15 commits intomdn:masterfrom
jwhitlock:sample_database_wip_1271509

Conversation

@jwhitlock
Copy link
Contributor

@jwhitlock jwhitlock commented Dec 20, 2016

Now that PR #4245 and PR #4248 are merged, this PR is getting a little more sane:

  • Adds a management command scrape_links to find document links on an MDN page. It is designed for the homepage, but can work on other pages as needed.
  • Adds a management command sample_mdn that loads fixtures and scrapes production data from a JSON spec.
  • Adds a script to create an empty but migrated database, load it with fixtures and scraped data, and export a SQL dump for development and testing. Since the data is scraped from public interfaces, avoids issues with exposing personally identifiable information.

A recent generated sample database is at http://mdn-downloads.s3-website-us-west-2.amazonaws.com/mdn_sample_db.sql.gz

To Do:

  • Remove unneeded stuff like ./manage.py sample_templates and debug print statements that are now obvious once I've opened the PR.
  • Split up big files for sanity and better GitHub review
  • Split up long gather methods into sane methods
  • Update documentation to use sample database
  • Improve S3 download site
  • Set valid test account passwords for integration tests
  • Ensure fr/docs/User:anonymous:uitest is in sample DB
  • Include Mozilla Hacks blog scrape
  • Maybe Add en-US/docs/Apps
  • Add documentation for sampling data, generating the database
  • Allow scraping a tree of documents, like /en-US/Web/HTML
  • Change management command scrape_homepage to a generic scrape_links.
  • Add "less verbose" mode for cron-based sample DB generation
  • Set Site 1 to localhost:8000 from example.com
  • Add URLs in tests/performance/smoke.py to the sample DB

Not this PR:

  • Extract attachment data from documents
  • Render documents and set the render date of documents to avoid rebuild of sample DB on load
@codecov-io
Copy link

codecov-io commented Dec 20, 2016

Codecov Report

Merging #4076 into master will increase coverage by 0.4%.
The diff coverage is 100%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #4076 +/- ## ========================================= + Coverage 87.86% 88.27% +0.4%  ========================================= Files 162 164 +2 Lines 10022 10236 +214 Branches 1367 1419 +52 ========================================= + Hits 8806 9036 +230  + Misses 989 971 -18  - Partials 227 229 +2
Impacted Files Coverage Δ
kuma/scrape/storage.py 100% <ø> (ø) ⬆️
kuma/scrape/sources/document_current.py 100% <100%> (ø)
kuma/scrape/sources/revision.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/__init__.py 100% <100%> (ø) ⬆️
kuma/scrape/scraper.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/links.py 100% <100%> (ø)
kuma/scrape/fixture.py 100% <100%> (ø)
kuma/scrape/sources/document.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/document_history.py 100% <100%> (ø) ⬆️
kuma/scrape/sources/document_rendered.py 100% <100%> (ø) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cb9c09...22b7ab2. Read the comment docs.

@jwhitlock jwhitlock self-assigned this Jan 23, 2017
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch from 3e24569 to 2679b79 Compare February 1, 2017 22:07
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch 3 times, most recently from e045126 to f0ecd92 Compare February 9, 2017 18:24
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch 3 times, most recently from ec4037e to 4305ce4 Compare March 8, 2017 23:01
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch from 6319539 to e9622c2 Compare March 23, 2017 18:17
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch 2 times, most recently from 3f8da5b to 3e9ed80 Compare April 4, 2017 19:59
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch from 3e9ed80 to dc36a53 Compare April 17, 2017 21:21
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch from dc36a53 to 58e36e9 Compare May 11, 2017 17:55
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch 6 times, most recently from 366be36 to fabf2a2 Compare May 31, 2017 23:10
@jwhitlock
Copy link
Contributor Author

With #4245 merged, I've move the work-in-progress branch to https://github.com/jwhitlock/kuma/tree/sample_database_wip_long_1271509, and replaced this PR with the mergeable commits. There's two milestone PRs:

  • Adding ./manage.py scrape_document
  • Adding ./manage.py scrape_homepage and sample_mdn

I'm continuing to work on the code.

@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch 7 times, most recently from aa1578c to cedf918 Compare June 6, 2017 16:32
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch from 8803845 to d2bdddc Compare June 7, 2017 14:21
Scrape the documents linked from a page, such as the homepage.
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch from d2bdddc to 26a33ac Compare June 8, 2017 21:41
@jwhitlock
Copy link
Contributor Author

I ran the scrape again, and hit two problems:

  • A translation was deleted on the site, but the $json metadata for the English page still included it. When fetching the page, a 404 is returned. This caused an immediate termination of scraping, rather than setting the document_rendered source to an error state. The code needs updating.
  • All pages show "This article doesn't have any content yet.". The document's html field is an empty string. This should be set when the current revision is saved, to the current revision's content. More debugging is needed.
@jwhitlock
Copy link
Contributor Author

I incorrectly thought that revision.save() was skipped when using Revision.objects.create. I was thinking of bulk creation, like when loading a Django test fixture. Oops.

Most documents are now setup correctly. There are exceptions (22 of 1213 documents):

  • Mozilla/Projects/Midas - has a revision, but not set to current_revision, html is an empty string.
  • cs/docs/Tools - This is a redirect to the English doc on MDN, and has no content or revisions in the scrape.
Add a management command to populate a database with fixtures and with data scraped from MDN.
Add a script and specification file to create the sample database used for MDN development and integration testing.
One common case is that the document metadata refers to a deleted translation. Instead of halting scrape with error, set the document_rendered source to state=ERROR.
When a new revision is created, even w/ Revision.objects.get_or_create, the Revision.save() method is called. This surpised me, I thought it was skipped, which is only true for bulk creation. If is_approved=True (the default), the save() method makes the revision the current revision, which sets Document.html to the revision.content (default empty string). I've fixed the storage code to save the initial revision with is_approved=False, so that we don't accidentally clear doc.html when adding a historical revision
Fixtures assumed that Revision.objects.create() did not call .save(), and document.current_revision had to be called directly. Update fixture setup for my new understanding.
Instead of logging errors in Source.gather, capture the error so that it can be logged by the scraper, and so that a summary of errors can be printed at the end of the run.
If the document history has less revisions than requested, then remember that all revisions have been scraped.
At the end of document scraping, chech that the document has a current_revision. If it doesn't, scrape more revisions and/or more history, until we find a current revision or run out of revisions. Page moves mark the page move in the most recent revision, but leave the n-1 revision as the current_revision. The default for ./scrape_document and others is to just scrape one document, leaving these stuck with no content.
@jwhitlock jwhitlock force-pushed the sample_database_wip_1271509 branch from 641291d to 08818cf Compare June 12, 2017 23:34
@jwhitlock
Copy link
Contributor Author

The issue with an empty document.current_revision is a quirk of page moves. On Mozilla/Projects/Midas$history, the latest revision is the page move, and the second-to-latest revision is the "current revision". If the scraper just scrapes the latest revision, it never sets current_revision. The solution is a new source type document_current that checks that current_revision is set, and requests more revisions until it is.

Other issues were translations where a root document was renamed but the child documents had the old paths. For example pl/docs/Narzędzia was the parent topic of pl/docs/Tools/WebIDE, and the scraper had issues following the parent relations, attempting to get the history of a redirect. None of the APIs expose the parent topic's URL, and scraper assumes you can remove everything after the last slash to get it. In any case, I manually moved these child translations so that they matched their parent topic's slug.

I've now successfully scraped MDN with this code, and uploaded the sample DB to the downloads server.

If you want to try out the sample DB script, it takes me about 40 minutes to run, makes a little over 10,000 requests, and takes over your working copy while doing it. Plan your work day accordingly! You can also try a more limited version by reducing the scraped items in etc/sample_db.json, for example removing items, not scraping translations, etc.

When scraping a page for Document URLs, omit the links to profiles, such as the contributor bar.
Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sad that more eyes won't review this, so you get the credit you deserve. This is great work with superb tests and test coverage (100% for every file under kuma/scrape). It deserves fanfare. 🎉 🍾

I've included some nitpicks for you to consider as you see fit, and after your response to those I'll merge it!

docs/data.rst Outdated

Create the Sample Database
==========================
These scraping tools are used to crete a sample database of public
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: replace crete with create

relations = metadata.get('relations', {})
filters = metadata.get('filters', {})
app_label, model_name = model_id.split('.')
assert apps.get_model(app_label, model_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick. It appears that app_label and model_name are not used other than for the apps.get_model call, so could remove the app_label, model_name = model_id.split('.') line and do assert apps.get_model(model_id) since apps.get_model supports the dotted notation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this worked. I didn't know that apps.get_model was flexible that way.

for model_id, items in self.spec.items():
metadata = self.model_metadata[model_id]
app_label, model_name = model_id.split('.')
Model = apps.get_model(app_label, model_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick. As before, could remove the app_label, model_name = model_id.split('.') line and do Model = apps.get_model(model_id).

"""
Attempt to create or update an item.

Returns the instance if attempt suceeded, or None if related items are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment to reflect that a NeedsDependency exception is raised if related items are needed.

dest='ssl')

def handle(self, *arg, **options):
self.setup_logging(options.get('verbosity'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sake of consistency, would prefer self.setup_logging(options['verbosity']).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I thought I got all of these...

params[param] = options[param]
scraper.add_source("links", path, **params)

scraper.scrape()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scraper.scrape() call will handle reporting any error details, but do we also want to raise a CommandError exception, like the other commands, if any of the underlying sources fail? Maybe something like this:

sources = scraper.scrape() if any(source.state != source.STATE_DONE for source in sources.values()): raise CommandError('Incomplete scraping of the links for "%s"' % path) 
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think I should standardize on CommandError in all the management commands.

@@ -113,7 +115,8 @@ def create_source(self, source_key, **options):
'complete, freshness=%(freshness)s, with %(dep_count)s'
' dependant source%(dep_s)s.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: should be dependent rather than dependant

' dependant source%(dep_s)s.')
_report_error = (_report_prefix +
'errored, with %(dep_count)s dependant source%(dep_s)s.')
'errored %(err_msg)s, with %(dep_count)s dependant'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: dependent rather than dependant

'errored %(err_msg)s, with %(dep_count)s dependant'
' source%(dep_s)s.')
_report_progress = (_report_prefix +
'in state "%(state)s" with %(dep_count)s dependant'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: dependent rather than dependant

path = '/en-US/' # Default to English homepage
path = urlparse(path).path
assert path.startswith('/')
self.locale = urlparse(path).path.split('/')[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a redundant call to urlparse(path).path, so could be self.locale = path.split('/')[1]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally redundent redundant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

@jgmize
Copy link
Contributor

jgmize commented Jul 7, 2017

It's sad that more eyes won't review this, so you get the credit you deserve. This is great work with superb tests and test coverage (100% for every file under kuma/scrape). It deserves fanfare. 🎉 🍾

agreed 💯. I also hate to break the tradition but I'm looking forward to seeing this finally merge this month.

jwhitlock added 5 commits July 7, 2017 11:09
- Update scrape_links docstring for abilties beyond homepage - Use options['verbosity'], since always populated - Use CommandError to report source failures - Remove unneeded parentheses - Close specification file after processing it - Fix pluralization of incomplete sources message
Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @jwhitlock, this is a great addition to Kuma! 🎉

@escattone escattone merged commit 6b29825 into mdn:master Jul 7, 2017
@jwhitlock jwhitlock deleted the sample_database_wip_1271509 branch July 7, 2017 19:18
@jwhitlock
Copy link
Contributor Author

Thanks! I'm proud of this code, but I don't want anyone, including me, to follow my example of working on a feature for over a year. Small PRs, reviewed and merged often!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

4 participants