Bug 1271509: Sample database and production scraper by jwhitlock · Pull Request #4076 · mdn/kuma

jwhitlock · 2016-12-20T01:47:24Z

Now that PR #4245 and PR #4248 are merged, this PR is getting a little more sane:

Adds a management command scrape_links to find document links on an MDN page. It is designed for the homepage, but can work on other pages as needed.
Adds a management command sample_mdn that loads fixtures and scrapes production data from a JSON spec.
Adds a script to create an empty but migrated database, load it with fixtures and scraped data, and export a SQL dump for development and testing. Since the data is scraped from public interfaces, avoids issues with exposing personally identifiable information.

A recent generated sample database is at http://mdn-downloads.s3-website-us-west-2.amazonaws.com/mdn_sample_db.sql.gz

To Do:

Remove unneeded stuff like ./manage.py sample_templates and debug print statements that are now obvious once I've opened the PR.
Split up big files for sanity and better GitHub review
Split up long gather methods into sane methods
Update documentation to use sample database
Improve S3 download site
Set valid test account passwords for integration tests
Ensure fr/docs/User:anonymous:uitest is in sample DB
Include Mozilla Hacks blog scrape
Maybe Add en-US/docs/Apps
Add documentation for sampling data, generating the database
Allow scraping a tree of documents, like /en-US/Web/HTML
Change management command scrape_homepage to a generic scrape_links.
Add "less verbose" mode for cron-based sample DB generation
Set Site 1 to localhost:8000 from example.com
Add URLs in tests/performance/smoke.py to the sample DB

Not this PR:

Extract attachment data from documents
Render documents and set the render date of documents to avoid rebuild of sample DB on load

codecov-io · 2016-12-20T02:16:58Z

Codecov Report

Merging #4076 into master will increase coverage by 0.4%.
The diff coverage is 100%.

@@ Coverage Diff @@ ## master #4076 +/- ## ========================================= + Coverage 87.86% 88.27% +0.4%  ========================================= Files 162 164 +2 Lines 10022 10236 +214 Branches 1367 1419 +52 ========================================= + Hits 8806 9036 +230  + Misses 989 971 -18  - Partials 227 229 +2

Impacted Files	Coverage Δ
kuma/scrape/storage.py	`100% <ø> (ø)`	⬆️
kuma/scrape/sources/document_current.py	`100% <100%> (ø)`
kuma/scrape/sources/revision.py	`100% <100%> (ø)`	⬆️
kuma/scrape/sources/__init__.py	`100% <100%> (ø)`	⬆️
kuma/scrape/scraper.py	`100% <100%> (ø)`	⬆️
kuma/scrape/sources/links.py	`100% <100%> (ø)`
kuma/scrape/fixture.py	`100% <100%> (ø)`
kuma/scrape/sources/document.py	`100% <100%> (ø)`	⬆️
kuma/scrape/sources/document_history.py	`100% <100%> (ø)`	⬆️
kuma/scrape/sources/document_rendered.py	`100% <100%> (ø)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6cb9c09...22b7ab2. Read the comment docs.

jwhitlock · 2017-05-31T23:20:02Z

With #4245 merged, I've move the work-in-progress branch to https://github.com/jwhitlock/kuma/tree/sample_database_wip_long_1271509, and replaced this PR with the mergeable commits. There's two milestone PRs:

Adding ./manage.py scrape_document
Adding ./manage.py scrape_homepage and sample_mdn

I'm continuing to work on the code.

Scrape the documents linked from a page, such as the homepage.

jwhitlock · 2017-06-09T13:19:17Z

I ran the scrape again, and hit two problems:

A translation was deleted on the site, but the $json metadata for the English page still included it. When fetching the page, a 404 is returned. This caused an immediate termination of scraping, rather than setting the document_rendered source to an error state. The code needs updating.
All pages show "This article doesn't have any content yet.". The document's html field is an empty string. This should be set when the current revision is saved, to the current revision's content. More debugging is needed.

jwhitlock · 2017-06-09T18:22:02Z

I incorrectly thought that revision.save() was skipped when using Revision.objects.create. I was thinking of bulk creation, like when loading a Django test fixture. Oops.

Most documents are now setup correctly. There are exceptions (22 of 1213 documents):

Mozilla/Projects/Midas - has a revision, but not set to current_revision, html is an empty string.
cs/docs/Tools - This is a redirect to the English doc on MDN, and has no content or revisions in the scrape.

Add a management command to populate a database with fixtures and with data scraped from MDN.

Add a script and specification file to create the sample database used for MDN development and integration testing.

One common case is that the document metadata refers to a deleted translation. Instead of halting scrape with error, set the document_rendered source to state=ERROR.

When a new revision is created, even w/ Revision.objects.get_or_create, the Revision.save() method is called. This surpised me, I thought it was skipped, which is only true for bulk creation. If is_approved=True (the default), the save() method makes the revision the current revision, which sets Document.html to the revision.content (default empty string). I've fixed the storage code to save the initial revision with is_approved=False, so that we don't accidentally clear doc.html when adding a historical revision

Fixtures assumed that Revision.objects.create() did not call .save(), and document.current_revision had to be called directly. Update fixture setup for my new understanding.

Instead of logging errors in Source.gather, capture the error so that it can be logged by the scraper, and so that a summary of errors can be printed at the end of the run.

If the document history has less revisions than requested, then remember that all revisions have been scraped.

At the end of document scraping, chech that the document has a current_revision. If it doesn't, scrape more revisions and/or more history, until we find a current revision or run out of revisions. Page moves mark the page move in the most recent revision, but leave the n-1 revision as the current_revision. The default for ./scrape_document and others is to just scrape one document, leaving these stuck with no content.

jwhitlock · 2017-06-12T23:54:43Z

The issue with an empty document.current_revision is a quirk of page moves. On Mozilla/Projects/Midas$history, the latest revision is the page move, and the second-to-latest revision is the "current revision". If the scraper just scrapes the latest revision, it never sets current_revision. The solution is a new source type document_current that checks that current_revision is set, and requests more revisions until it is.

Other issues were translations where a root document was renamed but the child documents had the old paths. For example pl/docs/Narzędzia was the parent topic of pl/docs/Tools/WebIDE, and the scraper had issues following the parent relations, attempting to get the history of a redirect. None of the APIs expose the parent topic's URL, and scraper assumes you can remove everything after the last slash to get it. In any case, I manually moved these child translations so that they matched their parent topic's slug.

I've now successfully scraped MDN with this code, and uploaded the sample DB to the downloads server.

If you want to try out the sample DB script, it takes me about 40 minutes to run, makes a little over 10,000 requests, and takes over your working copy while doing it. Plan your work day accordingly! You can also try a more limited version by reducing the scraped items in etc/sample_db.json, for example removing items, not scraping translations, etc.

When scraping a page for Document URLs, omit the links to profiles, such as the contributor bar.

escattone

It's sad that more eyes won't review this, so you get the credit you deserve. This is great work with superb tests and test coverage (100% for every file under kuma/scrape). It deserves fanfare. 🎉 🍾

I've included some nitpicks for you to consider as you see fit, and after your response to those I'll merge it!

escattone · 2017-07-05T16:11:15Z

docs/data.rst

+
+Create the Sample Database
+==========================
+These scraping tools are used to crete a sample database of public


Typo: replace crete with create

escattone · 2017-07-05T22:37:02Z

kuma/scrape/fixture.py

+ relations = metadata.get('relations', {})
+ filters = metadata.get('filters', {})
+ app_label, model_name = model_id.split('.')
+ assert apps.get_model(app_label, model_name)


Nitpick. It appears that app_label and model_name are not used other than for the apps.get_model call, so could remove the app_label, model_name = model_id.split('.') line and do assert apps.get_model(model_id) since apps.get_model supports the dotted notation.

Thanks, this worked. I didn't know that apps.get_model was flexible that way.

escattone · 2017-07-05T22:39:39Z

kuma/scrape/fixture.py

+ for model_id, items in self.spec.items():
+ metadata = self.model_metadata[model_id]
+ app_label, model_name = model_id.split('.')
+ Model = apps.get_model(app_label, model_name)


Nitpick. As before, could remove the app_label, model_name = model_id.split('.') line and do Model = apps.get_model(model_id).

escattone · 2017-07-05T23:45:41Z

kuma/scrape/fixture.py

+ """
+ Attempt to create or update an item.
+
+ Returns the instance if attempt suceeded, or None if related items are


Update comment to reflect that a NeedsDependency exception is raised if related items are needed.

escattone · 2017-07-05T23:53:29Z

kuma/scrape/management/commands/sample_mdn.py

+ dest='ssl')
+
+ def handle(self, *arg, **options):
+ self.setup_logging(options.get('verbosity'))


For sake of consistency, would prefer self.setup_logging(options['verbosity']).

Good catch, I thought I got all of these...

escattone · 2017-07-06T18:17:52Z

kuma/scrape/management/commands/scrape_links.py

+ params[param] = options[param]
+ scraper.add_source("links", path, **params)
+
+ scraper.scrape()


The scraper.scrape() call will handle reporting any error details, but do we also want to raise a CommandError exception, like the other commands, if any of the underlying sources fail? Maybe something like this:

sources = scraper.scrape() if any(source.state != source.STATE_DONE for source in sources.values()): raise CommandError('Incomplete scraping of the links for "%s"' % path)

Sounds good. I think I should standardize on CommandError in all the management commands.

escattone · 2017-07-06T18:26:27Z

kuma/scrape/scraper.py

@@ -113,7 +115,8 @@ def create_source(self, source_key, **options):
 'complete, freshness=%(freshness)s, with %(dep_count)s'
 ' dependant source%(dep_s)s.')


Typo: should be dependent rather than dependant

escattone · 2017-07-06T18:26:50Z

kuma/scrape/scraper.py

 ' dependant source%(dep_s)s.')
 _report_error = (_report_prefix +
- 'errored, with %(dep_count)s dependant source%(dep_s)s.')
+ 'errored %(err_msg)s, with %(dep_count)s dependant'


Typo: dependent rather than dependant

escattone · 2017-07-06T18:27:04Z

kuma/scrape/scraper.py

+ 'errored %(err_msg)s, with %(dep_count)s dependant'
+ ' source%(dep_s)s.')
 _report_progress = (_report_prefix +
 'in state "%(state)s" with %(dep_count)s dependant'


Typo: dependent rather than dependant

escattone · 2017-07-06T21:35:17Z

kuma/scrape/sources/links.py

+ path = '/en-US/' # Default to English homepage
+ path = urlparse(path).path
+ assert path.startswith('/')
+ self.locale = urlparse(path).path.split('/')[1]


Seems like a redundant call to urlparse(path).path, so could be self.locale = path.split('/')[1]?

Yep, totally ~~redundent~~ redundant

jgmize · 2017-07-07T04:34:10Z

It's sad that more eyes won't review this, so you get the credit you deserve. This is great work with superb tests and test coverage (100% for every file under kuma/scrape). It deserves fanfare. 🎉 🍾

agreed 💯. I also hate to break the tradition but I'm looking forward to seeing this finally merge this month.

- Update scrape_links docstring for abilties beyond homepage - Use options['verbosity'], since always populated - Use CommandError to report source failures - Remove unneeded parentheses - Close specification file after processing it - Fix pluralization of incomplete sources message

escattone

Thanks again @jwhitlock, this is a great addition to Kuma! 🎉

jwhitlock · 2017-07-07T19:20:12Z

Thanks! I'm proud of this code, but I don't want anyone, including me, to follow my example of working on a feature for over a year. Small PRs, reviewed and merged often!

jwhitlock added the not ready label Dec 20, 2016

jwhitlock self-assigned this Jan 23, 2017

jwhitlock force-pushed the sample_database_wip_1271509 branch from 3e24569 to 2679b79 Compare February 1, 2017 22:07

jwhitlock mentioned this pull request Feb 2, 2017

Bug 1314708: Test for translation interface #4091

Merged

jwhitlock force-pushed the sample_database_wip_1271509 branch 3 times, most recently from e045126 to f0ecd92 Compare February 9, 2017 18:24

jwhitlock force-pushed the sample_database_wip_1271509 branch 3 times, most recently from ec4037e to 4305ce4 Compare March 8, 2017 23:01

jwhitlock force-pushed the sample_database_wip_1271509 branch from 6319539 to e9622c2 Compare March 23, 2017 18:17

jwhitlock force-pushed the sample_database_wip_1271509 branch 2 times, most recently from 3f8da5b to 3e9ed80 Compare April 4, 2017 19:59

jwhitlock force-pushed the sample_database_wip_1271509 branch from 3e9ed80 to dc36a53 Compare April 17, 2017 21:21

jwhitlock force-pushed the sample_database_wip_1271509 branch from dc36a53 to 58e36e9 Compare May 11, 2017 17:55

jwhitlock force-pushed the sample_database_wip_1271509 branch 6 times, most recently from 366be36 to fabf2a2 Compare May 31, 2017 23:10

jwhitlock force-pushed the sample_database_wip_1271509 branch 7 times, most recently from aa1578c to cedf918 Compare June 6, 2017 16:32

jwhitlock force-pushed the sample_database_wip_1271509 branch from 8803845 to d2bdddc Compare June 7, 2017 14:21

bug 1271509: Add ./manage.py scrape_links

4dd193f

Scrape the documents linked from a page, such as the homepage.

jwhitlock force-pushed the sample_database_wip_1271509 branch from d2bdddc to 26a33ac Compare June 8, 2017 21:41

jwhitlock added the not ready label Jun 9, 2017

jwhitlock added 8 commits June 12, 2017 18:34

bug 1271509: Add ./manage.py sample_mdn

e051f21

Add a management command to populate a database with fixtures and with data scraped from MDN.

bug 1271509: Add script to create sample database

338113c

Add a script and specification file to create the sample database used for MDN development and integration testing.

bug 1271509: Handle 404s for document_rendered

fae8163

One common case is that the document metadata refers to a deleted translation. Instead of halting scrape with error, set the document_rendered source to state=ERROR.

bug 1271509: Update test fixtures for rev.save()

424895e

Fixtures assumed that Revision.objects.create() did not call .save(), and document.current_revision had to be called directly. Update fixture setup for my new understanding.

bug 1271509: Move error logging to the scraper

d61bca6

Instead of logging errors in Source.gather, capture the error so that it can be logged by the scraper, and so that a summary of errors can be printed at the end of the run.

bug 1271509: document_history detects all history

d590695

If the document history has less revisions than requested, then remember that all revisions have been scraped.

jwhitlock force-pushed the sample_database_wip_1271509 branch from 641291d to 08818cf Compare June 12, 2017 23:34

jwhitlock removed the not ready label Jun 12, 2017

bug 1271509: Omit profiles when scraping links

e5de46c

When scraping a page for Document URLs, omit the links to profiles, such as the contributor bar.

jwhitlock assigned escattone Jun 21, 2017

escattone suggested changes Jul 6, 2017

View reviewed changes

jwhitlock added 5 commits July 7, 2017 11:09

bug 1271509: Fix spelling

d6cec60

bug 1271509: Use dotted shortcut for get_model

b8af1b5

bug 1271509: Remove redundant urlparse

0cd9653

bug 1271509: Fix docs for when dependency needed

3d6abcd

escattone approved these changes Jul 7, 2017

View reviewed changes

escattone merged commit 6b29825 into mdn:master Jul 7, 2017

jwhitlock deleted the sample_database_wip_1271509 branch July 7, 2017 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1271509: Sample database and production scraper#4076

Bug 1271509: Sample database and production scraper#4076
escattone merged 15 commits intomdn:masterfrom
jwhitlock:sample_database_wip_1271509

jwhitlock commented Dec 20, 2016 •

edited

Loading

codecov-io commented Dec 20, 2016 •

edited

Loading

jwhitlock commented May 31, 2017

jwhitlock commented Jun 9, 2017

jwhitlock commented Jun 9, 2017

jwhitlock commented Jun 12, 2017

escattone left a comment

escattone Jul 5, 2017

escattone Jul 5, 2017

jwhitlock Jul 7, 2017

escattone Jul 5, 2017

escattone Jul 5, 2017

escattone Jul 5, 2017

jwhitlock Jul 7, 2017

escattone Jul 6, 2017

jwhitlock Jul 7, 2017

escattone Jul 6, 2017

escattone Jul 6, 2017

escattone Jul 6, 2017

escattone Jul 6, 2017

jwhitlock Jul 7, 2017

escattone Jul 7, 2017

jgmize commented Jul 7, 2017

escattone left a comment

jwhitlock commented Jul 7, 2017

Labels

4 participants

		@@ -113,7 +115,8 @@ def create_source(self, source_key, **options):
		'complete, freshness=%(freshness)s, with %(dep_count)s'
		' dependant source%(dep_s)s.')

Conversation

jwhitlock commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov-io commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

jwhitlock commented May 31, 2017

jwhitlock commented Jun 9, 2017

jwhitlock commented Jun 9, 2017

jwhitlock commented Jun 12, 2017

escattone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgmize commented Jul 7, 2017

escattone left a comment

Choose a reason for hiding this comment

jwhitlock commented Jul 7, 2017

Labels

4 participants

jwhitlock commented Dec 20, 2016 •

edited

Loading

codecov-io commented Dec 20, 2016 •

edited

Loading