Bug 1271509: Sample database and production scraper#4076
Bug 1271509: Sample database and production scraper#4076escattone merged 15 commits intomdn:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@ ## master #4076 +/- ## ========================================= + Coverage 87.86% 88.27% +0.4% ========================================= Files 162 164 +2 Lines 10022 10236 +214 Branches 1367 1419 +52 ========================================= + Hits 8806 9036 +230 + Misses 989 971 -18 - Partials 227 229 +2
Continue to review full report at Codecov.
|
3e24569 to 2679b79 Compare e045126 to f0ecd92 Compare ec4037e to 4305ce4 Compare 6319539 to e9622c2 Compare 3f8da5b to 3e9ed80 Compare 3e9ed80 to dc36a53 Compare dc36a53 to 58e36e9 Compare 366be36 to fabf2a2 Compare | With #4245 merged, I've move the work-in-progress branch to https://github.com/jwhitlock/kuma/tree/sample_database_wip_long_1271509, and replaced this PR with the mergeable commits. There's two milestone PRs:
I'm continuing to work on the code. |
aa1578c to cedf918 Compare 8803845 to d2bdddc Compare Scrape the documents linked from a page, such as the homepage.
d2bdddc to 26a33ac Compare | I ran the scrape again, and hit two problems:
|
| I incorrectly thought that Most documents are now setup correctly. There are exceptions (22 of 1213 documents):
|
Add a management command to populate a database with fixtures and with data scraped from MDN.
Add a script and specification file to create the sample database used for MDN development and integration testing.
One common case is that the document metadata refers to a deleted translation. Instead of halting scrape with error, set the document_rendered source to state=ERROR.
When a new revision is created, even w/ Revision.objects.get_or_create, the Revision.save() method is called. This surpised me, I thought it was skipped, which is only true for bulk creation. If is_approved=True (the default), the save() method makes the revision the current revision, which sets Document.html to the revision.content (default empty string). I've fixed the storage code to save the initial revision with is_approved=False, so that we don't accidentally clear doc.html when adding a historical revision
Fixtures assumed that Revision.objects.create() did not call .save(), and document.current_revision had to be called directly. Update fixture setup for my new understanding.
Instead of logging errors in Source.gather, capture the error so that it can be logged by the scraper, and so that a summary of errors can be printed at the end of the run.
If the document history has less revisions than requested, then remember that all revisions have been scraped.
At the end of document scraping, chech that the document has a current_revision. If it doesn't, scrape more revisions and/or more history, until we find a current revision or run out of revisions. Page moves mark the page move in the most recent revision, but leave the n-1 revision as the current_revision. The default for ./scrape_document and others is to just scrape one document, leaving these stuck with no content.
641291d to 08818cf Compare | The issue with an empty Other issues were translations where a root document was renamed but the child documents had the old paths. For example pl/docs/Narzędzia was the parent topic of pl/docs/Tools/WebIDE, and the scraper had issues following the parent relations, attempting to get the history of a redirect. None of the APIs expose the parent topic's URL, and scraper assumes you can remove everything after the last slash to get it. In any case, I manually moved these child translations so that they matched their parent topic's slug. I've now successfully scraped MDN with this code, and uploaded the sample DB to the downloads server. If you want to try out the sample DB script, it takes me about 40 minutes to run, makes a little over 10,000 requests, and takes over your working copy while doing it. Plan your work day accordingly! You can also try a more limited version by reducing the scraped items in |
When scraping a page for Document URLs, omit the links to profiles, such as the contributor bar.
escattone left a comment
There was a problem hiding this comment.
It's sad that more eyes won't review this, so you get the credit you deserve. This is great work with superb tests and test coverage (100% for every file under kuma/scrape). It deserves fanfare. 🎉 🍾
I've included some nitpicks for you to consider as you see fit, and after your response to those I'll merge it!
docs/data.rst Outdated
| | ||
| Create the Sample Database | ||
| ========================== | ||
| These scraping tools are used to crete a sample database of public |
There was a problem hiding this comment.
Typo: replace crete with create
kuma/scrape/fixture.py Outdated
| relations = metadata.get('relations', {}) | ||
| filters = metadata.get('filters', {}) | ||
| app_label, model_name = model_id.split('.') | ||
| assert apps.get_model(app_label, model_name) |
There was a problem hiding this comment.
Nitpick. It appears that app_label and model_name are not used other than for the apps.get_model call, so could remove the app_label, model_name = model_id.split('.') line and do assert apps.get_model(model_id) since apps.get_model supports the dotted notation.
There was a problem hiding this comment.
Thanks, this worked. I didn't know that apps.get_model was flexible that way.
kuma/scrape/fixture.py Outdated
| for model_id, items in self.spec.items(): | ||
| metadata = self.model_metadata[model_id] | ||
| app_label, model_name = model_id.split('.') | ||
| Model = apps.get_model(app_label, model_name) |
There was a problem hiding this comment.
Nitpick. As before, could remove the app_label, model_name = model_id.split('.') line and do Model = apps.get_model(model_id).
kuma/scrape/fixture.py Outdated
| """ | ||
| Attempt to create or update an item. | ||
| | ||
| Returns the instance if attempt suceeded, or None if related items are |
There was a problem hiding this comment.
Update comment to reflect that a NeedsDependency exception is raised if related items are needed.
| dest='ssl') | ||
| | ||
| def handle(self, *arg, **options): | ||
| self.setup_logging(options.get('verbosity')) |
There was a problem hiding this comment.
For sake of consistency, would prefer self.setup_logging(options['verbosity']).
There was a problem hiding this comment.
Good catch, I thought I got all of these...
| params[param] = options[param] | ||
| scraper.add_source("links", path, **params) | ||
| | ||
| scraper.scrape() |
There was a problem hiding this comment.
The scraper.scrape() call will handle reporting any error details, but do we also want to raise a CommandError exception, like the other commands, if any of the underlying sources fail? Maybe something like this:
sources = scraper.scrape() if any(source.state != source.STATE_DONE for source in sources.values()): raise CommandError('Incomplete scraping of the links for "%s"' % path) There was a problem hiding this comment.
Sounds good. I think I should standardize on CommandError in all the management commands.
kuma/scrape/scraper.py Outdated
| @@ -113,7 +115,8 @@ def create_source(self, source_key, **options): | |||
| 'complete, freshness=%(freshness)s, with %(dep_count)s' | |||
| ' dependant source%(dep_s)s.') | |||
There was a problem hiding this comment.
Typo: should be dependent rather than dependant
kuma/scrape/scraper.py Outdated
| ' dependant source%(dep_s)s.') | ||
| _report_error = (_report_prefix + | ||
| 'errored, with %(dep_count)s dependant source%(dep_s)s.') | ||
| 'errored %(err_msg)s, with %(dep_count)s dependant' |
There was a problem hiding this comment.
Typo: dependent rather than dependant
kuma/scrape/scraper.py Outdated
| 'errored %(err_msg)s, with %(dep_count)s dependant' | ||
| ' source%(dep_s)s.') | ||
| _report_progress = (_report_prefix + | ||
| 'in state "%(state)s" with %(dep_count)s dependant' |
There was a problem hiding this comment.
Typo: dependent rather than dependant
kuma/scrape/sources/links.py Outdated
| path = '/en-US/' # Default to English homepage | ||
| path = urlparse(path).path | ||
| assert path.startswith('/') | ||
| self.locale = urlparse(path).path.split('/')[1] |
There was a problem hiding this comment.
Seems like a redundant call to urlparse(path).path, so could be self.locale = path.split('/')[1]?
There was a problem hiding this comment.
Yep, totally redundent redundant
agreed 💯. I also hate to break the tradition but I'm looking forward to seeing this finally merge this month. |
- Update scrape_links docstring for abilties beyond homepage - Use options['verbosity'], since always populated - Use CommandError to report source failures - Remove unneeded parentheses - Close specification file after processing it - Fix pluralization of incomplete sources message
escattone left a comment
There was a problem hiding this comment.
Thanks again @jwhitlock, this is a great addition to Kuma! 🎉
| Thanks! I'm proud of this code, but I don't want anyone, including me, to follow my example of working on a feature for over a year. Small PRs, reviewed and merged often! |
Now that PR #4245 and PR #4248 are merged, this PR is getting a little more sane:
scrape_linksto find document links on an MDN page. It is designed for the homepage, but can work on other pages as needed.sample_mdnthat loads fixtures and scrapes production data from a JSON spec.A recent generated sample database is at http://mdn-downloads.s3-website-us-west-2.amazonaws.com/mdn_sample_db.sql.gz
To Do:
./manage.py sample_templatesand debug print statements that are now obvious once I've opened the PR.gathermethods into sane methodsscrape_homepageto a genericscrape_links.Not this PR: