Performance limitations of Scrapy (and other non-service scraping/extraction solutions)

Question

I'm currently using a service that provides a simple to use API to set up web scrapers for data extraction. The extraction is rather simple: grab the title (both text and hyperlink url) and two other text attributes from each item in a list of items that varies in length from page to page, with a max length of 30 items.

The service performs this function well, however, the speed is somewhat slow at about 300 pages per hour. I'm currently scraping up to 150,000 pages of time sensitive data (I must use the data within a few days or it becomes "stale"), and I predict that number to grow several fold. My workaround is to clone these scrapers dozens of times and run them simultaneously on small sets of URLs, but this makes the process much more complicated.

My question is whether writing my own scraper using Scrapy (or some other solution) and running it from my own computer would achieve a performance greater than this, or is this magnitude simply not within the scope of solutions like Scrapy, Selenium, etc. on a single, well-specced home computer (attached to an 80mbit down, 8mbit up connection).

Thanks!

Most of the time spent is going to be network latency, something you can get around with multithreading / multiple processes. Selenium will always be slower because it loads additional assets. — pguardiario
– pguardiario, Commented Apr 8, 2015 at 0:21
Scrapy uses Twisted for asynchronous operation, and it can achieve much greater speeds than 300 requests per hour. I would expect >10k per hour. Make sure to watch out for the bans. — bosnjak
– bosnjak, Commented Apr 8, 2015 at 7:30

Francesco Bovoli · Accepted Answer · 2015-04-11 11:45:17Z

You didn't provide the site you are trying to scrape, so I can only answer according to my general knowledge.

I agree Scrapy should be able to go faster than that.

With Bulk Extract import.io is definitely faster, I have extracted 300 URLs in a minute, you may want to give it a try.

You do need to respect the website ToUs.

Thanks! So, the definitive answer is, "yes, it will be much faster," but not using a service poses the issue of getting banned from the sites, which are large store fronts such as Home Depot, Lowe's, Toys R Us, etc. I'll look more into how to do this while being considerate to their servers :]

Collectives™ on Stack Overflow

Performance limitations of Scrapy (and other non-service scraping/extraction solutions)

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related