0

Is there any way to speed up a web-scraper by having multiple computers contribute to processing a list of urls? Like computer A takes urls 1 - 500 and computer B takes urls 501 - 1000, etc. I am looking for a way to build the fastest possible web scraper with resources available to everyday people.

I am already using multiprocessing from the grequests module. Which is gevent + request combined.

This scraping does not need to be run constantly, but at a specific time each day in the morning (6 A.M.), and be done near as soon as it starts. I am looking for something quick and punctual.

Also I am looking through urls for retail stores (i.e.: target, bestbuy, newegg, etc), and using it to check what items are in stock for the day.

This is a code segment for grabbing those urls in the script I'm trying to put together:

import datetime import grequests thread_number = 20 nnn = int(len(product_number_list)/100) float_nnn = (len(product_number_list)/100) # Product number list is a list of product numbers, too big for me to include the full list. Here are like three: product_number_list = ['N82E16820232476', 'N82E16820233852', 'N82E16820313777'] base_url = 'https://www.newegg.com/Product/Product.aspx?Item={}' url_list = [] for number in product_number_list: url_list.append(base_url.format(product_number_list)) # The above three lines create a list of urls. results = [] appended_number = 0 for x in range(0, len(product_number_list), thread_number): attempts = 0 while attempts < 10: try: rs = (grequests.get(url, stream=False) for url in url_list[x:x+thread_number]) reqs = grequests.map(rs, stream=False, size=20) append = 'yes' for i in reqs: if i.status_code != 200: append = 'no' print('Bad Status Code. Nothing Appended.') attempts += 1 break if append == 'yes': appended_number += 1 results.extend(reqs) break except: print('Something went Wrong. Try Section Failed.') attempts += 1 time.sleep(5) if appended_number % nnn == 0: now = datetime.datetime.today() print(str(int(20*appended_number/float_nnn)) + '% of the way there at: ' + str(now.strftime("%I:%M:%S %p"))) if attempts == 10: print('Failed ten times to get urls.') time.sleep(3600) if len(results) != len(url_list): print('Results count is off. len(results) == "' + str(len(results)) + '". len(url_list) == "' + str(len(url_list)) + '".') 

this is not my code, it is sourced from these two links:

Using grequests to make several thousand get requests to sourceforge, get "Max retries exceeded with url"

Understanding requests versus grequests

19
  • 1
    @GKFX I'd say botnets generally imply the other computers are illicitly under your control. Turning your 5 computers on your local network into a "botnet" is not the traditional meaning of the word. Commented Jun 1, 2018 at 21:10
  • 1
    grequests is not multiprocessing; it's running everything in a single process, in a single thread, on a single core, using a whole bunch of "threadlets". So, unless you have a non-hyperthreaded single-core processor (which I doubt), you can already speed things up by just using your other cores, without needing to drag in other machines. But unless the bottleneck is your CPU, that won't help. If it's your NIC or your OS, multiple computers will help. But if it's your LAN, or your router, or your upstream connection, even that won't do any good. Commented Jun 1, 2018 at 21:10
  • 3
    What might be a good idea is interspersing calls to different websites (so you contact e.g. Newegg and Bestbuy in two different threads at the same time) as then you avoid scraping any one webservice too intensively. @TemporalWolf I'm exaggerating slightly for comic effect. You are of course right. Commented Jun 1, 2018 at 21:11
  • 2
    @RandomProgrammer The important question is: are you actually blocked on CPU power? When your program is running, use your Activity Monitor or Task Manager or whatever to see your CPU usage. If one core is at 100% and the others are doing nothing, going multiprocessor will help. If one core is at 35% and the others are doing nothing, CPU isn't your problem, so going multiprocessor will not help, and you'll need to look for other ways to scale. (It's even better to look at the CPU usage of your particular program, rather than the system as a whole, but for a quick&dirty check…) Commented Jun 1, 2018 at 21:19
  • 1
    Also, from what it looks like, your script potentially hits a server 200 times as fast as that server can respond... that's a good way to get your IP blacklisted. Commented Jun 1, 2018 at 21:21

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.