Is there any way to speed up a web-scraper by having multiple computers contribute to processing a list of urls? Like computer A takes urls 1 - 500 and computer B takes urls 501 - 1000, etc. I am looking for a way to build the fastest possible web scraper with resources available to everyday people.
I am already using multiprocessing from the grequests module. Which is gevent + request combined.
This scraping does not need to be run constantly, but at a specific time each day in the morning (6 A.M.), and be done near as soon as it starts. I am looking for something quick and punctual.
Also I am looking through urls for retail stores (i.e.: target, bestbuy, newegg, etc), and using it to check what items are in stock for the day.
This is a code segment for grabbing those urls in the script I'm trying to put together:
import datetime import grequests thread_number = 20 nnn = int(len(product_number_list)/100) float_nnn = (len(product_number_list)/100) # Product number list is a list of product numbers, too big for me to include the full list. Here are like three: product_number_list = ['N82E16820232476', 'N82E16820233852', 'N82E16820313777'] base_url = 'https://www.newegg.com/Product/Product.aspx?Item={}' url_list = [] for number in product_number_list: url_list.append(base_url.format(product_number_list)) # The above three lines create a list of urls. results = [] appended_number = 0 for x in range(0, len(product_number_list), thread_number): attempts = 0 while attempts < 10: try: rs = (grequests.get(url, stream=False) for url in url_list[x:x+thread_number]) reqs = grequests.map(rs, stream=False, size=20) append = 'yes' for i in reqs: if i.status_code != 200: append = 'no' print('Bad Status Code. Nothing Appended.') attempts += 1 break if append == 'yes': appended_number += 1 results.extend(reqs) break except: print('Something went Wrong. Try Section Failed.') attempts += 1 time.sleep(5) if appended_number % nnn == 0: now = datetime.datetime.today() print(str(int(20*appended_number/float_nnn)) + '% of the way there at: ' + str(now.strftime("%I:%M:%S %p"))) if attempts == 10: print('Failed ten times to get urls.') time.sleep(3600) if len(results) != len(url_list): print('Results count is off. len(results) == "' + str(len(results)) + '". len(url_list) == "' + str(len(url_list)) + '".') this is not my code, it is sourced from these two links:
grequestsis not multiprocessing; it's running everything in a single process, in a single thread, on a single core, using a whole bunch of "threadlets". So, unless you have a non-hyperthreaded single-core processor (which I doubt), you can already speed things up by just using your other cores, without needing to drag in other machines. But unless the bottleneck is your CPU, that won't help. If it's your NIC or your OS, multiple computers will help. But if it's your LAN, or your router, or your upstream connection, even that won't do any good.