Am I setting my multithreading web scraper properly?

Question

I'm trying to improve the speed of my web scraper, and I have thousands of sites I need to get info from. I'm trying to get the ratings and number of ratings for sites in Google search webpages from Facebook and Yelp. I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible to use their Graph API (I've tried...). My sites are all in Google search pages. What I have so far (I have provided 8 sample sites for reproducibility):

from multiprocessing.dummy import Pool import requests from bs4 import BeautifulSoup pools = Pool(8) #My computer has 8 cores proxies = MY_PROXIES #How I set up my urls for requests on Google searches. #Since each item has a "+" in between in a Google search, I have to format #my urls to copy it. site_list = ['Golden Gate Bridge', 'Statue of Liberty', 'Empire State Building', 'Millennium Park', 'Gum Wall', 'The Alamo', 'National Art Gallery', 'The Bellagio Hotel'] urls = list(map(lambda x: "+".join(x.split(" ")), site_list) def scrape_google(url_list): info = [] for i in url_list: reviews = {'FB Rating': None, 'FB Reviews': None, 'Yelp Rating': None, 'Yelp Reviews': None} request = requests.get(i, proxies=proxies, verify=False).text search = BeautifulSoup(search, 'lxml') results = search.find_all('div', {'class': 's'}) #Where the ratings roughly are for j in results: if 'Rating' in str(j.findChildren()) and 'yelp' in str(j.findChildren()[1]): reviews['Yelp Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] #Had to brute-force get the ratings this way. reviews['Yelp Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3] elif 'Rating' in str(j.findChildren()) and 'facebook' in str(j.findChildren()[1]): reviews['FB Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] reviews['FB Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3] info.append(reviews) return info results = pools.map(scrape_google, urls)

I tried something similar to this, but I think I'm getting way too many duplicated results. Will multithreading make this run more quickly? I did diagnostics on my code to see which parts took up the most time, and by far getting the requests was the rate-limiting factor.

EDIT: I just tried this out, and I get the following error:

Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

I don't understand what the problem is, because if I try my scrape_google function without multithreading, it works just fine (albeit very very slowly), so url validity should not be an issue.

abarnert · Accepted Answer · 2018-08-12 23:06:39Z

Yes, multithreading will probably make it run more quickly.

As a very rough rule of thumb, you can usually profitably make about 8-64 requests in parallel, as long as no more than 2-12 of them are to the same host. So, one dead-simple way to apply that is to just toss all of your requests into a concurrent.futures.ThreadPoolExecutor with, say, 8 workers.

In fact, that's the main example for ThreadPoolExecutor in the docs.

(By the way, the fact that your computer has 8 cores is irrelevant here. Your code isn't CPU-bound, it's I/O bound. If you do 12 requests in parallel, or even 500 of them, at any given moment, almost all of your threads are waiting on a socket.recv or similar call somewhere, blocking until the server responds, so they aren't using your CPU.)

However:

I think I'm getting way too many duplicated result

Fixing this may help far more than threading. Although, of course, you can do both.

I have no idea what your issue is here from the limited information you provided, but there's a pretty obvious workaround: Keep a set of everything you've seen so far. Whenever you get a new URL, if it's already in the set, throw it away instead of queuing up a new request.

Finally:

I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible

If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.¹

In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.

Your function takes a list of urls and loops over them:

def scrape_google(url_list): # ... for i in url_list:

But then you call it with a single URL at a time:

results = pools.map(scrape_google, urls)

This is similar to using the builtin map, or a list comprehension:

results = map(scrape_google, urls) results = [scrape_google(url) for url in urls]

What happens if you get a single URL instead of a list of them, but try to use it as a list? A string is a sequence of its characters, so you loop over the characters of the URL one by one, trying to download each character as if it were a URL.

So, you want to change your function, like this:

def scrape_google(url): reviews = # … request = requests.get(url, proxies=proxies, verify=False).text # … return reviews

Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.

_{1. Or maybe something more creative. Someone posted a question on SO a few years ago about a site that apparently sent corrupted responses that seem to have been specifically crafted to waste exponential CPU for a typical scraper regex…}

What's the difference between simply using pools.map and using ThreadPoolExecutor? Also, after all threads do their thing, will they be joined in the order of my list of urls or will they be joined in the the order of completion of each thread (i.e. the first thread that completed its respective list of urls will be the first batch in the output list)?
@J.Buck For simple uses, not much difference. If you need to, e.g., compose results, a Future is better than an AsyncResult; if you need to manage the queues manually, multiprocessing gives lower-level functionality; etc. But if all you're doing is mapping a function over an iterable of arguments in parallel and getting the results back in order, they do essentially the same thing.
@J.Buck The problem with your edited code is that the function you're mapping wants a list of URLs, but by calling it with Pool.map, you're calling it with one URL at a time. I can edit the answer to explain in more detail.
Yes please. What do I have to do to my function to make it work properly?

Collectives™ on Stack Overflow

Am I setting my multithreading web scraper properly?

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related