Return to Answer

added 1543 characters in body

edited Aug 12, 2018 at 23:06

367.8k
54
626
691

(By the way, the fact that your computer has 8 cores is irrelevant here. Your code isn't CPU-bound, it's I/O bound. If you do 12 requests in parallel, or even 500 of them, at any given moment, almost all of your threads are waiting on a socket.recv or similar call somewhere, blocking until the server responds, so they aren't using your CPU.)

If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.¹

In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.

Your function takes a list of urls and loops over them:

def scrape_google(url_list): # ... for i in url_list:

But then you call it with a single URL at a time:

results = pools.map(scrape_google, urls)

This is similar to using the builtin map, or a list comprehension:

results = map(scrape_google, urls) results = [scrape_google(url) for url in urls]

What happens if you get a single URL instead of a list of them, but try to use it as a list? A string is a sequence of its characters, so you loop over the characters of the URL one by one, trying to download each character as if it were a URL.

So, you want to change your function, like this:

def scrape_google(url): reviews = # … request = requests.get(url, proxies=proxies, verify=False).text # … return reviews

Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.

In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.

Your function takes a list of urls and loops over them:

def scrape_google(url_list): # ... for i in url_list:

But then you call it with a single URL at a time:

results = pools.map(scrape_google, urls)

This is similar to using the builtin map, or a list comprehension:

results = map(scrape_google, urls) results = [scrape_google(url) for url in urls]

So, you want to change your function, like this:

def scrape_google(url): reviews = # … request = requests.get(url, proxies=proxies, verify=False).text # … return reviews

Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.

Source Link

answered Aug 10, 2018 at 22:54

abarnert

367.8k
54
626
691

Yes, multithreading will probably make it run more quickly.

As a very rough rule of thumb, you can usually profitably make about 8-64 requests in parallel, as long as no more than 2-12 of them are to the same host. So, one dead-simple way to apply that is to just toss all of your requests into a concurrent.futures.ThreadPoolExecutor with, say, 8 workers.

In fact, that's the main example for ThreadPoolExecutor in the docs.

However:

I think I'm getting way too many duplicated result

Fixing this may help far more than threading. Although, of course, you can do both.

I have no idea what your issue is here from the limited information you provided, but there's a pretty obvious workaround: Keep a set of everything you've seen so far. Whenever you get a new URL, if it's already in the set, throw it away instead of queuing up a new request.

Finally:

I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible

_{1. Or maybe something more creative. Someone posted a question on SO a few years ago about a site that apparently sent corrupted responses that seem to have been specifically crafted to waste exponential CPU for a typical scraper regex…}

Collectives™ on Stack Overflow

Return to Answer