Skip to main content
added 1543 characters in body
Source Link
abarnert
  • 367.8k
  • 54
  • 626
  • 691

(By the way, the fact that your computer has 8 cores is irrelevant here. Your code isn't CPU-bound, it's I/O bound. If you do 12 requests in parallel, or even 500 of them, at any given moment, almost all of your threads are waiting on a socket.recv or similar call somewhere, blocking until the server responds, so they aren't using your CPU.)

If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.1


In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.

Your function takes a list of urls and loops over them:

def scrape_google(url_list): # ... for i in url_list: 

But then you call it with a single URL at a time:

results = pools.map(scrape_google, urls) 

This is similar to using the builtin map, or a list comprehension:

results = map(scrape_google, urls) results = [scrape_google(url) for url in urls] 

What happens if you get a single URL instead of a list of them, but try to use it as a list? A string is a sequence of its characters, so you loop over the characters of the URL one by one, trying to download each character as if it were a URL.

So, you want to change your function, like this:

def scrape_google(url): reviews = # … request = requests.get(url, proxies=proxies, verify=False).text # … return reviews 

Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.

If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.1

(By the way, the fact that your computer has 8 cores is irrelevant here. Your code isn't CPU-bound, it's I/O bound. If you do 12 requests in parallel, or even 500 of them, at any given moment, almost all of your threads are waiting on a socket.recv or similar call somewhere, blocking until the server responds, so they aren't using your CPU.)

If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.1


In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.

Your function takes a list of urls and loops over them:

def scrape_google(url_list): # ... for i in url_list: 

But then you call it with a single URL at a time:

results = pools.map(scrape_google, urls) 

This is similar to using the builtin map, or a list comprehension:

results = map(scrape_google, urls) results = [scrape_google(url) for url in urls] 

What happens if you get a single URL instead of a list of them, but try to use it as a list? A string is a sequence of its characters, so you loop over the characters of the URL one by one, trying to download each character as if it were a URL.

So, you want to change your function, like this:

def scrape_google(url): reviews = # … request = requests.get(url, proxies=proxies, verify=False).text # … return reviews 

Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.

Source Link
abarnert
  • 367.8k
  • 54
  • 626
  • 691

Yes, multithreading will probably make it run more quickly.

As a very rough rule of thumb, you can usually profitably make about 8-64 requests in parallel, as long as no more than 2-12 of them are to the same host. So, one dead-simple way to apply that is to just toss all of your requests into a concurrent.futures.ThreadPoolExecutor with, say, 8 workers.

In fact, that's the main example for ThreadPoolExecutor in the docs.


However:

I think I'm getting way too many duplicated result

Fixing this may help far more than threading. Although, of course, you can do both.

I have no idea what your issue is here from the limited information you provided, but there's a pretty obvious workaround: Keep a set of everything you've seen so far. Whenever you get a new URL, if it's already in the set, throw it away instead of queuing up a new request.


Finally:

I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible

If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.1


1. Or maybe something more creative. Someone posted a question on SO a few years ago about a site that apparently sent corrupted responses that seem to have been specifically crafted to waste exponential CPU for a typical scraper regex…