I'm trying to improve the speed of my web scraper, and I have thousands of sites I need to get info from. I'm trying to get the ratings and number of ratings for sites in Google search webpages from Facebook and Yelp. I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible to use their Graph API (I've tried...). My sites are all in Google search pages. What I have so far (I have provided 8 sample sites for reproducibility):
from multiprocessing.dummy import Pool import requests from bs4 import BeautifulSoup pools = Pool(8) #My computer has 8 cores proxies = MY_PROXIES #How I set up my urls for requests on Google searches. #Since each item has a "+" in between in a Google search, I have to format #my urls to copy it. site_list = ['Golden Gate Bridge', 'Statue of Liberty', 'Empire State Building', 'Millennium Park', 'Gum Wall', 'The Alamo', 'National Art Gallery', 'The Bellagio Hotel'] urls = list(map(lambda x: "+".join(x.split(" ")), site_list) def scrape_google(url_list): info = [] for i in url_list: reviews = {'FB Rating': None, 'FB Reviews': None, 'Yelp Rating': None, 'Yelp Reviews': None} request = requests.get(i, proxies=proxies, verify=False).text search = BeautifulSoup(search, 'lxml') results = search.find_all('div', {'class': 's'}) #Where the ratings roughly are for j in results: if 'Rating' in str(j.findChildren()) and 'yelp' in str(j.findChildren()[1]): reviews['Yelp Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] #Had to brute-force get the ratings this way. reviews['Yelp Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3] elif 'Rating' in str(j.findChildren()) and 'facebook' in str(j.findChildren()[1]): reviews['FB Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] reviews['FB Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3] info.append(reviews) return info results = pools.map(scrape_google, urls) I tried something similar to this, but I think I'm getting way too many duplicated results. Will multithreading make this run more quickly? I did diagnostics on my code to see which parts took up the most time, and by far getting the requests was the rate-limiting factor.
EDIT: I just tried this out, and I get the following error:
Invalid URL 'h': No schema supplied. Perhaps you meant http://h? I don't understand what the problem is, because if I try my scrape_google function without multithreading, it works just fine (albeit very very slowly), so url validity should not be an issue.