3

I'm trying to scrape a javascript website using scrapy and selenium. I open the javascript website using selenium and a chrome driver and I scrape all the links to different listings from the current page using scrapy and store them in a list (this has been the best way to do it so far as trying to follow links using seleniumRequest and callingback to a parse new page function has caused a lot errors). Then, I loop through the list of URLs, open them in the selenium driver and scrape the info from the pages. So far this scrapes 16 pages/ minute which is not ideal given the amount of listings on this site. I would ideally have the selenium drivers opening links in parallel like the following implementations:

How can I make Selenium run in parallel with Scrapy?

https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b

However, I can't figure out how to implement parallel processing in my selenium-scrapy code. `

 import scrapy import time from scrapy.selector import Selector from scrapy_selenium import SeleniumRequest from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC class MarketPagSpider(scrapy.Spider): name = 'marketPagination' def start_requests(self): yield SeleniumRequest( url="https://www.cryptoslam.io/nba-top-shot/marketplace", wait_time=5, wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')), callback=self.parse ) responses = [] def parse(self, response): # initialize driver driver = response.meta['driver'] driver.set_window_size(1920,1080) time.sleep(1) WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]")) ) rows = response_obj.xpath("//tbody/tr[@role='row']") for row in rows: link = row.xpath(".//td[4]/a/@href").get() absolute_url = response.urljoin(link) self.responses.append(absolute_url) for resp in self.responses: driver.get(resp) html = driver.page_source response_obj = Selector(text=html) yield { 'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(), 'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get() } 

I know that scrapy-splash can handle multiprocessing but the website I'm trying to scrape doesn't open in splash (at least I don't think)

As well, I've deleted the lines of code for pagination to keep the code concise.

I'm very new to this and open to any suggestions and solutions to multiprocessing with selenium.

8
  • Post you multiprocessing code, it works as usual, but each "thread / process" should use his own driver Commented Feb 5, 2021 at 9:21
  • @Wonka I'm not really sure how to implement that. I'm very unfamiliar with the multiprocessing library in general, I apologize Commented Feb 5, 2021 at 18:44
  • See [this question}(stackoverflow.com/questions/53475578/…) for the basic technique and the accepted answer and my (Booboo) answer, which ensures that drivers terminate when you are done. The accepted answer is a technique that uses one driver per thread instead of one driver per URL. In other words, it reuses the drivers just as you reuse your driver for all the URLs in your non-threading code.. Commented Feb 6, 2021 at 12:25
  • @Booboo Hey thanks for your answer! I managed to get selenium to multiprocess like your solution. However, I can't seem to delete the drivers after the script is done even though I put del threadlocal at the end. I actually end up getting this error: NameError: name 'threadLocal' is not defined Commented Feb 7, 2021 at 3:54
  • In the accepted answer is the declaration threadLocal = threading.local(). I didn't copy to my answer that required line on the assumption that it was understood. I have now updated the answer to make that declaration explicit. Commented Feb 7, 2021 at 12:50

1 Answer 1

1

The following sample program creates a thread pool with only 2 threads for demo purposes and then scrapes 4 URLs to get their titles:

from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver import threading import gc class Driver: def __init__(self): options = webdriver.ChromeOptions() options.add_argument("--headless") # suppress logging: options.add_experimental_option('excludeSwitches', ['enable-logging']) self.driver = webdriver.Chrome(options=options) print('The driver was just created.') def __del__(self): self.driver.quit() # clean up driver when we are cleaned up print('The driver has terminated.') threadLocal = threading.local() def create_driver(): the_driver = getattr(threadLocal, 'the_driver', None) if the_driver is None: the_driver = Driver() setattr(threadLocal, 'the_driver', the_driver) return the_driver.driver def get_title(url): driver = create_driver() driver.get(url) source = BeautifulSoup(driver.page_source, "lxml") title = source.select_one("title").text print(f"{url}: '{title}'") # just 2 threads in our pool for demo purposes: with ThreadPool(2) as pool: urls = [ 'https://www.google.com', 'https://www.microsoft.com', 'https://www.ibm.com', 'https://www.yahoo.com' ] pool.map(get_title, urls) # must be done before terminate is explicitly or implicitly called on the pool: del threadLocal gc.collect() # pool.terminate() is called at exit of with block 

Prints:

The driver was just created. The driver was just created. https://www.google.com: 'Google' https://www.microsoft.com: 'Microsoft - Official Home Page' https://www.ibm.com: 'IBM - United States' https://www.yahoo.com: 'Yahoo' The driver has terminated. The driver has terminated. 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.