I'm trying to scrape a javascript website using scrapy and selenium. I open the javascript website using selenium and a chrome driver and I scrape all the links to different listings from the current page using scrapy and store them in a list (this has been the best way to do it so far as trying to follow links using seleniumRequest and callingback to a parse new page function has caused a lot errors). Then, I loop through the list of URLs, open them in the selenium driver and scrape the info from the pages. So far this scrapes 16 pages/ minute which is not ideal given the amount of listings on this site. I would ideally have the selenium drivers opening links in parallel like the following implementations:
How can I make Selenium run in parallel with Scrapy?
https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b
However, I can't figure out how to implement parallel processing in my selenium-scrapy code. `
import scrapy import time from scrapy.selector import Selector from scrapy_selenium import SeleniumRequest from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import Select from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC class MarketPagSpider(scrapy.Spider): name = 'marketPagination' def start_requests(self): yield SeleniumRequest( url="https://www.cryptoslam.io/nba-top-shot/marketplace", wait_time=5, wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')), callback=self.parse ) responses = [] def parse(self, response): # initialize driver driver = response.meta['driver'] driver.set_window_size(1920,1080) time.sleep(1) WebDriverWait(driver, 10).until( EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]")) ) rows = response_obj.xpath("//tbody/tr[@role='row']") for row in rows: link = row.xpath(".//td[4]/a/@href").get() absolute_url = response.urljoin(link) self.responses.append(absolute_url) for resp in self.responses: driver.get(resp) html = driver.page_source response_obj = Selector(text=html) yield { 'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(), 'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get() } I know that scrapy-splash can handle multiprocessing but the website I'm trying to scrape doesn't open in splash (at least I don't think)
As well, I've deleted the lines of code for pagination to keep the code concise.
I'm very new to this and open to any suggestions and solutions to multiprocessing with selenium.
threadLocal = threading.local(). I didn't copy to my answer that required line on the assumption that it was understood. I have now updated the answer to make that declaration explicit.