2

In my django app I use selenium for crawling and parsing some html page. I tried to introduce the multiprocess to improve performance. This is my code:

import os from selenium import webdriver from multiprocessing import Pool os.environ["DISPLAY"]=":56017" def render_js(url): driver = webdriver.Firefox() driver.set_page_load_timeout(300) driver.get(url) text = driver.page_source driver.quit() return text def parsing(url): text = render_js(url) ... parsing the text .... ... write in db.... url_list = ['www.google.com','www.python.com','www.microsoft.com'] pool = Pool(processes=2) pool.map_async(parsing, url_list) pool.close() pool.join() 

I have this error when two processes work together simultaneously and use selenium: the first process starts firefox with 'www.google.it' and it returns the correct text, the second with url 'www.python.com' returns the text of www.google.it and not of www.python.com. Can you tell me where I'm wrong?

4
  • 1
    You definitely don't need to use Selenium just to scrape the page for it's HTML - this is likely where your performance issue is - Selenium is unneeded for your job. Commented Jan 17, 2013 at 13:19
  • 1
    @Arran I have many page with javascript and selenium is the best solution that I know... If I use selenium with a single task everything works perfectly with performance in line with other instruments. Now, however, the number of url is increasing a lot and would like to find a way to get more performance with multiprocessing... How can I do? Commented Jan 17, 2013 at 14:08
  • 1
    It appears your using Firefox in your tests. I'd suggest giving PhantomJS a try instead "Webdriver.PhantomJS" Commented May 9, 2013 at 0:18
  • you are trying to share the same instance with multiple processes. You need to create a new instance for each process you create Commented May 10, 2017 at 22:45

1 Answer 1

3
from selenium import webdriver from multiprocessing import Pool def parsing(url): driver = webdriver.Chrome() driver.set_page_load_timeout(300) driver.get(url) text = driver.page_source driver.close() return text url_list = ['http://www.google.com', 'http://www.python.com'] pool = Pool(processes=4) ret = pool.map(parsing, url_list) for text in ret: print text[:30] 

I tried running your code and Selenium complained about bad urls. Adding http:// to it made it work.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.