0

I'm scraping a large list of URLs (1.2 million) using Selenium + BeautifulSoup with Python's multiprocessing.Pool. I want to scale it up to scrape faster, ideally without hitting system resource limits.

Right now:

  • I am using pool.map() with 4 processes.

  • It works, but overall throughput is limited, and memory use increases rapidly over time.

Questions:

  • Is multiprocessing the best choice for this type of task?

  • Should I switch to asyncio and use Playwright instead?

  • How do I control scraping rate or implement smarter batching?

I know I should add rotating proxies as well (e.g. BrightData or 2Captcha).

Here is the minimal reproducible code with some actual links in it:

import json from multiprocessing import Pool from selenium import webdriver from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup def scrape(url): try: options = Options() options.add_argument("--headless=new") options.add_argument("--disable-gpu") driver = webdriver.Chrome(options=options) driver.get(url) html = driver.page_source driver.quit() soup = BeautifulSoup(html, "lxml") title = soup.find("title").get_text(strip=True) return {"url": url, "title": title} except Exception as e: return {"url": url, "error": str(e)} if __name__ == "__main__": urls = [ "https://pitchbook.com/profiles/company/168089-41", "https://pitchbook.com/profiles/company/111349-63", "https://pitchbook.com/profiles/company/121201-84", "https://pitchbook.com/profiles/company/168179-14", "https://pitchbook.com/profiles/company/539159-59", "https://pitchbook.com/profiles/company/226251-19", "https://pitchbook.com/profiles/company/266296-42", "https://pitchbook.com/profiles/company/120144-61", "https://pitchbook.com/profiles/company/467017-30", "https://pitchbook.com/profiles/company/347056-03", "https://pitchbook.com/profiles/company/259802-47" ] with Pool(processes=4) as pool: results = pool.map(scrape, urls) for res in results: print(res) 
7
  • Selenium is very heavy, if you just need webpages why not just use requests? Also have you investigated into async pools? Also, have you checked that your processes aren't IO -bounded? Commented Jul 10 at 7:28
  • @nabulator I have tried only using requests, but the site has cloudfare protection. What do you mean by IO -bounded? Commented Jul 10 at 7:50
  • 1
    You've suggested programmatic solutions but how do you know your network isn't at fault? You should try to quantify time spent on the networking side. If they CDN you they may also be rate limiting your connections already, although it's more likely you're just get 503 errors or something of that nature Commented Jul 10 at 7:55
  • 1
    If you encounter cloudflare protection, simple requests might not do the work, but at the same time, if you go at scale, cloudflare can easily detect and block you regardless of what proxies and systems you use. Generally they dont go that hard on everyone though. I would still suggest you check out the proxies, tools like curl-impersonate etc. I would also suggest you to avoid using selenium at scale. Commented Jul 10 at 8:05
  • 1
    you should give HTMLUnit a try. It's much more lightweight than a normal browser. Commented Jul 10 at 16:45

1 Answer 1

1

You're dealing with an I/O-bound task since most of the time is spent waiting on the network, not doing CPU work. Starting a new Chrome for every URL is super heavy and burns through memory fast.

Switch to asyncio with Playwright so you can keep one browser open and load new tabs inside it. It's way more efficient. Use a semaphore or thread pool to limit how many tabs run at once, batch your URLs in chunks like 10k, and save results as you go. Also set up rotating proxies early so you don’t get blocked.

Sign up to request clarification or add additional context in comments.

3 Comments

If you want to load URLs in the same browser, it can also be done using Selenium.
I already have script with Playwright will try to tune it up a little bit. I presumed the issue was constantly open a new Chrome browser.
I think what Salt and S A are proposing is to use tabs over new browser instances to solve "the issue" you have identified

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.