I'm scraping a large list of URLs (1.2 million) using Selenium + BeautifulSoup with Python's multiprocessing.Pool. I want to scale it up to scrape faster, ideally without hitting system resource limits.
Right now:
I am using
pool.map()with 4 processes.It works, but overall throughput is limited, and memory use increases rapidly over time.
Questions:
Is
multiprocessingthe best choice for this type of task?Should I switch to
asyncioand usePlaywrightinstead?How do I control scraping rate or implement smarter batching?
I know I should add rotating proxies as well (e.g. BrightData or 2Captcha).
Here is the minimal reproducible code with some actual links in it:
import json from multiprocessing import Pool from selenium import webdriver from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup def scrape(url): try: options = Options() options.add_argument("--headless=new") options.add_argument("--disable-gpu") driver = webdriver.Chrome(options=options) driver.get(url) html = driver.page_source driver.quit() soup = BeautifulSoup(html, "lxml") title = soup.find("title").get_text(strip=True) return {"url": url, "title": title} except Exception as e: return {"url": url, "error": str(e)} if __name__ == "__main__": urls = [ "https://pitchbook.com/profiles/company/168089-41", "https://pitchbook.com/profiles/company/111349-63", "https://pitchbook.com/profiles/company/121201-84", "https://pitchbook.com/profiles/company/168179-14", "https://pitchbook.com/profiles/company/539159-59", "https://pitchbook.com/profiles/company/226251-19", "https://pitchbook.com/profiles/company/266296-42", "https://pitchbook.com/profiles/company/120144-61", "https://pitchbook.com/profiles/company/467017-30", "https://pitchbook.com/profiles/company/347056-03", "https://pitchbook.com/profiles/company/259802-47" ] with Pool(processes=4) as pool: results = pool.map(scrape, urls) for res in results: print(res)