Skip to main content
Notice removed Draw attention by robots.txt
Bounty Ended with miraculixx's answer chosen by robots.txt
updated
Source Link
robots.txt
  • 157
  • 2
  • 10
  • 43

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my tryThis is my try (it's a working one):

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try (it's a working one):

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 
updated
Source Link
robots.txt
  • 157
  • 2
  • 10
  • 43

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 
Notice added Draw attention by robots.txt
Bounty Started worth 50 reputation by robots.txt

I've written a script in python in combination with selenium to scrape the links of different posts from it'sits landing page and finally get the title of each post by tracking the url leading to it'sits inner page. Although the content I parsed here are static ones, iI used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I knewknow that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 

I've written a script in python in combination with selenium to scrape the links of different posts from it's landing page and finally get the title of each post by tracking the url leading to it's inner page. Although the content I parsed here are static ones, i used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I knew that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url)) 
Source Link
robots.txt
  • 157
  • 2
  • 10
  • 43
Loading