Revisions to Python selenium multiprocessing

Notice removed Draw attention by robots.txt

occurred Dec 5, 2018 at 7:32

Bounty Ended with miraculixx's answer chosen by robots.txt

occurred Dec 5, 2018 at 7:32

updated

Source Link

edited Nov 28, 2018 at 10:10

robots.txt

157
2
10
43

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my tryThis is my try (it's a working one):

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try (it's a working one):

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

updated

Source Link

edited Nov 28, 2018 at 9:19

robots.txt

157
2
10
43

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time using selenium when it is made to run using multiprocessing?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

Notice added Draw attention by robots.txt

occurred Nov 28, 2018 at 8:38

Bounty Started worth 50 reputation by robots.txt

occurred Nov 28, 2018 at 8:38

Grammar changes

Source Link

edit approved Nov 26, 2018 at 6:33

exe

374
3
23

I've written a script in python in combination with selenium to scrape the links of different posts from it'sits landing page and finally get the title of each post by tracking the url leading to it'sits inner page. Although the content I parsed here are static ones, iI used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I knewknow that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

I've written a script in python in combination with selenium to scrape the links of different posts from it's landing page and finally get the title of each post by tracking the url leading to it's inner page. Although the content I parsed here are static ones, i used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I knew that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

I've written a script in python in combination with selenium to scrape the links of different posts from its landing page and finally get the title of each post by tracking the url leading to its inner page. Although the content I parsed here are static ones, I used selenium to see how it works in multiprocessing.

However, my intention is to do the scraping using multiprocessing. So far I know that selenium doesn't support multiprocessing but it seems I was wrong.

My question: how can I reduce the execution time?

This is my try:

import requests from urllib.parse import urljoin from multiprocessing.pool import ThreadPool from bs4 import BeautifulSoup from selenium import webdriver def get_links(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") titles = [urljoin(url,items.get("href")) for items in soup.select(".summary .question-hyperlink")] return titles def get_title(url): chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") driver = webdriver.Chrome(chrome_options=chromeOptions) driver.get(url) sauce = BeautifulSoup(driver.page_source,"lxml") item = sauce.select_one("h1 a").text print(item) if __name__ == '__main__': url = "https://stackoverflow.com/questions/tagged/web-scraping" ThreadPool(5).map(get_title,get_links(url))

Source Link

asked Nov 26, 2018 at 6:10

robots.txt

157
2
10
43

Loading

Collectives™ on Stack Overflow

Return to Question