Web scraper for a webpage article

Question

I am a beginner in Python and have just coded a simple web scraper for a webpage article, output to a text file, using BeautifulSoup and List.

The code is working fine, but I'm wondering if anybody would know a more efficient way to achieve the same.

import requests page = requests.get('https://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp') # 2. Parsing the page using BeautifulSoup import pandas as pd from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser') # 3. Write the context to a text file all_p_tags = soup.findAll('p') # Put all <p> and their text into a list number_of_tags = len(all_p_tags) # No of <p>? x=0 with open('filename.txt', mode='wt', encoding='utf-8') as file: title = soup.find('h1').text.strip() # Write the <header> file.write(title) file.write('\n') for x in range(number_of_tags): word = all_p_tags[x].get_text() # Write the content by referencing each item in the list file.write(word) file.write('\n') file.close()

Is there a reason you want to make this "efficient"? And that file.close() is unnecessary, js. — Wiggy A.
– Wiggy A., Commented Dec 6, 2017 at 4:33

Shailyn Ortiz · Accepted Answer · 2017-12-06 04:42:19Z

#libraries always at top, at least if they are not conditional imported import requests from bs4 import BeautifulSoup base_url = 'https://www.msn.com/en-sg/money/topstories/\ 10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp' page = requests.get(base_url) content = page.content # 2. Parsing the page using BeautifulSoup #removed pandas as you are not using it here. soup = BeautifulSoup(page.content, 'html.parser') # 3. Write the context to a text file all_p_tags = soup.findAll('p') # Put all <p> and their text into a list #you don't need to count then #not initializer needed, remove x = 0 with open('filename.txt', mode='wt', encoding='utf-8') as file: title = soup.find('h1').text.strip() # Write the <header> file.write(title + ' \n') for p in all_p_tags: file.write(p.get_text()+ ' \n') #files open with a 'with' statement doens't have to be manually closet

Community · Accepted Answer · 2020-06-10 13:24:26Z

There are at least three things that may help to make the code more efficient:

switch to lxml instead of html.parser (requires lxml to be installed)
use a SoupStrainer to parse only the relevant part of the document
you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests from bs4 import BeautifulSoup, SoupStrainer page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp') parse_only = SoupStrainer("body") soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) with open('filename.txt', mode='wt', encoding='utf-8') as file: title = soup.find('h1').text.strip() file.write(title + ' \n') for p_tag in soup.select('p') : file.write(p_tag.get_text() + '\n')

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

Stack Exchange Network

Web scraper for a webpage article

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Web scraper for a webpage article

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions