1
\$\begingroup\$

I am a beginner in Python and have just coded a simple web scraper for a webpage article, output to a text file, using BeautifulSoup and List.

The code is working fine, but I'm wondering if anybody would know a more efficient way to achieve the same.

import requests page = requests.get('https://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp') # 2. Parsing the page using BeautifulSoup import pandas as pd from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser') # 3. Write the context to a text file all_p_tags = soup.findAll('p') # Put all <p> and their text into a list number_of_tags = len(all_p_tags) # No of <p>? x=0 with open('filename.txt', mode='wt', encoding='utf-8') as file: title = soup.find('h1').text.strip() # Write the <header> file.write(title) file.write('\n') for x in range(number_of_tags): word = all_p_tags[x].get_text() # Write the content by referencing each item in the list file.write(word) file.write('\n') file.close() 
\$\endgroup\$
1
  • \$\begingroup\$ Is there a reason you want to make this "efficient"? And that file.close() is unnecessary, js. \$\endgroup\$ Commented Dec 6, 2017 at 4:33

2 Answers 2

1
\$\begingroup\$
#libraries always at top, at least if they are not conditional imported import requests from bs4 import BeautifulSoup base_url = 'https://www.msn.com/en-sg/money/topstories/\ 10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp' page = requests.get(base_url) content = page.content # 2. Parsing the page using BeautifulSoup #removed pandas as you are not using it here. soup = BeautifulSoup(page.content, 'html.parser') # 3. Write the context to a text file all_p_tags = soup.findAll('p') # Put all <p> and their text into a list #you don't need to count then #not initializer needed, remove x = 0 with open('filename.txt', mode='wt', encoding='utf-8') as file: title = soup.find('h1').text.strip() # Write the <header> file.write(title + ' \n') for p in all_p_tags: file.write(p.get_text()+ ' \n') #files open with a 'with' statement doens't have to be manually closet 
\$\endgroup\$
0
1
\$\begingroup\$

There are at least three things that may help to make the code more efficient:

  • switch to lxml instead of html.parser (requires lxml to be installed)
  • use a SoupStrainer to parse only the relevant part of the document
  • you can switch to http instead of https. While this would bring the security aspect down, you would avoid overhead of SSL handshaking, encryption etc - I've noticed the execution time difference locally, try it out

Improved code:

import requests from bs4 import BeautifulSoup, SoupStrainer page = requests.get('http://www.msn.com/en-sg/money/topstories/10-top-stocks-of-2017/ar-BBGgEyA?li=AA54rX&ocid=spartandhp') parse_only = SoupStrainer("body") soup = BeautifulSoup(page.content, 'lxml', parse_only=parse_only) with open('filename.txt', mode='wt', encoding='utf-8') as file: title = soup.find('h1').text.strip() file.write(title + ' \n') for p_tag in soup.select('p') : file.write(p_tag.get_text() + '\n') 

Note that I've also removed the unused variables and imports.

Btw, if it weren't for the title, we could've pinpointed SoupStrainer to p elements only - might've improved performance even more.

\$\endgroup\$
0

You must log in to answer this question.