Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup

Question

I have a .txt file that contains the complete URLs to a number of pages that each contain a table I want to scrape data off of. My code works for one URL, but when I try to add a loop and read in the URLs from the .txt file I get the following error

raise ValueError, "unknown url type: %s" % self.__original ValueError: unknown url type: ?

Here's my code

from urllib2 import urlopen from bs4 import BeautifulSoup as soup with open('urls.txt', 'r') as f: urls = f.read() for url in urls: uClient = urlopen(url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.findAll("tr", {"class":"data"}) for container in containers: unform_name = container.findAll("th", {"width":"30%"}) name = unform_name[0].text.strip() unform_delegate = container.findAll("td", {"id":"y000"}) delegate = unform_delegate[0].text.strip() print(name) print(delegate) f.close()

I've checked my .txt file and all the entries are normal. They start with HTTP: and end with .html. There are no apostrophes or quotes around them. I'm I coding the for loop incorrectly?

Using

with open('urls.txt', 'r') as f: for url in f: print(url)

I get the following

??http://www.thegreenpapers.com/PCC/AL-D.html http://www.thegreenpapers.com/PCC/AL-R.html http://www.thegreenpapers.com/PCC/AK-D.html

And so forth on 100 lines. Only the first line has question marks. My .txt file contains those URLs with only the state and party abbreviation changing.

Brian O'Donnell · Accepted Answer · 2018-03-31 18:59:04Z

You can't read the whole file into a string using 'f.read()' and then iterate on the string. To resolve see the change below. I also removed your last line. When you use the 'with' statement it will close the file when the code block finishes.

Code from Greg Hewgill for (Python 2) shows if the url string is of type 'str' or 'unicode'.

from urllib2 import urlopen from bs4 import BeautifulSoup as soup # Code from Greg Hewgill def whatisthis(s): if isinstance(s, str): print "ordinary string" elif isinstance(s, unicode): print "unicode string" else: print "not a string" with open('urls.txt', 'r') as f: for url in f: print(url) whatisthis(url) uClient = urlopen(url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.findAll("tr", {"class":"data"}) for container in containers: unform_name = container.findAll("th", {"width":"30%"}) name = unform_name[0].text.strip() unform_delegate = container.findAll("td", {"id":"y000"}) delegate = unform_delegate[0].text.strip() print(name) print(delegate)

Running the code with a text file with the URLs listed above produces this output:

http://www.thegreenpapers.com/PCC/AL-D.html ordinary string Gore, Al 54.   84% Uncommitted 10.   16% LaRouche, Lyndon http://www.thegreenpapers.com/PCC/AL-R.html ordinary string Bush, George W. 44.  100% Keyes, Alan Uncommitted http://www.thegreenpapers.com/PCC/AK-D.html ordinary string Gore, Al 13.   68% Uncommitted 6.   32% Bradley, Bill

Those are great suggestions, but unfortunately I'm still getting the same error.
It sounds like you have a file that was not encoded in utf-8. You may have Unicode characters. See my updated code to find out.

SIM · Accepted Answer · 2018-03-31 13:11:53Z

The way you have tried can be fixed by twitching two different lines in your code.

Try this:

with open('urls.txt', 'r') as f: urls = f.readlines() #make sure this line is properly indented. for url in urls: uClient = urlopen(url.strip())

Collectives™ on Stack Overflow

Scrape data from multiple webpages using a .txt file that contains the URLs with Python and beautiful soup

2 Answers 2

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Linked

Related