0

I have a .txt file that contains the complete URLs to a number of pages that each contain a table I want to scrape data off of. My code works for one URL, but when I try to add a loop and read in the URLs from the .txt file I get the following error

raise ValueError, "unknown url type: %s" % self.__original ValueError: unknown url type: ? 

Here's my code

from urllib2 import urlopen from bs4 import BeautifulSoup as soup with open('urls.txt', 'r') as f: urls = f.read() for url in urls: uClient = urlopen(url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.findAll("tr", {"class":"data"}) for container in containers: unform_name = container.findAll("th", {"width":"30%"}) name = unform_name[0].text.strip() unform_delegate = container.findAll("td", {"id":"y000"}) delegate = unform_delegate[0].text.strip() print(name) print(delegate) f.close() 

I've checked my .txt file and all the entries are normal. They start with HTTP: and end with .html. There are no apostrophes or quotes around them. I'm I coding the for loop incorrectly?

Using

with open('urls.txt', 'r') as f: for url in f: print(url) 

I get the following

??http://www.thegreenpapers.com/PCC/AL-D.html http://www.thegreenpapers.com/PCC/AL-R.html http://www.thegreenpapers.com/PCC/AK-D.html 

And so forth on 100 lines. Only the first line has question marks. My .txt file contains those URLs with only the state and party abbreviation changing.

0

2 Answers 2

1

You can't read the whole file into a string using 'f.read()' and then iterate on the string. To resolve see the change below. I also removed your last line. When you use the 'with' statement it will close the file when the code block finishes.

Code from Greg Hewgill for (Python 2) shows if the url string is of type 'str' or 'unicode'.

from urllib2 import urlopen from bs4 import BeautifulSoup as soup # Code from Greg Hewgill def whatisthis(s): if isinstance(s, str): print "ordinary string" elif isinstance(s, unicode): print "unicode string" else: print "not a string" with open('urls.txt', 'r') as f: for url in f: print(url) whatisthis(url) uClient = urlopen(url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.findAll("tr", {"class":"data"}) for container in containers: unform_name = container.findAll("th", {"width":"30%"}) name = unform_name[0].text.strip() unform_delegate = container.findAll("td", {"id":"y000"}) delegate = unform_delegate[0].text.strip() print(name) print(delegate) 

Running the code with a text file with the URLs listed above produces this output:

http://www.thegreenpapers.com/PCC/AL-D.html ordinary string Gore, Al 54.   84% Uncommitted 10.   16% LaRouche, Lyndon http://www.thegreenpapers.com/PCC/AL-R.html ordinary string Bush, George W. 44.  100% Keyes, Alan Uncommitted http://www.thegreenpapers.com/PCC/AK-D.html ordinary string Gore, Al 13.   68% Uncommitted 6.   32% Bradley, Bill 
Sign up to request clarification or add additional context in comments.

4 Comments

Those are great suggestions, but unfortunately I'm still getting the same error.
Please add a print for the url and show the output.
No, only when I print it
It sounds like you have a file that was not encoded in utf-8. You may have Unicode characters. See my updated code to find out.
1

The way you have tried can be fixed by twitching two different lines in your code.

Try this:

with open('urls.txt', 'r') as f: urls = f.readlines() #make sure this line is properly indented. for url in urls: uClient = urlopen(url.strip()) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.