I have a .txt file that contains the complete URLs to a number of pages that each contain a table I want to scrape data off of. My code works for one URL, but when I try to add a loop and read in the URLs from the .txt file I get the following error
raise ValueError, "unknown url type: %s" % self.__original ValueError: unknown url type: ? Here's my code
from urllib2 import urlopen from bs4 import BeautifulSoup as soup with open('urls.txt', 'r') as f: urls = f.read() for url in urls: uClient = urlopen(url) page_html = uClient.read() uClient.close() page_soup = soup(page_html, "html.parser") containers = page_soup.findAll("tr", {"class":"data"}) for container in containers: unform_name = container.findAll("th", {"width":"30%"}) name = unform_name[0].text.strip() unform_delegate = container.findAll("td", {"id":"y000"}) delegate = unform_delegate[0].text.strip() print(name) print(delegate) f.close() I've checked my .txt file and all the entries are normal. They start with HTTP: and end with .html. There are no apostrophes or quotes around them. I'm I coding the for loop incorrectly?
Using
with open('urls.txt', 'r') as f: for url in f: print(url) I get the following
??http://www.thegreenpapers.com/PCC/AL-D.html http://www.thegreenpapers.com/PCC/AL-R.html http://www.thegreenpapers.com/PCC/AK-D.html And so forth on 100 lines. Only the first line has question marks. My .txt file contains those URLs with only the state and party abbreviation changing.