[UPDATE] Here is the complete project code
https://bitbucket.org/deshan/simple-web-crawler
[ANWSER]
soup('a') returns the complete html tag.
<a href="http://itunes.apple.com/us/store">Buy Music Now</a>
so the urlopen gives the error 'NoneType' object is not callable'. you need extract the only the url/href.
links=soup.findAll('a',href=True) for l in links: print(l['href'])
You need to validate the url too.refer to following anwsers
Again i would like to suggest you to use python sets instead Arrays.you can easily add,ommit duplicate urls.
Try the following code:
import re import httplib import urllib2 from urlparse import urlparse import BeautifulSoup regex = re.compile( r'^(?:http|ftp)s?://' # http:// or https:// r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain... r'localhost|' #localhost... r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip r'(?::\d+)?' # optional port r'(?:/?|[/?]\S+)$', re.IGNORECASE) def isValidUrl(url): if regex.match(url) is not None: return True; return False def crawler(SeedUrl): tocrawl=[SeedUrl] crawled=[] while tocrawl: page=tocrawl.pop() print 'Crawled:'+page pagesource=urllib2.urlopen(page) s=pagesource.read() soup=BeautifulSoup.BeautifulSoup(s) links=soup.findAll('a',href=True) if page not in crawled: for l in links: if isValidUrl(l['href']): tocrawl.append(l['href']) crawled.append(page) return crawled crawler('http://www.princeton.edu/main/')
Nonevalue.