1

i wrote below program in python for very simple web crawler, but when i run it it return me 'NoneType' object is not callable' , could you please help me?

import BeautifulSoup import urllib2 def union(p,q): for e in q: if e not in p: p.append(e) def crawler(SeedUrl): tocrawl=[SeedUrl] crawled=[] while tocrawl: page=tocrawl.pop() pagesource=urllib2.urlopen(page) s=pagesource.read() soup=BeautifulSoup.BeautifulSoup(s) links=soup('a') if page not in crawled: union(tocrawl,links) crawled.append(page) return crawled crawler('http://www.princeton.edu/main/') 
1
  • 1
    Can you post the full traceback? That should at least narrow down what function call is being made on a None value. Commented Dec 1, 2012 at 11:32

1 Answer 1

6

[UPDATE] Here is the complete project code

https://bitbucket.org/deshan/simple-web-crawler

[ANWSER]

soup('a') returns the complete html tag.

<a href="http://itunes.apple.com/us/store">Buy Music Now</a> 

so the urlopen gives the error 'NoneType' object is not callable'. you need extract the only the url/href.

links=soup.findAll('a',href=True) for l in links: print(l['href']) 

You need to validate the url too.refer to following anwsers

Again i would like to suggest you to use python sets instead Arrays.you can easily add,ommit duplicate urls.

Try the following code:

import re import httplib import urllib2 from urlparse import urlparse import BeautifulSoup regex = re.compile( r'^(?:http|ftp)s?://' # http:// or https:// r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain... r'localhost|' #localhost... r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip r'(?::\d+)?' # optional port r'(?:/?|[/?]\S+)$', re.IGNORECASE) def isValidUrl(url): if regex.match(url) is not None: return True; return False def crawler(SeedUrl): tocrawl=[SeedUrl] crawled=[] while tocrawl: page=tocrawl.pop() print 'Crawled:'+page pagesource=urllib2.urlopen(page) s=pagesource.read() soup=BeautifulSoup.BeautifulSoup(s) links=soup.findAll('a',href=True) if page not in crawled: for l in links: if isValidUrl(l['href']): tocrawl.append(l['href']) crawled.append(page) return crawled crawler('http://www.princeton.edu/main/') 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.