simple web crawler

Question

i wrote below program in python for very simple web crawler, but when i run it it return me 'NoneType' object is not callable' , could you please help me?

import BeautifulSoup import urllib2 def union(p,q): for e in q: if e not in p: p.append(e) def crawler(SeedUrl): tocrawl=[SeedUrl] crawled=[] while tocrawl: page=tocrawl.pop() pagesource=urllib2.urlopen(page) s=pagesource.read() soup=BeautifulSoup.BeautifulSoup(s) links=soup('a') if page not in crawled: union(tocrawl,links) crawled.append(page) return crawled crawler('http://www.princeton.edu/main/')

Can you post the full traceback? That should at least narrow down what function call is being made on a None value. — Blckknght
– Blckknght, Commented Dec 1, 2012 at 11:32

Community · Accepted Answer · 2017-05-23 12:01:44Z

[UPDATE] Here is the complete project code

https://bitbucket.org/deshan/simple-web-crawler

[ANWSER]

soup('a') returns the complete html tag.

<a href="http://itunes.apple.com/us/store">Buy Music Now</a>

so the urlopen gives the error 'NoneType' object is not callable'. you need extract the only the url/href.

links=soup.findAll('a',href=True) for l in links: print(l['href'])

You need to validate the url too.refer to following anwsers

Again i would like to suggest you to use python sets instead Arrays.you can easily add,ommit duplicate urls.

http://docs.python.org/2/library/sets.html

Try the following code:

import re import httplib import urllib2 from urlparse import urlparse import BeautifulSoup regex = re.compile( r'^(?:http|ftp)s?://' # http:// or https:// r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain... r'localhost|' #localhost... r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip r'(?::\d+)?' # optional port r'(?:/?|[/?]\S+)$', re.IGNORECASE) def isValidUrl(url): if regex.match(url) is not None: return True; return False def crawler(SeedUrl): tocrawl=[SeedUrl] crawled=[] while tocrawl: page=tocrawl.pop() print 'Crawled:'+page pagesource=urllib2.urlopen(page) s=pagesource.read() soup=BeautifulSoup.BeautifulSoup(s) links=soup.findAll('a',href=True) if page not in crawled: for l in links: if isValidUrl(l['href']): tocrawl.append(l['href']) crawled.append(page) return crawled crawler('http://www.princeton.edu/main/')

Collectives™ on Stack Overflow

simple web crawler

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related