0

I am not able to access all the URLs from bing.com web page I am using this program .

def main(): usock = urllib.urlopen("http://www.bing.com/") parser = urllister.URLLister() parser.feed(usock.read()) usock.close() parser.close() for url in parser.urls: print url 

I will get only few URLs which are written in HTML, Is it possible to get the all the URLs of a web page from source page ? or are there any restrictions to access these URLs, can anybody please check and lemme know. Thank you in advance.

3 Answers 3

2
import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://www.bing.com/') for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_attr('href'): print link['href'] 

try with beautifulsoup

Sign up to request clarification or add additional context in comments.

Comments

2
def urllist(): import urllib2 import re website = urllib2.urlopen('http://www.google.com') html = website.read() links = re.findall('"((?:http|ftp)s?://.*?)"', html) for link in links: print link 

This might help.

4 Comments

"((?:http|ftp)s?://.*?)" won't catch just http and ftp words
What does that means won't catch just words.
without ?: you will get ('schema.org/WebPage', 'http'), if you add ?: you will get only link 'schema.org/WebPage'. regex key ?: meaning "don't catch this group". You can run your code with ?: and without it and you will see difference
Ok I see your point, it was a requirement previously in one of my codes so I didn't bother to modify it much but you're right about regex key. My bad Please see I've updated it with new one.
0

lxml lib example:

from lxml.html import parse page = parse('http://bing.com').getroot() for l in page.iterlinks(): if l[2].startswith('http'): print(l[2]) 

From lxml lib doc:

.iterlinks():

This yields (element, attribute, link, pos) for every link in the document. attribute may be None if the link is in the text (as will be the case with a <style> tag with @import).
This finds any link in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute. It also searches style attributes for url(link), and <style> tags for @import and url().
This function does not pay attention to <base href>.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.