URL access from web page

Question

I am not able to access all the URLs from bing.com web page I am using this program .

def main(): usock = urllib.urlopen("http://www.bing.com/") parser = urllister.URLLister() parser.feed(usock.read()) usock.close() parser.close() for url in parser.urls: print url

I will get only few URLs which are written in HTML, Is it possible to get the all the URLs of a web page from source page ? or are there any restrictions to access these URLs, can anybody please check and lemme know. Thank you in advance.

sheh · Accepted Answer · 2015-12-08 08:45:05Z

import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request('http://www.bing.com/') for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')): if link.has_attr('href'): print link['href']

try with beautifulsoup

Prashant Shukla · Accepted Answer · 2015-12-08 09:13:05Z

2

def urllist(): import urllib2 import re website = urllib2.urlopen('http://www.google.com') html = website.read() links = re.findall('"((?:http|ftp)s?://.*?)"', html) for link in links: print link

This might help.

edited Dec 8, 2015 at 9:13

answered Dec 8, 2015 at 8:30

Prashant Shukla

7623 gold badges6 silver badges20 bronze badges

4 Comments

sheh Over a year ago

"((?:http|ftp)s?://.*?)" won't catch just http and ftp words

Prashant Shukla Over a year ago

What does that means won't catch just words.

sheh Over a year ago

without ?: you will get ('schema.org/WebPage', 'http'), if you add ?: you will get only link 'schema.org/WebPage'. regex key ?: meaning "don't catch this group". You can run your code with ?: and without it and you will see difference

Prashant Shukla Over a year ago

Ok I see your point, it was a requirement previously in one of my codes so I didn't bother to modify it much but you're right about regex key. My bad Please see I've updated it with new one.

Community · Accepted Answer · 2020-06-20 09:12:55Z

lxml lib example:

from lxml.html import parse page = parse('http://bing.com').getroot() for l in page.iterlinks(): if l[2].startswith('http'): print(l[2])

From lxml lib doc:

.iterlinks():

This yields (element, attribute, link, pos) for every link in the document. attribute may be None if the link is in the text (as will be the case with a <style> tag with @import).
This finds any link in an action, archive, background, cite, classid, codebase, data, href, longdesc, profile, src, usemap, dynsrc, or lowsrc attribute. It also searches style attributes for url(link), and <style> tags for @import and url().
This function does not pay attention to <base href>.

Collectives™ on Stack Overflow

URL access from web page

3 Answers 3

Comments

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Related