Skip to main content
2 of 3
deleted 63 characters in body
Jamal
  • 35.2k
  • 13
  • 134
  • 238

Data retrieval from Dynamic HTML page with time-out (Web scraping w. Python)

The HTML page shows list of a friend network of a person (each Name has anchor <a> tag w. link to list of friend network). Since the page has a timer, I've written a py code to scrap the mth position (friend) of the nth count (page) by traversing through the cycle: (m->n->m->n....). And it works!

import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup import ssl # Ignore SSL certificate errors ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter URL: ') position = int(input('Enter position: ')) #Name/link Traverse count = int(input('Enter count: ')) #Page Traverse print("Retrieving:", url) for c in range(count): #returns range of indices html = urllib.request.urlopen(url, context=ctx).read() #opening URL soup = BeautifulSoup(html, 'html.parser') a_tags=soup('a') link=a_tags[position-1].get('href', None) #url = href(key) value pair content=a_tags[position-1].contents #name=a_tag.contents url=link print("Retrieving:", url) 

Input:

Enter URL: http://py4e-data.dr-chuck.net/known_by_Kory.html Enter position: 1 Enter count: 10 

Output:

Retrieving: http://py4e-data.dr-chuck.net/known_by_Kory.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Shaurya.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Raigen.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Dougal.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Aonghus.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Daryn.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Pauline.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Laia.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Iagan.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Leanna.html Retrieving: http://py4e-data.dr-chuck.net/known_by_Malakhy.html 

Questions:

  1. Is there a better way to approach this? (libraries, workarounds to delay the timer)

  2. My goals is to make an exhaustive 'list' of friends of all unique Names here; I don't want any code, just suggestions and approaches will do.

Sumax
  • 121
  • 1