Scraping a table for links, click links & scrape data

Question

I have a table which I wish to pick up all the links, go through the link and scrape the items within td class=horse.

The home page where the table is with all the links has the following code:

 <table border="0" cellspacing="0" cellpadding="0" class="full-calendar"> <tr> <th width="160">&nbsp;</th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=NSW">NSW</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=VIC">VIC</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=QLD">QLD</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=WA">WA</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=SA">SA</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=TAS">TAS</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=ACT">ACT</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=NT">NT</a></th> </tr> <tr class="rows"> <td> <p><span>FRIDAY 13 JAN</span></p> </td> <td> <p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br> </p> </td> <td> <p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br> </p> </td> <td> <p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,QLD,Doomben">Doomben</a><br> </p> </td>

I currently have the code to look up the table and print the links

from selenium import webdriver import requests from bs4 import BeautifulSoup #path to chromedriver path_to_chromedriver = '/Users/Kirsty/Downloads/chromedriver' #ensure browser is set to Chrome browser = webdriver.Chrome(executable_path= path_to_chromedriver) #set browser to Racing Australia Home Page url = 'http://www.racingaustralia.horse/' r = requests.get(url) soup=BeautifulSoup(r.content, "html.parser") #looks up to find the table & prints link for each page table = soup.find('table',attrs={"class" : "full-calendar"}). find_all('a') for link in table: print link.get('href')

Wondering if anyone can assist in how I can get the code to click on all the links that are within the table & do the following to the each of the pages

g data = soup.findall("td",{"class":"horse"}) for item in g_data: print item.text

Thanks in advance

What do you mean by "Click on the links"? Meaning, going to the page of the link, then scraping all the links on there? — Jason
– Jason, Commented Jan 12, 2017 at 23:36
Yes, so the table consists of data such as the below, <table border="0" cellspacing="0" cellpadding="0" class="full-calendar"> <tr class="rows"> <td>FRIDAY 13 JAN </td><td> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a> </td> <td> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a> </td> — Kirsty
– Kirsty, Commented Jan 12, 2017 at 23:43
@KirstyDent Please put any relevant data, like the HTML in your comment above, into the question itself so that it's easier for later readers to find. — JeffC
– JeffC, Commented Jan 13, 2017 at 0:25

宏杰李 · Accepted Answer · 2017-01-13 05:20:24Z

import requests, bs4, re from urllib.parse import urljoin start_url = 'http://www.racingaustralia.horse/' def make_soup(url): r = requests.get(url) soup = bs4.BeautifulSoup(r.text, 'lxml') return soup def get_links(url): soup = make_soup(url) a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/")) links = [urljoin(start_url, a['href'])for a in a_tags] # convert relative url to absolute url return links def get_tds(link): soup = make_soup(link) tds = soup.find_all('td', class_="horse") if not tds: print(link, 'do not find hours tag') else: for td in tds: print(td.text) if __name__ == '__main__': links = get_links(start_url) for link in links: get_tds(link)

out:

http://www.racingaustralia.horse/FreeFields/GroupAndListedRaces.aspx do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=NSW do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=VIC do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=QLD do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=WA do not find hours tag ....... WEARETHECHAMPIONS STORMY HORIZON OUR RED JET SAPPER TOM MY COUSIN BOB ALL TOO HOT SAGA DEL MAR ZIGZOFF SASHAY AWAY SO SHE IS MILADY DUCHESS

bs4 + requests can meet your need.

how do you add pagination to this code? Where the main page has several pages

Collectives™ on Stack Overflow

Scraping a table for links, click links & scrape data

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related