1

I have a table which I wish to pick up all the links, go through the link and scrape the items within td class=horse.

The home page where the table is with all the links has the following code:

 <table border="0" cellspacing="0" cellpadding="0" class="full-calendar"> <tr> <th width="160">&nbsp;</th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=NSW">NSW</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=VIC">VIC</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=QLD">QLD</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=WA">WA</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=SA">SA</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=TAS">TAS</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=ACT">ACT</a></th> <th width="105"><a href="/FreeFields/Calendar.aspx?State=NT">NT</a></th> </tr> <tr class="rows"> <td> <p><span>FRIDAY 13 JAN</span></p> </td> <td> <p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br> </p> </td> <td> <p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br> </p> </td> <td> <p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,QLD,Doomben">Doomben</a><br> </p> </td> 

I currently have the code to look up the table and print the links

from selenium import webdriver import requests from bs4 import BeautifulSoup #path to chromedriver path_to_chromedriver = '/Users/Kirsty/Downloads/chromedriver' #ensure browser is set to Chrome browser = webdriver.Chrome(executable_path= path_to_chromedriver) #set browser to Racing Australia Home Page url = 'http://www.racingaustralia.horse/' r = requests.get(url) soup=BeautifulSoup(r.content, "html.parser") #looks up to find the table & prints link for each page table = soup.find('table',attrs={"class" : "full-calendar"}). find_all('a') for link in table: print link.get('href') 

Wondering if anyone can assist in how I can get the code to click on all the links that are within the table & do the following to the each of the pages

g data = soup.findall("td",{"class":"horse"}) for item in g_data: print item.text 

Thanks in advance

4
  • What do you mean by "Click on the links"? Meaning, going to the page of the link, then scraping all the links on there? Commented Jan 12, 2017 at 23:36
  • Yes, so the table consists of data such as the below, <table border="0" cellspacing="0" cellpadding="0" class="full-calendar"> <tr class="rows"> <td><p><span>FRIDAY 13 JAN</span></p> </td><td><p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Ballina">Ballina</a><br> <a href="/FreeFields/Form.aspx?Key=2017Jan13,NSW,Gosford">Gosford</a><br> </p></td> <td><p> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Ararat">Ararat</a><br> <a href="/FreeFields/Form.aspx?Key=2017Jan13,VIC,Cranbourne">Cranbourne</a><br></p></td> Commented Jan 12, 2017 at 23:43
  • @KirstyDent Please put any relevant data, like the HTML in your comment above, into the question itself so that it's easier for later readers to find. Commented Jan 13, 2017 at 0:25
  • apolgies - I will do now! Commented Jan 13, 2017 at 1:42

1 Answer 1

3
import requests, bs4, re from urllib.parse import urljoin start_url = 'http://www.racingaustralia.horse/' def make_soup(url): r = requests.get(url) soup = bs4.BeautifulSoup(r.text, 'lxml') return soup def get_links(url): soup = make_soup(url) a_tags = soup.find_all('a', href=re.compile(r"^/FreeFields/")) links = [urljoin(start_url, a['href'])for a in a_tags] # convert relative url to absolute url return links def get_tds(link): soup = make_soup(link) tds = soup.find_all('td', class_="horse") if not tds: print(link, 'do not find hours tag') else: for td in tds: print(td.text) if __name__ == '__main__': links = get_links(start_url) for link in links: get_tds(link) 

out:

http://www.racingaustralia.horse/FreeFields/GroupAndListedRaces.aspx do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=NSW do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=VIC do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=QLD do not find hours tag http://www.racingaustralia.horse/FreeFields/Calendar.aspx?State=WA do not find hours tag ....... WEARETHECHAMPIONS STORMY HORIZON OUR RED JET SAPPER TOM MY COUSIN BOB ALL TOO HOT SAGA DEL MAR ZIGZOFF SASHAY AWAY SO SHE IS MILADY DUCHESS 

bs4 + requests can meet your need.

Sign up to request clarification or add additional context in comments.

1 Comment

how do you add pagination to this code? Where the main page has several pages

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.