1

I'm trying to scrape movie information from the info box on Wikipedia using BeautifulSoup. I'm having trouble scraping movie budgets, as below.

For example, I want to scrape the '$25 million' budget value from the info box. How can I get the budget value, given that the neither the th nor td tags are unique? (See example HTML).

Say I have tag = soup.find('th') with the value <th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th> - How can I get the value of '$25 million' from tag?

I thought I could do something like tag.td or tag.text but neither of these are working for me.

Do I have to loop over all tags and check if their text is equal to 'Budget', and if so get the following cell?

Example HTML Code:

<tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;">Budget</th> <td style="line-height:1.3em;">$25 million<sup id="cite_ref-2" class="reference"><a href="#cite_note-2">[2]</a></sup></td> </tr> <tr> <th scope="row" style="white-space:nowrap;padding-right:0.65em;">Box office</th> <td style="line-height:1.3em;">$65.7 million<sup id="cite_ref-BOM_3-0" class="reference"><a href="#cite_note-BOM-3">[3]</a></sup></td> </tr> 

4 Answers 4

2

You can firstly find the node with tag td whose text is Budget and then find its next sibling td and get the text from the node:

soup.find("th", text="Budget").find_next_sibling("td").get_text() # u'$25 million[2]' 
Sign up to request clarification or add additional context in comments.

Comments

0

To get every Amount in <td> tags You should use

tags = soup.findAll('td')

and then

for tag in tags: print tag.get_text() # To get the text i.e. '$25 million' 

4 Comments

Will this not just print the values of every <td> tag?
Yea It will print the value, If you want you can do whatever you want to do with it
But I'm specifically looking for the value after the tag which contains the word 'Budget' as the tag text, not every <td>value.
Yea for that you can simply make a comparison in findAll <tr> and then get only value of <td> if the text of <th> is equal to Budget.
0

What you need is find_all() method in BeatifulSoup.

For example:

 tdTags = soup.find_all('td',{'class':'reference'}) 

This means you will find all 'td' tags when class = 'reference'.

You can find whatever td tags you want as long as you find the unique attribute in expected td tags.

Then you can do a for loop to find the content, as @Bijoy said.

Comments

0

The other possible way might be:

split_text = soup.get_text().split('\n') # The next index from Budget is cost split_text[split_text.index('Budget')+1] 

Comments