1

Let's say I have the following website:

https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod 

When you go on this website, it displays a bunch of information. In my case, I just want to the temperature from the Culture Culture Conditions section.

when you scroll down the webpage, you will see a section called "Culture Conditions"

Atmosphere: air, 95%; carbon dioxide (CO2), 5% Temperature: 37°C 

using the requests library, I'm able to get to the HTML code of the page. when I save the HTML and search through it for my data it's towards the bottom

in this form

 Culture Conditions </th> <td> <div><strong>Atmosphere: </strong>air, 95%; carbon dioxide (CO<sub>2</sub>), 5%</div><div><strong>Temperature: </strong>37&deg;C</div> 

I'm not sure what to do after this. I looked into using BeautifulSoup to parse the HTML but i was not successful.

this is all the code that I have so far.

import requests url='https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod' page = requests.get(url) textPage = str(page.text) file = open('test2', 'w') file.write(textPage) file.close() 

3 Answers 3

2
import requests from bs4 import BeautifulSoup url = 'https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod' r = requests.get(url) soup = BeautifulSoup(r.text, 'lxml') cc = soup.select('#layoutcontent_2_middlecontent_0_productdetailcontent_0_maincontent_2_rptTabContent_rptFields_2_fieldRow_3 td div') for c in cc: print(c.text.strip()) 

Output:

Atmosphere: air, 95%; carbon dioxide (CO2), 5% Temperature: 37°C 

To just get the temperature:

cc = soup.select('#layoutcontent_2_middlecontent_0_productdetailcontent_0_maincontent_2_rptTabContent_rptFields_2_fieldRow_3 td div')[-1] cc = cc.text.split(':')[-1].strip() print(cc) 

Output:

37°C 
Sign up to request clarification or add additional context in comments.

Comments

1

I did a regular expression that search for the line starting by <div><strong>Atmosphere: and take all until the end of the line. Then I removed every unwanted strings from the result. Et Voila!

import re textPage = re.search(r"<div><strong>Atmosphere: .*", textPage).group(0) wrongString = ['<div>','</div>','<strong>','</strong>','<sub>','</sub>'] for ws in wrongString: textPage = re.sub(ws, "", textPage) file = open('test2', 'w') file.write(textPage) file.close() 

Comments

0

Another way you may find useful is something like below:

import requests from bs4 import BeautifulSoup url = 'https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod' page = requests.get(url) soup = BeautifulSoup(page.text,"lxml") for items in soup.find_all("strong"): if "Atmosphere:" in items.text: atmos = items.find_parent().text temp = items.find_parent().find_next_sibling().text print(f'{atmos}\n{temp}') 

Output:

Atmosphere: air, 95%; carbon dioxide (CO2), 5% Temperature: 37°C 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.