0

I want to extract a bit of data from this snippet:

<div id="information_content"> <b>Name:</b> file.rar <br> <b>Date Modified:</b> 2 days ago <br> <b>Size:</b> 212.19 MB <br> <b>Type:</b> Archive <br> <b>Permissions:</b> Public </div> </div> 

I want to extract only 212.19 MB.

I have extracted the snippet using soup.find('div', attrs={'id': 'information_content'}) but I can't figure out how to drill further down to get what I need.

Can anybody help?

1

3 Answers 3

0

As BeautifulSoup doesn't support Xpath, the best way would be to use lxml.

Sign up to request clarification or add additional context in comments.

Comments

0

If the DIV has always the same structure, you can follow this instructions, using BeautifulSoup. Once you get the DIV extracted, create a new LIST with the text, splitted by '\n'. Then, just select the right element of the list.

I've done something similar and here I explained everything I did: Python and BeautifulSoup: extracting prizes from Quiniela - http://www.manejandodatos.es/2014/2/python-beautifulsoup-extracting-prizes-quiniela

I hope it helps!

Comments

0

As said previously, if the structure of these divs is always the same, the size will be in the third string if you split.

>>>> x = '<div id="information_content"> <b>Name:</b> file.rar <br> <b>Date Modified:</b> 2 days ago <br> <b>Size:</b> 212.19 MB <br> <b>Type:</b> Archive <br> <b>Permissions:</b> Public </div> </div>' >>>> x.split('<br>')[2] ' <b>Size:</b> 212.19 MB ' 

From there you can use regular expressions to get just the part you need. For example this pattern matches all values of this kind of formatting:

\d+.\d\d\s.B 

it matches 10.00 kB as well as 1000.34 TB

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.