Need help with regex to extract data inside tags

Question

I have been struggling to create a regex suiting my need for the HTML below for some time. I´m using the java.util.regex.* package, and for different reasons I need to use this package rather than any third party lib.

What I want is to extract the data inside the tags, so the data I want in this particular HTML is 25 / 25, Lindhagen, 0, Spinninghall, 35 and Test Person.

Is it possible to create a regex for this?

<div id="rsv_detail"> <hr /> <label>Bokningsstatus</label> <span>&nbsp;</span> <label>Bokningar</label> <span>25 / 25 &nbsp;</span> <br /> <label>Plats</label> <span>Lindhagen&nbsp;</span> <label>Anlänt</label> <span>0&nbsp;</span> <br /> <label>Sal</label> <span>Spinninghall&nbsp;</span> <label>Max antal</label> <span>35&nbsp;</span> <br /> <label>Ledare</label> <span>Test Person&nbsp;</span> <br /><br /> <label>Visa mer</label> <span> <a href="/index.php?instructors%5B%5D=X129518&amp;func=la&amp;tak=0.36507500+1302460619">Ledare</a> <a href="/index.php?locations=LI&amp;func=la&amp;tak=0.36507500+1302460619">Plats</a> <a href="/index.php?activities=SP_MEDEL&amp;func=la&amp;tak=0.36507500+1302460619">Aktivitet</a> </span> <br /><br /> <br /> <br /> <hr /> </div>

Hovercraft Full Of Eels · Accepted Answer · 2011-04-10 20:00:16Z

4

As far as I know, the best way to extract information from HTML is to use an HTML parser or to convert the HTML to XHTML and extract it via standard XML techniques. Why can't you use 3rd party libraries?

answered Apr 10, 2011 at 20:00

Hovercraft Full Of Eels

286k25 gold badges267 silver badges391 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Daniel Over a year ago

The parser is a proxy for an Android app, and I will deploy this proxy to Google App Engine. I haven´t been able to find a good HTML parser which does not use some classes that is not on the GAE white list. Also, since the alot of the pages that will be parsed are not well formed, any SAX-based parsers will throw exceptions... Hope that clarifys

Alan Moore · Accepted Answer · 2011-04-11 00:16:05Z

Pattern p = Pattern.compile("<span>([^<&]+)&nbsp;</span>"); Matcher m = p.matcher(text); while (m.find()) { System.out.println(m.group(1)); }

output:

25 / 25 Lindhagen 0 Spinninghall 35 Test Person

This assumes the target <span> always ends with  , and never contains any other entities or elements.

user unknown · Accepted Answer · 2011-04-10 21:14:10Z

If you filter out each line which doesn't open and close the span-tag in the same line, you can use:

filtered.replaceAll ("<span>([^<]*)</span>", "$1") .replaceAll ("&nbsp;", "")

The paranteheses build a capturing group, which you later reference from left to right by the first paren by number - here it is just one, hence $1. After the opening tag, you read everything except ^ a less-than sign, which you expect to be the closing tag, until the closing tag.

However, in most cases I would agree with stema and Hovercraft Full Of Eels. Pitfalls for regex in html are:

Open and close tag are hard to find with regex, if they span over multiple lines, and more so, if they are nested.
Tags inside Comments are hard to detect

However there are rare cases, where regexes are useful:

One time jobs, where you oversee all coming input.
Generated HTML, which will always look the same, from routers for example, or javadocs
HTML which you build yourself with your program in mind

eyquem · Accepted Answer · 2011-04-11 00:30:31Z

0

'<span>(.*?)&</span>' as a RE will do, won't it ?

answered Apr 11, 2011 at 0:30

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

Collectives™ on Stack Overflow

Need help with regex to extract data inside tags

4 Answers 4

1 Comment

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Linked

Related