0

I have been struggling to create a regex suiting my need for the HTML below for some time. I´m using the java.util.regex.* package, and for different reasons I need to use this package rather than any third party lib.

What I want is to extract the data inside the tags, so the data I want in this particular HTML is 25 / 25, Lindhagen, 0, Spinninghall, 35 and Test Person.

Is it possible to create a regex for this?

<div id="rsv_detail"> <hr /> <label>Bokningsstatus</label> <span>&nbsp;</span> <label>Bokningar</label> <span>25 / 25 &nbsp;</span> <br /> <label>Plats</label> <span>Lindhagen&nbsp;</span> <label>Anlänt</label> <span>0&nbsp;</span> <br /> <label>Sal</label> <span>Spinninghall&nbsp;</span> <label>Max antal</label> <span>35&nbsp;</span> <br /> <label>Ledare</label> <span>Test Person&nbsp;</span> <br /><br /> <label>Visa mer</label> <span> <a href="/index.php?instructors%5B%5D=X129518&amp;func=la&amp;tak=0.36507500+1302460619">Ledare</a> <a href="/index.php?locations=LI&amp;func=la&amp;tak=0.36507500+1302460619">Plats</a> <a href="/index.php?activities=SP_MEDEL&amp;func=la&amp;tak=0.36507500+1302460619">Aktivitet</a> </span> <br /><br /> <br /> <br /> <hr /> </div> 

4 Answers 4

4

As far as I know, the best way to extract information from HTML is to use an HTML parser or to convert the HTML to XHTML and extract it via standard XML techniques. Why can't you use 3rd party libraries?

Sign up to request clarification or add additional context in comments.

1 Comment

The parser is a proxy for an Android app, and I will deploy this proxy to Google App Engine. I haven´t been able to find a good HTML parser which does not use some classes that is not on the GAE white list. Also, since the alot of the pages that will be parsed are not well formed, any SAX-based parsers will throw exceptions... Hope that clarifys
1
Pattern p = Pattern.compile("<span>([^<&]+)&nbsp;</span>"); Matcher m = p.matcher(text); while (m.find()) { System.out.println(m.group(1)); } 

output:

25 / 25 Lindhagen 0 Spinninghall 35 Test Person 

This assumes the target <span> always ends with &nbsp;, and never contains any other entities or elements.

Comments

0

If you filter out each line which doesn't open and close the span-tag in the same line, you can use:

filtered.replaceAll ("<span>([^<]*)</span>", "$1") .replaceAll ("&nbsp;", "") 

The paranteheses build a capturing group, which you later reference from left to right by the first paren by number - here it is just one, hence $1. After the opening tag, you read everything except ^ a less-than sign, which you expect to be the closing tag, until the closing tag.

However, in most cases I would agree with stema and Hovercraft Full Of Eels. Pitfalls for regex in html are:

  • Open and close tag are hard to find with regex, if they span over multiple lines, and more so, if they are nested.
  • Tags inside Comments are hard to detect

However there are rare cases, where regexes are useful:

  • One time jobs, where you oversee all coming input.
  • Generated HTML, which will always look the same, from routers for example, or javadocs
  • HTML which you build yourself with your program in mind

Comments

0

'<span>(.*?)&amp;</span>' as a RE will do, won't it ?

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.