4

I am trying to build a regular expression to extract the text inside the HTML tag as shown below. However I have limited skills in regular expressions, and I'm having trouble building the string.

How can I extract the text from this tag:

<a href="javascript:ProcessQuery('report_drilldown',145817)">text</a>

That is just a sample of the HTML source of the page. Basically, I need a regex string to match the "text" inside of the <a> tag. Can anyone assist me with this? Thank you. I hope my question wasn't phrased too horribly.

UPDATE: Just for clarification, report_drilldown is absolute, but I don't really care if it's present in the regex as absolute or not.

145817 is a random 6 digit number that is actually a database id. "text" is just simple plain text, so it shouldn't be invalid HTML. Also, most people are saying that it's best to not use regex in this situation, so what would be best to use? Thanks so much!

4
  • 14
    Using regex to solve the problem of parsing HTML? Now you have two problems. Commented Jun 30, 2009 at 1:44
  • How so? I've used regex before in another project with a quite similar task. Maybe it's better to use something else to extract the text of the tag? Commented Jun 30, 2009 at 1:47
  • 3
    Parsing HTML with a regex is, in general, a Bad Thing: stackoverflow.com/questions/701166 Commented Jun 30, 2009 at 1:53
  • 2
    HTML parsing with regex doesn't work with invalid html, and even valid html cases can be a pain. better to use a Dom Document implementation in C#, and access the textContent of a particular node[s]. Commented Jun 30, 2009 at 1:56

4 Answers 4

4

The answer is... DON'T!

Use a library, such as this one

Sign up to request clarification or add additional context in comments.

Comments

2
<a href="javascript:ProcessQuery\('report_drilldown',[0-9]+\)">([^<]*)</a> 

This won't really solve the problem, but it may just barely scrape by. In particular, it's very brittle, the slightest change to the markup and it won't match. If report_drilldown isn't meant to be absolute, replace it with [^']*, and/or capture both it and the number if you need.

If you need something that parses HTML, then it's a bit of a nightmare if you have to deal with tag soup. If you were using Python, I'd suggest BeautifulSoup, but I don't know something similar for C#. (Anyone know of a similar tag soup parsing library for C#?)

6 Comments

Attributes in HTML aren't supposed to contain <. And it's a well-formedness constraint in XML.
Yes im sorry stupid console fonts are mixing me up - it was supposed to be (). Thanks for your help!
Hah, I update my post, see your answer, and now rollback to the original.
Sorry about that!!!! My bad - now im convinced that i need to find a better font for CMD. Thanks!
Lucida Console and Envy Code R (search google for it) work well for me.
|
1

I agree regex might not be the best way to parse this, but using backreference it's easily done:

<(?<tag>\w*)(?:.*)>(?<text>.*)</\k<tag>> 

Where tag and text are named capture groups.

hat-tip: expresso library

2 Comments

Even assuming well-formed input (if it's not, this style of parsing may fail or, worse, incorrectly succeed) you have two problems shown by this sample input: 1) <em><em>text</em>more text</em>. 2) <em>a</em><em>b</em>. Of course, your answer is really no better than mine, but I would be hesitant to call it easily done. Regex is simply the wrong tool for this job, even when it works occasionally.
Ok. I am going to continue searching for a very "safe" and "good" method to process such "tag soup", but for now, as R. Pate's regex is working, I'm going to continue using it until i find a better solution. Thanks so much everybody!!!
-1
<a href\=\"[^\x00]*?\"> 

should get you the opening tag.

<\/a> 

will give you the closing tag. Just extract out what is in between. Untested though.

1 Comment

Do you mean \x intead of /x? Why any character except null? Why are = and " escaped? Since you're not using / delimiters in sed-style, escaping / is a little strange too.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.