1

I'm having trouble with a regex match. Here is the string:

(<a href="HTTP://WWW.TEST.COM/TEST/TEST.JPG">LOREM IPSUM DOLOR SIT AMET, CONSECTETUR ADIPISCING ELIT.</a>) LOREM IPSUM DOLOR <a href="HTTP://WWW.TEST.COM/TEST/TEST.JPG">SIT AMET</a> CONSECTETUR ADIPISCING ELIT. 

The regex pattern I'm using is:

/(<)(.*=")(.*)(">)(.*)(<\/.*>)/g 

The problem is that it's only picking up one match because of the .* before in the last matching group in the regex pattern. I want it to find two matches of that pattern (which there is in this string). How do I get it to look for the stop at the first instance of > when searching? I figure that would do the trick.

I've heard it called 'non-greedy'? I've tried a + and ? but neither seem to work with what I'm doing.

Thanks!

4
  • @Mx what would be your expected output? Commented Jun 26, 2014 at 4:00
  • Hey there, following up on this. Did one of the answers solve it for you, or is the question still there? Please give us some feedback. :) Commented Jun 27, 2014 at 0:09
  • Hi @zx81, I'm going to take a look at this after work and will let you know! Thanks for following up. Commented Jun 27, 2014 at 0:10
  • Alright, brilliant. Let us know. :) Commented Jun 27, 2014 at 0:10

2 Answers 2

1
  1. FYI and FWIW, the accepted wisdom on SO is that regex is not the best way to parse html...
  2. but if you're sticking with regex, the main problem is that your .* quantifiers eat up all the characters to the end of the string. This can be fixed by adding a ? to make the quantifiers "lazy": .*?

The * quantifier means zero or more. It causes the . dot to match every single character to the end of the string... Then, to allow the rest of the regex to match, the engine backtracks... So that .* ends up matching the longest match, not the shortest one. In contrast, .*? will get you on the road to the shortest match (with some caveats explained in the articles below.)

Reference

Sign up to request clarification or add additional context in comments.

4 Comments

Ok this is great. I just had to add a question mark after every * and it worked great. I appreciate the help. Just curious, what is, in your opinion, the best way to parse HTML if not RegEx? Even just a little nudge in the right direction would be great, I'm always trying to get more efficient with my code.
Thanks, glad it helps! To parse html, many people here recommend a Dom parser. It really depends on what it is and what language you're using. IMO for small fragments that are guaranteed to be well-formed (as opposed to something you scraped), a well-designed regex will do its job.
Ah ok yeah that makes sense. I am using a DOM parser to do some screen scraping but it's going far beyond that where I have to find specific links within the content after I've already pulled the data. Some strings have one match, some have several, etc. so I'm using preg_match_all in PHP to do the job...
You can go a long way with preg_match, and it's fun. And it sounds like you already know about DOM parsers. :)
1

The below starts matching from the begining and stops at the first occurance of >, and > in the regex matches the following > symbol also.

\(<[^>]*> 

DEMO

If you want to match <a href link only then try this regex,

\(?(<a[^>]*>) 

DEMO

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.