RegEx Multiple Matches

Question

I'm having trouble with a regex match. Here is the string:

(<a href="HTTP://WWW.TEST.COM/TEST/TEST.JPG">LOREM IPSUM DOLOR SIT AMET, CONSECTETUR ADIPISCING ELIT.</a>) LOREM IPSUM DOLOR <a href="HTTP://WWW.TEST.COM/TEST/TEST.JPG">SIT AMET</a> CONSECTETUR ADIPISCING ELIT.

The regex pattern I'm using is:

/(<)(.*=")(.*)(">)(.*)(<\/.*>)/g

The problem is that it's only picking up one match because of the .* before in the last matching group in the regex pattern. I want it to find two matches of that pattern (which there is in this string). How do I get it to look for the stop at the first instance of > when searching? I figure that would do the trick.

I've heard it called 'non-greedy'? I've tried a + and ? but neither seem to work with what I'm doing.

Thanks!

Hey there, following up on this. Did one of the answers solve it for you, or is the question still there? Please give us some feedback. :) — zx81
– zx81, Commented Jun 27, 2014 at 0:09
Hi @zx81, I'm going to take a look at this after work and will let you know! Thanks for following up. — MillerMedia
– MillerMedia, Commented Jun 27, 2014 at 0:10

zx81 · Accepted Answer · 2014-06-26 04:01:44Z

FYI and FWIW, the accepted wisdom on SO is that regex is not the best way to parse html...
but if you're sticking with regex, the main problem is that your .* quantifiers eat up all the characters to the end of the string. This can be fixed by adding a ? to make the quantifiers "lazy": .*?

The * quantifier means zero or more. It causes the . dot to match every single character to the end of the string... Then, to allow the rest of the regex to match, the engine backtracks... So that .* ends up matching the longest match, not the shortest one. In contrast, .*? will get you on the road to the shortest match (with some caveats explained in the articles below.)

Reference

Ok this is great. I just had to add a question mark after every * and it worked great. I appreciate the help. Just curious, what is, in your opinion, the best way to parse HTML if not RegEx? Even just a little nudge in the right direction would be great, I'm always trying to get more efficient with my code.
Thanks, glad it helps! To parse html, many people here recommend a Dom parser. It really depends on what it is and what language you're using. IMO for small fragments that are guaranteed to be well-formed (as opposed to something you scraped), a well-designed regex will do its job.
Ah ok yeah that makes sense. I am using a DOM parser to do some screen scraping but it's going far beyond that where I have to find specific links within the content after I've already pulled the data. Some strings have one match, some have several, etc. so I'm using preg_match_all in PHP to do the job...
You can go a long way with preg_match, and it's fun. And it sounds like you already know about DOM parsers. :)

Avinash Raj · Accepted Answer · 2014-06-26 04:13:38Z

The below starts matching from the begining and stops at the first occurance of >, and > in the regex matches the following > symbol also.

\(<[^>]*>

DEMO

If you want to match <a href link only then try this regex,

\(?(<a[^>]*>)

DEMO

Collectives™ on Stack Overflow

RegEx Multiple Matches

2 Answers 2

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Related