Boost regex, regular expression, url and img

Question

I need to find all links and images in HTML source of the webpage. Actaually I have following expression:

boost::regex findurl("(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^http]{1}[^\\s>]*)['\"]", boost::regex::normal | boost::regbase::icase);

How should it look like to find images ( tag) also?

Careful, you might summon Cthulhu :)

djf
– djf

2013-05-30 19:27:22 +00:00
Commented May 30, 2013 at 19:27 — djf
– djf, Commented May 30, 2013 at 19:27

djechlin · Accepted Answer · 2012-05-22 21:51:15Z

It will take you less time to learn Perl and use HTML::Parser than it will for you to debug this regex that won't work on pathological HTML. I can already spot three bugs in it for links, even though you're only asking about images.

This includes sample code that you can probably figure out how to modify even if you don't know Perl. http://perlmeme.org/tutorials/html_parser.html

Happy Green Kid Naps · Accepted Answer · 2012-05-22 22:14:50Z

Having a character repeat in a character class ([^http]) doesn't appear correct. djechlin has a point in that a RE is likely to be insufficient but for the simplest of HTMLs.

Collectives™ on Stack Overflow

Boost regex, regular expression, url and img

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related