2

I need to find all links and images in HTML source of the webpage. Actaually I have following expression:

boost::regex findurl("(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^http]{1}[^\\s>]*)['\"]", boost::regex::normal | boost::regbase::icase); 

How should it look like to find images ( tag) also?

1
  • Careful, you might summon Cthulhu :) Commented May 30, 2013 at 19:27

2 Answers 2

4

It will take you less time to learn Perl and use HTML::Parser than it will for you to debug this regex that won't work on pathological HTML. I can already spot three bugs in it for links, even though you're only asking about images.

This includes sample code that you can probably figure out how to modify even if you don't know Perl. http://perlmeme.org/tutorials/html_parser.html

Sign up to request clarification or add additional context in comments.

Comments

0

Having a character repeat in a character class ([^http]) doesn't appear correct. djechlin has a point in that a RE is likely to be insufficient but for the simplest of HTMLs.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.