0

I'm creating a regex. This is my test dataset:

<a href="test.html">test1</a> <a href="test.pdf">test2</a> <a href="test.html">test1</a> <a href="test.html">test1</a><a href="testtime.pdf">test2</a> 

I'm trying to capture from "href=" to "pdf", but the following regex:

href=.*?\.pdf 

Will capture the right data if it is isolated to one line, but it will also match the following from the last line:

href="test.html">test1</a><a href="testtime.pdf 

I only want from the last "href" to the ".pdf", I don't want the first "href" on the line or anything that comes between it and the second "href". Is it possible to modify the regex to match this properly?

Thanks.

3
  • You want the name of the last linked file only if it's a PDF? Commented Apr 18, 2017 at 13:20
  • regex for javascript? Commented Apr 18, 2017 at 13:20
  • Please note that parsing HTML with regexes is fraught with peril. See htmlparsing.com/regexes.html for examples of why. Commented Apr 18, 2017 at 13:28

2 Answers 2

2

Make the attribute to start with a quote and the value not contain this quote:

href="[^"]*?\.pdf 

Demo: https://regex101.com/r/UuRin3/1

P.S.

Don't use Regex to parse HTML

Sign up to request clarification or add additional context in comments.

1 Comment

This helped me out, thanks. By the way, I am not using Regex to parse HTML. I am trying to find instances of linked PDFs on a site with 9000 HTML pages.
0

First of all, use capturing groups, they allow you match whole word, but extract only part of it, for example href=\"(.*\.pdf)\" should allow you to match the href="xxxx.pdf" string, but extract only xxxx.pdf part.

How you do this depends on what technology you use to fetch Regex. Somehow I doubt this is html.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.