Narrowing Regex results

Question

I'm creating a regex. This is my test dataset:

<a href="test.html">test1</a> <a href="test.pdf">test2</a> <a href="test.html">test1</a> <a href="test.html">test1</a><a href="testtime.pdf">test2</a>

I'm trying to capture from "href=" to "pdf", but the following regex:

href=.*?\.pdf

Will capture the right data if it is isolated to one line, but it will also match the following from the last line:

href="test.html">test1</a><a href="testtime.pdf

I only want from the last "href" to the ".pdf", I don't want the first "href" on the line or anything that comes between it and the second "href". Is it possible to modify the regex to match this properly?

Thanks.

You want the name of the last linked file only if it's a PDF? — Waxi
– Waxi, Commented Apr 18, 2017 at 13:20
Please note that parsing HTML with regexes is fraught with peril. See htmlparsing.com/regexes.html for examples of why. — Andy Lester
– Andy Lester, Commented Apr 18, 2017 at 13:28

Community · Accepted Answer · 2017-05-23 11:54:17Z

2

Make the attribute to start with a quote and the value not contain this quote:

href="[^"]*?\.pdf

Demo: https://regex101.com/r/UuRin3/1

P.S.

Don't use Regex to parse HTML

edited May 23, 2017 at 11:54

CommunityBot

11 silver badge

answered Apr 18, 2017 at 13:21

Dmitry Egorov

9,6903 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Katori Over a year ago

This helped me out, thanks. By the way, I am not using Regex to parse HTML. I am trying to find instances of linked PDFs on a site with 9000 HTML pages.

schroedingersKat · Accepted Answer · 2017-04-18 13:20:51Z

First of all, use capturing groups, they allow you match whole word, but extract only part of it, for example href=\"(.*\.pdf)\" should allow you to match the href="xxxx.pdf" string, but extract only xxxx.pdf part.

How you do this depends on what technology you use to fetch Regex. Somehow I doubt this is html.

Collectives™ on Stack Overflow

Narrowing Regex results

2 Answers 2

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Linked

Related