1

I'm looking to build a robust, reliable GREP rule to catch all web links and URLs that appears in text, covering all possible characters and gotchas like HTTPS, or URLs in brackets like (http://whatever.com), or followed by punctuation like http://whatever.com?! It's for an InDesign paragraph style GREP rule.

I've put the best I've come up with so far down below as an answer - is it missing anything, is there anything more robust or straightforward?

1 Answer 1

2

This seems to work pretty well:

https?\://.*?(?=(\)|\.|\,|\?|\!|"|')*($|\s)) 
  • Start with either http:// or https://
    • https?\://
  • ...then match the shortest uninterrupted string of any characters -.*?
  • ...that is followed by, but doesn't include
    • the (?= ) "positive lookahead"
  • ...zero or more of any common punctuation - ) . , ? ! and any type of single or double quotation mark, curly or straight, left or right
    • (\)|\.|\,|\?|\!|"|')*
  • ...and then either the end of the paragraph or any type of whitespace
    • ($|\s)

Some testing:

enter image description here

2
  • If I use this in notepad++ with the source code for this webpage, I see that you need figure out how to stop on less-than/greater-than brackets, and this will grab "http://" without an actual url. This seems to capture most instances with no false positives (just slightly greedy when there are brackets), though I haven't looked for false negatives. Commented Jan 21, 2016 at 19:10
  • 1
    stackoverflow.com/questions/27745/getting-parts-of-a-url-regex Commented Jan 21, 2016 at 19:14

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.