16

I have this:

import java.util.regex.*; String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))"; String s = "hello world"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(s); while(matcher.find()) { MatchResult matchResult = m.toMatchResult(); String substring = s.substring(matchResult.start(), matchResult.end()); System.out.println(substring); } 

The above only prints hello whereas I want it to print hello world.

One way to fix this is to re-order the groups in String regex = "(?<m2>(hello world))|(?<m1>(hello|universe))" but I don't have control over the regex I get in my case...

So what is the best way to find the longest match? An obvious way would be to check all possible substrings of s as mentioned here (Efficiently finding all overlapping matches for a regular expression) by length and pick the first but that is O(n^2). Can we do better?

13
  • 4
    Use "(?<m1>hello world)|(?<m2>hello|universe)", put the longest alternative branch before the shortest. Commented Feb 28, 2017 at 0:35
  • 2
    @WiktorStribiżew: You did not read my question. I clearly said that is an alternative that I cannot pursue. Commented Feb 28, 2017 at 0:36
  • 3
    @pathikrit If you're inserting the alternation dynamically, you have enough control to re-sort (I prefer reverse-alphabetically for a dynamic scenario like this). I just answered a similar question the other day stackoverflow.com/questions/42432356/… Commented Feb 28, 2017 at 2:25
  • 2
    You are a funny guy, pathikrit. You want to cure a disease by looking at the symptoms instead of the root cause. This is like taking Aspirin against a headache caused by an operable brain tumor. If the regex is outside of your control and does the wrong thing, either do not use it at all but your own regex or talk to the guy maintaining the regex generator in order to make him fix it for you upstream. Commented Mar 5, 2017 at 13:30
  • 3
    @pathikrit, if the regex is always very simple and consistent, just edit it and use the edited version to match. Commented Mar 8, 2017 at 15:11

4 Answers 4

4
+50

Here is a way of doing it using matcher regions, but with a single loop over the string index:

public static String findLongestMatch(String regex, String s) { Pattern pattern = Pattern.compile("(" + regex + ")$"); Matcher matcher = pattern.matcher(s); String longest = null; int longestLength = -1; for (int i = s.length(); i > longestLength; i--) { matcher.region(0, i); if (matcher.find() && longestLength < matcher.end() - matcher.start()) { longest = matcher.group(); longestLength = longest.length(); } } return longest; } 

I'm forcing the pattern to match until the region's end, and then I move the region's end from the rightmost string index towards the left. For each region's end tried, Java will match the leftmost starting substring that finishes at that region's end, i.e. the longest substring that ends at that place. Finally, it's just a matter of keeping track of the longest match found so far.

As a matter of optimization, and since I start from the longer regions towards the shorter ones, I stop the loop as soon as all regions that would come after are already shorter than the length of longest substring already found.


An advantage of this approach is that it can deal with arbitrary regular expressions and no specific pattern structure is required:

findLongestMatch("(?<m1>(hello|universe))|(?<m2>(hello world))", "hello world") ==> "hello world" findLongestMatch("hello( universe)?", "hello world") ==> "hello" findLongestMatch("hello( world)?", "hello world") ==> "hello world" findLongestMatch("\\w+|\\d+", "12345 abc") ==> "12345" 
Sign up to request clarification or add additional context in comments.

3 Comments

Although you did some clever optimizations to break out early, this is still O(n^2) since matched.find() is O(n) and it is in a loop of length n
@pathikrit I agree with you. However, the solutions in the link you posted in the question are O(n^3) and not O(n^2).
Alternatively, you could find all matches of the regular expression and select the longest match(es) from the list, but this might not be efficient.
2

If you are dealing with just this specific pattern:

  1. There is one or more named group on the highest level connected by |.
  2. The regex for the group is put in superfluous braces.
  3. Inside those braces is one or more literal connected by |.
  4. Literals never contain |, ( or ).

Then it is possible to write a solution by extracting the literals, sorting them by their length and then returning the first match:

private static final Pattern g = Pattern.compile("\\(\\?\\<[^>]+\\>\\(([^)]+)\\)\\)"); public static final String findLongestMatch(String s, Pattern p) { Matcher m = g.matcher(p.pattern()); List<String> literals = new ArrayList<>(); while (m.find()) Collections.addAll(literals, m.group(1).split("\\|")); Collections.sort(literals, new Comparator<String>() { public int compare(String a, String b) { return Integer.compare(b.length(), a.length()); } }); for (Iterator<String> itr = literals.iterator(); itr.hasNext();) { String literal = itr.next(); if (s.indexOf(literal) >= 0) return literal; } return null; } 

Test:

System.out.println(findLongestMatch( "hello world", Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))") )); // output: hello world System.out.println(findLongestMatch( "hello universe", Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))") )); // output: universe 

Comments

1
+25

just add the $ (End of string) before the Or separator |.
Then it check whether the string is ended of not. If ended, it will return the string. Otherwise skip that part of regex.

The below code gives what you want

import java.util.regex.*; public class RegTest{ public static void main(String[] arg){ String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))"; String s = "hello world"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(s); while(matcher.find()) { MatchResult matchResult = matcher.toMatchResult(); String substring = s.substring(matchResult.start(), matchResult.end()); System.out.println(substring); } } } 

Likewise, the below code will skip hello , hello world and match hello world there
See the usage of $ there

import java.util.regex.*; public class RegTest{ public static void main(String[] arg){ String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))$|(?<m3>(hello world there))"; String s = "hello world there"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(s); while(matcher.find()) { MatchResult matchResult = matcher.toMatchResult(); String substring = s.substring(matchResult.start(), matchResult.end()); System.out.println(substring); } } } 

3 Comments

I like your idea :)
I don't think this is correct, e.g. "hello, world!" you won't match anything, but the expression should match "hello" or "hello universe!" should match "universe" but you won't match anything again.
Or "hello world hello" you will find "hello" but you should find "hello world". What you are doing is basically a endsWith() which doesn't ensure you find the longest match.
1

If the structure of the regex is always the same, this should work:

String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))"; String s = "hello world"; //split the regex into the different groups String[] allParts = regex.split("\\|\\(\\?\\<"); for (int i=1; i<allParts.length; i++) { allParts[i] = "(?<" + allParts[i]; } //find the longest string int longestSize = -1; String longestString = null; for (int i=0; i<allParts.length; i++) { Pattern pattern = Pattern.compile(allParts[i]); Matcher matcher = pattern.matcher(s); while(matcher.find()) { MatchResult matchResult = matcher.toMatchResult(); String substring = s.substring(matchResult.start(), matchResult.end()); if (substring.length() > longestSize) { longestSize = substring.length(); longestString = substring; } } } System.out.println("Longest: " + longestString); 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.