48

Recently I have been had to search a number of string values to see which one matches a certain pattern. Neither the number of string values nor the pattern itself is clear until a search term has been entered by the user.

The problem is I have noticed each time my application runs the following line:

if (stringValue.matches(rexExPattern)) { // do something so simple } 

it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.

The pattern is something like:

"A*B*C*D*E*F*" 

where A~F are just examples here, but the pattern is some thing like the above. Please note* that the pattern actually changes per search. For example "A*B*C*" may change to "W*D*G*A*".

I wonder if there is a better substitution for the above pattern or, more generally, an alternative for Java regular expressions.

4
  • 3
    Is this really not meeting your performance requirements? Is this truly the bottleneck in your code? A few hundred * 4 microseconds = a few milliseconds. Commented Nov 7, 2013 at 7:06
  • I may not calculate correctly. but normally it takes 10 seconds per search. I'll edit my question in a few seconds. Commented Nov 7, 2013 at 8:18
  • 1
    The solution is most likely not a faster regex, and more likely a better search algorithm. If you gave more details about your actual usage, you may get some better advice. Also, your math still doesn't add up. A few hundred * 40 microseconds != 10 seconds. Where did your profiling show you that the bottleneck was? Commented Nov 7, 2013 at 8:21
  • Also have a look on boolean sequence. It currently supports limited RE symbols but it is fast and flexible and give options to become more faster. Commented May 20, 2016 at 20:38

2 Answers 2

134

Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process. Each time you invoke the method String.matches(String regex), the specified regular expression is compiled again.

So you should compile your regular expression only once and reuse it:

Pattern pattern = Pattern.compile(regexPattern); for(String value : values) { Matcher matcher = pattern.matcher(value); if (matcher.matches()) { // your code here } } 
Sign up to request clarification or add additional context in comments.

5 Comments

You can create the Matcher ahead of time as well as the Pattern. Just feed the constructor an empty string to start with, then inside the loop you can use Matcher.reset(value) to apply it. It won't have nearly as much impact on performance as precompiling the Pattern, but if performance is really critical, it will help.
thank you for your help. But the problem is the pattern changes per search. for example "ABC" may changes to "EBCW". I'll edit my question right now.
@Joseph_Marzbani The point was to compile the regex once per search, precisely as the example in this answer clearly illustrates.
@AlanMoore As a caution: Matcher objects are NOT thread safe per Javadocs. Therefore, the reset method you've describe is likely dangerous in high-performance, multi-threaded scenarios. I'm not sure if reusing the Matcher object is beneficial enough to take that risk. Reusing the precompiled Pattern is definitely safe and good practice.
Yes, guys, you should not vote up @AlanMoore comment. You should not use the same instance of matcher! It's not thread-safe!
37

Consider the following (quick and dirty) test:

import java.util.ArrayList; import java.util.List; import java.util.Random; import java.util.regex.Matcher; import java.util.regex.Pattern; public class Test3 { // time that tick() was called static long tickTime; // called at start of operation, for timing static void tick () { tickTime = System.nanoTime(); } // called at end of operation, prints message and time since tick(). static void tock (String action) { long mstime = (System.nanoTime() - tickTime) / 1000000; System.out.println(action + ": " + mstime + "ms"); } // generate random strings of form AAAABBBCCCCC; a random // number of characters each randomly repeated. static List<String> generateData (int itemCount) { Random random = new Random(); List<String> items = new ArrayList<String>(); long mean = 0; for (int n = 0; n < itemCount; ++ n) { StringBuilder s = new StringBuilder(); int characters = random.nextInt(7) + 1; for (int k = 0; k < characters; ++ k) { char c = (char)(random.nextInt('Z' - 'A') + 'A'); int rep = random.nextInt(95) + 5; for (int j = 0; j < rep; ++ j) s.append(c); mean += rep; } items.add(s.toString()); } mean /= itemCount; System.out.println("generated data, average length: " + mean); return items; } // match all strings in items to regexStr, do not precompile. static void regexTestUncompiled (List<String> items, String regexStr) { tick(); int matched = 0, unmatched = 0; for (String item:items) { if (item.matches(regexStr)) ++ matched; else ++ unmatched; } tock("uncompiled: regex=" + regexStr + " matched=" + matched + " unmatched=" + unmatched); } // match all strings in items to regexStr, precompile. static void regexTestCompiled (List<String> items, String regexStr) { tick(); Matcher matcher = Pattern.compile(regexStr).matcher(""); int matched = 0, unmatched = 0; for (String item:items) { if (matcher.reset(item).matches()) ++ matched; else ++ unmatched; } tock("compiled: regex=" + regexStr + " matched=" + matched + " unmatched=" + unmatched); } // test all strings in items against regexStr. static void regexTest (List<String> items, String regexStr) { regexTestUncompiled(items, regexStr); regexTestCompiled(items, regexStr); } // generate data and run some basic tests public static void main (String[] args) { List<String> items = generateData(1000000); regexTest(items, "A*"); regexTest(items, "A*B*C*"); regexTest(items, "E*C*W*F*"); } } 

Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.

Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:

Regex Uncompiled Precompiled A* 0.564 sec 0.126 sec A*B*C* 1.768 sec 0.238 sec E*C*W*F* 0.795 sec 0.275 sec 

Actual output:

generated data, average length: 208 uncompiled: regex=A* matched=6004 unmatched=993996: 564ms compiled: regex=A* matched=6004 unmatched=993996: 126ms uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms 

Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.

After switching to precompiled expressions, if you still are not meeting your actual performance requirements, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.

For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "ABC*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.

1 Comment

If I "saved your life" because you switched to compiled regex's, you should really mark @Seelenvirtuose's answer, who suggested it very clearly, as the correct one.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.