1

The following program takes 1.5 seconds to run on my Intel Core i7 machine:

 public static void main(String[] args) { String regex = ".*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*b.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*x"; long start = System.nanoTime(); boolean match = "aaabccc".matches(regex); long end = System.nanoTime(); System.out.println((end - start) / 1e9); } 

The regex consists of 30 times ".*" followed by a "b", followed by another 30 times ".*" and finally an "x". The time taken by String.matches to return is exponential on the number of ".*", i.e. it's O(eⁿ). If you replace 30 by 50, it will take more than a minute to return.

This is concerning because I was thinking of having a public HTTP API where the user would be able to enter a regex. If the regex is coming from outside, it is possible that a malicious user could enter a regex that would cause very high CPU on my server for a long time.

Is this a known issue in the Java API and does anyone know of a workaround, i.e. a way to inspect a regex to determine whether it is dangerous to use?

6
  • It seems only Java and JavaScript are using regex engines that timesout when I tested this on regex101.com. All the other quickly concludes that there is no match. Perhaps you could use PCRE2 (if there are java bindings for it)? Commented May 11, 2023 at 21:37
  • @TedLyngmo PCRE2 took ~713k steps before failing (check Regex Debugger). Commented May 11, 2023 at 21:38
  • I don't know Java, but this seems to be related: Is Java ReDos vulnerable? Commented May 11, 2023 at 21:41
  • 1
    DFA-based regex engines can match patterns like this in linear time (at the expense of usually using more memory). If you want to support tagged subexpressions, you can use a TDFA (tagged DFA), but they're more complex still. You can also do some quick hacks by pre-processing the user's input for obvious problems like this one--since .* (once) can match an arbitrary amount of input, you can safely collapse .*.* down to a single .*. In your example case, that reduces the pattern to .*b.*x, which should be fast on essentially any regex engine. Commented May 11, 2023 at 21:58
  • 2
    A clever user can pretty easily outwit that simple-minded hack though. You can add more sophisticated hacks to make a DoS attack more difficult, but in the end, it probably makes more sense to just spawn the regex search into a separate process, and limit its resource usage, or use a DFA-based regex engine. Commented May 11, 2023 at 22:01

2 Answers 2

0

According to Mastering Regular Expressions, 3rd Edition, Java uses a Traditional NFA regex engine that supports lazy quantifiers but has the disadvantage that some patterns can take a very long time to match.

DFA engines, being deterministic, don't suffer from that problem but are less powerful. In particular, lazy quantifiers are not possible with a DFA regex engine such as that found in "awk (most versions), egrep (most versions), flex, lex".

See also DFA Based Regular Expression Engines for Java with Capture.

Sign up to request clarification or add additional context in comments.

Comments

0

SonarQube keeps alarming on regex like .*something.*. Even that must have bad performance, but compared to what?

Coming back to your question: Apparently Sonar has the ability to inspect a regex and determine the risk. While I have not checked where in the code it would decide to alarm, it is open source and can be found at https://github.com/SonarSource/sonarqube

Edit: There seem to be multiple checks for regex patterns in this directory: https://github.com/SonarSource/sonar-java/tree/master/java-checks/src/main/java/org/sonar/java/checks/regex

Edit2: The code uses factored out common classes. Here is another promising specimen: https://github.com/SonarSource/sonar-analyzer-commons/blob/master/regex-parsing/src/main/java/org/sonarsource/analyzer/commons/regex/finders/RedosFinder.java

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.