The following program takes 1.5 seconds to run on my Intel Core i7 machine:
public static void main(String[] args) { String regex = ".*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*b.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*x"; long start = System.nanoTime(); boolean match = "aaabccc".matches(regex); long end = System.nanoTime(); System.out.println((end - start) / 1e9); } The regex consists of 30 times ".*" followed by a "b", followed by another 30 times ".*" and finally an "x". The time taken by String.matches to return is exponential on the number of ".*", i.e. it's O(eⁿ). If you replace 30 by 50, it will take more than a minute to return.
This is concerning because I was thinking of having a public HTTP API where the user would be able to enter a regex. If the regex is coming from outside, it is possible that a malicious user could enter a regex that would cause very high CPU on my server for a long time.
Is this a known issue in the Java API and does anyone know of a workaround, i.e. a way to inspect a regex to determine whether it is dangerous to use?
.*(once) can match an arbitrary amount of input, you can safely collapse.*.*down to a single.*. In your example case, that reduces the pattern to.*b.*x, which should be fast on essentially any regex engine.