Comparing performance of different regexes, clarification needed

Question

Consider 3 regex expressions designed to remove non Latin characters from the string.

 String x = "some†¥¥¶¶ˆ˚˚word"; long now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]", "")); // 5ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]+", "")); // 2ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]*", "")); // <1ms System.out.println(System.nanoTime() - now);

All 3 produce the same result with vastly difference performance metrics.

Why is that?

Andrew Cooper · Accepted Answer · 2012-02-06 03:39:29Z

The first one is slower because the regex matches each non-latin character individually, so replaceAll operates on each characters individually.

The other patterns match the whole sequence of non-latin characters, so replaceAll can replace the whole sequence in one go. I can't explain the performance difference between these two, though. Probably something to do with the difference in handling * and + in the regex engine.

What @Andrew Cooper said, the difference between * and + is because the + has to do a little more looking ahead to make sure the pattern doesn't occur more than once, so it'll be a little slower

Leo · Accepted Answer · 2012-02-06 03:42:58Z

The last one will replace empty strings with empty strings (unless that is optimized away, I do not know the compiler) which seems a bit unnecessary... ;-)

The first one will search much more times than the second if non-latin chars are adjecent. Otherwise not. So I guess the time for 1 and 2 might be roughly the same on some texts and longer for 1 on other texts.

Collectives™ on Stack Overflow

Comparing performance of different regexes, clarification needed

2 Answers 2

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Related