4

Consider 3 regex expressions designed to remove non Latin characters from the string.

 String x = "some†¥¥¶¶ˆ˚˚word"; long now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]", "")); // 5ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]+", "")); // 2ms System.out.println(System.nanoTime() - now); now = System.nanoTime(); System.out.println(x.replaceAll("[^a-zA-Z]*", "")); // <1ms System.out.println(System.nanoTime() - now); 

All 3 produce the same result with vastly difference performance metrics.

Why is that?

2 Answers 2

1

The first one is slower because the regex matches each non-latin character individually, so replaceAll operates on each characters individually.

The other patterns match the whole sequence of non-latin characters, so replaceAll can replace the whole sequence in one go. I can't explain the performance difference between these two, though. Probably something to do with the difference in handling * and + in the regex engine.

Sign up to request clarification or add additional context in comments.

1 Comment

What @Andrew Cooper said, the difference between * and + is because the + has to do a little more looking ahead to make sure the pattern doesn't occur more than once, so it'll be a little slower
1

The last one will replace empty strings with empty strings (unless that is optimized away, I do not know the compiler) which seems a bit unnecessary... ;-)

The first one will search much more times than the second if non-latin chars are adjecent. Otherwise not. So I guess the time for 1 and 2 might be roughly the same on some texts and longer for 1 on other texts.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.