Performance of StringTokenizer class vs. String.split method in Java

Question

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," "); while (sTokenize.hasMoreTokens()) {

or

String[] splitS = s.split(" "); for(int i =0; i < splitS.length; i++)

What if someone says that option X is the fastest? Will you go for that option, or just to be sure, will you test both? If it's the latter, why not do so straight away? :) — Bart Kiers
– Bart Kiers, Commented May 11, 2011 at 14:23
@John : Please be clear with your question, do you need better among Tokenize vs split or you are looking for best approach regard less of Tokenize vs split — developer
– developer, Commented May 11, 2011 at 14:33

Peter Lawrey · Accepted Answer · 2011-05-11 14:55:28Z

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder(); for (int i = 100000; i < 100000 + 60; i++) sb.append(i).append(' '); String sample = sb.toString(); int runs = 100000; for (int i = 0; i < 5; i++) { { long start = System.nanoTime(); for (int r = 0; r < runs; r++) { StringTokenizer st = new StringTokenizer(sample); List<String> list = new ArrayList<String>(); while (st.hasMoreTokens()) list.add(st.nextToken()); } long time = System.nanoTime() - start; System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0); } { long start = System.nanoTime(); Pattern spacePattern = Pattern.compile(" "); for (int r = 0; r < runs; r++) { List<String> list = Arrays.asList(spacePattern.split(sample, 0)); } long time = System.nanoTime() - start; System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0); } { long start = System.nanoTime(); for (int r = 0; r < runs; r++) { List<String> list = new ArrayList<String>(); int pos = 0, end; while ((end = sample.indexOf(' ', pos)) >= 0) { list.add(sample.substring(pos, end)); pos = end + 1; } } long time = System.nanoTime() - start; System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0); } }

prints

StringTokenizer took an average of 5.8 us Pattern.split took an average of 4.8 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 4.9 us Pattern.split took an average of 3.7 us indexOf loop took an average of 1.7 us StringTokenizer took an average of 5.2 us Pattern.split took an average of 3.9 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 5.1 us Pattern.split took an average of 4.1 us indexOf loop took an average of 1.6 us StringTokenizer took an average of 5.0 us Pattern.split took an average of 3.8 us indexOf loop took an average of 1.6 us

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

Interesting, I ran your code and split is consistently taking about twice as long on my machine as StringTokenizer. indexof takes half as long.
Scanner and String tokenizer use Pattern/regex which is more flexible but not as efficient as just looking for a specific character.
@tjjjohnson Java 7 split does an operation similar to a series on indexOf, but only for limited, but very common operations.
Just for the record, your implementation of indexOf loops is incorrect, you are missing the part after the last separator. Not sure this impact performance very much but anyway

nes1983 · Accepted Answer · 2012-06-21 22:33:19Z

14

Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.

answered Jun 21, 2012 at 22:33

nes1983

15.8k4 gold badges46 silver badges64 bronze badges

3 Comments

Krzysztof Wolny Over a year ago

Are you sure? I can see in line 2361 under the link you provided: return Pattern.compile(regex).split(this, limit);

nes1983 Over a year ago

Implementation is in 1770.

hendalst Over a year ago

The implementation will use (i.e. indexOf) if the regex satisfies certain criteria, and will use Pattern.compile(regex).split(this, limit); otherwise. From the source:

fastpath if the regex is a (1)one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter.

But as pointed out elsewhere, this is an implementation detail and as such should not be relied upon.

Gilles 'SO- stop being evil' · Accepted Answer · 2012-06-03 13:12:38Z

6

The Java API specification recommends using split. See the documentation of StringTokenizer.

edited Jun 3, 2012 at 13:12

Gilles 'SO- stop being evil'

109k38 gold badges217 silver badges263 bronze badges

answered May 11, 2011 at 14:27

developer

9,54830 gold badges95 silver badges155 bronze badges

3 Comments

developer Over a year ago

@downvoters : Please be clear with above question, do you need better among Tokenize vs split or you are looking for best approach regard less of Tokenize vs split

Bill the Lizard Over a year ago

The question is pretty clear that he's looking for the best way to do this in terms of performance. The API recommends split, but doesn't mention that (according to everything else I'm finding through Google) Tokenize performs better.

developer Over a year ago

@Bill, Sorry mistake from my side. then their might be change in title of the question

cristianoms · Accepted Answer · 2017-10-17 01:46:27Z

Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)) also reduces processing time. So, if you're looking for performance, I would recommend using something like:

private static final String DELIM = "#"; public void splitIt(String input) { StringTokenizer st = new StringTokenizer(input, DELIM, true); while (st.hasMoreTokens()) { String next = getNext(st); System.out.println(next); } } private String getNext(StringTokenizer st){ String value = st.nextToken(); if (DELIM.equals(value)) value = null; else if (st.hasMoreTokens()) st.nextToken(); return value; }

Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.

Basanth Roy · Accepted Answer · 2011-05-11 14:22:46Z

3

Use split.

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.

answered May 11, 2011 at 14:22

Basanth Roy

6,5305 gold badges27 silver badges25 bronze badges

4 Comments

Basanth Roy Over a year ago

Why the -1 ? This does correctly answer the question of whether to use split or StringTokenizer. The Spec does mention that Split is recommended over StringTokenizer

Bill the Lizard Over a year ago

See my comments on Damodar's answer. The spec doesn't say anything about performance, which is what this question is asking.

Basanth Roy Over a year ago

Thanks Bill. - rationalSpring

Calvin Taylor Over a year ago

I downvoted because it does not ask whether to use one or the other, but asks which is faster.

Nicolas Bousquet · Accepted Answer · 2011-05-11 14:27:11Z

What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?

If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.

If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.

Ivo · Accepted Answer · 2013-08-30 08:13:09Z

While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.

In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.

user207421 · Accepted Answer · 2016-05-01 08:03:58Z

Regardless of its legacy status, I would expect StringTokenizer to be significantly quicker than String.split() for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself via indexOf(). In fact String.split() has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.

From reading elsewhere, String.split doesn't recompile every time, it might, but not for all cases.

chiperortiz · Accepted Answer · 2016-08-13 20:26:59Z

This could be a reasonable benchmarking using 1.6.0

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8

Collectives™ on Stack Overflow

Performance of StringTokenizer class vs. String.split method in Java

9 Answers 9

19 Comments

3 Comments

3 Comments

Comments

4 Comments

Comments

Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

19 Comments

3 Comments

3 Comments

Comments

4 Comments

Comments

Comments

1 Comment

Comments

Linked

Related