41

In my software I need to split string into words. I currently have more than 19,000,000 documents with more than 30 words each.

Which of the following two ways is the best way to do this (in terms of performance)?

StringTokenizer sTokenize = new StringTokenizer(s," "); while (sTokenize.hasMoreTokens()) { 

or

String[] splitS = s.split(" "); for(int i =0; i < splitS.length; i++) 
7
  • 10
    I'd reckon the first but why don't you just measure it? Commented May 11, 2011 at 14:21
  • 1
    I can but I am also interested in the explanation... Commented May 11, 2011 at 14:23
  • 3
    What if someone says that option X is the fastest? Will you go for that option, or just to be sure, will you test both? If it's the latter, why not do so straight away? :) Commented May 11, 2011 at 14:23
  • @John : Please be clear with your question, do you need better among Tokenize vs split or you are looking for best approach regard less of Tokenize vs split Commented May 11, 2011 at 14:33
  • 1
    @Damodar "best way to do this in terms of performance" Commented May 11, 2011 at 14:39

9 Answers 9

65

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder(); for (int i = 100000; i < 100000 + 60; i++) sb.append(i).append(' '); String sample = sb.toString(); int runs = 100000; for (int i = 0; i < 5; i++) { { long start = System.nanoTime(); for (int r = 0; r < runs; r++) { StringTokenizer st = new StringTokenizer(sample); List<String> list = new ArrayList<String>(); while (st.hasMoreTokens()) list.add(st.nextToken()); } long time = System.nanoTime() - start; System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0); } { long start = System.nanoTime(); Pattern spacePattern = Pattern.compile(" "); for (int r = 0; r < runs; r++) { List<String> list = Arrays.asList(spacePattern.split(sample, 0)); } long time = System.nanoTime() - start; System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0); } { long start = System.nanoTime(); for (int r = 0; r < runs; r++) { List<String> list = new ArrayList<String>(); int pos = 0, end; while ((end = sample.indexOf(' ', pos)) >= 0) { list.add(sample.substring(pos, end)); pos = end + 1; } } long time = System.nanoTime() - start; System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0); } } 

prints

StringTokenizer took an average of 5.8 us Pattern.split took an average of 4.8 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 4.9 us Pattern.split took an average of 3.7 us indexOf loop took an average of 1.7 us StringTokenizer took an average of 5.2 us Pattern.split took an average of 3.9 us indexOf loop took an average of 1.8 us StringTokenizer took an average of 5.1 us Pattern.split took an average of 4.1 us indexOf loop took an average of 1.6 us StringTokenizer took an average of 5.0 us Pattern.split took an average of 3.8 us indexOf loop took an average of 1.6 us 

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

Sign up to request clarification or add additional context in comments.

19 Comments

Interesting, I ran your code and split is consistently taking about twice as long on my machine as StringTokenizer. indexof takes half as long.
Scanner and String tokenizer use Pattern/regex which is more flexible but not as efficient as just looking for a specific character.
@Peter Lawrey: StringTokenizer does not use regex.
@tjjjohnson Java 7 split does an operation similar to a series on indexOf, but only for limited, but very common operations.
Just for the record, your implementation of indexOf loops is incorrect, you are missing the part after the last separator. Not sure this impact performance very much but anyway
|
14

Split in Java 7 just calls indexOf for this input, see the source. Split should be very fast, close to repeated calls of indexOf.

3 Comments

Are you sure? I can see in line 2361 under the link you provided: return Pattern.compile(regex).split(this, limit);
Implementation is in 1770.
The implementation will use (i.e. indexOf) if the regex satisfies certain criteria, and will use Pattern.compile(regex).split(this, limit); otherwise. From the source: fastpath if the regex is a (1)one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter. But as pointed out elsewhere, this is an implementation detail and as such should not be relied upon.
6

The Java API specification recommends using split. See the documentation of StringTokenizer.

3 Comments

@downvoters : Please be clear with above question, do you need better among Tokenize vs split or you are looking for best approach regard less of Tokenize vs split
The question is pretty clear that he's looking for the best way to do this in terms of performance. The API recommends split, but doesn't mention that (according to everything else I'm finding through Google) Tokenize performs better.
@Bill, Sorry mistake from my side. then their might be change in title of the question
5

Another important thing, undocumented as far as I noticed, is that asking for the StringTokenizer to return the delimiters along with the tokenized string (by using the constructor StringTokenizer(String str, String delim, boolean returnDelims)) also reduces processing time. So, if you're looking for performance, I would recommend using something like:

private static final String DELIM = "#"; public void splitIt(String input) { StringTokenizer st = new StringTokenizer(input, DELIM, true); while (st.hasMoreTokens()) { String next = getNext(st); System.out.println(next); } } private String getNext(StringTokenizer st){ String value = st.nextToken(); if (DELIM.equals(value)) value = null; else if (st.hasMoreTokens()) st.nextToken(); return value; } 

Despite the overhead introduced by the getNext() method, that discards the delimiters for you, it's still 50% faster according to my benchmarks.

Comments

3

Use split.

StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method instead.

4 Comments

Why the -1 ? This does correctly answer the question of whether to use split or StringTokenizer. The Spec does mention that Split is recommended over StringTokenizer
See my comments on Damodar's answer. The spec doesn't say anything about performance, which is what this question is asking.
Thanks Bill. - rationalSpring
I downvoted because it does not ask whether to use one or the other, but asks which is faster.
2

What the 19,000,000 documents have to do there ? Do you have to split words in all the documents on a regular basis ? Or is it a one shoot problem?

If you display/request one document at a time, with only 30 word, this is a so tiny problem that any method would work.

If you have to process all documents at a time, with only 30 words, this is a so tiny problem that you are more likely to be IO bound anyway.

Comments

2

While running micro (and in this case, even nano) benchmarks, there is a lot that affects your results. JIT optimizations and garbage collection to name just a few.

In order to get meaningful results out of the micro benchmarks, check out the jmh library. It has excellent samples bundled on how to run good benchmarks.

Comments

2

Regardless of its legacy status, I would expect StringTokenizer to be significantly quicker than String.split() for this task, because it doesn't use regular expressions: it just scans the input directly, much as you would yourself via indexOf(). In fact String.split() has to compile the regex every time you call it, so it isn't even as efficient as using a regular expression directly yourself.

1 Comment

From reading elsewhere, String.split doesn't recompile every time, it might, but not for all cases.
1

This could be a reasonable benchmarking using 1.6.0

http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml#.V6-CZvnhCM8 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.