Issue in tokenizing the String

Question

I had a requirement to read data from PSV file which is contaning atleast 100K records and map every line to a DTO objects.

For e.g. I have a pipe separated String SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL to be parsed and extract each token against DTO values.

I started with String Tokenizer,it gave me correct result until I received a above string as an input.

Specialty about this string is it doesn't have any characters in between few Pipes for e.g. |23||1257| and Vikas||PATIL

When I tried to split it with tokenizer it gave me less tokens than I was expecting.It simply ignored the empty characters and result was I assigned 1257 value to phoneNumber and InsertDaate value to regionCode.

I should have assigning values lets say sampleData to dto field dataType , 1111 to recordID .. and '' to phone Number as input data doesn't have data for phone number but after 23 tokenizer read next token as 1257 so I assigned wrong value 1257 to phonenumber field.

Thank god I realized this mistake in testing environment only.

I tried few options and finally solved this issue with String.split() method.

import java.util.StringTokenizer; public class TestSpitingOfString { public static void main(String args[]) throws Exception { //DTO dataType|recordID|employeeid|deptID|phoneNumber|regionCode|InsertDate|StartDate|hobby|designation|age|firstName|middleName|lastName String str = "SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL"; System.out.println("Original String -> "+str); StringTokenizer tokenizer= new StringTokenizer(str,"|");// skips empty values between tokens System.out.println("Words With StringTokenizer "); while(tokenizer.hasMoreElements()){ System.out.print(tokenizer.nextToken()+","); } System.out.println(); String distributedWithPipe[] =str.split("|");// disaster :( it splitted every character System.out.println("Words With String.split() distributedWithPipe character ->"); for(String split : distributedWithPipe){ System.out.print(split+","); } System.out.println(); String distributedWithEscapedPipe[] =str.split("\\|"); // This worked for me System.out.println("Words With String.split() distributedWithEscapedPipe ->"); for(String split : distributedWithEscapedPipe){ System.out.print(split+","); } } }

When I run this I get output (I kept , between each token just for understanding purpose):

Original String -> SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL Words With StringTokenizer SampleData,1111,9130,23,1257,2014-04-01 18:00:00,2014-04-12 09:00:00,Software Developer,20,Vikas,PATIL, Words With String.split() distributedWithPipe character -> ,S,a,m,p,l,e,D,a,t,a,|,1,1,1,1,|,9,1,3,0,|,2,3,|,|,1,2,5,7,|,2,0,1,4,-,0,4,-,0,1, ,1,8,:,0,0,:,0,0,|,2,0,1,4,-,0,4,-,1,2, ,0,9,:,0,0,:,0,0,|,|,S,o,f,t,w,a,r,e, ,D,e,v,e,l,o,p,e,r,|,2,0,|,V,i,k,a,s,|,|,P,A,T,I,L, Words With String.split() distributedWithEscapedPipe -> SampleData,1111,9130,23,,1257,2014-04-01 18:00:00,2014-04-12 09:00:00,,Software Developer,20,Vikas,,PATIL,

Why I asked the question:

If some One know how by using StringTokenizer we can solve this issue, I would be happy to learn it. Otherwise we can say that its a limitation with StringTokenizer.
If Some one have same issue then the alternate solution is available and no need to waste time for figuring out solution.
Also to highlight that as habituated with StringTokenizer we may tend to use "|" Pipe(without escape char) as delimeter and String.split() will not produce the expected output.

Perhaps you should take a look at Google Guava's Splitter class. It seems to specifically address some similar issues with the StringTokenizer class: code.google.com/p/guava-libraries/wiki/StringsExplained — Henrik Aasted Sørensen
– Henrik Aasted Sørensen, Commented Feb 20, 2015 at 14:14
Split expects a regular expression. It's in the documentation of String. The regular expression | splits on 'empty string' OR 'empty string', i.e., on every possible position. — Jongware
– Jongware, Commented Feb 20, 2015 at 14:15
You can use stringtokenizr if you replace all instances of "||" with "| |" (pipe space pipe) — headlikearock
– headlikearock, Commented Feb 20, 2015 at 14:16
StringTokenizer didnt gave any Error to me . Just that it missed few tokens . so lets say I was expecting 15 tokens I only got 13 because there were two occurences where || there was nothing in between PIPES . but as I was assigning values lets say sampleData to dto field dataType , 1111 to recordID .. and '' to phone Number , as input data doesnt have it. but after 23 it read next token as 1257 so I assigned wrong value 1257 to phone number field. — Shirish Bari
– Shirish Bari, Commented Feb 20, 2015 at 14:23

Tunaki · Accepted Answer · 2015-02-20 14:31:37Z

StringTokenizer states this behaviour in its javadoc (although I admit it might be clearer, depends on how you interpret "consecutive characters") :

An instance of StringTokenizer behaves in one of two ways, depending on whether it was created with the returnDelims flag having the value true or false:

If the flag is false, delimiter characters serve to separate tokens. A token is a maximal sequence of consecutive characters that are not delimiters.

If the flag is true, delimiter characters are themselves considered to be tokens. A token is thus either one delimiter character, or a maximal sequence of consecutive characters that are not delimiters.

Reading the comments of this bug in JDK Bug Database (or this one) :

StringTokenizer defines a token to be a maximal sequence of consecutive characters that are not delimiters. Thus there are no tokens in substring ",,".

You could then use the constructor StringTokenizer(String str, String delim, true) but beware that this will return the delimitors as part of each token so you need to remove them yourself, which is quite a burden.

For all those reasons, it is better to just use String.split.

I tried with the constructor StringTokenizer tokenizer= new StringTokenizer(str,"|",true); But it returns | characters also
Yes, read the javadoc : "If the flag is true, delimiter characters are themselves considered to be tokens". I edited my answer to show that
OK, so zero characters is not considered a sequence. I would consider that highly debatable. But Joshua Bloch himself (!) indicates that this is "StringTokenizer is a very simple String scanner." in the second link (the first link didn't work for me).

Maarten Bodewes · Accepted Answer · 2015-02-20 14:43:51Z

It's probably better to use String.split() and a regular expression for this (you need to indicate that | is a character, not the logical OR!):

String str = "SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL"; String[] tokens = str.split("[|]"); for (String token : tokens) { // or do something else... System.out.println(token); }

or, much more complex but more efficient for strings with lots and lots of delimiters:

String str = "SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL"; // start or '|', then anything (reluctant) then '|' or end Matcher m = Pattern.compile("(?<=^|[|]).*?(?=[|]|$)").matcher(str); while (m.find()) { // or do something else... String token = m.group(); System.out.println(token); }

As for your questions:

StringTokenizer is a relatively simple class that probably should not be used for this.
I didn't have this problem, but sometimes it pays off to test my regexp skills, and this solution should work. See the Pattern class about ^ and $, reluctant quantifiers and of course positive lookbehind and positive lookahead.
Consider it highlighted :)

Collectives™ on Stack Overflow

Issue in tokenizing the String

2 Answers 2

3 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Related