I had a requirement to read data from PSV file which is contaning atleast 100K records and map every line to a DTO objects.
For e.g. I have a pipe separated String SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL to be parsed and extract each token against DTO values.
I started with String Tokenizer,it gave me correct result until I received a above string as an input.
Specialty about this string is it doesn't have any characters in between few Pipes for e.g. |23||1257| and Vikas||PATIL
When I tried to split it with tokenizer it gave me less tokens than I was expecting.It simply ignored the empty characters and result was I assigned 1257 value to phoneNumber and InsertDaate value to regionCode.
I should have assigning values lets say sampleData to dto field dataType , 1111 to recordID .. and '' to phone Number as input data doesn't have data for phone number but after 23 tokenizer read next token as 1257 so I assigned wrong value 1257 to phonenumber field.
Thank god I realized this mistake in testing environment only.
I tried few options and finally solved this issue with String.split() method.
import java.util.StringTokenizer; public class TestSpitingOfString { public static void main(String args[]) throws Exception { //DTO dataType|recordID|employeeid|deptID|phoneNumber|regionCode|InsertDate|StartDate|hobby|designation|age|firstName|middleName|lastName String str = "SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL"; System.out.println("Original String -> "+str); StringTokenizer tokenizer= new StringTokenizer(str,"|");// skips empty values between tokens System.out.println("Words With StringTokenizer "); while(tokenizer.hasMoreElements()){ System.out.print(tokenizer.nextToken()+","); } System.out.println(); String distributedWithPipe[] =str.split("|");// disaster :( it splitted every character System.out.println("Words With String.split() distributedWithPipe character ->"); for(String split : distributedWithPipe){ System.out.print(split+","); } System.out.println(); String distributedWithEscapedPipe[] =str.split("\\|"); // This worked for me System.out.println("Words With String.split() distributedWithEscapedPipe ->"); for(String split : distributedWithEscapedPipe){ System.out.print(split+","); } } } When I run this I get output (I kept , between each token just for understanding purpose):
Original String -> SampleData|1111|9130|23||1257|2014-04-01 18:00:00|2014-04-12 09:00:00||Software Developer|20|Vikas||PATIL Words With StringTokenizer SampleData,1111,9130,23,1257,2014-04-01 18:00:00,2014-04-12 09:00:00,Software Developer,20,Vikas,PATIL, Words With String.split() distributedWithPipe character -> ,S,a,m,p,l,e,D,a,t,a,|,1,1,1,1,|,9,1,3,0,|,2,3,|,|,1,2,5,7,|,2,0,1,4,-,0,4,-,0,1, ,1,8,:,0,0,:,0,0,|,2,0,1,4,-,0,4,-,1,2, ,0,9,:,0,0,:,0,0,|,|,S,o,f,t,w,a,r,e, ,D,e,v,e,l,o,p,e,r,|,2,0,|,V,i,k,a,s,|,|,P,A,T,I,L, Words With String.split() distributedWithEscapedPipe -> SampleData,1111,9130,23,,1257,2014-04-01 18:00:00,2014-04-12 09:00:00,,Software Developer,20,Vikas,,PATIL, Why I asked the question:
- If some One know how by using StringTokenizer we can solve this issue, I would be happy to learn it. Otherwise we can say that its a limitation with StringTokenizer.
- If Some one have same issue then the alternate solution is available and no need to waste time for figuring out solution.
- Also to highlight that as habituated with StringTokenizer we may tend to use "|" Pipe(without escape char) as delimeter and String.split() will not produce the expected output.
Splitterclass. It seems to specifically address some similar issues with theStringTokenizerclass: code.google.com/p/guava-libraries/wiki/StringsExplainedSplitexpects a regular expression. It's in the documentation of String. The regular expression|splits on 'empty string' OR 'empty string', i.e., on every possible position.split("\\|", -1)if the issue is missing ""||there was nothing in between PIPES . but as I was assigning values lets say sampleData to dto field dataType , 1111 to recordID .. and '' to phone Number , as input data doesnt have it. but after23it read next token as1257so I assigned wrong value1257to phone number field.