2

Is it possible to build a regexp for use with Javas Pattern.split(..) method to reproduce the StringTokenizer("...", "...", true) behaveiour?

So that the input is split to an alternating sequence of the predefined token characters and any abitrary strings running between them.

The JRE reference states for StringTokenizer it should be considered deprecated and String.split(..) could be used instead way. So it is considered possible there.

The reason I want to use split is that regular expressions are often highly optimized. The StringTokenizer for example is quite slow on the Android Platforms VM, while regex patterns are executed by optimized native code there it seems.

3
  • possible duplicate of Is there a way to split strings with String.split() and include the delimiters? Commented May 8, 2011 at 18:56
  • There is a uncommented "Code Challange" with the same idea, but no answer it seems. I do not want to include the delimiters, but fetch them as distinct tokens. Commented May 8, 2011 at 19:04
  • Maybe there should be a "I am pedantic, answer question exactly as asked" flag :-) Commented May 8, 2011 at 20:15

3 Answers 3

1

Considering that the documentation for split doesn't specify this behavior and has only one optional parameter that tells how large the array should be.. no you can't.

Also looking at the only other class I can think of that could have this feature - a scanner - it doesn't either. So I think the easiest would be to continue using the Tokenizer, even if it's deprecated. Better than writing your own class - while that shouldn't be too hard (quite trivial really) I can think of better ways to spend ones time.

Sign up to request clarification or add additional context in comments.

3 Comments

But String.split() takes an abitrary regular expression and it is not obvious to me why it should not be possible with a smart expression?
+1 for recommending to use the proper tool for the job. StringTokenizer is not depricated and does exactly what you want. Don't force String.split(...) to attempt to do something it wasn't designed for. Even if you can get it to work, nobody will actually understand the regex used. Keep it simple. Did you look at the link provided by CoolBeans above? The code is horrendous to try and do something that is easily done by the StringTokenizer.
Currently I like to use Pattern.split(..) on the Android platform, as the VM is rather slow there and the implementation of StringTokenizer is not very efficient. On the other hand, regex'es are implemented natively on the platform and quite fast, so Pattern.split(..) is.
1

a regex Pattern can help you

Patter p = Pattern.compile("(.*?)(\\s*)"); //put the boundary regex in between the second brackets (where the \\s* now is) Matcher m = p.matcher(string); int endindex=0; while(m.find(endindex)){ //m.group(1) is the part between the pattern //m.group(2) is the match found of the pattern endindex = m.end(); } //then the remainder of the string is string.substring(endindex); 

Comments

1
import java.util.List; import java.util.LinkedList; import java.util.regex.Pattern; import java.util.regex.Matcher; public class Splitter { public Splitter(String s, String delimiters) { this.string = s; this.delimiters = delimiters; Pattern pattern = Pattern.compile(delimiters); this.matcher = pattern.matcher(string); } public String[] split() { String[] strs = string.split(delimiters); String[] delims = delimiters(); if (strs.length == 0) { return new String[0];} assert(strs.length == delims.length + 1); List<String> output = new LinkedList<String>(); int i; for(i = 0;i < delims.length;i++) { output.add(strs[i]); output.add(delims[i]); } output.add(strs[i]); return output.toArray(new String[0]); } private String[] delimiters() { List<String> delims = new LinkedList<String>(); while(matcher.find()) { delims.add(string.subSequence(matcher.start(), matcher.end()).toString()); } return delims.toArray(new String[0]); } public static void main(String[] args) { Splitter s = new Splitter("a b\tc", "[ \t]"); String[] tokensanddelims = s.split(); assert(tokensanddelims.length == 5); System.out.print(tokensanddelims[0].equals("a")); System.out.print(tokensanddelims[1].equals(" ")); System.out.print(tokensanddelims[2].equals("b")); System.out.print(tokensanddelims[3].equals("\t")); System.out.print(tokensanddelims[4].equals("c")); } private Matcher matcher; private String string; private String delimiters; } 

2 Comments

Well, seems cool. However it separasizes tokens from delimiters what is not needed in my case. I like to replace the StringTokenizers behaviour with alternating delimiter / token sequence output.
I added the missing import statement. Works fine. It doesn't replace StringTokenizer by something more performant however. I was in hope that a single RegExp for use with split could do the job as a single RegExp is handled natively fast on the Android platform.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.