I am looking for an elegant way of splitting a string in words and non-words, where a "word" is defined by some regular expression (for instance, [a-zA-Z]+).
Input is a string, output should be a list of word and non-word substrings in order. For instance:
"A! B C, d." -> Arrays.asList("A", "! ", "B", " ", "C", ", ","d", ".") Here's my take:
public static String WORD_PATTERN = "[a-zA-Z]+"; public static List<String> splitString(String str) { if (str == null) { return null; } Pattern wordPattern = Pattern.compile(WORD_PATTERN); Matcher wordMatcher = wordPattern.matcher(str); List<String> splitString = new ArrayList<>(); int endOfLastWord = 0; while(wordMatcher.find()) { int startOfNextWord = wordMatcher.start(); int endOfNextWord = wordMatcher.end(); if (startOfNextWord > endOfLastWord) { String nextNonWord = str.substring(endOfLastWord, startOfNextWord); splitString.add(nextNonWord); } String nextWord = str.substring(startOfNextWord, endOfNextWord); splitString.add(nextWord); endOfLastWord = endOfNextWord; } if (endOfLastWord < str.length()) { String lastNonWord = str.substring(endOfLastWord); splitString.add(lastNonWord); } return splitString; } This does not feel elegant, I think there should be a better way which I'm just not aware of.
I am not looking to improve the code above, so please don't refer to Codereview. I've only posted it to avoid "what have you tried so far" comments.
I am looking for a more concise and elegant way, ideally only using standard Java packages.
Streamon ideone, it do the same (but probably more performant this way).