1

I am looking for an elegant way of splitting a string in words and non-words, where a "word" is defined by some regular expression (for instance, [a-zA-Z]+).

Input is a string, output should be a list of word and non-word substrings in order. For instance:

"A! B C, d." -> Arrays.asList("A", "! ", "B", " ", "C", ", ","d", ".") 

Here's my take:

public static String WORD_PATTERN = "[a-zA-Z]+"; public static List<String> splitString(String str) { if (str == null) { return null; } Pattern wordPattern = Pattern.compile(WORD_PATTERN); Matcher wordMatcher = wordPattern.matcher(str); List<String> splitString = new ArrayList<>(); int endOfLastWord = 0; while(wordMatcher.find()) { int startOfNextWord = wordMatcher.start(); int endOfNextWord = wordMatcher.end(); if (startOfNextWord > endOfLastWord) { String nextNonWord = str.substring(endOfLastWord, startOfNextWord); splitString.add(nextNonWord); } String nextWord = str.substring(startOfNextWord, endOfNextWord); splitString.add(nextWord); endOfLastWord = endOfNextWord; } if (endOfLastWord < str.length()) { String lastNonWord = str.substring(endOfLastWord); splitString.add(lastNonWord); } return splitString; } 

This does not feel elegant, I think there should be a better way which I'm just not aware of.

I am not looking to improve the code above, so please don't refer to Codereview. I've only posted it to avoid "what have you tried so far" comments.

I am looking for a more concise and elegant way, ideally only using standard Java packages.

4
  • Does the order of elements in the resulting list have to be consistent with the original input text? Commented May 16, 2018 at 5:59
  • @ErnestKiwele Yes, the order must be consistent. Concatenation in order must produce the original string. Commented May 16, 2018 at 6:07
  • 3
    @Downvoter Please provide reason for downvote. Commented May 16, 2018 at 7:05
  • 1
    @ErnestKiwele, since you removed your answer, here is the same logic I had provided for your merge method but without the Stream on ideone, it do the same (but probably more performant this way). Commented May 16, 2018 at 8:18

1 Answer 1

2

You can use a regex to capture both word and non-word with an optional content :

(\w*)(\W*) 
  • \w : [a-zA-Z0-9_]
  • \W : [^a-zA-Z0-9_]

Example with regex101

For each match, take both capture groups, check if there is a value captured (length > 0) and add the value to the list.

This give a nice and simple solution like :

public List<String> splitWord(String s){ List<String> result = new ArrayList<>(); Pattern p = Pattern.compile("(\\w*)(\\W*)"); Matcher m = p.matcher(s); while(m.find()){ Optional.of(m.group(1)).filter(str -> !str.isEmpty()).ifPresent(result::add); Optional.of(m.group(2)).filter(str -> !str.isEmpty()).ifPresent(result::add); } return result; } 

Note : the Optional is ... optional but I am trying to improve myself on it. It will simply check if the group have a value that is not empty and will add it to the list.

And the result formatted to match your example

"abc def" -> Arrays.asList("abc", " ", "def") "a.b. c" -> Arrays.asList("a", ".", "b", ". ", "c") "a.b." -> Arrays.asList("a", ".", "b", ".") ".aa" -> Arrays.asList(".", "aa") "." -> Arrays.asList(".") "a" -> Arrays.asList("a") ".." -> Arrays.asList("..") 

Here is the example with the formatting method in ideone

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.