Since your requirement doesn't say "max 20" anywhere, change [a-z]{3,20} to [a-z]{3,} for unlimited length.
Regex cannot lowercase the tokens, so you need to call toLowerCase() separately. Your regex will only work ok if you do that before invoking the regex. If you intend to call toLowerCase() on each token after invoking the regex, you need to change [a-z] to [a-zA-Z]. Easiest is doing it before.
The above means that your code should be modified as follows:
Pattern pattern = Pattern.compile("[a-z]{3,}"); Matcher matcher = pattern.matcher(EXAMPLE_TEST.toLowerCase());
Output
Start index: 0 End index: 4 this Start index: 5 End index: 12 mariana Start index: 13 End index: 17 john Start index: 18 End index: 21 bar Start index: 22 End index: 26 barr Start index: 33 End index: 38 fffff Start index: 39 End index: 43 aaaa
To do the same thing using split, you need to split on any sequence of characters that consists of nonalphabetic characters or at most 2 consecutive alphabetic characters.
String[] split = EXAMPLE_TEST.toLowerCase().split("(?:[^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+"); System.out.println(Arrays.toString(split));
Output
[this, mariana, john, bar, barr, fffff, aaaa]
Explanation:
(?: Start non-capturing repeating group: [^a-z]+ Match one or more nonalphabetic characters | Or (?<![a-z]) Not preceded by an alphabetic character [a-z]{1,2} Match 1-2 alphabetic characters (?![a-z]) Not followed by an alphabetic character )+ Match one or more of the above
Note: The + after [^a-z] can be removed, since the + at the end will do the repetition anyway, but the regex should perform better with the + there.
The difference between the original code and the split code is that split will return an empty string as the first result if the input starts with a nonalphabetic character.
splitcannot lowercase result. --- 2) You can't usesplitto "remove all nonalphabetic characters" without also splitting on them, but you said to only "split on whitespace", e.g. what should happen with inputabc1@3xyz? Should that returnabcxyz(nonalphabetics removed), or should it returnabcandxyz? --- Your requirements, as stated, are impossible.split()when your regex already meets your requirements.