Creating a tokenizer using the method Split

Question

I'm trying to create simple tokenizer that splits on whitespace, lowercases tokens, removes all nonalphabetic characters, and keeps only terms with 3 or more characters. I write this code, it´s all ready work on lowercases, nonalphabetic characters and only keeps 3 or more characters. But I want to use the method split, but I don't know how. Please suggest something.

public class main { public static final String EXAMPLE_TEST = "This Mariana John bar Barr " + "12364 FFFFF aaaa a s d f g."; public static void main(String[] args) { Pattern pattern = Pattern.compile("(\\s[a-z]{3,20})"); Matcher matcher = pattern.matcher(EXAMPLE_TEST); while (matcher.find()) { System.out.print("Start index: " + matcher.start()); System.out.print(" End index: " + matcher.end() + " "); System.out.println(matcher.group()); } } }

1) split cannot lowercase result. --- 2) You can't use split to "remove all nonalphabetic characters" without also splitting on them, but you said to only "split on whitespace", e.g. what should happen with input abc1@3xyz? Should that return abcxyz (nonalphabetics removed), or should it return abc and xyz? --- Your requirements, as stated, are impossible. — Andreas
– Andreas, Commented Oct 26, 2018 at 21:15
I'm also curious as to why you want to use split() when your regex already meets your requirements. — Grant Foster
– Grant Foster, Commented Oct 26, 2018 at 21:21

pantuptus · Accepted Answer · 2018-10-26 22:50:32Z

If you do not have to track the index:

List<String> processed = Arrays.stream(EXAMPLE_TEST.split(" ")).map(String::toLowerCase) .map(s -> s.replaceAll("[^a-z]", "")).filter(s -> s.length() >= 3).collect(Collectors.toList()); for (String s : processed) { System.out.println(s); }

But your example output presents the index as well. Then you have to store it in additional container (like Map):

Map<Integer, String> processed = Arrays.stream(EXAMPLE_TEST.split(" ")).collect(Collectors.toMap(s -> EXAMPLE_TEST.indexOf(s), s -> s.toLowerCase().replaceAll("[^a-z]", ""))); Map<Integer, String> filtered = processed.entrySet().stream().filter(entry -> entry.getValue().length() >= 3).collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue)); for (Map.Entry<Integer, String> entry : filtered.entrySet()) { System.out.println("Start index: " + entry.getKey() + " " + entry.getValue()); }

Andreas · Accepted Answer · 2018-10-26 21:41:21Z

Since your requirement doesn't say "max 20" anywhere, change [a-z]{3,20} to [a-z]{3,} for unlimited length.

Regex cannot lowercase the tokens, so you need to call toLowerCase() separately. Your regex will only work ok if you do that before invoking the regex. If you intend to call toLowerCase() on each token after invoking the regex, you need to change [a-z] to [a-zA-Z]. Easiest is doing it before.

The above means that your code should be modified as follows:

Pattern pattern = Pattern.compile("[a-z]{3,}"); Matcher matcher = pattern.matcher(EXAMPLE_TEST.toLowerCase());

Output

Start index: 0 End index: 4 this Start index: 5 End index: 12 mariana Start index: 13 End index: 17 john Start index: 18 End index: 21 bar Start index: 22 End index: 26 barr Start index: 33 End index: 38 fffff Start index: 39 End index: 43 aaaa

To do the same thing using split, you need to split on any sequence of characters that consists of nonalphabetic characters or at most 2 consecutive alphabetic characters.

String[] split = EXAMPLE_TEST.toLowerCase().split("(?:[^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+"); System.out.println(Arrays.toString(split));

Output

[this, mariana, john, bar, barr, fffff, aaaa]

Explanation:

(?: Start non-capturing repeating group: [^a-z]+ Match one or more nonalphabetic characters | Or (?<![a-z]) Not preceded by an alphabetic character [a-z]{1,2} Match 1-2 alphabetic characters (?![a-z]) Not followed by an alphabetic character )+ Match one or more of the above

Note: The + after [^a-z] can be removed, since the + at the end will do the repetition anyway, but the regex should perform better with the + there.

The difference between the original code and the split code is that split will return an empty string as the first result if the input starts with a nonalphabetic character.

Collectives™ on Stack Overflow

Creating a tokenizer using the method Split

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related