How to split string written in any language to words (with Flutter/Dart)?

Question

I have this string:

I like grumpy cats. Do you? ффффф ыыыыы ইউটিউব থেকে

If I use \w in regexp - then I get only words written with latin letters:

 final text = "I like grumpy cats. Do you? ффффф ыыыыы ইউটিউব থেকে"; RegExp re = RegExp(r"\w+"); List<String> words = []; for (Match match in re.allMatches(text)) { words.add(match.group(0)!); } print(words);

Output:

[I, like, grumpy, cats, Do, you]

But I need this result:

[I, like, grumpy, cats, Do, you, ффффф, ыыыыы, ইউটিউব, থেকে]

In this answer I found that \p{L} means "any kind of letter from any kind of language". But I could not make it work in Flutter/Dart

check WordBoundary

pskink
– pskink

2024-09-14 18:44:37 +00:00
Commented Sep 14, 2024 at 18:44 — pskink
– pskink, Commented Sep 14, 2024 at 18:44

Frank van Puffelen · Accepted Answer · 2024-09-14 19:11:14Z

You can de-compose the \w shorthand character class into its constituent Unicode category classes and use the unicode: true argument in the RegExp constructor:

String text = "I like grumpy cats. Do you? ффффф ыыыыы ইউটিউব থেকে"; RegExp re = new RegExp re = new RegExp(r'[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]+', unicode: true); List<String?> words = re.allMatches(text).map((z) => z.group(0)).toList(); print(words);

Output:

[I, like, grumpy, cats, Do, you, ффффф, ыыыыы, ইউটিউব, থেকে]

Details:

unicode: true enables Unicode category classes in the regex pattern
\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control} - matches any Unicode word character, where
- \p{Alphabetic} - matches letters
- \p{Mark} - matches diacritics
- \p{Decimal_Number} - matches decimal numbers
- \p{Connector_Punctuation} - matches connector punctuation, like underscore (equivalent of [_\uFF3F\uFE4D-\uFE4F\uFE33\uFE34 \u203F \u2054 \u2040])
- \p{Join_Control} - it is basically an equivalent of [\u200C\u200D], a zero-width joiner and non-joiner.

Wow, nice! (upvoted) I'd normally do an "anything but whitespace" character match for this use-case, mostly because I've never seen this approach. How is the approach you have different/better?
@FrankvanPuffelen This regex does not match commas, dots, and other punctuation like this. It can be further customized.
@WiktorStribiżew Why don't use just this: RegExp re = RegExp(r"\p{Alphabetic}+", unicode: true); It seems to return the same result.
@Max If you need to extract sequences of alphabetic characters only, fine, use it. I used mine since there was \w in the question, and I assumed the task is to match any Unicode word chars. As I said, the pattern is customizable.
@WiktorStribiżew thanks, it works. You saved a lot of time.

Collectives™ on Stack Overflow

How to split string written in any language to words (with Flutter/Dart)?

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related