3

I have this string:

I like grumpy cats. Do you? ффффф ыыыыы ইউটিউব থেকে 

If I use \w in regexp - then I get only words written with latin letters:

 final text = "I like grumpy cats. Do you? ффффф ыыыыы ইউটিউব থেকে"; RegExp re = RegExp(r"\w+"); List<String> words = []; for (Match match in re.allMatches(text)) { words.add(match.group(0)!); } print(words); 

Output:

[I, like, grumpy, cats, Do, you] 

But I need this result:

[I, like, grumpy, cats, Do, you, ффффф, ыыыыы, ইউটিউব, থেকে] 

In this answer I found that \p{L} means "any kind of letter from any kind of language". But I could not make it work in Flutter/Dart

1
  • 1
    check WordBoundary Commented Sep 14, 2024 at 18:44

1 Answer 1

6

You can de-compose the \w shorthand character class into its constituent Unicode category classes and use the unicode: true argument in the RegExp constructor:

String text = "I like grumpy cats. Do you? ффффф ыыыыы ইউটিউব থেকে"; RegExp re = new RegExp re = new RegExp(r'[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]+', unicode: true); List<String?> words = re.allMatches(text).map((z) => z.group(0)).toList(); print(words); 

Output:

[I, like, grumpy, cats, Do, you, ффффф, ыыыыы, ইউটিউব, থেকে] 

Details:

  • unicode: true enables Unicode category classes in the regex pattern
  • \p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control} - matches any Unicode word character, where
    • \p{Alphabetic} - matches letters
    • \p{Mark} - matches diacritics
    • \p{Decimal_Number} - matches decimal numbers
    • \p{Connector_Punctuation} - matches connector punctuation, like underscore (equivalent of [_\uFF3F\uFE4D-\uFE4F\uFE33\uFE34 \u203F \u2054 \u2040])
    • \p{Join_Control} - it is basically an equivalent of [\u200C\u200D], a zero-width joiner and non-joiner.
Sign up to request clarification or add additional context in comments.

5 Comments

Wow, nice! (upvoted) I'd normally do an "anything but whitespace" character match for this use-case, mostly because I've never seen this approach. How is the approach you have different/better?
@FrankvanPuffelen This regex does not match commas, dots, and other punctuation like this. It can be further customized.
@WiktorStribiżew Why don't use just this: RegExp re = RegExp(r"\p{Alphabetic}+", unicode: true); It seems to return the same result.
@Max If you need to extract sequences of alphabetic characters only, fine, use it. I used mine since there was \w in the question, and I assumed the task is to match any Unicode word chars. As I said, the pattern is customizable.
@WiktorStribiżew thanks, it works. You saved a lot of time.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.