How to count words in a Greek text

Question

I want to perform a simple analysis of some Greek text: collect the different words used and count their frequency. It seems that some of the built-in commands do not work well with Greek letters. For instance, with

words = StringCases["α β1 rpr other", WordCharacter ..]

the output is

{"1", "rpr", "other"}

How do I get Mathematica to recognize other kinds of words?

FWIW, you might consider something like StringCases["α β1 rpr0 other", RegularExpression["(\\w|[\[CapitalAlpha]-ω])+"]]. A bit cumbersome, but there ya go... with the caveat that this can't handle characters with tonos; you'll have to modify the regex as needed. — J. M.'s missing motivation
– J. M.'s missing motivation, Commented Apr 26, 2013 at 11:26
It'd good to be aware of this: forums.wolfram.com/mathgroup/archive/2009/Jul/msg00398.html — Szabolcs
– Szabolcs, Commented Apr 26, 2013 at 14:43

cormullion · Accepted Answer · 2013-04-26 11:24:25Z

I have partial success in splitting Greek text:

greek = ExampleData[{"Text", "HomerOdysseyGreek"}] Style[StringTake[greek, 100], FontFamily -> "Times"] Style[StringSplit[StringTake[greek, 100], " "], FontFamily -> "Times"]

greek

StringSplit is essentially all I need, and it works great. It correctly identifies words in the Greek. — GregH
– GregH, Commented Apr 26, 2013 at 12:18

Sjoerd C. de Vries · Accepted Answer · 2013-04-26 11:40:15Z

I'd say this is either a programming or a documentation bug.

The documentation for WordCharacter says:

WordCharacter matches any character for which either LetterQ or DigitQ yields True. »

Well, WordCharacter clearly doesn't consider alpha a letter:

StringMatchQ["α", WordCharacter]

False

but LetterQ does:

LetterQ["α"]

True

A workaround:

theGreeks=Alternatives@@Select[CharacterRange["\[CapitalAlpha]", "\[Omega]"], LetterQ]; StringCases["α β1 rpr other", (WordCharacter | theGreeks) ..]

{"α", "β1", "rpr", "other"}

You may want to enlarge the set of characters included in "theGreeks". Now only, the basic Greek characters are in there.

FWIW, both StringMatchQ["α", RegularExpression["[[:word:]]"]] and StringMatchQ["α", RegularExpression["\\w"]] also return False. — J. M.'s missing motivation
– J. M.'s missing motivation, Commented Apr 26, 2013 at 11:19

Stack Exchange Network

How to count words in a Greek text

2 Answers 2

Hot Network Questions

How to count words in a Greek text

2 Answers 2

Related

Hot Network Questions