6
$\begingroup$

I want to perform a simple analysis of some Greek text: collect the different words used and count their frequency. It seems that some of the built-in commands do not work well with Greek letters. For instance, with

words = StringCases["α β1 rpr other", WordCharacter ..] 

the output is

{"1", "rpr", "other"} 

How do I get Mathematica to recognize other kinds of words?

$\endgroup$
2
  • 1
    $\begingroup$ FWIW, you might consider something like StringCases["α β1 rpr0 other", RegularExpression["(\\w|[\[CapitalAlpha]-ω])+"]]. A bit cumbersome, but there ya go... with the caveat that this can't handle characters with tonos; you'll have to modify the regex as needed. $\endgroup$ Commented Apr 26, 2013 at 11:26
  • $\begingroup$ It'd good to be aware of this: forums.wolfram.com/mathgroup/archive/2009/Jul/msg00398.html $\endgroup$ Commented Apr 26, 2013 at 14:43

2 Answers 2

4
$\begingroup$

I have partial success in splitting Greek text:

greek = ExampleData[{"Text", "HomerOdysseyGreek"}] Style[StringTake[greek, 100], FontFamily -> "Times"] Style[StringSplit[StringTake[greek, 100], " "], FontFamily -> "Times"] 

greek

$\endgroup$
1
  • $\begingroup$ StringSplit is essentially all I need, and it works great. It correctly identifies words in the Greek. $\endgroup$ Commented Apr 26, 2013 at 12:18
7
$\begingroup$

I'd say this is either a programming or a documentation bug.

The documentation for WordCharacter says:

WordCharacter matches any character for which either LetterQ or DigitQ yields True. »

Well, WordCharacter clearly doesn't consider alpha a letter:

StringMatchQ["α", WordCharacter] 

False

but LetterQ does:

LetterQ["α"] 

True

A workaround:

theGreeks=Alternatives@@Select[CharacterRange["\[CapitalAlpha]", "\[Omega]"], LetterQ]; StringCases["α β1 rpr other", (WordCharacter | theGreeks) ..] 

{"α", "β1", "rpr", "other"}

You may want to enlarge the set of characters included in "theGreeks". Now only, the basic Greek characters are in there.

$\endgroup$
1
  • 1
    $\begingroup$ FWIW, both StringMatchQ["α", RegularExpression["[[:word:]]"]] and StringMatchQ["α", RegularExpression["\\w"]] also return False. $\endgroup$ Commented Apr 26, 2013 at 11:19

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.