Word splitting with regular expressions in Haskell

Question

There are several packages available for the usage of regular expressions in Haskell (e.g. Text.Regex.Base, Text.Regex.Posix etc.). Most packages I've seen so far use a subset of Regex I know, by which I mean: I am used to split a sentence into words with the following Regex:

\\w+

Nearly all packages in Haskell I tried so far don't support this (at least the earlier mentioned and Text.Regex.TDFA neither). I know that with Posix the usage of [[:word:]+] would have the same effect, but I would like to use the variant mentioned above.

From there are two questions:

Is there any package to archive that?
If there really is, why is there a different common usage?
What advantages or disadvantages are there?

Do you require regular expressions to split the words? There's a function words that does exactly what you want. — Adam Wagner
– Adam Wagner, Commented Dec 7, 2011 at 14:27
Thanks, I didn't know that function but it doesn't do what I want. If there are any dots, commas etc. in a string the Regex would ignore them but words would attach them. E.g.: Prelude> words "Just a simple test." would result ["Just","a","simple","test."] I want it without the dot. — beyeran
– beyeran, Commented Dec 7, 2011 at 14:37

Matvey Aksenov · Accepted Answer · 2011-12-07 15:56:23Z

13

I'd use Adam's suggestion or (perhaps more readable)

> :m +Data.Char > :m +Data.List.Split > wordsBy (not . isLetter) "Just a simple test." ["Just","a","simple","test"]

No need in regexps here.

edited Dec 7, 2011 at 15:56

answered Dec 7, 2011 at 14:56

Matvey Aksenov

3,9013 gold badges26 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ДМИТРИЙ МАЛИКОВ Over a year ago

Just a note. Splitting into a words is not equal to that. For example, wordsBy (not . isLetter) "I wanna have 14 balls." returning ["I","wanna","have","balls"], but 14 can be a word actually.

Matvey Aksenov Over a year ago

@ДМИТРИЙ This is not supposed to be a complete answer. Actually \w is letters ++ digits ++ "_" so not . isLetter is just a placeholder. I wanted to show easy and understandable splitting pattern.

Chris Kuklewicz · Accepted Answer · 2011-12-07 14:34:22Z

The '\w' is a Perl pattern, and supported by PCRE, which you can access in Haskell with my regex-pcre package or the pcre-light library. If your input is a list of Char then the 'words' function in the standard Prelude may be enough; if your input is ASCII bytestring then Data.ByteString.Char8 may work. There may be a utf8 library with word splitting, but I cannot quickly find it.

Adam Wagner · Accepted Answer · 2011-12-07 15:22:25Z

If you want to break into words, and filter out things other than letters, you could use filter and isAlpha or isAlphaNum (or any of the other is functions in Data.Char that suite your need.)

import Data.Char wordsButOnlyLetters = map (filter isAlpha) . words

Marko Tunjic · Accepted Answer · 2014-03-23 16:14:01Z

words function works well, but it's more like 'split by white space', use splitRegex.

import Text.Regex (splitRegex, mkRegex) splitByWord :: String -> [String] splitByWord = splitRegex (mkRegex "[^a-zA-Z]+") >splitByWord "Word splitting with regular expressions in Haskell" >["Word","splitting","with","regular","expressions","in","Haskell"]

Could not find module ‘Text.Regex’ Perhaps you meant Text.Read

Collectives™ on Stack Overflow

Word splitting with regular expressions in Haskell

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Related