12

There are several packages available for the usage of regular expressions in Haskell (e.g. Text.Regex.Base, Text.Regex.Posix etc.). Most packages I've seen so far use a subset of Regex I know, by which I mean: I am used to split a sentence into words with the following Regex:

\\w+ 

Nearly all packages in Haskell I tried so far don't support this (at least the earlier mentioned and Text.Regex.TDFA neither). I know that with Posix the usage of [[:word:]+] would have the same effect, but I would like to use the variant mentioned above.

From there are two questions:

  1. Is there any package to archive that?
  2. If there really is, why is there a different common usage?
  3. What advantages or disadvantages are there?
2
  • 4
    Do you require regular expressions to split the words? There's a function words that does exactly what you want. Commented Dec 7, 2011 at 14:27
  • Thanks, I didn't know that function but it doesn't do what I want. If there are any dots, commas etc. in a string the Regex would ignore them but words would attach them. E.g.: Prelude> words "Just a simple test." would result ["Just","a","simple","test."] I want it without the dot. Commented Dec 7, 2011 at 14:37

4 Answers 4

13

I'd use Adam's suggestion or (perhaps more readable)

> :m +Data.Char > :m +Data.List.Split > wordsBy (not . isLetter) "Just a simple test." ["Just","a","simple","test"] 

No need in regexps here.

Sign up to request clarification or add additional context in comments.

2 Comments

Just a note. Splitting into a words is not equal to that. For example, wordsBy (not . isLetter) "I wanna have 14 balls." returning ["I","wanna","have","balls"], but 14 can be a word actually.
@ДМИТРИЙ This is not supposed to be a complete answer. Actually \w is letters ++ digits ++ "_" so not . isLetter is just a placeholder. I wanted to show easy and understandable splitting pattern.
11

The '\w' is a Perl pattern, and supported by PCRE, which you can access in Haskell with my regex-pcre package or the pcre-light library. If your input is a list of Char then the 'words' function in the standard Prelude may be enough; if your input is ASCII bytestring then Data.ByteString.Char8 may work. There may be a utf8 library with word splitting, but I cannot quickly find it.

Comments

6

If you want to break into words, and filter out things other than letters, you could use filter and isAlpha or isAlphaNum (or any of the other is functions in Data.Char that suite your need.)

import Data.Char wordsButOnlyLetters = map (filter isAlpha) . words 

Comments

3

words function works well, but it's more like 'split by white space', use splitRegex.

import Text.Regex (splitRegex, mkRegex) splitByWord :: String -> [String] splitByWord = splitRegex (mkRegex "[^a-zA-Z]+") >splitByWord "Word splitting with regular expressions in Haskell" >["Word","splitting","with","regular","expressions","in","Haskell"] 

1 Comment

Could not find module ‘Text.Regex’ Perhaps you meant Text.Read

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.