A Haskell regular expression tutorial

Someone showed up on #haskell yesterday, asking how to use regular expressions. This isn’t a completely straightforward question to answer. While Haskell’s regexp libraries provide the same functionality you’ll find in Perl, Python, and Java, they provide a rich and fairly abstract interface that can be daunting to newcomers.

So let’s fix that, and strip away the abstractions to show how we might actually use regexps in practice. I’m assuming that you’re already familiar with regexps from using them in some other language, and simply want to find your feet in Haskell. The standard library that implements regexps is Text.Regex.Posix. As the name suggests, this wraps the system’s native POSIX extended regexp library.

If you’re coming from a language like Perl, Python, or Java, you may not have encountered POSIX regular expressions before. Perl-style regexps are much more expressive, and hence far more widely used. POSIX regexps look superficially similar to Perl-style regexps, but they’re not as expressive, and they have different matching behaviour. There aren’t any concise online descriptions of the syntactic and semantic differences between the two regexp families, but the Wikipedia entry on regexps will help you to understand some of the differences. The only advantage of the POSIX regexp library is that it’s bundled with GHC; you don’t need to fetch any extra bits to get it to work.

Before we continue, let’s start up GHC’s interpreter, ghci.

 $ ghci Loading package base ... linking ... done.

To use the Text.Regex.Posix module, we use the :mod (or simply :m) command to load it and bring it into scope.

 Prelude> :mod +Text.Regex.Posix Prelude Text.Regex.Posix>

The simplest way to use regexps is via the “=~” function, which we normally use as an infix operator. And, of course, it’s exactly the same operator that Perl uses for regexp matching.

What’s interesting is that this function is polymorphic in both its arguments and its return type; this is why its documentation can be hard to read if you’re new to the language. Here’s a simplified type signature, to give you the idea.

 (=~) :: string -> pattern -> result

(I’ve left out the constraints on these type variables. The real type signature of =~ is huge, and only obscures early understanding of what on earth is going on, so I’m not going to reproduce it here.)

You can use either a normal Haskell string or a much more efficient ByteString as the pattern or string. You can use them in whatever combination you like; a String for one, a ByteString for the other, or whatever suits your needs.

If you try to use =~ interactively at the ghci prompt, ghci gives a fearsome error message.

 > "bar" =~ "(foo|bar)" <interactive>:1:0: No instance for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] result) arising from use of `=~' at <interactive>:1:0-19 Possible fix: add an instance declaration for (Text.Regex.Base.RegexLike.RegexContext Regex [Char] result) In the expression: "bar" =~ "(foo|bar)" In the definition of `it': it = "bar" =~ "(foo|bar)"

Yikes! What’s happened here is that we haven’t given a type for the result of the =~ operator. Since the result type is polymorphic, ghci has no way to infer what type we might actually want. We can easily fix this, by suffixing the expression with a type signature.

 > "bar" =~ "(foo|bar)" :: Bool True > "quux" =~ "(foo|bar)" :: Bool False

By constraining the type of the result to Bool, we get a simple “yes” or “no” answer when we ask if the match has succeeded.

But we can also use String as the result type, which gives us the first string that matches, or an empty string if the match fails.

 > let pat = "(foo[a-z]*bar|quux)" > "foodiebar, fooquuxbar" =~ pat :: String "foodiebar" > "nomatch" =~ pat :: String ""

If we use [String], we get a list of every string that matches.

 > "foodiebar, fooquuxbar" =~ pat :: String ["foodiebar","fooquuxbar"]

It’s also possible to get more detail about the context in which a match occurred. The 3-tuple in this result consists of the text before a match; the matched text; and the text after the match.

 > "before foodiebar after" =~ pat :: (String,String,String) ("before ","foodiebar"," after")

The 4-tuple below adds the text of all subgroups of the match.

 > "before foodiebar after" =~ pat :: (String,String,String,[String]) ("before ","foodiebar"," after",["foodiebar"])

But wait, there’s more! You can get plain, simple offset information, either singly or in a list.

 > :mod +Text.Regex.Base > "the foodiebar" =~ pat :: (MatchOffset,MatchLength) (4,9) > "no match" =~ pat :: [(MatchOffset,MatchLength)] []

All of these different result types are instances of the RegexContext type class. By this point, you should have enough examples that you can read the complete documentation for the other RegexContext instances at Text.Regex.Base.Context without it seeming completely overwhelming. There are many more instances, some of which give you a lot of detailed information about a match.

There’s a monadic variant of the =~ operator, called =~~. You can use this to perform matches inside a monad, for example as follows. This binds a to the number of matches.

 > a <- "foo" =~~ "foo":: IO Int 1

You can also use the =~~ operator outside of a monad in some cases. For example, you can try a match in the Maybe monad to tidily handle the possibility of a failure.

 > "foo" =~~ "bar":: Maybe String Nothing > "foo" =~~ "foo" :: Maybe String Just "foo"

Here’s an interesting pitfall to watch out for. There’s a small chance you could shoot yourself in the foot if you use a list as a RegexContext. Let me show you what I mean. This expression returns a list of all matches, which is what I’d normally expect.

 > "foo foo foo" =~ "foo" :: [String] ["foo","foo","foo"]

But this expression, which differs in only one character, runs the match in the list monad! It’s only ever going to return an empty or single-entry list. Eeek!

 > "foo foo foo" =~~ "foo" :: [String] ["foo"]

There’s normally no need to compile regular expression patterns. A pattern will be compiled the first time it’s used, and your Haskell runtime should memoise the compiled representation for you.

This is necessarily only a basic introduction to the Haskell regexp API. In practice, you will want to avoid the Text.Regex.Posix implementation; it’s terribly slow, and too strict besides. When I have time, I’ll talk about these problems in more detail, and what alternatives you can use. Until then, happy regexp matching!