How to shorten this terrible HTTP header parser?

Question

I am trying to read a line ending in \r\n from a Handle. This is an HTTP header.

I’m fairly new to functional programming, so I tend to use a lot of case expressions and stuff. This makes the code very long and ugly.

handleRequest :: HostName -> Handle -> IO () handleRequest host handle = do requestLine <- readHeaderLine putStrLn $ requestLine ++ "\n-------------------" -- FIXME: This code is bad, and its author should feel bad. where readHeaderLine = do readHeaderLine' "" where readHeaderLine' s = do chr <- hGetChar handle case chr of '\r' -> do nextChr <- hGetChar handle case nextChr of '\n' -> return s _ -> readHeaderLine' $ s ++ [chr, nextChr] _ -> readHeaderLine' $ s ++ [chr]

How can I reduce the amount of case expressions in this code? I thought of using Parsec, but that seemed overkill to me for something this trivial, and I don’t know how well it works with Handles.

The answer depends on your goal. If you want high performance or an elegant HTTP parser, you should choose different algorithm and libraries. If you just want to see how your current code can be cleaned up with only structural changes without changing the algoritm, it's a different story. Please specify. — nponeccop
– nponeccop, Commented Sep 19, 2012 at 9:20
@nponeccop performance isn’t really an issue. The point is making the code more concise. — daknøk
– daknøk, Commented Sep 19, 2012 at 10:17
Can't you just use hGetLine? There are many alternatives to your code. For example, you can just use hGetContent and parse the resulting list. — nponeccop
– nponeccop, Commented Sep 19, 2012 at 10:54
@nponeccop hGetContent won’t work, since the length of the data isn’t known before the headers are parsed. hGetLine won’t work if the headers contain a newline. — daknøk
– daknøk, Commented Sep 19, 2012 at 11:40
hGetContent doesn't require length to be known in advance. — nponeccop
– nponeccop, Commented Sep 19, 2012 at 12:38

Dan Burton · Accepted Answer · 2012-09-20 01:27:02Z

You can use hGetContents to read lazily from a handle, and the split package (which will hopefully be bundled into the next Haskell Platform release) provides some nice ways of dealing with lists.

import Data.List.Split (splitOn) import System.IO (hGetContents) ... do ... contentLines <- splitOn "\r\n" <$> hGetContents handle

contentLines will contain a lazy list of the "entire contents" of that handle, split into chunks that were originally separated by \r\n.

Another approach is to use something like the conduit package. See, for example, Data.Conduit.Binary.lines. Keep an eye on conduit & friends; I get the feeling that within the next year or so the Haskell community will start to agree on the "best" implementation of this sort of abstraction, and then some good tutorials will inevitably follow.

nponeccop · Accepted Answer · 2012-09-19 15:13:03Z

Here I assume you just want to see how your current code can be improved without changing the algoritm.

First of all, give HLint tool a chance to suggest you obvious improvements. In your case the only improvement was that do in do readHeaderLine' "" was redundant, so not much.

Second, in my opinion many small top-level definitions are better than few large ones. You can still control namespace pollution by not exporting definitions local to the module:

import System.IO type HostName = String handleRequest :: HostName -> Handle -> IO () handleRequest host handle = do requestLine <- readHeaderLine handle putStrLn $ requestLine ++ "\n-------------------" -- FIXME: This code is bad, and its author should feel bad. readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = do chr <- hGetChar handle case chr of '\r' -> do nextChr <- hGetChar handle case nextChr of '\n' -> return s _ -> readHeaderLine' $ s ++ [chr, nextChr] _ -> readHeaderLine' $ s ++ [chr]

Next, your nextChr <- hGetChar handle ; case nextChr of limes appear twice, so you can extract a function and reduce code duplication.

Thanks to purity, you can do pretty mechanically:

foo handle quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz

Now replace the two fragments with calls to foo. Let's start with inner one:

readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = do chr <- hGetChar handle case chr of '\r' -> do foo handle '\n' (return s) (readHeaderLine' $ s ++ [chr, nextChr]) _ -> readHeaderLine' $ s ++ [chr]

Heh, it didn't work because nextChr is only known inside foo. No problem, pass it as parameter to baz branch and use lambda to catch it:

readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = do chr <- hGetChar handle case chr of '\r' -> do foo handle '\n' (return s) (\nextChr -> readHeaderLine' $ s ++ [chr, nextChr]) _ -> readHeaderLine' $ s ++ [chr] foo handle quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

So now you can do the outer one.

readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = do foo handle '\r' (foo handle '\n' (return s) (\nextChr -> readHeaderLine' $ s ++ [chr, nextChr])) (\chr -> readHeaderLine' $ s ++ [chr])

No luck again as first chr is nowhere to get from. Fortunately we know it's always '\r', so

readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = do foo handle '\r' (foo handle '\n' (return s) (\nextChr -> readHeaderLine' $ s ++ ['\r', nextChr])) (\chr -> readHeaderLine' $ s ++ [chr])

As line got too long, we can split it, removing another redundant do:

readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = foo handle '\r' haveCR noCR where haveCR = foo handle '\n' (return s) haveCRnoLF noCR chr = readHeaderLine' $ s ++ [chr] haveCRnoLF nextChr = readHeaderLine' $ s ++ ['\r', nextChr]

Now there are more repeated patterns to eliminate: foo handle and readHeaderLine' $ s ++. To remove foo handle we move foo back in and remove its handle parameter both from applications and definition as it's now accessible from closure:

readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = foo '\r' haveCR noCR where haveCR = foo '\n' (return s) haveCRnoLF noCR chr = readHeaderLine' $ s ++ [chr] haveCRnoLF nextChr = readHeaderLine' $ s ++ ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

To eliminate readHeaderline' repeated patterns we extract them into recurse local function:

readHeaderLine handle = readHeaderLine' "" where readHeaderLine' s = foo '\r' haveCR noCR where haveCR = foo '\n' (return s) haveCRnoLF noCR chr = recurse [chr] haveCRnoLF nextChr = recurse ['\r', nextChr] recurse x = readHeaderLine' $ s ++ x foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

This is how far you can get with mechanical code deduplication. Now it's time for heavier weapons. You can still:

separate recursive code from non-recursive code
separate monadic code from non-monadic code

The initial redHeaderLine' call can be implemented using recurse with an extra parameter:

readHeaderLine handle = recurse [] [] where recurse s x = readHeaderLine' $ s ++ x readHeaderLine' s = foo '\r' haveCR noCR where haveCR = foo '\n' (return s) haveCRnoLF noCR chr = recurse s [chr] haveCRnoLF nextChr = recurse s ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

Now we can inline readHeaderLine' as it is only applied once:

readHeaderLine handle = recurse [] [] where recurse s1 x = foo '\r' haveCR noCR where s = s1 ++ x haveCR = foo '\n' (return s) haveCRnoLF noCR chr = recurse s [chr] haveCRnoLF nextChr = recurse s ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

And we can remove duplication of recurse s:

readHeaderLine handle = recurse [] [] where recurse s1 x = foo '\r' haveCR noCR where s = s1 ++ x rf = recurse s haveCR = foo '\n' (return s) haveCRnoLF noCR chr = rf [chr] haveCRnoLF nextChr = rf ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

Now let's put return value of recurse into a local declaration g:

readHeaderLine handle = recurse [] [] where recurse s1 x = g where s = s1 ++ x rf = recurse s g = foo '\r' haveCR noCR haveCR = foo '\n' (return s) haveCRnoLF noCR chr = rf [chr] haveCRnoLF nextChr = rf ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

Our goal is to divorse g from recurse. You can do it by adding parameters to both recurse and g and localizing identifiers used only in recurse and used only in g:

readHeaderLine handle = recurse g [] [] where recurse g s1 x = g rf s where s = s1 ++ x rf = recurse g s

g rf s = foo '\r' haveCR noCR where haveCR = foo '\n' (return s) haveCRnoLF noCR chr = rf [chr] haveCRnoLF nextChr = rf ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

Now recurse is completely self-contained:

readHeaderLine handle = recurse g [] [] where g rf s = foo '\r' haveCR noCR where haveCR = foo '\n' (return s) haveCRnoLF noCR chr = rf [chr] haveCRnoLF nextChr = rf ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr

recurse g s1 x = g rf s where s = s1 ++ x rf = recurse g s

But g is still recursive: it has a nasty rf parameter which is an indirect recursive application. We need to move rf into recurse too. So here comes a trick: convert a function call into a constructor.

g can have only 3 return values: return s, rf [chr] and rf ['\r', nextChr]. We can represent them with a data type and return it instead of calling return or rf:

data Outcomes a b c = RF1 a | RF2 b | Return c readHeaderLine handle = recurse g [] [] where g rf s = foo '\r' haveCR noCR where haveCR = foo '\n' (return $ Return s) haveCRnoLF noCR chr = return $ RF1 [chr] haveCRnoLF nextChr = return $ RF2 ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr recurse g s1 x = analyzeOutcomes $ g rf s where s = s1 ++ x rf = recurse g s analyzeOutcomes outcomeM = do outcome <- outcomeM case outcome of RF1 a -> rf a RF2 a -> rf a Return a -> return a

Now rf parameter is unused, so we can clean the definitions of g and recurse:

data Outcomes a b c = RF1 a | RF2 b | Return c

readHeaderLine handle = recurse g [] [] where g s = foo '\r' haveCR noCR where haveCR = foo '\n' (return $ Return s) haveCRnoLF noCR chr = return $ RF1 [chr] haveCRnoLF nextChr = return $ RF2 ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr recurse g s1 x = analyzeOutcomes $ g s where s = s1 ++ x rf = recurse g s analyzeOutcomes outcomeM = do outcome <- outcomeM case outcome of RF1 a -> rf a RF2 a -> rf a Return a -> return a

Now two more improvements: a) RF1 and RF2 outcomes can be joined into one outcome as they are handled uniformly and have the same types; b) the only reason we pass s is to return it in Return outcome, so we can eliminate s argument of g too.

data Outcomes a = RF a | Return readHeaderLine handle = recurse g [] [] where g = foo '\r' haveCR noCR where haveCR = foo '\n' (return Return) haveCRnoLF noCR chr = return $ RF [chr] haveCRnoLF nextChr = return $ RF ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr recurse g s1 x = analyzeOutcomes g where s = s1 ++ x rf = recurse g s analyzeOutcomes outcomeM = do outcome <- outcomeM case outcome of RF a -> rf a Return -> return s

Now rf and analyzeOutcomes are used only once and Outcomes type became the same as Maybe. So:

readHeaderLine handle = recurse g [] [] where g = foo '\r' haveCR noCR where haveCR = foo '\n' (return Nothing) haveCRnoLF noCR chr = return $ Just [chr] haveCRnoLF nextChr = return $ Just ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr recurse outcomeM s1 x = do outcome <- outcomeM let s = s1 ++ x in case outcome of Just a -> recurse outcomeM s a Nothing -> return s

Now s1 and x are only used in recurse to construct s. We can then construct s outside of recurse and pass it. Also, g now can be inlined.

readHeaderLine handle = recurse (foo '\r' haveCR noCR) [] where haveCR = foo '\n' (return Nothing) haveCRnoLF noCR chr = return $ Just [chr] haveCRnoLF nextChr = return $ Just ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else baz chr recurse outcomeM s = do outcome <- outcomeM case outcome of Just a -> recurse outcomeM (s ++ a) Nothing -> return s

Now note that outcomeM is just a constant and it is not changed across recursive calls. So we can proceed further with our splitting of recursive and non-recursive code:

recurse outcomeM s = f s where f s = do outcome <- outcomeM case outcome of Just a -> f (s ++ a) Nothing -> return s

And duplicate return $ Just can be moved inside foo:

readHeaderLine handle = recurse (foo '\r' haveCR noCR) [] where haveCR = foo '\n' (return Nothing) haveCRnoLF noCR chr = [chr] haveCRnoLF nextChr = ['\r', nextChr] foo quux bar baz = do chr <- hGetChar handle if chr == quux then bar else return (Just $ baz chr)

After renaming of nonsense identifiers in definition of foo we get:

readHeaderLine handle = recurse (match '\r' haveCR noCR) [] where haveCR = match '\n' (return Nothing) haveCRnoLF noCR chr = [chr] haveCRnoLF nextChr = ['\r', nextChr] match expectedChar onSuccess onFailure = do actualChar <- hGetChar handle if actualChar == expectedChar then onSuccess else return (Just $ onFailure actualChar) recurse outcomeM s = f s where f s = do outcome <- outcomeM case outcome of Just a -> f (s ++ a) Nothing -> return s

Perhaps replace the expression recurse (match '\r' haveCR noCR) [] with unfoldM (match '\r' haveCR noCR), where unfoldM is in the monad-loops package. — dave4420
– dave4420, Commented Sep 19, 2012 at 17:28

Frerich Raabe · Accepted Answer · 2012-09-19 16:13:23Z

You're trying to read from stdin until "\r\n" is encountered? You could just recursively call hGetChar until you find the right sequence (the buffer is maintained in reverse order and reversed at the end for efficiency reasons and to allow nice pattern-matching):

readHeaderLine :: Handle -> IO String readHeaderLine h = reverse <$> go "" where go :: String -> IO String go ('\n':'\r':s) = return s go xs = do ch <- hGetChar h go (ch:xs)

I changed go s@('\n':'\r':_) = return s into go s@('\n':'\r':s') = return s' so that it wouldn’t include the \r\n in the result. — daknøk
– daknøk, Commented Sep 19, 2012 at 16:11
@RadekSlupik: Ah, good point - I overlooked that the EOL marker is not included in the output. I'll adjust my answer accordingly (you don't need the s@ part anymore in that case). — Frerich Raabe
– Frerich Raabe, Commented Sep 19, 2012 at 16:12

solrize · Accepted Answer · 2012-09-19 20:54:02Z

Note as Frerich mentioned that one serious antipattern in that code is the use of ++ to build up a list element by element. That's bad because each time you add an element, ++ traverses the whole list so far, which means building an n-element list takes O(n**2) operations. It's better to use the ByteString snoc function, or use a difference list, or build up the return string in reverse order and then reverse it.

As for recognizing the \r\n, it's pretty easy to code a straightforward state machine, using a datatype to remember what's been seen so far. In the code below I build up the reversed output and then reverse it, and introduce a newtype for reversed lists to keep track of when a list hasn't been reversed yet. "snoc" inserts a new element at the front of a reversed list, which means it will be at the end of the unreversed list. "snoc" is "cons" spelled backwards and "cons" is the traditional name (from Lisp) for putting a new element at the front of a linked list.

import System.IO -- Seen is the datatype for the current state -- we don't call it State because that's a commonly used -- library module data Seen = SeenNothing | SeenCR type HostName = String handleRequest :: HostName -> Handle -> IO () handleRequest host handle = do requestLine <- readHeaderLine handle putStrLn $ requestLine ++ "\n-------------------" -- below is a datatype we'll use for reversed lists newtype Reversed a = Reversed a snoc :: Reversed [a] -> a -> Reversed [a] snoc (Reversed xs) x = Reversed (x:xs) -- turn a reversed list back to a normal one unreverse :: Reversed [a] -> [a] unreverse (Reversed xs) = reverse xs readHeaderLine h = do r <- f h SeenNothing (Reversed "") return $ unreverse r -- this helper function does the work of building the reversed string f :: Handle -> Seen -> Reversed String -> IO (Reversed String) f h state result = do c <- hGetChar h let r = snoc result c go :: Seen -> IO (Reversed String) go state = f h state r case (state,c) of (SeenCR,'\n') -> return r (_, '\r') -> go SeenCR otherwise -> go SeenNothing

Stack Exchange Network

How to shorten this terrible HTTP header parser?

4 Answers 4

You must log in to answer this question.

Hot Network Questions

How to shorten this terrible HTTP header parser?

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions