I have this simple code in Python:
input = open("baseforms.txt","r",encoding='utf8') S = {} for i in input: words = i.split() S.update( {j:words[0] for j in words} ) print(S.get("sometext","not found")) print(len(S))
It requires 300MB for work. "baseforms.txt" size is 123M.
I've written the same code in Haskell:
{-# LANGUAGE OverloadedStrings #-} import qualified Data.Map as M import qualified Data.ByteString.Lazy.Char8 as B import Data.Text.Lazy.Encoding(decodeUtf8) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as I import Control.Monad(liftM) main = do text <- B.readFile "baseforms.txt" let m = (M.fromList . (concatMap (parseLine.decodeUtf8))) (B.lines text) print (M.lookup "sometext" m) print (M.size m) where parseLine line = let base:forms = T.words line in [(f,base)| f<-forms] It requires 544 MB and it's slower than Python version. Why? Is it possible to optimise Haskell version?
Data file can be downloaded here
I've tried non-lazy version of Data.Text and Data.Text.IO - memory usage is near 650MB
I've also tried hashtables, but memory usage growth up to 870 MB