I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float sampleMean n dist = do xs <- sequence (replicate n dist) return (sum xs) Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs $ ./main +RTS -s First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap 1,195,050,032 bytes copied during GC 169,411,368 bytes maximum residency (12 sample(s)) 7,360,232 bytes maximum slop 423 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 1.92s ( 1.94s elapsed) GC time 3.47s ( 3.70s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 0.23s ( 0.23s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 5.63s ( 5.87s elapsed) %GC time 61.8% (63.1% elapsed) Alloc rate 716,368,278 bytes per MUT second Productivity 34.2% of total user, 32.7% of total elapsed Here are the results from
$ ./main +RTS -p The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 434 4 0.0 0.0 100.0 100.0 likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0 distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6 getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1 bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0 bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8 bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1 bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8 bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5 weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3 bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0 bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2 bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5 bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2 bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7 Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.

Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?
replicateM ninstead ofsequence . replicate nreplicateM nis defined assequence . replicate n.length xsis alwaysn, isn't it? So replace(length xs)withnand thenxsdoesn't have to stay entirely resident in memory at any time.sumis a list consumer andreplicateMis a list generator. This would eliminate the need for extra strictness as well.randomR, so it generates the same 'random' number each time. You could try something like (I've not tested this)main = do { a <- sampleMean 10000 (randomRIO (0.0,1.0)); print a }