49

I have some high-performance Haskell code - the inner loop compiles down to 6 assembly instructions. Modifying the inner loop to be less efficient doesn't have any noticable impact on the performance, suggesting the inner loop is not the bottleneck. However, when I turn on profiling, the assembly code generated for the inner loop becomes dramatically less efficient, and the profiler reports that the inner loop takes 85% of the time.

I suspect something is unnecessarily slow, but when I use profiling to see what, I suspect that profiling makes the inner loop sufficiently slow that it dominates. What techniques can I use to see where the time is going? A sampling profiler would be great, if one existed for Haskell.

10
  • 3
    Maybe if you add cost centres manually, you could not add a cost centre to that loop and so it won't be slower? Not sure if that works though Commented Jan 2, 2014 at 11:46
  • 1
    Good idea, but it doesn't seem to work. Even with no cost centers at all, just adding -prof puts a call enterFunCCS in the inner loop, plus about 5 other additional instructions (but the call will be the one that kills it). Commented Jan 2, 2014 at 11:57
  • 2
    I have some hacks for doing sampling. Basically you just link with the profiling startup code. The problem is that the generated .prof file makes gprof die. Probably because gprof has some assumptions about the symbols. But a hacked up gprof that knows about ghc symbols should be able to work. Commented Jan 2, 2014 at 13:29
  • 3
    Would it be possible to comment out the loop? That way you could profile the rest of the code, without having the loop messing the profile. Commented Jan 2, 2014 at 14:57
  • 1
    Maybe you could try using "ticky-ticky" profiling: ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky (I have no experience doing this, just stumbled over it on the GHC wiki) Commented Jan 4, 2014 at 15:00

1 Answer 1

24

You can use Linux perf events: https://ghc.haskell.org/trac/ghc/wiki/Debugging/LowLevelProfiling/Perf

This will give you an output that looks like:

# Samples: 9161149923 # # Overhead Command Shared Object Symbol # ........ ....... ................. ...... # 30.65% queens queens [.] s1ql_info 18.67% queens queens [.] s1qj_info 12.17% queens queens [.] s1qi_info 9.94% queens queens [.] s1o9_info 5.85% queens queens [.] r1nI_info 5.33% queens queens [.] s1sF_info 5.18% queens queens [.] s1sG_info 3.69% queens queens [.] s1oP_info 1.68% queens queens [.] stg_upd_frame_info 0.88% queens queens [.] stg_ap_2_upd_info 0.62% queens queens [.] s1sE_info 0.56% queens [kernel] [k] read_hpet 0.39% queens queens [.] stg_ap_p_info 0.35% :2030 f76beb [.] 0x00000000f76beb 0.31% queens queens [.] s1oD_info 0.28% swapper [kernel] [k] mwait_idle_with_hints 0.25% queens queens [.] __stg_gc_enter_1 0.23% queens queens [.] evacuate 0.18% swapper [kernel] [k] read_hpet 0.12% queens queens [.] scavenge_block 

If you save the core as you compile you can map these symbols back to the functions in core.

A bit painful, but gives you more trustworthy results.

There is some work afoot to do this automatically.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.