Profiling trivial loop

Question

I am trying to understand how to use performance profiling. Here is a solution for problem "Line of sight" from USACO 2013.

import Data.Array.Unboxed import Data.List import Data.Int angle !a | a > 2 * pi = a - 2 * pi angle !a | a < 0 = a + 2 * pi angle !a = a tans :: Int64 -> [[Int64]] -> UArray (Int,Int) Double tans r cs = listArray ((0,0), (length cs - 1, 1)) $ concatMap f cs where f :: [Int64] -> [Double] f [x,y] = [angle a2, angle a1] where phi | y == 0 = if x < 0 then pi else 0.0 | otherwise = (fromIntegral $ signum y) * (acos $ (fromIntegral x) / d) d = sqrt $ fromIntegral $ x*x + y*y z = sqrt $ fromIntegral $ x*x + y*y - r*r a1 = phi + (acos $ (fromIntegral r)/d) a2 = phi - (acos $ (fromIntegral r)/d) overlap !a1 !a2 !a1' !a2' | a1 < a2 && a1' < a2' = a1 <= a2' && a1' <= a2 | a1 > a2 && a1' > a2' = overlap (a1 - 2*pi) a2 (a1' - 2*pi) a2' | a1 > a2 && a1' <= pi = overlap (a1 - 2*pi) a2 a1' a2' | a1 > a2 = overlap a1 (a2 + 2*pi) a1' a2' | a1 <= pi = overlap a1 a2 (a1' - 2*pi) a2' | otherwise = overlap a1 a2 a1' (a2' + 2 * pi) solve cows = length $ [ 1 | i <- [0..n] , j <- [i+1..n] , let a1 = cows ! (i,0) , let a2 = cows ! (i,1) , let a1' = cows ! (j,0) , let a2' = cows ! (j,1) , overlap a1 a2 a1' a2' ] where ((0,0),(n,1)) = bounds cows main = do ls <- getContents let ([n, r]: cows ) = map (map read . words) $ lines ls print $ solve $ tans r cows

I am using example dataset 5.in from http://www.usaco.org/current/data/sight.zip and get the following profile:

$ ghc -O2 -XBangPatterns -ddump-simpl sight3.hs $ ./sight3 < 5.in ... Sun Dec 01 23:35 2013 Time and Allocation Profiling Report (Final) sight3.EXE +RTS -p -hd -RTS total time = 10.46 secs (10459 ticks @ 1000 us, 1 processor) total alloc = 1,847,301,536 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc solve Main 65.2 30.7 overlap Main 14.4 0.0 solve.a2' Main 8.9 32.5 solve.a1' Main 8.6 32.5 main.(...) Main 2.8 4.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 49 0 0.0 0.0 100.0 100.0 main Main 99 0 0.0 0.1 99.9 100.0 main.r Main 110 1 0.0 0.0 0.0 0.0 tans Main 105 1 0.0 0.0 0.0 0.1 tans.f Main 106 10000 0.0 0.1 0.0 0.1 tans.f.a1 Main 112 10000 0.0 0.0 0.0 0.0 angle Main 111 20000 0.0 0.0 0.0 0.0 tans.f.d Main 109 10000 0.0 0.0 0.0 0.0 tans.f.phi Main 108 10000 0.0 0.0 0.0 0.0 tans.f.a2 Main 107 10000 0.0 0.0 0.0 0.0 solve Main 104 1 65.2 30.7 97.1 95.7 overlap Main 117 64368980 14.4 0.0 14.4 0.0 solve.a2' Main 116 49995000 8.9 32.5 8.9 32.5 solve.a1' Main 115 49995000 8.6 32.5 8.6 32.5 solve.a2 Main 114 9999 0.0 0.0 0.0 0.0 solve.a1 Main 113 9999 0.0 0.0 0.0 0.0 solve.(...) Main 103 1 0.0 0.0 0.0 0.0 solve.n Main 102 1 0.0 0.0 0.0 0.0 main.cows Main 101 1 0.0 0.0 0.0 0.0 main.(...) Main 100 1 2.8 4.0 2.8 4.0 CAF GHC.IO.Encoding.CodePage 83 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.Internals 82 0 0.0 0.0 0.0 0.0 CAF Text.Read.Lex 79 0 0.1 0.0 0.1 0.0 CAF GHC.IO.Encoding 75 0 0.0 0.0 0.0 0.0 CAF GHC.Int 71 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Handle.FD 67 0 0.0 0.0 0.0 0.0 CAF:main1 Main 63 0 0.0 0.0 0.0 0.0 main Main 98 1 0.0 0.0 0.0 0.0 CAF:lvl3_r3iU Main 59 0 0.0 0.0 0.0 0.0

what is being allocated in solve.a1' and a2'? I thought that being strict, it won't allocate anything (and the computation is no different from solve.a1)
how to find out where the CPU is spent in solve? I'd expect the most cost being spent in overlap, and the enclosing loop to be very cheap in comparison.

(for the sake of stray reader, I add that this is purely a profiling exercise - I do have a solution that is hundreds of times faster, but it is boring from a profiling perspective even with plain lists)

Chris · Accepted Answer · 2013-12-02 13:06:31Z

ghc does not manage to fold the length computation with building the list - i.e. it allocates list cells.

If you rewrite solve to an explicit loop, the allocations vanish:

solve cows = n `seq` go 0 0 1 n where (_,(n,_)) = bounds cows go count i j n | i > n = count | j > n = go count (i+1) (i+2) n | overlap (cows ! (i,0)) (cows ! (i,1)) (cows ! (j,0)) (cows ! (j,1)) = go (count + 1) i (j + 1) n | otherwise = go count i (j + 1) n

As to why the allocations are attributed to a1' and a2', I have no clue.

Cpu use is dominated by the go function, which probably means array access. overlap is only about 15% of total run time.

Edit: here's (less readable) version with two of the array accesses moved out of the inner loop:

solve !cows = n `seq` go 0 0 where (_,(n,_)) = bounds cows go !count !i | i >= n = count | otherwise = go2 count i (i+1) (cows ! (i,0)) (cows ! (i,1)) go2 !count !i !j !a1 !a2 | j > n = go count (i+1) | overlap a1 a2 (cows ! (j,0)) (cows ! (j,1)) = go2 (count+1) i (j+1) a1 a2 | otherwise = go2 count i (j+1) a1 a2

I have a problem with this. 1. array access costing 4x more than a series of tests? 2. if that is all, why it runs for 3 seconds? (or over 15 seconds for 6.in - that's just a few million array accesses and comparisons) 3. allocations still are attributed to a1, a2, a1' and a2' (I kept them instead of supplying arguments directly) 4. rewriting with "go" didn't speed up the program
You didn't ask about speeding up the program, but only where allocations and time go. And while the data in 5.in should fit into L2 cache, accessing that can still take a lot of time compared to some branches (assuming a high branch prediction rate).
Note that my version is actually slower than the original because the array accesses with index i are not moved out of the inner j loop. Adding a faster version. Also, compile FP code with -fllvm
A C version is only 3 times faster - to get the Haskell version faster, you'll need to do low-level code tuning. Profiling the C version shows 80% time in overlap, btw. Gist: gist.github.com/bokesan/7750850
Yes, I didn't ask to speed up the program; but I'd expect the proof "where the problem was" to show that by changing the execution speed :) ok, I'll try the -fllvm

Collectives™ on Stack Overflow

Profiling trivial loop

1 Answer 1

7 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Related