the code takes a data of size 4k...it adds every 3 consecutive bytes and stores the result in a temporary buffer of size 4k. The temporary buffer is used for the generation of the histogram.
Vectorization is possible for the addition of 3 consecutive bytes which I did using SIMD instructions.
As per what Dietrich suggested, if instead of doing generating the histogram, I simply added the values in the temporary buffer, it executes very fast. But generation of histogram is the part which takes time. I did the profiling of the code using cache grind...the output is:
==11845== ==11845== I refs: 212,171 ==11845== I1 misses: 842 ==11845== LLi misses: 827 ==11845== I1 miss rate: 0.39% ==11845== LLi miss rate: 0.38% ==11845== ==11845== D refs: 69,179 (56,158 rd + 13,021 wr) ==11845== D1 misses: 2,905 ( 2,289 rd + 616 wr) ==11845== LLd misses: 2,470 ( 1,895 rd + 575 wr) ==11845== D1 miss rate: 4.1% ( 4.0% + 4.7% ) ==11845== LLd miss rate: 3.5% ( 3.3% + 4.4% ) ==11845== ==11845== LL refs: 3,747 ( 3,131 rd + 616 wr) ==11845== LL misses: 3,297 ( 2,722 rd + 575 wr) ==11845== LL miss rate: 1.1% ( 1.0% + 4.4% )
and the complete output is:
I1 cache: 65536 B, 64 B, 2-way associative D1 cache: 65536 B, 64 B, 2-way associative LL cache: 1048576 B, 64 B, 16-way associative Command: ./a.out Data file: cachegrind.out.11845 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 212,171 842 827 56,158 2,289 1,895 13,021 616 575 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 97,335 651 642 26,648 1,295 1,030 10,883 517 479 ???:??? 59,413 13 13 13,348 886 829 17 1 0 ???:_dl_addr 40,023 7 7 12,405 10 8 223 18 17 ???:core_get_signature 5,123 2 2 1,277 64 19 256 64 64 ???:core_get_signature_parallel 3,039 46 44 862 9 4 665 8 8 ???:vfprintf 2,344 11 11 407 0 0 254 1 1 ???:_IO_file_xsputn 887 7 7 234 0 0 134 1 0 ???:_IO_file_overflow 720 9 7 250 5 2 150 0 0 ???:__printf_chk 538 4 4 104 0 0 102 2 2 ???:__libc_memalign 507 6 6 145 0 0 114 0 0 ???:_IO_do_write 478 2 2 42 1 1 0 0 0 ???:strchrnul 350 3 3 80 0 0 50 0 0 ???:_IO_file_write 297 4 4 98 0 0 23 0 0 ???:_IO_default_xsputn
registerkeyword is practically ignored by modern compilers, andcounter[arr[i]]++is more readable (there is no difference in the resulting code).forloop, but then you unedited it. Does it make sense.?counterbecause it is only 512 bytes and so it fits easily into L1 cache.