3

I have written a piece of code which is used for counting the frequency of numbers between 0 and 255.

unsigned char arr[4096]; //aligned 64 bytes, filled with random characters short counter[256]; //aligned 32 bytes register int i; for(i = 0; i < 4096; i++) ++counter[arr[i]]; 

It is taking a lot of time to execute; the random access to counter array is very expensive.

Does anyone have any ideas which I could use to either make the accesses sequential or any other approach I could use?

11
  • 4
    FYI: the register keyword is practically ignored by modern compilers, and counter[arr[i]]++ is more readable (there is no difference in the resulting code). Commented Jul 1, 2011 at 1:54
  • 1
    @randy7 - I edited your for loop, but then you unedited it. Does it make sense.? Commented Jul 1, 2011 at 1:55
  • 3
    @randy7: It would have been a register anyway. Check the assembly. Commented Jul 1, 2011 at 2:03
  • 2
    @Klee1: You do have spatial locality for counter because it is only 512 bytes and so it fits easily into L1 cache. Commented Jul 1, 2011 at 2:22
  • 1
    A lot of time to execute? This code runs in essentially zero time. How are you measuring "a lot of time?" Commented Jul 1, 2011 at 6:09

6 Answers 6

11

What makes you think that random access to the counter array is expensive? Have you profiled? Try Valgrind, which has a cache profiling tool called "cachegrind". Profiling also lets you know if the code is actually slow or if you just think it is slow because it ought to be.

This is a very simple piece of code and before optimizing it is important to know whether it is memory bound or if it is not memory bound (w.r.t. the data, not the histogram table). I can't answer that off the top of my head. Try comparing to a simple algorithm which just sums the entire input: if both run at about the same speed, then your algorithm is memory bound and you are done.

My best guess is that the main issue which could slow you down is this:

 Registers RAM 1. <-- read data[i] --------------- 2. <-- read histogram[data[i]] ---- 3. increment 4. --- write histogram[data[i]] --> 5. <-- read data[i] --------------- 6. <-- read histogram[data[i]] ---- 

The compiler and processor are not allowed to reorder most of the instructions here (except #1 and #5, which can be done ahead of time) so you are basically going to be limited by whichever is smaller: the bandwidth of your L1 cache (which is where the histogram is) and the bandwidth of your main RAM, each multiplied by some unknown constant factor. (Note: the compiler can only move #1/5 around if it unrolls the loop, but the processor might be able to move it around anyway.)

Which is why you profile before you try to get clever -- because if your L1 cache has enough bandwidth, then you will always be starving for data and there is nothing you can do about it.

Footnote:

This code:

register int i; for(i = 0; i < 4096; i++) ++counter[arr[i]]; 

Generates the same assembly as this code:

int i; for(i = 0; i < 4096; i++) counter[arr[i]]++; 

But this code is easier to read.

Sign up to request clarification or add additional context in comments.

2 Comments

Well the interesting question is if the compiler is able to make sure there are no aliasing problems (for the posted code that should be trivial) in which case he could do pretty much the complete code in parallel. So I'd check to see if gcc vectorizes the code reasonably (there was some option to show which loops are vectorized) - if not that's probably a performance loss
@Voo: I doubt this code is manually vectorizable, never mind automatically vectorizable, and it's not really relevant if the scalar code is memory bound.
0

More idiomatic:

// make sure you actually fill this with random chars // if this is declared in a function, it _might_ have stack garbage // if it's declared globally, it will be zeroed (which makes for a boring result) unsigned char arr[4096]; // since you're counting bytes in an array, the array can't have more // bytes than the current system memory width, so then size_t will never overflow // for this usage size_t counter[256]; for(size_t i = 0; i < sizeof(arr)/sizeof(*arr); ++i) ++counter[arr[i]]; 

Now the key is to compile with C99, and some serious optimization flags:

cc mycode.c -O3 -std=c99 

Any optimization at all on a simple loop like this will make it extremely quick. Don't waste more time on making something this simple faster.

2 Comments

short may fit better into cache than size_t. Even better, int_least16_t.
It's unlikely to make much difference in this example. It'll be 1-2K on most systems.
0

First, I complete agree with Dietrich, please prove (yourself and us) where the real bottleneck lies, first.

The only possible improvement I can see is in your short. The size of the table will not be a problem here, I guess, but promotions and overflow. Use a type that handles this by default, namely unsigned.

Anyhow, counters should always be unsigned (even better size_t), this is the semantic of cardinalities. As an extra advantage unsigned types don't overflow, but wrap a arround in a controled way. The compiler doesn't have to use an additional instruction for that.

Then arithmetic in C is done with a width of at least that of int. This then has to be cast back to short.

Comments

0

the code takes a data of size 4k...it adds every 3 consecutive bytes and stores the result in a temporary buffer of size 4k. The temporary buffer is used for the generation of the histogram.

Vectorization is possible for the addition of 3 consecutive bytes which I did using SIMD instructions.

As per what Dietrich suggested, if instead of doing generating the histogram, I simply added the values in the temporary buffer, it executes very fast. But generation of histogram is the part which takes time. I did the profiling of the code using cache grind...the output is:

==11845== ==11845== I refs: 212,171 ==11845== I1 misses: 842 ==11845== LLi misses: 827 ==11845== I1 miss rate: 0.39% ==11845== LLi miss rate: 0.38% ==11845== ==11845== D refs: 69,179 (56,158 rd + 13,021 wr) ==11845== D1 misses: 2,905 ( 2,289 rd + 616 wr) ==11845== LLd misses: 2,470 ( 1,895 rd + 575 wr) ==11845== D1 miss rate: 4.1% ( 4.0% + 4.7% ) ==11845== LLd miss rate: 3.5% ( 3.3% + 4.4% ) ==11845== ==11845== LL refs: 3,747 ( 3,131 rd + 616 wr) ==11845== LL misses: 3,297 ( 2,722 rd + 575 wr) ==11845== LL miss rate: 1.1% ( 1.0% + 4.4% ) 

and the complete output is:

I1 cache: 65536 B, 64 B, 2-way associative D1 cache: 65536 B, 64 B, 2-way associative LL cache: 1048576 B, 64 B, 16-way associative Command: ./a.out Data file: cachegrind.out.11845 Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Thresholds: 0.1 100 100 100 100 100 100 100 100 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 212,171 842 827 56,158 2,289 1,895 13,021 616 575 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function -------------------------------------------------------------------------------- 97,335 651 642 26,648 1,295 1,030 10,883 517 479 ???:??? 59,413 13 13 13,348 886 829 17 1 0 ???:_dl_addr 40,023 7 7 12,405 10 8 223 18 17 ???:core_get_signature 5,123 2 2 1,277 64 19 256 64 64 ???:core_get_signature_parallel 3,039 46 44 862 9 4 665 8 8 ???:vfprintf 2,344 11 11 407 0 0 254 1 1 ???:_IO_file_xsputn 887 7 7 234 0 0 134 1 0 ???:_IO_file_overflow 720 9 7 250 5 2 150 0 0 ???:__printf_chk 538 4 4 104 0 0 102 2 2 ???:__libc_memalign 507 6 6 145 0 0 114 0 0 ???:_IO_do_write 478 2 2 42 1 1 0 0 0 ???:strchrnul 350 3 3 80 0 0 50 0 0 ???:_IO_file_write 297 4 4 98 0 0 23 0 0 ???:_IO_default_xsputn 

1 Comment

the counter is then used for sorting and getting the top 8 most recurring values. An interesting observation is: if i disable the sorting function the generation of histogram executes very fast (0 - 1 us - using gettimeofday function) but if i enable the sorting then the time for generating the histogram jumps to 22 us. I believe that the cpu is writing the data from the cache back to memory and that is what is taking time.
0

Well, Richard is certainly right. This is because the compiler has to convert the array to pointer, but that takes some time, thus increasing execution time. For instance, try this:

for(i = 0; i < 4096; i++) ++*(counter+*(arr+i)); 

1 Comment

C does not specify any execution times.
-3

Consider using a pointer to arr, instead of indexing.

unsigned char p = &arr; for (i = 4096-1; 0 <= i; --i) ++counter[*p++]; 

1 Comment

Even worse, char* is allowed to alias anything, so you're potentially making it harder for the compiler to prove that *p doesn't alias other data, even of other types. And there's a typo, you declared p as char instead of char *. All those factors add up to a downvote I otherwise wouldn't have bothered with.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.