Parallelizing a very tight loop

Question

I've been banging my head for hours on this one and I always end up with thread contention eating up any performance improvements of parallelizing my loop.

I'm trying to calculate a histogram of a 8 bit grayscale gigapixel image. People who have read the book "CUDA by Example" will probably know where this is coming from (Chapter 9).

The method is very very simple (resulting in a very tight loop). It's basically just

 private static void CalculateHistogram(uint[] histo, byte[] buffer) { foreach (byte thisByte in buffer) { // increment the histogram at the position // of the current array value histo[thisByte]++; } }

where buffer is an array of 1024^3 elements.

On a somewhat recent Sandy Bridge-EX CPU building a histogram of 1 billion elements takes 1 second running on one core.

Anyways, I tried speeding up the calculation by distributing the loop among all my cores and end up with a solution 50 times slower.

 private static void CalculateHistrogramParallel(byte[] buffer, ref int[] histo) { // create a variable holding a reference to the histogram array int[] histocopy = histo; var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }; // loop through the buffer array in parallel Parallel.ForEach( buffer, parallelOptions, thisByte => Interlocked.Increment(ref histocopy[thisByte])); }

Quite obviously because of the performance impact of the atomic increment.

No matter what I tried (like range partitioners [http://msdn.microsoft.com/en-us/library/ff963547.aspx], concurrent collections [http://msdn.microsoft.com/en-us/library/dd997305(v=vs.110).aspx], and so on) it boils down to the fact that I'm reducing one billion elements to 256 elements and I always end up in a race condition while trying to access my histogram array.

My last try was to use a range partitioner like

 var rangePartitioner = Partitioner.Create(0, buffer.Length); Parallel.ForEach(rangePartitioner, parallelOptions, range => { var temp = new int[256]; for (long i = range.Item1; i < range.Item2; i++) { temp[buffer[i]]++; } });

to calculate sub-histograms. But in the end, I'm still having the problem that I have to merge all those sub-histograms, and bang, thread contention again.

I refuse to believe that there is no way to speed things up by parallelizing, even if it's such a tight loop. If its possible on the GPU, it must be - to some extent - possible on the CPU as well.

What else, except giving up, is there to try?

I've searched stackoverflow and the interwebs quite a bit but this seems to be an edge case for parallelism.

Have you tried using a separate histo for each parallel thingy and add them all up at the end? — Andrew Morton
– Andrew Morton, Commented Jul 18, 2014 at 9:22
I've done something similar with hough transforms. I used separate accumulators and merged them at the end, gave me a massive boost. Merging 4/8 small arrays at the end shouldn't be a bottle neck. I've never personally used Parallel, so don't know much about that, but if you don't get a boost from this it seems that it might be doing something weird. — Chris
– Chris, Commented Jul 18, 2014 at 9:27
@lightxx Consider the start up cost of each parallel loop, create a task, assign some L1 / L2 cache, assigning what it thinks it will need, referencing the memory etc. That can get pretty heavy and cause slow down with such a tight loop. You could look into using Dynamic Partitions msdn.microsoft.com/en-us/library/dd997416.aspx and often albahari.com/threading/part5.aspx#_PLINQ will speed things up. — BLoB
– BLoB, Commented Jul 18, 2014 at 9:43
Consider speeding up the bad code in the first place. histo[thisByte]++; is slow - use a pointer here and unsafe code. SHould give a significant boost. — TomTom
– TomTom, Commented Jul 18, 2014 at 9:55
I wouldn't call avoiding unsafe code "bad code" actually. Anyways, even if we'd manage to speed up the single core version, that's not the question here. — lightxx
– lightxx, Commented Jul 18, 2014 at 10:02

Dirk · Accepted Answer · 2014-07-18 09:53:56Z

You should use one of the Parallel.ForEach loops that has a local state.

Each seperate partition of a parallelized loop has a unique local state, which means it doesn't need synchronization. As a final action you have to aggregate every local state into the final value. This step requires synchronization but is only called once for every partition instead of once per iteration.

Instead of

Parallel.ForEach( buffer, parallelOptions, thisByte => Interlocked.Increment(ref histocopy[thisByte]));

you can use

Parallel.ForEach( buffer, parallelOptions, () => new int[histocopy.Length], // initialize local histogram (thisByte, state, local) => local[thisByte]++, // increment local histogram local => { lock(histocopy) // add local histogram to global { for (int idx = 0; idx < histocopy.Length; idx++) { histocopy[idx] += local[idx]; } } }

It might also be a good idea to start with the default options for partition size and parallel options and optimize from there.

I ran some tests, this was actually slower than the single-core variant. Got 3 sec on "naive" implementation, 15 sec on this, using i5-2540M (ie laptop processor with 2 physical cores and 4 logical units)
@flindeberg What happens if you force it to use two threads only, seeing how (iirc) hyper-threading cores share cache?
These are the results I get: 00:00:02.7638552 (naive) vs 00:00:04.9138028 (Dirk 2 threads) vs 00:00:02.7535994 (2 threads, hardcoded)
So yeah, limiting to two threads made a huge difference, but at least for me not faster than single core.
@flindeberg I ran a similar test case and concur with your observations. So while this approach is better than not using a local state it is still bad. My guess is that the problem is memory bound and not CPU bound.

Chris · Accepted Answer · 2014-07-18 10:31:20Z

I don't have any experience with Parallel, but I whipped up a test with manual threading, and it works perfectly.

private class Worker { public Thread Thread; public int[] Accumulator = new int[256]; public int Start, End; public byte[] Data; public Worker( int start, int end, byte[] buf ) { this.Start = start; this.End = end; this.Data = buf; this.Thread = new Thread( Func ); this.Thread.Start(); } public void Func() { for( int i = Start; i < End; i++ ) this.Accumulator[this.Data[i]]++; } } int NumThreads = 8; int len = buf.Length / NumThreads; var workers = new Worker[NumThreads]; for( int i = 0; i < NumThreads; i++ ) workers[i] = new Worker( i * len, i * len + len, buf ); foreach( var w in workers ) w.Thread.Join(); int[] accumulator = new int[256]; for( int i = 0; i < workers.Length; i++ ) for( int j = 0; j < accumulator.Length; j++ ) accumulator[j] += workers[i].Accumulator[j];

Results on my Q720 mobile i7:

Single threaded time = 5.50s 4 threads = 1.90s 8 threads = 1.24s

Looks like it's working to me. And interestingly, even though the hyper-threading cores shares a cache, 8 threads was actually a bit faster than 4.

I can confirm your findings. On a Xeon E5-2680 it takes roughly 420ms with 8 threads, 200ms with 16 threads and less than 100ms with 32 threads. Just out of curiosity, I tried 64 threads (~~ 120ms) and 128 threads (~~ 150ms)
I'll leave the question open for a bit just in case someone comes up with an idea on how to speed things up using the parallel framework.
Note that I changed the worker constructor a bit, how I did it the first time was a bit dangerous.
The algorithm is pretty much all memory reads, and the data is nowhere near small enough to fit in cache, so I'd think that most of the time the processor is just sitting around waiting for another line of memory to be read in.

Community · Accepted Answer · 2017-05-23 12:16:55Z

I have no ideas if this will be faster, but a little observation;

what if you sort all the elements in buffer[]? It would mean that there is no crossing between different cores anymore. If the performance is applicable, you can then increase core count, it should go up linearly. Note that you really need to handle the firstRange/secondRange splitting a little better since you don't want to have two elements with same values on different ranges.

private static void CalculateHistogram(uint[] histo, byte[] buffer) { Array.Sort(buffer); // so the indexes into histo play well with cache. // todo; rewrite to handle edge-cases. var firstRange = new[] {0, buffer.Length/2}; // [inclusive, exclusive] var secondRange = new[] {buffer.Length/2, buffer.Length}; // create two tasks for now ;o var tasks = new Task[2]; var taskIdentifier = 0; foreach (var range in new[] {firstRange, secondRange}) { var rangeFix = range; // lambda capture ;s tasks[taskIdentifier++] = Task.Factory.StartNew(() => { for (var i = rangeFix[0]; i < rangeFix[1]; i++) ++histo[i]; }); } Task.WaitAll(tasks); }

Quick googling shows me that you can use C# & GPU to sort numbers even further, which would lead to around 3x better performance, worth a try: http://adnanboz.wordpress.com/2011/07/27/faster-sorting-in-c-by-utilizing-gpu-with-nvidia-cuda/

Ps there's few tricks more that can bring very substantial performance gains:

1) Remember the concept of false cache sharing - http://msdn.microsoft.com/en-us/magazine/cc872851.aspx

2) Try using stackalloc keyword and make sure ANY memory allocation is done through stack. Trust me on this - any memory allocation is mad slow, unless directly from stack. We are talking about 5x differences.

3) You can use C# MONO SIMD to try and SUM different arrays(this is C version, but the concept applies to C# C++ Adding 2 arrays together quickly)

Thanks for your input. However, using the GPU is exactly what I'm trying to avoid. I'm kinda fed up by people telling me their GPU implementation of an algorithm is "orders of magnitudes" faster than a CPU implementation, just because the CPU implementation sucks and they've optimized their GPU version. There's an interesting paper by Lee et al. covering that topic. It's called "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU"

Collectives™ on Stack Overflow

Parallelizing a very tight loop

3 Answers 3

5 Comments

6 Comments

1 Comment

Linked

Hot Network Questions