Java sequential implementation is 4 times faster than parallel implementation

Question

I created a really simple scenario where I recognized a really weird behavior which I cant understand.

Under following link I created an sequential implementation: http://ideone.com/B8JYeA Basically there are several big arrays with fixed size. The algorithm iterates through them and changes the value.

for(int i = 0; i < numberOfCells; i++) { h0[i] = h0[i] + 1; h1[i] = h1[i] + 1; h2[i] = h2[i] + 1; h3[i] = h3[i] + 1; h4[i] = h4[i] + 1; }

If I run it on my workstation it takes around 5 seconds.

I implemented the same in a parallel version. And 8 threads run it simultaneously. The code should be thread safe and there is no dependency between the threads.

But still the code runs around 4 times slower on my workstation: http://ideone.com/yfwVmr

final int numberOfThreads = Runtime.getRuntime().availableProcessors(); ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads); for(int thread = 0; thread < numberOfThreads; thread++) { final int threadId = thread; exec.submit(new Runnable() { @Override public void run() { for(int i = threadId; i < numberOfCells; i += numberOfThreads) { h0[i] = h0[i] + 1; h1[i] = h1[i] + 1; h2[i] = h2[i] + 1; h3[i] = h3[i] + 1; h4[i] = h4[i] + 1; } } }); } exec.shutdown();

Does anyone have an idea why this happens?

Edit: This problem differs to others because the reason why is probably a Caching Problem. How can I solve this caching problem?

This was closed rather quickly. The other question is very unspecific, like "sometimes something is slower". Here, one could expect more interesting answers.... — Marco13
– Marco13, Commented May 5, 2015 at 16:35
Reopened the question as it is looking for something much more specific to the code in question. — Peter Lawrey
– Peter Lawrey, Commented May 5, 2015 at 17:17

Peter Lawrey · Accepted Answer · 2015-05-05 17:15:08Z

The biggest overhead is the time spent starting and stopping the threads. If I reduce the size of the array to 10 from 10000 it takes about the same amount of time.

If you keep the thread pool, and divide the work for each thread to write to a local data set, it is 4x faster on my machine with 6 cores.

import java.util.ArrayList; import java.util.List; import java.util.concurrent.*; public class ParallelImplementationOptimised { static final int numberOfThreads = Runtime.getRuntime().availableProcessors(); final ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads); private int numberOfCells; public ParallelImplementationOptimised(int numberOfCells) { this.numberOfCells = numberOfCells; } public void update() throws ExecutionException, InterruptedException { List<Future<?>> futures = new ArrayList<>(); for(int thread = 0; thread < numberOfThreads; thread++) { final int threadId = thread; futures.add(exec.submit(new Runnable() { @Override public void run() { int num = numberOfCells / numberOfThreads; double[] h0 = new double[num], h1 = new double[num], h2 = new double[num], h3 = new double[num], h4 = new double[num], h5 = new double[num], h6 = new double[num], h7 = new double[num], h8 = new double[num], h9 = new double[num]; for (int i = 0; i < num; i++) { h0[i] = h0[i] + 1; h1[i] = h1[i] + 1; h2[i] = h2[i] + 1; h3[i] = h3[i] + 1; h4[i] = h4[i] + 1; h5[i] = h5[i] + 1; h6[i] = h6[i] + 1; h7[i] = h7[i] + 1; h8[i] = h8[i] + 1; h9[i] = h9[i] + 1; } } })); } for (Future<?> future : futures) { future.get(); } } public static void main(String[] args) throws ExecutionException, InterruptedException { ParallelImplementationOptimised si = new ParallelImplementationOptimised(10); long start = System.currentTimeMillis(); for (int i = 0; i < 10000; i++) { if(i % 1000 == 0) { System.out.println(i); } si.update(); } long stop = System.currentTimeMillis(); System.out.println("Time: " + (stop - start)); si.exec.shutdown(); } }

SequentialImplementation 3.3 sec. ParallelImplementationOptimised 0.8 sec.

You appear to be writing to the same data on the same cache line. This means the data has to pass via an L3 cache miss which takes 20x longer than an access to L1 cache. I suggest you try completely separate data structures which are at least 128 bytes apart to be sure you are not touching the same cache line.

Note: even if you intended to complete overwrite a whole cache line, x64 CPUs will pull in the previous values of the cache line first.

Another question might be

Why isn't this 20x slower?

The CPU core which has grabbed the cache line might have two threads running with hyper threading (i.e. two threads can access the data locally), and that CPU might go around the loop a few times before it loses the cache line to another CPU core which is demanding it. This means the 20x penalty is not on every access or on every loop but often enough that you get a much slower result.

Could you provide some more explanation as to why each single(?) write to a cache line would have to pass through to the L3 before the next calculation? (Does this depend on whether or not volatile is used?)
@HannoBinder if every access did got via the L3 it would be 20x slower or more. volatile won't help really as you can't have volatile elements of an array, only a volatile reference to an array. In theory the JIT could optimise the loop to just execute the last iteration and use of volatile might prevent that, but in this case it doesn't eliminate the loop anyway.
Apparently you are correct about only the reference being volatile. I'm not sure though if volatile wouldn't still yield memory barriers for each access to the array. So if comparing a (single-threaded) implementation without volatile against (a multi-threaded) one with (unnecessary) volatile I figure it may make a difference.
@HannoBinder there is AtomicIntergerArray which supports volatile access on every element.
I wonder how one should enforce that data being on different cache lines. The JVM spec explicitly says (!) that it does not say anything (!) about the memory layout. A quick test, e.g. with 10 threads and h[10][numCells] ("definitely" being far apart) did not show any speedup. In contrast to that, maintaining the locality (as Hanno Binder suggested) does make it faster (though not as fast as a sequential implementation - which has many reasons, primarily the memory bottleneck, but I don't see how this could be avoided or alleviated by speculating about the caching...).

Community · Accepted Answer · 2017-05-23 11:51:02Z

Not really an answer, but: First I'd try to maintain the locality of data access where possible:

final int numberOfCellsPerThread = numberOfCells / numberOfThreads; public void run() { final int start = threadId * numberOfCellsPerThread; final int end = start + numberOfCellsPerThread; for(int i = start; i < end; i++) { h0[i] = h0[i] + 1; h1[i] = h1[i] + 1; h2[i] = h2[i] + 1; h3[i] = h3[i] + 1; h4[i] = h4[i] + 1; } }

For more explanation about why locality matters see for example Why does cache locality matter for array performance? or http://en.wikipedia.org/wiki/Locality_of_reference.

Basically it's just about using data that's already in the cache where possible. Since the cache is limited in size, if a[i] is already in the cache, e.g. due to a previous read operation, the chance that a[i+1] is in the cache too is fairly high. At least higher than the chance of a[i+100] for example.

Besides, sequential reads from memory can potentially be optimized into bursts by the hardware, and are the easiest to predict by prefetching logic.

You want to make sure the data is on different cache lines. i.e. at least 64-128 bytes apart.
I ran quite some tests now, slicing and dicing the data in various ways and passing it to the executor service in various ways, and from all my approaches, the simple, pragmatic solution of maintaining the locality (similar to what you suggested) basically was the fastest. I already upvoted it, but it could become a "real" answer if you shortly explained why locality is important, and maybe corrected/extended the code to be copy+paste-able (maybe as an MVCE, but at least pointing out that numberOfCells should actually be numberOfCellsPerThread, for example)
Thanks for taking the time to do the testing. - I really overlooked the numberOfCellsPerThread issue, thanks for pointing it out.
Now, one could add some hints to be careful for the case that numberOfCells is not a multiple of numberOfThreads, which would cause the last few cells to be skipped from the computation, but I think that the idea now becomes clear(er), and such details may be left to the implementer.

Collectives™ on Stack Overflow

Java sequential implementation is 4 times faster than parallel implementation

2 Answers 2

12 Comments

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

4 Comments

Linked

Related