The biggest overhead is the time spent starting and stopping the threads. If I reduce the size of the array to 10 from 10000 it takes about the same amount of time.
If you keep the thread pool, and divide the work for each thread to write to a local data set, it is 4x faster on my machine with 6 cores.
import java.util.ArrayList; import java.util.List; import java.util.concurrent.*; public class ParallelImplementationOptimised { static final int numberOfThreads = Runtime.getRuntime().availableProcessors(); final ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads); private int numberOfCells; public ParallelImplementationOptimised(int numberOfCells) { this.numberOfCells = numberOfCells; } public void update() throws ExecutionException, InterruptedException { List<Future<?>> futures = new ArrayList<>(); for(int thread = 0; thread < numberOfThreads; thread++) { final int threadId = thread; futures.add(exec.submit(new Runnable() { @Override public void run() { int num = numberOfCells / numberOfThreads; double[] h0 = new double[num], h1 = new double[num], h2 = new double[num], h3 = new double[num], h4 = new double[num], h5 = new double[num], h6 = new double[num], h7 = new double[num], h8 = new double[num], h9 = new double[num]; for (int i = 0; i < num; i++) { h0[i] = h0[i] + 1; h1[i] = h1[i] + 1; h2[i] = h2[i] + 1; h3[i] = h3[i] + 1; h4[i] = h4[i] + 1; h5[i] = h5[i] + 1; h6[i] = h6[i] + 1; h7[i] = h7[i] + 1; h8[i] = h8[i] + 1; h9[i] = h9[i] + 1; } } })); } for (Future<?> future : futures) { future.get(); } } public static void main(String[] args) throws ExecutionException, InterruptedException { ParallelImplementationOptimised si = new ParallelImplementationOptimised(10); long start = System.currentTimeMillis(); for (int i = 0; i < 10000; i++) { if(i % 1000 == 0) { System.out.println(i); } si.update(); } long stop = System.currentTimeMillis(); System.out.println("Time: " + (stop - start)); si.exec.shutdown(); } }
SequentialImplementation 3.3 sec. ParallelImplementationOptimised 0.8 sec.
You appear to be writing to the same data on the same cache line. This means the data has to pass via an L3 cache miss which takes 20x longer than an access to L1 cache. I suggest you try completely separate data structures which are at least 128 bytes apart to be sure you are not touching the same cache line.
Note: even if you intended to complete overwrite a whole cache line, x64 CPUs will pull in the previous values of the cache line first.
Another question might be
Why isn't this 20x slower?
The CPU core which has grabbed the cache line might have two threads running with hyper threading (i.e. two threads can access the data locally), and that CPU might go around the loop a few times before it loses the cache line to another CPU core which is demanding it. This means the 20x penalty is not on every access or on every loop but often enough that you get a much slower result.