Why is the sum of reciprocals using a for-loop ~400x faster than streams?

Question

This code is benchmarking 3 different ways to compute the sum of the reciprocals of the elements of a double[].

a for-loop
Java 8 streams
the colt math library

What is the reason that the computation using a simple for-loop is ~400 times faster than the one using streams? (Or is there anything needs to be improved in the benchmarking code? Or a faster way of computing this using streams?)

Code :

import java.util.Arrays; import java.util.List; import java.util.Map; import java.util.concurrent.TimeUnit; import java.util.stream.Collectors; import java.util.stream.IntStream; import cern.colt.list.DoubleArrayList; import cern.jet.stat.Descriptive; import org.openjdk.jmh.annotations.*; @State(Scope.Thread) public class MyBenchmark { public static double[] array; static { int num_of_elements = 100; array = new double[num_of_elements]; for (int i = 0; i < num_of_elements; i++) { array[i] = i+1; } } @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public void testInversionSumForLoop(){ double result = 0; for (int i = 0; i < array.length; i++) { result += 1.0/array[i]; } } @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public void testInversionSumUsingStreams(){ double result = 0; result = Arrays.stream(array).map(d -> 1/d).sum(); } @Benchmark @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public void testInversionSumUsingCernColt(){ double result = Descriptive.sumOfInversions(new DoubleArrayList(array), 0, array.length-1); } }

Results:

/** * Results * Benchmark Mode Cnt Score Error Units * MyBenchmark.testInversionSumForLoop avgt 200 1.647 ± 0.155 ns/op * MyBenchmark.testInversionSumUsingCernColt avgt 200 603.254 ± 22.199 ns/op * MyBenchmark.testInversionSumUsingStreams avgt 200 645.895 ± 20.833 ns/o */

Update: these results show Blackhome.consume or return is necessary to avoid jvm optimization.

/** * Updated results after adding Blackhole.consume * Benchmark Mode Cnt Score Error Units * MyBenchmark.testInversionSumForLoop avgt 200 525.498 ± 10.458 ns/op * MyBenchmark.testInversionSumUsingCernColt avgt 200 517.930 ± 2.080 ns/op * MyBenchmark.testInversionSumUsingStreams avgt 200 582.103 ± 3.261 ns/op */

oracle jdk version "1.8.0_181", Darwin Kernel Version 17.7.0

JMH does not yet make the benchmark automatically correct. The result of the computation is not used here, and JIT removes the computation altogether in a loop case. You need to "consume" result either by calling Blackhole.consume or by simply adding return result; at the end of benchmark method. — apangin
– apangin, Commented Jan 12, 2019 at 21:27
1.647ns is too little time to sum 100 numbers, that over 6e+10 operations per second. I think testInversionSumForLoop is being optimized away by the JIT compiler. — ggf31416
– ggf31416, Commented Jan 12, 2019 at 21:27
@apangin you are right, I missed Blackhole.consume now results are similar. — Vipin
– Vipin, Commented Jan 12, 2019 at 21:54

Karol Dowbecki · Accepted Answer · 2019-01-14 09:24:09Z

In your example JVM most likely optimizes out the loop completely because result value is never read after computation. You should use Blackhole to consume the result like below:

@State(Scope.Thread) @Warmup(iterations = 10, time = 200, timeUnit = MILLISECONDS) @Measurement(iterations = 20, time = 500, timeUnit = MILLISECONDS) @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) public class MyBenchmark { static double[] array; static { int num_of_elements = 100; array = new double[num_of_elements]; for (int i = 0; i < num_of_elements; i++) { array[i] = i + 1; } } double result = 0; @Benchmark public void baseline(Blackhole blackhole) { result = 1; result = result / 1.0; blackhole.consume(result); } @Benchmark public void testInversionSumForLoop(Blackhole blackhole) { for (int i = 0; i < array.length; i++) { result += 1.0 / array[i]; } blackhole.consume(result); } @Benchmark public void testInversionSumUsingStreams(Blackhole blackhole) { result = Arrays.stream(array).map(d -> 1 / d).sum(); blackhole.consume(result); } }

This new benchmark shows difference of 4x which is expected. Loops benefit from a number of optimizations in the JVM and don't involve new objects creation like streams do.

Benchmark Mode Cnt Score Error Units MyBenchmark.baseline avgt 100 2.437 ± 0.139 ns/op MyBenchmark.testInversionSumForLoop avgt 100 135.512 ± 13.080 ns/op MyBenchmark.testInversionSumUsingStreams avgt 100 506.479 ± 4.209 ns/o

I attempted to add a baseline to show what is the cost of single operation on my machine. The baseline ns/ops is similar to your loop ns/ops which IMO confirms your loop was optimized out.

I'd love someone to tell me what would be a good baseline for this benchmark scenario.

My environment:

openjdk version "11.0.1" 2018-10-16 OpenJDK Runtime Environment 18.9 (build 11.0.1+13) OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode) Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Linux 4.15.0-43-generic #46-Ubuntu SMP Thu Dec 6 14:45:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

My new results are very close(see update in question), I see here you have 4X difference probably because of jvm versions. But as other said in comments you are also right about root cause of this problem.
@Vipin I rerun with different baseline and I still get 4x. Added my Java version and CPU.
@KarolDowbecki it does not have to be Blackhole, you could simply return the result, looks a bit cleaner to me this way. Also when you ask what is the baseline result for a single operation, you mean what is the time taken for that single division?
This question is a follow-up of this question and, as explained in the comment, DoubleStream.sum() uses the Kahan summation algorithm which provides a higher precision but also takes a bit more time than a plain for loop. So it's not just loop vs stream. You may compare .sum() with reduce(0, Double::sum)...
@KarolDowbecki you can be sure that the division has been elided in your baseline operation. For your original version, result += 1.0 / 1.0;, the term 1.0 / 1.0 was a compile-time constant. For your changed variant, result = 1; result = result / 1.0;, the runtime optimizer has to kick in, but then, it does not get confused by two subsequent (redundant) assignments. So your changed the increment operation to a plain assignment of 1, which is the reason why it became even faster. A true division would be significantly slower. You should try something like result = (result + 1) / 1.0;

Collectives™ on Stack Overflow

Why is the sum of reciprocals using a for-loop ~400x faster than streams?

1 Answer 1

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Linked

Related