Large performance gap between CPU's div instruction and HotSpot's JIT code

Question

Since the beginning of CPUs it has been general knowledge that the integer division instruction is expensive. I went to see how bad it is today, on CPUs which have the luxury of billions of transistors. I found that the hardware idiv instruction still performs significantly worse for constant divisors than the code the JIT compiler is able to emit, which doesn't contain the idiv instruction.

To bring this out in a dedicated microbenchmark I've written the following:

@BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @OperationsPerInvocation(MeasureDiv.ARRAY_SIZE) @Warmup(iterations = 8, time = 500, timeUnit = TimeUnit.MILLISECONDS) @Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS) @State(Scope.Thread) @Fork(1) public class MeasureDiv { public static final int ARRAY_SIZE = 128; public static final long DIVIDEND_BASE = 239520948509234807L; static final int DIVISOR = 10; final long[] input = new long[ARRAY_SIZE]; @Setup(Level.Iteration) public void setup() { for (int i = 0; i < input.length; i++) { input[i] = DIVISOR; } } @Benchmark public long divVar() { long sum = 0; for (int i = 0; i < ARRAY_SIZE; i++) { final long in = input[i]; final long dividend = DIVIDEND_BASE + i; final long divisor = in; final long quotient = dividend / divisor; sum += quotient; } return sum; } @Benchmark public long divConst() { long sum = 0; for (int i = 0; i < ARRAY_SIZE; i++) { final long in = input[i]; final long dividend = DIVIDEND_BASE + in; final int divisor = DIVISOR; final long quotient = dividend / divisor; sum += quotient; } return sum; } }

In a nutshell, I have two methods identical in every respect except that one (divVar) performs division by a number read off an array while the other divides by a compile-time constant. These are the results:

Benchmark Mode Cnt Score Error Units MeasureDiv.divConst avgt 5 1.228 ± 0.032 ns/op MeasureDiv.divVar avgt 5 8.913 ± 0.192 ns/op

The performance ratio is quite extraordinary. My expectation would be that a modern Intel processor has enough real estate, and its engineers have enough interest, to implement a complex but performant division algorithm in hardware. Yet the JIT compiler beats Intel by sending it a stream of some other instructions which perform the same job, just seven times faster. If anything, dedicated microcode should be able to utilize the CPU even better than what JIT can do through the public API of assembly instructions.

How come idiv is still so much slower, what is the fundamental limitation?

One explanation that comes to mind is the hypothetical existence of a division algorithm which involves the dividend for the first time very late into the process. The JIT compiler would then have a head start because it would evaluate the first part which involves only the divisor at compile time and emit just the second part of the algorithm as runtime code. Is that hypothesis true?

Have you taken a look at the code that the JIT compiler is emitting? Can you show it to us? — Stephen C
– Stephen C, Commented Jul 12, 2015 at 21:49
I have, it's a quite long stretch of cheap instructions like add, sub, sar, and shl. — Marko Topolnik
– Marko Topolnik, Commented Jul 12, 2015 at 21:51
This is a fairly standard optimization - most (all?) modern compilers will try to optimize constant divisions to a series of shifts, subs and adds, etc. This answer about C compilers has quite a bit more detail: stackoverflow.com/a/2580985/5087125 — pvg
– pvg, Commented Jul 12, 2015 at 22:17
@MarkoTopolnik yes, it's fundamental - if the constant is known in advance, it's possible to come up with a bunch of magic coefficients and operations (and leverage the much faster mul, in particular) to do the division faster. Actually finding them and then performing the fast division takes longer than a general purpose divide since typically this involves finding a reciprocal. It's just that the compiler has (comparatively) all the time in the world. — pvg
– pvg, Commented Jul 13, 2015 at 5:36
@MarkoTopolnik in order to find these, you have to do a fair bit of computation - several divisions, a search by approximation, etc - view source on this thing to get an idea. hackersdelight.org/magic.htm I'm no silicon expert but that looks like a lot more work than tossing another block on the ALU. — pvg
– pvg, Commented Jul 13, 2015 at 5:58

Marko Topolnik · Accepted Answer · 2015-07-13 08:14:46Z

As explained by user pvg through the comments, the hypothesized algorithm is indeed in existence and the best one currently known. The algorithm involves division by the same divisor in the preparatory step, so it is fundamentally irreducible as a whole. It is covered in Chapter 10 of the classic publication Hacker's Delight.

Collectives™ on Stack Overflow

Large performance gap between CPU's div instruction and HotSpot's JIT code

1 Answer 1

Comments

Hot Network Questions