Very Strange Efficiency Quirks while Sorting

Question

I am currently taking a Data Structures class and, as you may expect, one of the things we have to do is write some of the common sorts. In writing my insertion sort algorithm, I noticed in ran significantly faster than that of my instructor's (for 400000 data points it took my algorithm about 30 seconds and his about 90). I emailed him my code and the same results happened when they were both running on the same machine. We managed to waste more than 40 minutes slowly changing his sorting method into mine until it was exactly the same, word for word, except for one seemingly arbitrary thing. First, here is my insertion sort code:

public static int[] insertionSort(int[] A){ //Check for illegal cases if (A == null || A.length == 0){ throw new IllegalArgumentException("A is not populated"); } for(int i = 0; i < A.length; i++){ int j = i; while(j > 0 && A[j - 1] > A[j]){ int temp = A[j]; A[j] = A[j - 1]; A[j - 1] = temp; j--; } } return A; }

Now at this point his code was exactly the same as mine except for the lines where we swap A[j] and A[j - 1]. His code did the following:

int temp = A[j - 1]; A[j - 1] = A[j]; A[j] = temp;

We found that those 3 lines are the culprits. My code was running significantly faster because of this. Perplexed, we ran javap -c to get the byte code for a simple program that just had a main which contained an array declaration, a variable declaration for int j and the 3 lines of code for swapping as I wrote them and as he wrote them. Here is the byte code for my swapping method:

 Compiled from "me.java" public class me { public me(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."<init>":()V 4: return public static void main(java.lang.String[]); Code: 0: sipush 10000 3: newarray int 5: astore_1 6: bipush 10 8: istore_2 9: aload_1 10: iload_2 11: iaload 12: istore_3 13: aload_1 14: iload_2 15: aload_1 16: iload_2 17: iconst_1 18: isub 19: iaload 20: iastore 21: aload_1 22: iload_2 23: iconst_1 24: isub 25: iload_3 26: iastore 27: return }

And my instructor's method's byte code:

 Compiled from "instructor.java" public class instructor { public instructor(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."<init>":()V 4: return public static void main(java.lang.String[]); Code: 0: sipush 10000 3: newarray int 5: astore_1 6: bipush 10 8: istore_2 9: aload_1 10: iload_2 11: iconst_1 12: isub 13: iaload 14: istore_3 15: aload_1 16: iload_2 17: iconst_1 18: isub 19: aload_1 20: iload_2 21: iaload 22: iastore 23: aload_1 24: iload_2 25: iload_3 26: iastore 27: return }

I don't see any real difference between these byte codes. What might cause this strange behavior (my code still ran ~3 times faster than his and as to be expected this difference got more drastic as we feed the algorithms larger arrays)? Is this simply a strange quirk of Java. Furthermore, does this happen on your computer? For reference, this was run on a MacBook Pro mid 2014 and my code is exactly as it appears here and his code was deduced to exactly the code as it appears here except for those 3 lines.

[EDIT] Here are my test classes:

public class Tester1 { public static void main(String[] args){ int[] A = new int[400000]; for(int i = 0; i < A.length; i++){ A[i] = (int) (Math.random() * Integer.MAX_VALUE); } double start = System.currentTimeMillis(); insertionSort(A); System.out.println("My insertion sort took " + (System.currentTimeMillis() - start) + " milliseconds."); } public static int[] insertionSort(int[] A){ //Check for illegal cases if (A == null || A.length == 0){ throw new IllegalArgumentException("A is not populated"); } for(int i = 0; i < A.length; i++){ int j = i; while(j > 0 && A[j - 1] > A[j]){ int temp = A[j]; A[j] = A[j - 1]; A[j - 1] = temp; j--; } } return A; } }

And the second file:

public class Tester2 { public static void main(String[] args){ int[] A = new int[400000]; for(int i = 0; i < A.length; i++){ A[i] = (int) (Math.random() * Integer.MAX_VALUE); } double start = System.currentTimeMillis(); otherInsertion(A); System.out.println("Other insertion sort took " + (System.currentTimeMillis() - start) + " milliseconds."); } public static int[] otherInsertion(int[] A){ //Check for illegal cases if (A == null || A.length == 0){ throw new IllegalArgumentException("A is not populated"); } for(int i = 0; i < A.length; i++){ int j = i; while(j > 0 && A[j - 1] > A[j]){ int temp = A[j - 1]; A[j - 1] = A[j]; A[j] = temp; j--; } } return A; } }

And the outputs (with no arguments, just java Tester1 and java Tester2):

My insertion sort took 37680.0 milliseconds. Other insertion sort took 86358.0 milliseconds.

These were run as 2 separate files in 2 different JVMs.

@JonKiparsky I did. Also, I had it swapped around which is efficient. Assigning temp to A[j - 1] seemed to be more efficient. Also, a no temp swap makes it even worse. — Dylan Siegler
– Dylan Siegler, Commented Oct 8, 2016 at 0:10
Odd, is there anything obvious in the corresponding native code? — user555045
– user555045, Commented Oct 8, 2016 at 0:13
(Good thing you didn't start to invest time in earnest to investigate results of sub-second measurement runs.) With multi-second runs on a JIT-enabled VM, inspecting the static byte code is close to pointless - compiled out of context or not. — greybeard
– greybeard, Commented Oct 8, 2016 at 0:13
Your code is faster about 15% in my environment (Windows, JDK1.8.0_102) — user4910279
– user4910279, Commented Oct 8, 2016 at 1:45

apangin · Accepted Answer · 2016-10-08 02:32:19Z

This is the effect of loop unrolling optimization along with common subexpression elimination. Depending on the order of array access instructions, JIT can eliminate redundant loads in one case, but not in the other.

Let me explain in details. In both cases JIT unrolls 4 iterations of the inner loop.

E.g. for your case:

 while (j > 3) { if (A[j - 1] > A[j]) { int temp = A[j]; A[j] = A[j - 1]; A[j - 1] = temp; \ } A[j - 1] loaded immediately after store if (A[j - 2] > A[j - 1]) { / int temp = A[j - 1]; A[j - 1] = A[j - 2]; A[j - 2] = temp; \ } A[j - 2] loaded immediately after store if (A[j - 3] > A[j - 2]) { / int temp = A[j - 2]; A[j - 2] = A[j - 3]; A[j - 3] = temp; \ } A[j - 3] loaded immediately after store if (A[j - 4] > A[j - 3]) { / int temp = A[j - 3]; A[j - 3] = A[j - 4]; A[j - 4] = temp; } j -= 4; }

Then JIT eliminates redundant array loads, and the resulting assembly looks like

0x0000000002d53a70: movslq %r11d,%r10 0x0000000002d53a73: lea 0x0(%rbp,%r10,4),%r10 0x0000000002d53a78: mov 0x10(%r10),%ebx ; ebx = A[j] 0x0000000002d53a7c: mov 0xc(%r10),%r9d ; r9d = A[j - 1] 0x0000000002d53a80: cmp %ebx,%r9d ; if (r9d > ebx) { 0x0000000002d53a83: jle 0x0000000002d539f3 0x0000000002d53a89: mov %r9d,0x10(%r10) ; A[j] = r9d 0x0000000002d53a8d: mov %ebx,0xc(%r10) ; A[j - 1] = ebx ; } 0x0000000002d53a91: mov 0x8(%r10),%r9d ; r9d = A[j - 2] 0x0000000002d53a95: cmp %ebx,%r9d ; if (r9d > ebx) { 0x0000000002d53a98: jle 0x0000000002d539f3 0x0000000002d53a9e: mov %r9d,0xc(%r10) ; A[j - 1] = r9d 0x0000000002d53aa2: mov %ebx,0x8(%r10) ; A[j - 2] = ebx ; } 0x0000000002d53aa6: mov 0x4(%r10),%r9d ; r9d = A[j - 3] 0x0000000002d53aaa: cmp %ebx,%r9d ; if (r9d > ebx) { 0x0000000002d53aad: jle 0x0000000002d539f3 0x0000000002d53ab3: mov %r9d,0x8(%r10) ; A[j - 2] = r9d 0x0000000002d53ab7: mov %ebx,0x4(%r10) ; A[j - 3] = ebx ; } 0x0000000002d53abb: mov (%r10),%r8d ; r8d = A[j - 4] 0x0000000002d53abe: cmp %ebx,%r8d ; if (r8d > ebx) { 0x0000000002d53ac1: jle 0x0000000002d539f3 0x0000000002d53ac7: mov %r8d,0x4(%r10) ; A[j - 3] = r8 0x0000000002d53acb: mov %ebx,(%r10) ; A[j - 4] = ebx ; } 0x0000000002d53ace: add $0xfffffffc,%r11d ; j -= 4 0x0000000002d53ad2: cmp $0x3,%r11d ; while (j > 3) 0x0000000002d53ad6: jg 0x0000000002d53a70

Your instructor's code will look different after loop unrolling:

 while (j > 3) { if (A[j - 1] > A[j]) { int temp = A[j - 1]; A[j - 1] = A[j]; A[j] = temp; <-- another store instruction between A[j - 1] access } if (A[j - 2] > A[j - 1]) { int temp = A[j - 2]; A[j - 2] = A[j - 1]; A[j - 1] = temp; } ...

JVM will not eliminate the subsequent load of A[j - 1], because there is another store instruction after the previous load of A[j - 1] (though in this particular case such optimization is theoretically possible).

So, the assembly code will have more load instructions, and the performance will be worse:

0x0000000002b53a00: cmp %r8d,%r10d ; if (r10d > r8d) { 0x0000000002b53a03: jle 0x0000000002b53973 0x0000000002b53a09: mov %r8d,0xc(%rbx) ; A[j - 1] = r8d 0x0000000002b53a0d: mov %r10d,0x10(%rbx) ; A[j] = r10d ; } 0x0000000002b53a11: mov 0xc(%rbx),%r10d ; r10d = A[j - 1] 0x0000000002b53a15: mov 0x8(%rbx),%r9d ; r9d = A[j - 2] 0x0000000002b53a19: cmp %r10d,%r9d ; if (r9d > r10d) { 0x0000000002b53a1c: jle 0x0000000002b53973 0x0000000002b53a22: mov %r10d,0x8(%rbx) ; A[j - 2] = r10d 0x0000000002b53a26: mov %r9d,0xc(%rbx) ; A[j - 1] = r9d ; } 0x0000000002b53a2a: mov 0x8(%rbx),%r8d ; r8d = A[j - 2] 0x0000000002b53a2e: mov 0x4(%rbx),%r10d ; r10d = A[j - 3]

Note, that if you run JVM with loop unrolling optimization disabled (-XX:LoopUnrollLimit=0), the performance of both cases will be the same.

P.S. Full disassembly of both methods is available here, obtained using
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

Well. Something even stranger has happened. I ran the same testing code today first with no parameters and my results were swapped: My insertion sort took 56634.0 milliseconds. Other insertion sort took 146979.0 milliseconds. As well, when I ran with the -XX:LoopUnrollLimit=0, this only made mine slower: My insertion sort took 75795.0 milliseconds. Other insertion sort took 141911.0 milliseconds. The code didn't change. My code is still the int temp = A[j]. I used the javac compiler for those tests.
@DylanSiegler Never benchmark two algorithms in one method or even in one JVM. This may lead to unpredictable results. Here is a good article describing why this works wrong.
When I separated the tests into 2 files and ran them the same results persisted. My code ran a bit faster (49.7 seconds) and the other code also ran a bit faster (111.6 seconds) but that is still a drastic difference between the 2 methods.
@DylanSiegler Yes, that's exactly what I observe, too. The above answer explains the difference.
I want to believe your explanation, but I ran it wth the -XX:LoopUnrollLimit=0 option in separate files (testing code is in the question and updated), but still I have found my code takes ~30 seconds, (6 seconds faster than with loop unrolling) the other takes ~90 seconds (4 seconds slower than with loop unrolling).

Sean F · Accepted Answer · 2016-10-09 17:12:14Z

You will not see anything in the Java bytecode to explain. Things like this can be the effect of either the machine instructions (as created by the just-in-time JIT compiler) or even the microprocessor (particularly the effect of one of the processor caches).

One thing to note here is that in this sort algorithm, for each iteration of the loop, only A[j] needs to be loaded since A[j-1] was loaded in the last iteration. So the most recently loaded value was A[j] (assuming the fact that I just stated has been taken advantage at some level). Therefore, within the loop, the algorithm that stores A[j] first may behave differently because that values was most recently loaded by the loop check, so that value is more likely to remain in a register or in a processor cache, or be subject to just a single memory load due to optimizations in the JIT-generated machine code.

It is true (as the above answer demonstrates) that is very difficult to know exactly what series of machine instructions are being executed (the JIT has several different levels of compilation, each with different optimizations, based on how often a method is executed, and there are many different JIT optimizations which may interact resulting in different machine code). It is also difficult to know which memory accesses must go to RAM and which are avoided by memory cache hits in the processor. No doubt some combination of the above effects are affecting the performance here. The order of an algorithm has the most effect on performance, but beyond that, there are many optimizations in the JIT and the microprocessor that can affect performance, not all of them obvious to predict.

My instinct is that this is the result of the processor's memory cache working better with your sequence of operations, in part due to the fact that only A[j] needs to be loaded with each loop iteration.

Archimedes Trajano · Accepted Answer · 2016-10-08 02:42:46Z

TL;DR

Your experiment isn't valid, there are many variables that may affect the result. It's best that you utilize a microbenchmarking tool such as Caliper or JMH. I used such a tool for checking which method is faster to create indentation

The difference between yours and your professor's is negligible.

The experiment

For my experiment I had 745,038 data points. I created 3 tests, yours, your instructor's version and Arrays.sort() that is part of the JDK.

https://microbenchmarks.appspot.com/runs/8b8c0554-d3f1-4339-af5a-fdffd18dd053

Based on the results your runtime was: 1,419,867.808 ns Your instructor was: 1,429,798.824 ns

So we're talking about 0.01 ms.

The instructor's just had less variance between the runs.

The JDK Arrays.sort() was slower by a larger magnitude at 1,779,042.513 ns which is 0.300ms slower than yours.

Here is the code I used to do the microbenchmark in Caliper below.

package net.trajano.caliper.test; import java.io.DataInputStream; import java.io.EOFException; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import com.google.caliper.BeforeExperiment; import com.google.caliper.Benchmark; import com.google.caliper.api.VmOptions; import com.google.caliper.runner.CaliperMain; @VmOptions("-XX:-TieredCompilation") public class SortBenchmark { public static int[] insertionSort(final int[] A) { // Check for illegal cases if (A == null || A.length == 0) { throw new IllegalArgumentException("A is not populated"); } for (int i = 0; i < A.length; i++) { int j = i; while (j > 0 && A[j - 1] > A[j]) { final int temp = A[j - 1]; A[j - 1] = A[j]; A[j] = temp; j--; } } return A; } public static int[] insertionSortInstructor(final int[] A) { // Check for illegal cases if (A == null || A.length == 0) { throw new IllegalArgumentException("A is not populated"); } for (int i = 0; i < A.length; i++) { int j = i; while (j > 0 && A[j - 1] > A[j]) { final int temp = A[j]; A[j] = A[j - 1]; A[j - 1] = temp; j--; } } return A; } @BeforeExperiment void setUp() throws IOException { try (final DataInputStream dis = new DataInputStream( Files.newInputStream(Paths.get("C:/Program Files/iTunes/iTunes.exe")))) { final List<Integer> list = new ArrayList<Integer>(); while (true) { try { list.add(dis.readInt()); } catch (final EOFException e) { break; } } data = list.stream().mapToInt(i -> i).toArray(); System.out.println("Data size = " + data.length); } } // data to sort private static int[] data; @Benchmark public void insertionSort(final int reps) { for (int i = 0; i < reps; i++) { insertionSort(data); } } @Benchmark public void insertionSortInstructor(final int reps) { for (int i = 0; i < reps; i++) { insertionSortInstructor(data); } } @Benchmark public void jdkSort(final int reps) { for (int i = 0; i < reps; i++) { Arrays.sort(data); } } public static void main(final String[] args) { CaliperMain.main(SortBenchmark.class, args); } }

Aside

To be honest I was surprised at the result, that the JDK was slower. So I took a look at the source code. It appears there's three sorting algorithms being used by the JDK depending on the thresholds (merge sort, quicksort for less than 286 elements and insertion sort for less than 47 elements).

Since the data set I had was pretty large to begin with, merge sort goes first which has O(n) space complexity in the form of a second copy of the array. As such it is likely extra heap allocations that are causing the extra time.

I just ran my testing code again on MacOS 10.11 Java 8u77b03 using the javac compiler and these were my results (what's even stranger is now my code is running significantly faster than my instructor's): My insertion sort took 56634.0 milliseconds. Other insertion sort took 146979.0 milliseconds.

Collectives™ on Stack Overflow

Very Strange Efficiency Quirks while Sorting

3 Answers 3

8 Comments

Comments

TL;DR

The experiment

Aside

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

Comments

TL;DR

The experiment

Aside

1 Comment

Related