27

First of all we create a random binary file with 100.000.000 bytes. I used Python for this:

import random import os def Main(): length = 100000000 randomArray = random.randbytes(length) desktopPath = os.path.normpath(os.path.expanduser("~/Desktop")) fullPath=os.path.join(desktopPath,"RandomFile.dat") if os.path.isfile(fullPath): print("File already exists.") return with open(fullPath, "wb") as file: file.write(randomArray) print("The file has been created.") Main() 

After this we find how many of these bytes were greater than 100 using C# and Java.

C#:

using System.Diagnostics; namespace TestYouTube { public class Program { //finished public static void Main(string[] args) { string desktopPath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop); string fullPath = Path.Combine(desktopPath, "RandomFile.dat"); if (!File.Exists(fullPath)) { Console.WriteLine("The file does not exist."); return; } byte[] byteArray = File.ReadAllBytes(fullPath); int[] intArray = ByteArrayToIntArray(byteArray); Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); int count = TestSpeed(intArray); stopwatch.Stop(); Console.WriteLine($"Count = {count}"); Console.WriteLine($"Elapsed: {stopwatch.Elapsed.TotalSeconds} seconds"); Console.ReadLine(); } //finished private static int[] ByteArrayToIntArray(byte[] byteArray) { if (byteArray == null) return null; if (byteArray.Length == 0) return Array.Empty<int>(); int[] intArray = new int[byteArray.Length]; for (int i = 0; i < intArray.Length; i++) { intArray[i] = byteArray[i]; } return intArray; } //finished private static int TestSpeed(int[] array) { if (array == null) throw new ArgumentNullException(); if (array.Length == 0) throw new ArgumentException(); int count = 0; foreach (int element in array) { if (element > 100) { count++; } } return count; } } } 

Java:

import javax.swing.filechooser.*; import java.io.*; import java.nio.file.*; public class Main { public static void main(String[] args) throws IOException { FileSystemView view = FileSystemView.getFileSystemView(); File file = view.getHomeDirectory(); String desktopPath = file.getPath(); Path fullPath = Paths.get(desktopPath,"RandomFile.dat"); File randomFile=new File(fullPath.toString()); if (!randomFile.exists() || randomFile.isDirectory()) { System.out.println("The file does not exist."); return; } byte[] array = Files.readAllBytes(randomFile.toPath()); int[] intArray=ByteArrayToUnsignedByteArray(array); long start = System.currentTimeMillis(); int count=TestSpeed(intArray); long end = System.currentTimeMillis(); long elapsed=end-start; double elapsedSeconds=((double)elapsed)/1000; System.out.printf("Count = %d\n",count); System.out.printf("Elapsed: %f seconds\n",elapsedSeconds); } private static int ByteToUnsignedByte(byte b) { return b & 0b11111111; } private static int[] ByteArrayToUnsignedByteArray(byte[] byteArray) { if (byteArray==null) return null; if (byteArray.length==0) return new int[0]; int[] newArray=new int[byteArray.length]; for (int i=0;i<byteArray.length;i++) { int element=ByteToUnsignedByte(byteArray[i]); newArray[i]=element; } return newArray; } private static int TestSpeed(int[] byteArray) { if (byteArray==null) throw new IllegalArgumentException("The array must not be null."); if (byteArray.length==0) throw new IllegalArgumentException("The array length must be non zero."); int count=0; for (int element : byteArray) { if (element>100) { count++; } } return count; } } 

Please note that we only measure the time of the function TestSpeed in both languages.

Results: C#:
Count = 60542647
Elapsed: 0,4286155 seconds

Java:
Count = 60542647
Elapsed: 0,077000 seconds

I ran the code multiple times and the result is about the same. C# is about 5 times slower than Java. Please note that I ran the code from "Release" and also I used "Start Without Debbuging".

Does anybody know why this happens? What am I doing wrong?


According to comments it's better to benchmark the C# code with a benchmark library so I used BenchmarkDotNet for this. The results:
TestSpeed Mean: 411.2 ms
Which is not really different from Stopwatch.

  • .NET version: .NET 9.0,
  • Java openjdk version "24.0.1" 2025-04-15,
  • OS: Microsoft Windows [Version 10.0.19045.5965],
  • CPU name: AMD Athlon Gold 3150U with Radeon Graphics (dual core Zen 1)
48
  • 26
    That's quite a short amount of time, and you haven't done anything to warm up the JIT first, etc. For example, the .NET runtime first JITs a method with very few optimisations (which is very quick to do, which helps with a fast start-up): if the method ends up being called a lot, it gets rejitted with more optimisations. It could well be that you're not hitting the threshold for a rejit. I don't know what the JVM does here or whether it's comparable. Use BenchmarkDotNet for reliable micro-benchmarking. Commented Jun 23 at 8:18
  • 4
    AFAIK the JVM can interpret byte code without JIT. The .NET runtime always has to JIT the IL first and thus has a slower startup time. Usually this is compensated for by a "warm up" before benchmarking. Try running TestSpeed() once or twice before measuring it. But you should really be using a benchmarking library, especially when measuring programs that have not been compiled to native machine code. Commented Jun 23 at 8:35
  • 7
    With C# it's important to use a benchmark library so that you are timing what you think you're timing. Commented Jun 23 at 9:53
  • 5
    Also note that System.currentTimeInMillis() is not a great way of measuring very short intervals. You should be using system.nanoTime(). Commented Jun 23 at 10:23
  • 4
    Based on the comments, the lack of a micro-benchmarking framework in both your tests is serving as a (legitimate) distraction. You may want rewrite both your tests using such a framework and then edit your question to replace your current tests. Use Java Microbenchmark Harness (JMH) for Java and, according to other comments, BenchmarkDotNet for C#. You may also want to include the native code generated by each language's JIT. This Q&A should help with that in Java. Not sure how to do the same in C#. Commented Jun 23 at 13:28

5 Answers 5

32

Matthew Watson's and Charlieface's answers actually explain the reason for this question, because compared to CLR, JVM has implemented more loop optimizations, including vectorization and loop unrolling, which allows the JVM to use fewer jump and inc instructions and save more time when the loop length is very long.

There is an issue on GitHub that tracks the progress of .NET loop optimizations. unfortunately the above optimizations is still ongoing.

Sign up to request clarification or add additional context in comments.

8 Comments

As I commented on other answers, I expect the main difference isn't unrolling, it's if-conversion: making branchless asm from an if(){} in the source. (Ideally count += (int)(x >= 100); with cmp/setge/add, or what Java does which has a longer critical path latency of at least 2 cycles (increment and cmov), count = (x>=100) ? count+1 : count;. Java is up to 2x or potentially 3x on Ice Lake slower than it could be with simple scalar code, but at least it avoids branch mispredictions.)
It seems worth checking whether actually writing the code that way helps in C#, then.
@KarlKnechtel: Yeah, if someone with C# installed can test that, please do. count += (element > 100) ? 1 : 0; instead of the if compiles to an 8-uop using cmp/setg/movzx/add - sharplab.io/…
I forgot to check the OP's original source! Yes, as I expected, it compiles to cmp/jle over an inc, so this is nearly a duplicate of Why is processing a sorted array faster than processing an unsorted array? (except Java goes branchless for this, it's C# that would speed up with a sorted array.) With the if version, C# uses a memory operand for cmp instead of loading separately, cmp dword [rcx+r9*4+0x10], 0x64. Instead of a separate load instruction for cmp/setg. Silly compiler.
@KarlKnechtel: Turned my various comments into an answer of my own, with the C# asm output from Sharplab for the original and branchless version. I haven't actually benchmarked it, but I don't need to; it's branchless and only does sequential memory accesses, so we can just count the uops to get a pretty good prediction of performance.
I downvoted this answer. Java doesn't SIMD vectorize this loop, and its unrolling isn't helping in this case. The important optimization here is if-conversion of the control dependency in the source to a data dependency in the asm, i.e. making branchless asm to process this array of random data. (SIMD vectorization would also be branchless and even faster, but again, HotSpot misses that optimization, if the answer showing its asm output matches what happened on the OP's system.)
|
26

I've disassembled your TestSpeed Java method (by the way, you ought to call it testSpeed to keep to Java naming conventions). This is on Linux (x64), but the results should be similar on Windows. Evidently, the JVM compiler has a number of compilation stages, each time with more optimization than the previous stage. I'm not an assembly language expert, but I think this shows that in C1 compilation, it generated a simple loop, similar to that generated by the C# runtime, but in the C2 generated code it has unrolled the loop, thus speeding it up.

So, it seems, the JVM compiler optimizes better than the .NET one in this particular case.

The TestSpeed method is shown below with line numbers for cross-reference with the assembly listing:

28 private static int TestSpeed(int[] byteArray) 29 { 30 int count=0; 31 for (int element : byteArray) 32 if (element>100) 33 count++; 34 return count; 35 } 

And the assembly listing generated with
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly
with hsdis-amd64.so in the LD_LIBRARY_PATH:

============================= C1-compiled nmethod ============================== ----------------------------------- Assembly ----------------------------------- Compiled method (c1) 498 20 % 3 org.example.App::TestSpeed @ 10 (40 bytes) total in heap [0x00007d32b5403510,0x00007d32b5403ae0] = 1488 relocation [0x00007d32b5403670,0x00007d32b54036b0] = 64 main code [0x00007d32b54036c0,0x00007d32b54038e0] = 544 stub code [0x00007d32b54038e0,0x00007d32b5403910] = 48 oops [0x00007d32b5403910,0x00007d32b5403918] = 8 metadata [0x00007d32b5403918,0x00007d32b5403920] = 8 scopes data [0x00007d32b5403920,0x00007d32b5403998] = 120 scopes pcs [0x00007d32b5403998,0x00007d32b5403ab8] = 288 dependencies [0x00007d32b5403ab8,0x00007d32b5403ac0] = 8 nul chk table [0x00007d32b5403ac0,0x00007d32b5403ae0] = 32 [Disassembly] -------------------------------------------------------------------------------- [Constant Pool (empty)] -------------------------------------------------------------------------------- [Verified Entry Point] # {method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App' 0x00007d32b54036c0: mov %eax,-0x14000(%rsp) 0x00007d32b54036c7: push %rbp 0x00007d32b54036c8: sub $0x40,%rsp 0x00007d32b54036cc: movabs $0x7d323f400ac8,%rdi ; {metadata(method data for {method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App')} 0x00007d32b54036d6: mov 0xf4(%rdi),%ebx 0x00007d32b54036dc: add $0x2,%ebx 0x00007d32b54036df: mov %ebx,0xf4(%rdi) 0x00007d32b54036e5: and $0x7fe,%ebx 0x00007d32b54036eb: cmp $0x0,%ebx 0x00007d32b54036ee: je 0x00007d32b5403827 ;*iconst_0 {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@0 (line 30) 0x00007d32b54036f4: mov 0xc(%rsi),%edi ; implicit exception: dispatches to 0x00007d32b5403848 ;*arraylength {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@5 (line 31) 0x00007d32b54036f7: mov $0x0,%ebx 0x00007d32b54036fc: mov $0x0,%eax 0x00007d32b5403701: jmp 0x00007d32b5403791 ;*iload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@10 (line 31) 0x00007d32b5403706: xchg %ax,%ax 0x00007d32b5403708: cmp 0xc(%rsi),%ebx ; implicit exception: dispatches to 0x00007d32b540384d 0x00007d32b540370b: jae 0x00007d32b5403857 0x00007d32b5403711: movslq %ebx,%rdx 0x00007d32b5403714: mov 0x10(%rsi,%rdx,4),%edx ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) 0x00007d32b5403718: cmp $0x64,%edx 0x00007d32b540371b: movabs $0x7d323f400ac8,%rdx ; {metadata(method data for {method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App')} 0x00007d32b5403725: mov $0x158,%rcx 0x00007d32b540372c: jle 0x00007d32b5403739 0x00007d32b5403732: mov $0x168,%rcx 0x00007d32b5403739: mov (%rdx,%rcx,1),%r8 0x00007d32b540373d: lea 0x1(%r8),%r8 0x00007d32b5403741: mov %r8,(%rdx,%rcx,1) 0x00007d32b5403745: jle 0x00007d32b540374d ;*if_icmple {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@26 (line 32) 0x00007d32b540374b: inc %eax 0x00007d32b540374d: inc %ebx 0x00007d32b540374f: movabs $0x7d323f400ac8,%rdx ; {metadata(method data for {method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App')} 0x00007d32b5403759: mov 0xf8(%rdx),%ecx 0x00007d32b540375f: add $0x2,%ecx 0x00007d32b5403762: mov %ecx,0xf8(%rdx) 0x00007d32b5403768: and $0x3ffe,%ecx 0x00007d32b540376e: cmp $0x0,%ecx 0x00007d32b5403771: je 0x00007d32b5403865 ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@35 (line 31) 0x00007d32b5403777: mov 0x348(%r15),%r10 ; ImmutableOopMap {rsi=Oop } ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) org.example.App::TestSpeed@35 (line 31) 0x00007d32b540377e: test %eax,(%r10) ; {poll} 0x00007d32b5403781: movabs $0x7d323f400ac8,%rdx ; {metadata(method data for {method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App')} 0x00007d32b540378b: incl 0x178(%rdx) ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@35 (line 31) 0x00007d32b5403791: cmp %edi,%ebx 0x00007d32b5403793: movabs $0x7d323f400ac8,%rdx ; {metadata(method data for {method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App')} 0x00007d32b540379d: mov $0x148,%rcx 0x00007d32b54037a4: jl 0x00007d32b54037b1 0x00007d32b54037aa: mov $0x138,%rcx 0x00007d32b54037b1: mov (%rdx,%rcx,1),%r8 0x00007d32b54037b5: lea 0x1(%r8),%r8 0x00007d32b54037b9: mov %r8,(%rdx,%rcx,1) 0x00007d32b54037bd: jl 0x00007d32b5403708 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@13 (line 31) 0x00007d32b54037c3: add $0x40,%rsp 0x00007d32b54037c7: pop %rbp 0x00007d32b54037c8: cmp 0x340(%r15),%rsp ; {poll_return} 0x00007d32b54037cf: ja 0x00007d32b5403886 0x00007d32b54037d5: ret ;*ireturn {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@39 (line 34) 0x00007d32b54037d6: mov %eax,-0x14000(%rsp) 0x00007d32b54037dd: push %rbp 0x00007d32b54037de: sub $0x40,%rsp 0x00007d32b54037e2: mov 0x20(%rsi),%ebx 0x00007d32b54037e5: mov 0x18(%rsi),%rax 0x00007d32b54037e9: mov 0x10(%rsi),%edx 0x00007d32b54037ec: mov 0x8(%rsi),%ecx 0x00007d32b54037ef: mov %rsi,%rdi 0x00007d32b54037f2: mov %ebx,0x30(%rsp) 0x00007d32b54037f6: mov %rax,0x28(%rsp) 0x00007d32b54037fb: mov %edx,0x24(%rsp) 0x00007d32b54037ff: mov %ecx,0x20(%rsp) 0x00007d32b5403803: call 0x00007d32d4700830 ; {runtime_call SharedRuntime::OSR_migration_end(long*)} 0x00007d32b5403808: mov 0x20(%rsp),%ecx 0x00007d32b540380c: mov %rcx,%rbx 0x00007d32b540380f: mov 0x24(%rsp),%edx 0x00007d32b5403813: mov %rdx,%rdi 0x00007d32b5403816: mov 0x28(%rsp),%rax 0x00007d32b540381b: mov %rax,%rsi 0x00007d32b540381e: mov 0x30(%rsp),%eax 0x00007d32b5403822: jmp 0x00007d32b5403791 0x00007d32b5403827: movabs $0x7d323f4006c0,%r10 ; {metadata({method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App')} 0x00007d32b5403831: mov %r10,0x8(%rsp) 0x00007d32b5403836: movq $0xffffffffffffffff,(%rsp) 0x00007d32b540383e: call 0x00007d32bca2f280 ; ImmutableOopMap {rsi=Oop } ;*synchronization entry ; - org.example.App::TestSpeed@-1 (line 30) ; {runtime_call counter_overflow Runtime1 stub} 0x00007d32b5403843: jmp 0x00007d32b54036f4 0x00007d32b5403848: call 0x00007d32bca295a0 ; ImmutableOopMap {rsi=Oop } ;*arraylength {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@5 (line 31) ; {runtime_call throw_null_pointer_exception Runtime1 stub} 0x00007d32b540384d: call 0x00007d32bca295a0 ; ImmutableOopMap {rsi=Oop } ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) ; {runtime_call throw_null_pointer_exception Runtime1 stub} 0x00007d32b5403852: call 0x00007d32bca295a0 ; ImmutableOopMap {rsi=Oop } ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) ; {runtime_call throw_null_pointer_exception Runtime1 stub} 0x00007d32b5403857: mov %rbx,(%rsp) 0x00007d32b540385b: mov %rsi,0x8(%rsp) 0x00007d32b5403860: call 0x00007d32bca28ca0 ; ImmutableOopMap {rsi=Oop } ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) ; {runtime_call throw_range_check_failed Runtime1 stub} 0x00007d32b5403865: movabs $0x7d323f4006c0,%r10 ; {metadata({method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App')} 0x00007d32b540386f: mov %r10,0x8(%rsp) 0x00007d32b5403874: movq $0x23,(%rsp) 0x00007d32b540387c: call 0x00007d32bca2f280 ; ImmutableOopMap {rsi=Oop } ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) org.example.App::TestSpeed@35 (line 31) ; {runtime_call counter_overflow Runtime1 stub} 0x00007d32b5403881: jmp 0x00007d32b5403777 0x00007d32b5403886: movabs $0x7d32b54037c8,%r10 ; {internal_word} 0x00007d32b5403890: mov %r10,0x358(%r15) 0x00007d32b5403897: jmp 0x00007d32bc98e700 ; {runtime_call SafepointBlob} 0x00007d32b540389c: nop 0x00007d32b540389d: nop 0x00007d32b540389e: mov 0x3d0(%r15),%rax 0x00007d32b54038a5: movq $0x0,0x3d0(%r15) 0x00007d32b54038b0: movq $0x0,0x3d8(%r15) 0x00007d32b54038bb: add $0x40,%rsp 0x00007d32b54038bf: pop %rbp 0x00007d32b54038c0: jmp 0x00007d32bc99b600 ; {runtime_call unwind_exception Runtime1 stub} 0x00007d32b54038c5: hlt 0x00007d32b54038c6: hlt ... many hlt instructions as padding ... 0x00007d32b54038df: hlt [Exception Handler] 0x00007d32b54038e0: call 0x00007d32bca2b980 ; {no_reloc} 0x00007d32b54038e5: movabs $0x7d32d49cbfc8,%rdi ; {external_word} 0x00007d32b54038ef: and $0xfffffffffffffff0,%rsp 0x00007d32b54038f3: call 0x00007d32d451c6f0 ; {runtime_call MacroAssembler::debug64(char*, long, long*)} 0x00007d32b54038f8: hlt [Deopt Handler Code] 0x00007d32b54038f9: movabs $0x7d32b54038f9,%r10 ; {section_word} 0x00007d32b5403903: push %r10 0x00007d32b5403905: jmp 0x00007d32bc98d9a0 ; {runtime_call DeoptimizationBlob} 0x00007d32b540390a: hlt 0x00007d32b540390b: hlt 0x00007d32b540390c: hlt 0x00007d32b540390d: hlt 0x00007d32b540390e: hlt 0x00007d32b540390f: hlt -------------------------------------------------------------------------------- [/Disassembly] 

(This section added by @PeterCordes.)

The C1 code doesn't use any cmovcc (conditional-move) or setcc (condition into 0/1 integer) instructions. It looks like it's just branching over inc instructions.

The C2 code does if-conversion (from branchy source to branchless assembly), avoiding the slowdown of branch mispredictions on your random data. (Why is processing a sorted array faster than processing an unsorted array? is a very similar problem, also looping over an array and conditionally adding.) C# is probably not doing this, which would explain it being 5x slower.

Java unrolls by 8, and does all 8 loads before any compares. This is excessive; out-of-order exec and hardware prefetch will already do a good job here; it could have avoided saving/restoring so many registers outside the loop by using cmp with memory. (It also costs some I-cache footprint, but the JIT compiler knows this is a long-running very hot loop.)

Once it has values loaded, the if(element > 100) { count++; } is implemented as count = (element>100) ? count+1 : count;, with blocks of asm like this:

# count in EDX, element in EDI mov %edx,%r11d inc %r11d # r11d = edx+1. Could have been an lea 1(%rdx), %r11d cmp $0x64,%edi # compare element against 100 cmovle %edx,%r11d # keep the un-incremented count if 100 >= element # next block uses count = r11d 

This has critical path latency of at least 2 cycles from old count to new count (through the mov + inc + cmov). 3 cycles on CPUs that don't do mov-elimination, like Ice Lake (disabled for errata) and AMD Bulldozer-family. (Zen 1 does do mov-elimination on integer and XMM registers). So the bottleneck is one compare+increment per 2 clock cycles, and Zen 1 should achieve that even with ints coming all the way from DRAM, even without loop unrolling.

It would be faster to generate a 0 / 1 integer from the compare and add that. Like cmp $100, %edi ; setg %al / add %eax, %edx. (Assuming EAX is xor-zeroed once per unrolled loop.) 3 single-uop instructions vs. 4, but more importantly the critical path latency through count is only 1 cycle, an add. Out-of-order exec can handle the cmp/setg work as soon as element is ready, not needing count.

If Java had the value-range information to realize this could be an unsigned compare, it could have used cmp $101, %edi ; sbb $-1, %edx (to add 1 if no carry, or add 0 if carry, when x is unsigned-below 101). Converting from bytes to int on the fly would give the compiler more info, and avoid a separate unpack loop + cache footprint. But HotSpot isn't even looking for LEA as a copy-and-add peephole optimization, so probably won't try to use CF instead of materializing a 0/1 integer.

See also gcc optimization flag -O3 makes code slower than -O2 for a very similar case of branchless code-gen with a longer critical path than necessary.

TL:DR: Java could be going about twice as fast if it did count += (int)(element > 100) instead of count = (element>100) ? count+1 : count;.
Auto-vectorizing with SIMD could go much faster, like GCC and Clang do.

C2 disassembly:

============================= C2-compiled nmethod ============================== ----------------------------------- Assembly ----------------------------------- Compiled method (c2) 503 21 % 4 org.example.App::TestSpeed @ 10 (40 bytes) total in heap [0x00007d32bcecc090,0x00007d32bcecc6c0] = 1584 relocation [0x00007d32bcecc1f0,0x00007d32bcecc208] = 24 main code [0x00007d32bcecc220,0x00007d32bcecc4a0] = 640 stub code [0x00007d32bcecc4a0,0x00007d32bcecc4b8] = 24 oops [0x00007d32bcecc4b8,0x00007d32bcecc4c0] = 8 metadata [0x00007d32bcecc4c0,0x00007d32bcecc4d0] = 16 scopes data [0x00007d32bcecc4d0,0x00007d32bcecc568] = 152 scopes pcs [0x00007d32bcecc568,0x00007d32bcecc6a8] = 320 dependencies [0x00007d32bcecc6a8,0x00007d32bcecc6b0] = 8 nul chk table [0x00007d32bcecc6b0,0x00007d32bcecc6c0] = 16 [Disassembly] -------------------------------------------------------------------------------- [Constant Pool (empty)] -------------------------------------------------------------------------------- [Verified Entry Point] # {method} {0x00007d323f4006c0} 'TestSpeed' '([I)I' in 'org/example/App' 0x00007d32bcecc220: call 0x00007d32d463d440 ; {runtime_call os::breakpoint()} 0x00007d32bcecc225: data16 data16 nopw 0x0(%rax,%rax,1) 0x00007d32bcecc230: mov %eax,-0x14000(%rsp) 0x00007d32bcecc237: push %rbp 0x00007d32bcecc238: sub $0x30,%rsp 0x00007d32bcecc23c: mov 0x18(%rsi),%r14 0x00007d32bcecc240: mov 0x20(%rsi),%ebp 0x00007d32bcecc243: mov 0x10(%rsi),%r13d 0x00007d32bcecc247: mov 0x8(%rsi),%ebx 0x00007d32bcecc24a: mov %rsi,%rdi 0x00007d32bcecc24d: movabs $0x7d32d4700830,%r10 0x00007d32bcecc257: call *%r10 0x00007d32bcecc25a: nopw 0x0(%rax,%rax,1) 0x00007d32bcecc260: mov 0x8(%r14),%r11d ; implicit exception: dispatches to 0x00007d32bcecc47c 0x00007d32bcecc264: cmp $0x6c38,%r11d ; {metadata({type array int})} 0x00007d32bcecc26b: jne 0x00007d32bcecc464 ;*iload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@10 (line 31) 0x00007d32bcecc271: cmp %r13d,%ebx 0x00007d32bcecc274: jge 0x00007d32bcecc42a ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@13 (line 31) 0x00007d32bcecc27a: mov 0xc(%r14),%r10d 0x00007d32bcecc27e: mov %ebx,%r11d 0x00007d32bcecc281: inc %r11d 0x00007d32bcecc284: movslq %r11d,%r8 0x00007d32bcecc287: xor %r9d,%r9d 0x00007d32bcecc28a: test %r11d,%r11d 0x00007d32bcecc28d: cmovl %r9,%r8 0x00007d32bcecc291: mov %r8d,%r11d 0x00007d32bcecc294: cmp %r13d,%r11d 0x00007d32bcecc297: cmovg %r13d,%r11d ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) 0x00007d32bcecc29b: nopl 0x0(%rax,%rax,1) 0x00007d32bcecc2a0: cmp %r10d,%ebx 0x00007d32bcecc2a3: jae 0x00007d32bcecc444 0x00007d32bcecc2a9: mov 0x10(%r14,%rbx,4),%r9d 0x00007d32bcecc2ae: mov %ebp,%edx 0x00007d32bcecc2b0: inc %edx 0x00007d32bcecc2b2: cmp $0x64,%r9d 0x00007d32bcecc2b6: cmovle %ebp,%edx 0x00007d32bcecc2b9: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@32 (line 31) 0x00007d32bcecc2bb: nopl 0x0(%rax,%rax,1) 0x00007d32bcecc2c0: cmp %r11d,%ebx 0x00007d32bcecc2c3: jge 0x00007d32bcecc2c9 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@13 (line 31) 0x00007d32bcecc2c5: mov %edx,%ebp 0x00007d32bcecc2c7: jmp 0x00007d32bcecc29b 0x00007d32bcecc2c9: movslq %r10d,%r11 0x00007d32bcecc2cc: movslq %r13d,%r8 0x00007d32bcecc2cf: cmp %r11,%r8 0x00007d32bcecc2d2: cmovl %r8,%r11 0x00007d32bcecc2d6: add $0xfffffffffffffff9,%r11 0x00007d32bcecc2da: mov $0xffffffff80000000,%r8 0x00007d32bcecc2e1: cmp $0xffffffff80000000,%r11 0x00007d32bcecc2e8: cmovl %r8,%r11 0x00007d32bcecc2ec: mov %r11d,%r11d 0x00007d32bcecc2ef: cmp %r11d,%ebx 0x00007d32bcecc2f2: jge 0x00007d32bcecc3fa ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@35 (line 31) 0x00007d32bcecc2f8: jmp 0x00007d32bcecc31b 0x00007d32bcecc2fa: mov 0x348(%r15),%r10 ; ImmutableOopMap {r14=Oop } ;*goto {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) org.example.App::TestSpeed@35 (line 31) 0x00007d32bcecc301: test %eax,(%r10) ;*goto {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@35 (line 31) ; {poll} 0x00007d32bcecc304: cmp (%rsp),%ebx 0x00007d32bcecc307: jge 0x00007d32bcecc3f0 0x00007d32bcecc30d: vmovd %xmm0,%r13d 0x00007d32bcecc312: vmovd %xmm1,%r10d 0x00007d32bcecc317: mov (%rsp),%r11d 0x00007d32bcecc31b: mov %r11d,%ebp 0x00007d32bcecc31e: sub %ebx,%ebp 0x00007d32bcecc320: xor %r9d,%r9d 0x00007d32bcecc323: cmp %ebx,%r11d 0x00007d32bcecc326: cmovl %r9d,%ebp 0x00007d32bcecc32a: cmp $0x1f40,%ebp 0x00007d32bcecc330: mov $0x1f40,%r9d 0x00007d32bcecc336: cmova %r9d,%ebp 0x00007d32bcecc33a: add %ebx,%ebp 0x00007d32bcecc33c: vmovd %r13d,%xmm0 0x00007d32bcecc341: vmovd %r10d,%xmm1 0x00007d32bcecc346: mov %r11d,(%rsp) 0x00007d32bcecc34a: nopw 0x0(%rax,%rax,1) ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) 0x00007d32bcecc350: mov 0x10(%r14,%rbx,4),%r10d 0x00007d32bcecc355: movslq %ebx,%rsi 0x00007d32bcecc358: mov 0x2c(%r14,%rsi,4),%r13d 0x00007d32bcecc35d: mov 0x14(%r14,%rsi,4),%r11d 0x00007d32bcecc362: mov 0x28(%r14,%rsi,4),%r9d 0x00007d32bcecc367: mov 0x24(%r14,%rsi,4),%r8d 0x00007d32bcecc36c: mov 0x20(%r14,%rsi,4),%edi 0x00007d32bcecc371: mov 0x1c(%r14,%rsi,4),%ecx 0x00007d32bcecc376: mov 0x18(%r14,%rsi,4),%eax 0x00007d32bcecc37b: mov %edx,%esi 0x00007d32bcecc37d: inc %esi 0x00007d32bcecc37f: cmp $0x64,%r10d 0x00007d32bcecc383: cmovle %edx,%esi 0x00007d32bcecc386: mov %esi,%edx 0x00007d32bcecc388: inc %edx 0x00007d32bcecc38a: cmp $0x64,%r11d 0x00007d32bcecc38e: cmovle %esi,%edx 0x00007d32bcecc391: mov %edx,%r11d 0x00007d32bcecc394: inc %r11d 0x00007d32bcecc397: cmp $0x64,%eax 0x00007d32bcecc39a: cmovle %edx,%r11d 0x00007d32bcecc39e: mov %r11d,%edx 0x00007d32bcecc3a1: inc %edx 0x00007d32bcecc3a3: cmp $0x64,%ecx 0x00007d32bcecc3a6: cmovle %r11d,%edx 0x00007d32bcecc3aa: mov %edx,%r11d 0x00007d32bcecc3ad: inc %r11d 0x00007d32bcecc3b0: cmp $0x64,%edi 0x00007d32bcecc3b3: cmovle %edx,%r11d 0x00007d32bcecc3b7: mov %r11d,%r10d 0x00007d32bcecc3ba: inc %r10d 0x00007d32bcecc3bd: cmp $0x64,%r8d 0x00007d32bcecc3c1: cmovle %r11d,%r10d 0x00007d32bcecc3c5: mov %r10d,%r8d 0x00007d32bcecc3c8: inc %r8d 0x00007d32bcecc3cb: cmp $0x64,%r9d 0x00007d32bcecc3cf: cmovle %r10d,%r8d 0x00007d32bcecc3d3: mov %r8d,%edx 0x00007d32bcecc3d6: inc %edx 0x00007d32bcecc3d8: cmp $0x64,%r13d 0x00007d32bcecc3dc: cmovle %r8d,%edx 0x00007d32bcecc3e0: add $0x8,%ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@32 (line 31) 0x00007d32bcecc3e3: cmp %ebp,%ebx 0x00007d32bcecc3e5: jl 0x00007d32bcecc350 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@13 (line 31) 0x00007d32bcecc3eb: jmp 0x00007d32bcecc2fa 0x00007d32bcecc3f0: vmovd %xmm0,%r13d 0x00007d32bcecc3f5: vmovd %xmm1,%r10d 0x00007d32bcecc3fa: nopw 0x0(%rax,%rax,1) 0x00007d32bcecc400: cmp %r13d,%ebx 0x00007d32bcecc403: jge 0x00007d32bcecc460 0x00007d32bcecc409: jmp 0x00007d32bcecc40e 0x00007d32bcecc40b: nop 0x00007d32bcecc40c: mov %ebp,%edx ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) 0x00007d32bcecc40e: cmp %r10d,%ebx 0x00007d32bcecc411: jae 0x00007d32bcecc446 0x00007d32bcecc413: mov 0x10(%r14,%rbx,4),%r11d 0x00007d32bcecc418: mov %edx,%ebp 0x00007d32bcecc41a: inc %ebp 0x00007d32bcecc41c: cmp $0x64,%r11d 0x00007d32bcecc420: cmovle %edx,%ebp 0x00007d32bcecc423: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@32 (line 31) 0x00007d32bcecc425: cmp %r13d,%ebx 0x00007d32bcecc428: jl 0x00007d32bcecc40c ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@13 (line 31) 0x00007d32bcecc42a: mov $0xffffff45,%esi 0x00007d32bcecc42f: mov %r14,(%rsp) 0x00007d32bcecc433: mov %ebx,0x10(%rsp) 0x00007d32bcecc437: mov %r13d,0x14(%rsp) 0x00007d32bcecc43c: data16 xchg %ax,%ax 0x00007d32bcecc43f: call 0x00007d32bc98d600 ; ImmutableOopMap {[0]=Oop } ;*if_icmpge {reexecute=1 rethrow=0 return_oop=0} ; - (reexecute) org.example.App::TestSpeed@13 (line 31) ; {runtime_call UncommonTrapBlob} 0x00007d32bcecc444: mov %ebp,%edx 0x00007d32bcecc446: mov $0xffffffe4,%esi 0x00007d32bcecc44b: mov %edx,%ebp 0x00007d32bcecc44d: mov %r13d,0x8(%rsp) 0x00007d32bcecc452: mov %r14,0x10(%rsp) 0x00007d32bcecc457: mov %ebx,0x18(%rsp) 0x00007d32bcecc45b: call 0x00007d32bc98d600 ; ImmutableOopMap {[16]=Oop } ;*iaload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@19 (line 31) ; {runtime_call UncommonTrapBlob} 0x00007d32bcecc460: mov %edx,%ebp 0x00007d32bcecc462: jmp 0x00007d32bcecc42a 0x00007d32bcecc464: mov $0xffffff8d,%esi 0x00007d32bcecc469: mov %r14,(%rsp) 0x00007d32bcecc46d: mov %r13d,0x8(%rsp) 0x00007d32bcecc472: mov %ebx,0xc(%rsp) 0x00007d32bcecc476: nop 0x00007d32bcecc477: call 0x00007d32bc98d600 ; ImmutableOopMap {[0]=Oop } ;*iload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@10 (line 31) ; {runtime_call UncommonTrapBlob} 0x00007d32bcecc47c: mov $0xffffff76,%esi 0x00007d32bcecc481: mov %r13d,(%rsp) 0x00007d32bcecc485: mov %ebx,0x4(%rsp) 0x00007d32bcecc489: xchg %ax,%ax 0x00007d32bcecc48b: call 0x00007d32bc98d600 ; ImmutableOopMap {} ;*iload {reexecute=0 rethrow=0 return_oop=0} ; - org.example.App::TestSpeed@10 (line 31) ; {runtime_call UncommonTrapBlob} 0x00007d32bcecc490: hlt ... 14 more hlt instructions ... 0x00007d32bcecc49f: hlt [Exception Handler] 0x00007d32bcecc4a0: jmp 0x00007d32bc99bc00 ; {no_reloc} [Deopt Handler Code] 0x00007d32bcecc4a5: call 0x00007d32bcecc4aa 0x00007d32bcecc4aa: subq $0x5,(%rsp) 0x00007d32bcecc4af: jmp 0x00007d32bc98d9a0 ; {runtime_call DeoptimizationBlob} 0x00007d32bcecc4b4: hlt 0x00007d32bcecc4b5: hlt 0x00007d32bcecc4b6: hlt 0x00007d32bcecc4b7: hlt -------------------------------------------------------------------------------- [/Disassembly] 

17 Comments

@VasilisKontopoulos: I think your perspective on this is backward: it's not that C#/.NET is slower than you expected, but that Java is faster than you expected. Java has a nice optimization here, that C# doesn't yet have. I don't know why you expected Java to be slow/unoptimized (why would it be?), but even given that expectation, there's no reason to be disappointed that Java actually turns out to be fast.
@VasilisKontopoulos: The point is that Java optimizes it to be even faster than a for loop with just an if inside. A for-loop has overhead for every single iteration; Java performs loop unrolling to reduce the number of iterations. (Loop unrolling isn't exactly a bleeding-edge technique -- I know it was already well-known by 1983, and probably long before that -- but no compiler will catch every possible optimization, or always make the right tradeoffs. Optimization is hard!)
@ruakh: It's not unreasonable to hope that a smart compiler would have vectorized with vpcmpgtd / vpsubd which would go much faster than this clunky asm from Java's JIT. The JVM unrolled way too much, and is using XMM vector registers, but only as a way to spill integer regs as an alternative to stack memory, it appears. It made branchless asm, but not like count += (int)(val > 100);, instead it's doing count = (val>100) count+1 : count; so the critical path latency includes mov + inc (failed to even use lea) plus cmov. At least 2 cycle latency, vs. 1 for cmp/setcc/add
Doing worse than Java probably requires failing to do if-conversion, so having branch mispredicts. It's not unrolling that's the key here, it's if-conversion from branchy source to branchless asm. (@VasilisKontopoulos). About 5x sounds about right for that, like in Why is processing a sorted array faster than processing an unsorted array? which is basically the same problem. ++ instead of adding the element is the only real difference. Note that modern C++ compilers do vectorize this.
@VasilisKontopoulos: GCC doesn't unroll loops either by default even at -O3 (except for fully unrolling+peeling small constant-count loops); -funroll-loops is only enabled manually or as part of profile-guided optimization (so it knows which loops are important, like a JVM). IDK how much profiling info .NET gathers before making final optimized asm, if none then not unrolling makes sense. For an ahead-of-time compiler, bloating every loop with code like Java is making here would often be overall slower, with more I-cache misses and larger binaries.
|
11

TL:DR/summary: C# is slow because of branch mispredicts, but writing the source differently can fix that. Java makes branchless scalar asm which is ok but about 2x (or more) slower than it could be. Everyone's been focusing on loop unrolling but that's irrelevant here due to the loop-carried dependency-chain latency bottleneck for Java's branchless strategy.

C/C++ compilers (GCC and Clang) auto-vectorize in a fairly clunky way, but still much faster, or make pretty good branchless scalar asm if you tell them not to vectorize. Manually vectorizing with intrinsics for SSE2, AVX2, or NEON can go even faster (in C# or C++ or C or any language with usable intrinsics for packed-compares and arch-specific tricks like psadbw to horizontal sum bytes). About 1 vector (16 or 32 bytes) per clock cycle if memory bandwidth weren't a bottleneck. (32 elements, or only 8 if you unpacked your bytes to ints.)


C# makes branchy assembly for your if, so your random pattern of elements to be counted causes branch mispredicts

To make your C# code faster than Java (or at least equal), use this:

 count += (element > 100) ? 1 : 0; // C# or Java 

(Sharplab.io with x64 asm for this and the original).
C# .NET Framework for x64 compiles this to cmp ; setg(0/1 byte) ; movzx(zero-extend to 32-bit) ; add. Not ideal asm (movzx could be avoided by zeroing a reg outside the loop, or even using cmp/sbb $-1, %reg like GCC and Clang do), but still better than Java's worse branchless strategy which has longer critical-path latency for the loop-carried dependency chain through count. The above code might hand-hold Java into making better asm, which could make its loop unrolling useful.

 // Java's if-conversion strategy. C# compiles it to worse branching than the if! // and it would still be worse than the above for this use-case if it compiled as written, to cmp/lea/cmov count = (element > 100) ? count+1 : count; // This is worse, don't do this 

This also works on byte[] array data without any special tricks, since C# byte is unsigned like uint. Widening u8 on the fly to i32 or u32 is very cheap on modern CPUs, same cost as 32-bit load. It's a total disaster for overall performance to allocate another 4x as much memory and expand into it before doing any useful work on your data, but you left that loop out of your timed region. Usually you want to aim for more computational intensity: more work per time your data gets loaded into L1d cache or into registers. And keeping your data in a compact representation and not copying it (let alone expanding it) means a smaller cache footprint and less memory traffic for the same amount of work.

In Java, hopefully the JIT does something good with int elem = 0xFF & (int)elem_byte;. But even if it's dumb and uses movsx (sign-extending load) + and, instead of movzx (zero-extending load) or simply cmp byte [mem], 100, it's still cheaper to do this as part of the compare loop. It does make things maybe less comparable between C# and Java if you want to know how they perform for a loop where they more easily could have made the same asm. Timing a loop over the byte array would include the Java "tax" in what you're microbenchmarking for the language which doesn't provide unsigned 8-bit integers, if it fails to optimize source which could still compile to unsigned compares on bytes.


Branch mispredictions

See Why is processing a sorted array faster than processing an unsorted array? for details about that. (When that question was asked, C++ and Java compilers were like C# is now, making branchy machine code from source that uses a similar if(). It's a very similar problem, doing if (elem >= 128) sum += elem; instead of if (elem > 100) ++count;. C++ compilers now auto-vectorize, and Java makes branchless asm.)

Branch mispredicts account for most if not all of the 5x speed difference. Java's unrolling is pretty much irrelevant, especially on your Zen 1 CPU (pipeline is 5 uops wide), especially the slow way Java made branchless code with a 2 cycle critical path latency instead of 1.


C#'s asm

https://Sharplab.io/ AFAIK gets the .NET framework JIT to optimize as much as possible. It doesn't run your code, since it still works when you give it just a function with no caller that would pass it the right data, so unlike the HotSpot JIT or gcc -fprofile-generate/-fprofile-use or similar PGO (Profile-Guided Optimization) options, it doesn't have profiling information to let it know which branches are unpredictable or which loops are hot (and thus worth unrolling at the cost of more binary size and I-cache / I-TLB footprint, and worth spending more compile time on to run more optimization passes.) GCC -O3 doesn't enable -funroll-loops either; -funroll-loops is only on manually or as part of -fprofile-use PGO. Clang does unroll by default, but only for small loops; the fewer instructions in the loop body, the more it unrolls.

Anyway, C#'s inner loop from the if() version of the function looks like this. Even though the source uses foreach (int element in array), the compiler invents a 32-bit array index and redoes sign-extension to 64-bit every iteration. With crap likes this, no wonder there's such benefit to wide pipelines in modern CPUs.

Update: .NET x64 (not framework) on Sharplab (; Core CLR 9.0.525.21509 on x64) makes less clunky asm, avoiding the movsxd sign-extension of an array index, and uses cmp [mem], 100 instead of a separate load. It also avoids an indexed addressing-mode for the branchy version, but unfortunately not for branchless. And it still uses cmp/setcc/movzx instead of xor-zero/cmp/setcc.

The asm below is from .NET Framework - ; Desktop CLR 4.8.9305.0 on x64 on Sharplab.

# before this, zero idx and count, and branch over the loop if length==0 # do { L0010: movsxd r9, ecx # sign-extend idx to pointer width for use in an addressing-mode L0013: cmp dword [rdx+r9*4+0x10], 0x64 # array[idx]; I guess C# arrays have 16 bytes of metadata before the actual array. L0019: jle L001d # if (elem <= 100) goto skip_increment L001b: inc eax # ++count L001d: inc ecx # ++idx L001f: cmp r8d, ecx L0022: jg L0010 # }while(length > idx) 

With the count += (element>100); version, on a byte array:

 # do{ L0010: movsxd r9, ecx L0013: movzx r9d, byte [rdx+r9+0x10] # separately load, with zero-extension. Unscaled index this time because byte elements L0019: cmp r9d, 0x64 # int compare L001d: setg r9b # r9b (low byte of R9) = 0 or 1, upper bits unmodified, g = signed Greater L0021: movzx r9d, r9b # zero-extend into the full R9 (explicitly to 32, implicitly to 64) L0025: add eax, r9d # count += cmp_result L0028: inc ecx # ++idx L002a: cmp r8d, ecx L002d: jg L0010 # }while(length > idx) 

With macro-fusion of cmp/jg into a single uop, this is 8 uops, 7 of which need an ALU execution unit. (The load runs on a separate port). Probably only Intel Skymont E-cores can run it at 1 compare per cycle; other cores will bottleneck on the front-end and/or integer ALU ports thanks to C# doing everything it could to waste instructions inside the loop, including redoing sign-extension (movsxd) of a 32-bit index despite using foreach (int element in array) instead of manual indexing. (Lion Cove P-cores and Zen 5 are 8-wide but "only" have 6 integer ALU ports, so close to 1/clock, unlike Skymont which is also 8-wide but has 8 integer ALU ports and 3 branch execution units.)

(I don't know if Skymont can keep the cmp micro-fused into a single uop, with the indexed addressing mode and an immediate. Skylake can't. So it might even by 9 uops on most Intel CPUs, still 8 on AMD.)

This is like an exercise in how many instructions we can waste inside the loop. Redoing sign-extension of an index invented by the runtime is the most egregious sin, but that's present in the branchy version, too. Loop unrolling would help with the branchless version; instruction throughput is the bottleneck for it on most CPUs.

Even with an int array, we still get a separate mov instead of cmp [mem], 100. Same amount of work for the back-end execution units, but takes more slots in other parts of the pipeline (decode, the reorder buffer (ROB), and issue/rename. Or dispatch, to use the non-Intel terminology for sending a uop from the front-end to the back-end, as in the diagram of your CPU core in https://en.wikichip.org/wiki/amd/microarchitectures/zen#Individual_Core)

The choice of movzx r9d, r9b instead of into a different register like r10d defeats mov-elimination, so this instruction needs an execution unit. (Intel CPUs since Ivy Bridge could otherwise eliminated it. But IIRC, AMD CPUs don't eliminate movzx, only 32 and 64-bit mov reg,reg, so that wouldn't have helped when tuning for your Zen 1. What would have helped is xor-zeroing a register outside the loop and doing setg into the low byte of it. That would create another loop-carried dependency chain since the low byte isn't renamed separately from the full register, except on old Intel P6-family, and setcc is inconveniently designed to only write a byte instead of a whole register width. Intel APX will eventually fix that; until then see the setcc section of What is the best way to set a register to zero in x86 assembly: xor, mov or and?. Unrolling and zeroing xor-zeroing a register once per 4 or more setcc instructions would be good.)

Sharplab doesn't have an AArch64 target to check if it's using cinc (conditional-increment according to a flag condition). AArch64 is perfect for this problem.

Smarter compilers (like GCC or Clang, if told not to auto-vectorize: Godbolt) do much better, even avoiding materializing a 0/1 integer in a register at all, but rather adding the carry flag directly from the compare result. Or rather, subtracting it and -1, since on x86 CF is set (by sub/cmp) for "below", cleared for "above"; it's a "borrow" output from subtraction. GCC's loop is 4 uops, and can run at 1/cycle on CPUs as old as Broadwell (for single-uop sbb with a non-zero immediate) or your Zen 1.

# GCC 15 -O3 -fno-tree-vectorize for a loop doing: if(data[i] > 100) count++; .L4: # do{ cmp BYTE PTR [rdi], 101 # set flags from elem - 101. Carry Flag = 1 if elem is below 101 sbb eax, -1 # count -= (-1 + CF) // += 1 for elem >= 101, += 0 otherwise add rdi, 1 # ptr++ cmp rdi, rsi jne .L4 # }while(ptr != endp) 

Using a non-indexed addressing-mode prevents un-lamination) at issue/rename after micro-fusion in the decoder.

Clang uses the same cmp/sbb peephole optimization, but unrolls the loop by 8. This doesn't help for long-running loops except on really old CPUs with less than 4 execution units, like Sandy/Ivy Bridge. If it had unrolled with multiple registers holding separate counts (aka multiple accumulators) to add at the end, it could do 2 elements / cycle, limited by load ports. (2 loads per clock on older CPUs like Zen 1, 3 scalar loads per clock on many newer CPUs).

With signed int elements, the compare result isn't the carry flag, so there's nothing like the sbb trick. GCC uses xor-zeroing (inside the loop) / cmp / setcc / add. Same front-end uop count as C#, but xor-zeroing is literally as cheap as a NOP on Sandybridge-family and I think on Zen; no back-end execution unit needed. And of course it doesn't redo sign-extension of an index inside the loop, so only 7 uops, 5 of which need an ALU execution port.


Java makes branchless scalar asm. Its unrolling isn't useful here.

Making branchless machine code / assembly from source that uses an if is called "if conversion" optimization. In technical terminology, if-conversion is turning a control dependency into a data dependency.

See @k314159's answer for the Java asm, and my comments on it.

Java makes branchless asm using a worse strategy that can be represented in high-level source as count = (element > 100) ? count+1 : count;, which has 2 operations on the critical path from old count to new count: increment and cmov (conditional select using an ALU operation whose inputs are the compare result in FLAGS, count, count+1). AArch64 has a similar csel instruction, but also a csinc / cinc instruction that can conditionally increment as well as select, based on a flag condition.

See gcc optimization flag -O3 makes code slower than -O2 for a very similar case of branchless code-gen with a longer critical path than necessary.

Unrolling helps with throughput bottlenecks (front-end decode and alloc/rename into the back end, and execution ports), not with loop-carried dependency-chain latency bottlenecks.

Even with this cmov strategy instead of materializing a 0/1 integer to add, it could have unrolled better with multiple accumulators for count (like count1, count2 etc), summing at the end; that's a common technique for hiding floating-point ALU latency. A compiler is allowed to do it for you with integer math because it's associative, unlike FP math with rounding error. But that's not what the HotSpot JIT does, so it could be going about 2x faster (1 element per cycle) without even vectorizing. Or even more on wider CPUs like Arrow Lake or Zen 5, with multiple accumulators.

SIMD vectorization

Neither JIT (C# or Java) is doing this for you. Modern ahead-of-time compilers do (GCC and Clang), making asm that's much faster than scalar but still far from optimal.

With SSE2 or AVX2 used efficiently, you'd easily bottleneck on memory bandwidth, or close to L2 cache bandwidth (32 bytes per cycle) if data was hot in L2 or L1d. Same for NEON/ASIMD or SVE on AArch64.
With auto-vectorized machine code from GCC or Clang, you wouldn't be going that fast. Maybe fast enough to keep up with DRAM bandwidth of like 8 to 16 bytes per cycle on a desktop/laptop, even with byte elements. With 32-bit int elements, there's less unpacking/widening work per vector (using GCC/Clang's bad strategy) and the data is less dense, so even auto-vectorized C++ should just bottleneck on DRAM bandwidth for your large problem size. Manually vectorizing would be useful to let you go faster with hot caches, and waste less execution resources so the other logical core sharing a physical core can get more done.

See my comments under Matthew Watson's answer for more about SIMD. And his answer for C# with intrinsics doing an ok job, but horizontally summing the compare result inside the inner loop. Charlieface's answer shows Vector.Subtract to increment a vector of counts with the 0 / -1 compare-result vector.

The problem of accumulating byte compare results into a total count is the same as in How to count character occurrences using SIMD, where chtz's answer shows C++ with AVX2 intrinsics which accumulate counts in separate bytes for an inner loop of up to 255 iterations (the max before any element could overflow), then hsum with vpsadbw against a zeroed register and accumulate those 64-bit counts into another vector which you hsum outside the outer loop.

x86 before AVX-512 only has two integer compare predicates: equal and signed-greater. So to do unsigned above or below, one trick is x > y as min(x,y) != y; Clang does it that way, actually spending an XOR to invert the compare results from vpcmpeqb. Or using unsigned-saturating subtraction + cmpeq like GCC does. That's part of how GCC/Clang auto-vectorize, but they widen to 32-bit for every compare-result, inefficiently not even using psadbw. Or worse to 64-bit elements if your sum variable is 64-bit. Especially with AVX2 where the top 128 bits is less convenient to get at.

You can see GCC/Clang in the above Godbolt link; take out -fno-tree-vectorize from the command line options.

Saturating subtraction isn't available for arch-agnostic System.Numerics vectors, so you'd have to use Avx2.SubtractSaturate on Vector256 from System.Runtime.Intrinsics.X86 to do it that way, which would make it x86-only. But packed-min, packed-max, and equality are available. So is XOR and compare for greater-than, so you could XOR with 128 to range-shift unsigned to signed, and do (x^128) > (100^128). With the threshold reused across many vectors, XORing it as part of the setup is negligible. Or Max(v, thresh+1) == v should only be true when v>thresh.

Charlieface's version compiles nicely for int input.

For byte input, this is my rough attempt at doing something non-terrible with intrinsics in C#, despite not having access to efficient horizontal-sum of unsigned bytes (vpsadbw). And working around the compiler which wanted to use AVX-512 unsigned byte compares on Sharplab for .GreaterThan, but instead of doing a merge-masking add or sub, it turned the mask back into a vector with vpmovm2b, so it's no better than just using AVX2 to emulate unsigned compares. In fact it's worse; AVX-512 compares only run on one port on Intel CPUs, vs. vpcmpgtb with a YMM destination (instead of a k mask register) running on more ports.

... // trimmed to fit in 30k char limit Vector<ushort> outercount = new(); for (; i < unrolled_size_limit; ) { Vector<byte> vcount0 = new(); // Unroll with two u8 accumulator vectors to hide latency Vector<byte> vcount1 = new(); for(int j=0; j<255; ++j) { // Maybe use LoadUnsafe or something to avoid bounds checks? var v = new Vector<byte>(array.Slice(i, vectorSize)); var max = Vector.Max(v, vthresh_plus_1); var mask = Vector.Equals(max, v); // true when v >(unsigned) thresh vcount0 -= mask; v = new Vector<byte>(array.Slice(i+vectorSize, vectorSize)); max = Vector.Max(v, vthresh_plus_1); vcount1 -= Vector.Equals(max, v); i += 2 * vectorSize; } ... 

Sharplab with this and Charlieface's Vector<int> version, which I don't think I ended up modifying significantly since it was already good.

This has bounds-checks on each vector load inside the inner loop, so that sucks a lot for throughput, like an extra mov+lea+cmp/ja. Maybe using MemoryMarshal.Cast can do the check once, outside the loop? As in Writing a vector sum function with SIMD (System.Numerics) and making it faster than a for loop

v == max(v, thresh+1) forces the compiler to load into a register instead of using a memory source operand for vpmaxub. I really should have used XOR / signed compare for greater.

The inner loop from L0039: to dec/jne short L0039 is 19 uops for CPUs that handle 256-bit vectors as a single uop. (Unlike Zen 1, where they're 2 uops each). So maybe 2x 32-byte vectors per 4 clock cycles, or 16 byte elements per clock, on 5-wide CPUs with 256-bit vector ALUs, like Ice Lake or Zen 2. Intel might bottleneck on back-end ALU ports; AMD has separate ports for integer vs. SIMD and there's a good mix of both, plus 2 loads.

By comparison, the not-unrolled inner loop from Charlieface's version actually does still have the bounds checking, too. 10 uops for 1 vector, again about 2 cycles per 32 bytes or 16 bytes per cycle.

The outer loop code that widens to Vector<UInt16> is much worse than vpsadbw, but only runs every 2x255 vectors.

1 Comment

This got kinda long, and could probably use proof-reading. It was taking too long to write so I didn't do a final read-through of the whole thing. I'm sure I've repeated myself some, some of it intentionally so separate paragraphs / sections are more self-contained.
7

This is not an answer to the actual question, but just a demonstration of how you could optimise the C# code to use SIMD vectors.

I believe that the most likely answer is "because Java is using SIMD vector optimisations and C# is not" - but to prove this we would need to see the actual ASM generated for the Java.

If the Java is using SIMD, it might be doing something like this:

static int testSpeedVectorised(ReadOnlySpan<int> array) { if (array.Length == 0) throw new ArgumentException(); const int THRESHOLD = 100; int count = 0; int vectorSize = Vector<int>.Count; var threshold = new Vector<int>(THRESHOLD); // Vectorised loop int i = 0; for (; i <= array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array.Slice(i, vectorSize)); var mask = Vector.GreaterThan(v, threshold); // Each 'true' in mask is -1, so Vector.Sum(mask) is negative of the match count. // Therefore, we subtract it from count rather than adding it. count -= Vector.Sum(mask); } // Handle the leftover values that didn't fit in a Vector<int> for (; i < array.Length; i++) { if (array[i] > THRESHOLD) count++; } return count; } 

If you really REALLY want to, you can write a multithreaded version of the SIMD version to make it even faster:

[MethodImpl(MethodImplOptions.AggressiveOptimization)] static int testSpeedMultithreadedVectorised(int[] array) { if (array.Length == 0) throw new ArgumentException(); const int THRESHOLD = 100; int vectorSize = Vector<int>.Count; int length = array.Length; const int BLOCK_SIZE = 1_000_000; // Might not be optimal, adjust as needed. int total = 0; Parallel.For( 0, (length + BLOCK_SIZE - 1) / BLOCK_SIZE, () => 0, // Initial local count for each thread (blockIndex, _ /* state - not used */, localCount) => { int start = blockIndex * BLOCK_SIZE; int end = Math.Min(start + BLOCK_SIZE, length); int i = start; var threshold = new Vector<int>(THRESHOLD); for (; i <= end - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); var mask = Vector.GreaterThan(v, threshold); // Each 'true' in mask is -1, so Vector.Sum(mask) is negative of the match count. // Therefore, we subtract it from count rather than adding it. localCount -= Vector.Sum(mask); } // Handle the leftover values that didn't fit in a Vector<int> for (; i < end; i++) { if (array[i] > THRESHOLD) localCount++; } return localCount; }, localCount => Interlocked.Add(ref total, localCount) ); return total; } 

The benchmark results are as follows:

| Method | Mean | Error | StdDev | Allocated | |--------------------------------- |-----------:|----------:|----------:|----------:| | TestSpeed | 288.823 ms | 2.8061 ms | 2.6248 ms | 200 B | | TestSpeedVectorised | 20.204 ms | 0.3636 ms | 0.3401 ms | 12 B | | TestSpeedMultithreadedVectorised | 6.554 ms | 0.0168 ms | 0.0157 ms | 7844 B | 

As you can see, the optimised versions are significantly faster.


Update

I have incorporated the suggestions from other answers into my benchmarks.

However, I have not attempted to optimise for using a byte array, since I think that is a distraction from the requirement of checking which values of an int array exceed a particular threshold.

To drive this point home, the test data is an array of 10 million int values between 0 and 10_000 inclusive and the threshold is 5_000.

Benchmark Code

  • Original() is the original C# code from the OP.

  • Optimised() rewrites the inner loop to use the suggestion from u/PeterCordes.

  • Vectorised() is my original vector approach.

  • BetterVectorised() incorporates a suggestion from /u/CharlieFace

  • MtVectorised() is the multithreaded version of Vectorised().

  • MtBetterVectorised() is the multithreaded version of BetterVectorised().

    using System.Numerics; using System.Runtime.CompilerServices; using BenchmarkDotNet.Attributes; namespace ConsoleApp1; [MemoryDiagnoser] public class Benchmark { private int[] _array = null!; [GlobalSetup] public void Setup() { var rng = new Random(321312); _array = new int[100_000_000]; for (int i = 0; i < _array.Length; i++) { _array[i] = rng.Next(0, 10_001); // Random values between 0 and 10_000. } } [Benchmark(Baseline = true)] public int Original() { return testSpeed(_array); } [Benchmark] public int Optimised() { return testSpeedOptimised(_array); } [Benchmark] public int Vectorised() { return testSpeedVectorised(_array); } [Benchmark] public int BetterVectorised() { return testSpeedVectorisedOptimised(_array); } [Benchmark] public int MtVectorised() { return testSpeedMultithreadedVectorised(_array); } [Benchmark] public int MtBetterVectorised() { return testSpeedMultithreadedVectorisedOptimised(_array); } [MethodImpl(MethodImplOptions.AggressiveOptimization)] static int testSpeedVectorised(ReadOnlySpan<int> array) { if (array.Length == 0) throw new ArgumentException(); const int THRESHOLD = 5_000; int count = 0; int vectorSize = Vector<int>.Count; int i = 0; var threshold = new Vector<int>(THRESHOLD); // Vectorized loop for (; i <= array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array.Slice(i, vectorSize)); var mask = Vector.GreaterThan(v, threshold); // Each 'true' in mask is -1, so Vector.Sum(mask) is negative of the match count. // Therefore, we subtract it from count rather than adding it. count -= Vector.Sum(mask); } // Remainder loop for (; i < array.Length; i++) { if (array[i] > THRESHOLD) count++; } return count; } [MethodImpl(MethodImplOptions.AggressiveOptimization)] static int testSpeedVectorisedOptimised(ReadOnlySpan<int> array) { if (array.Length == 0) throw new ArgumentException(); const int THRESHOLD = 5_000; Vector<int> count = new(); int vectorSize = Vector<int>.Count; var threshold = new Vector<int>(THRESHOLD); // Vectorised loop int i = 0; for (; i <= array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(array.Slice(i, vectorSize)); var mask = Vector.GreaterThan(v, threshold); count = Vector.Subtract(count, mask); } var totalCount = Vector.Sum(count); // Handle the leftover values that didn't fit in a Vector<int> for (; i < array.Length; i++) { if (array[i] > THRESHOLD) totalCount++; } return totalCount; } [MethodImpl(MethodImplOptions.AggressiveOptimization)] static int testSpeedMultithreadedVectorised(int[] array) { if (array.Length == 0) throw new ArgumentException(); const int THRESHOLD = 5_000; int vectorSize = Vector<int>.Count; int length = array.Length; const int BLOCK_SIZE = 1_000_000; // Might not be optimal, adjust as needed. int total = 0; Parallel.For( 0, (length + BLOCK_SIZE - 1) / BLOCK_SIZE, () => 0, // Initial local count for each thread (chunkIdx, _ /* state - not used */, localCount) => { int start = chunkIdx * BLOCK_SIZE; int end = Math.Min(start + BLOCK_SIZE, length); int i = start; var threshold = new Vector<int>(THRESHOLD); for (; i <= end - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); var mask = Vector.GreaterThan(v, threshold); // Each 'true' in mask is -1, so Vector.Sum(mask) is negative of the match count. // Therefore, we subtract it from count rather than adding it. localCount -= Vector.Sum(mask); } // Handle the leftover values that didn't fit in a Vector<int> for (; i < end; i++) { if (array[i] > THRESHOLD) localCount++; } return localCount; }, localCount => Interlocked.Add(ref total, localCount) ); return total; } [MethodImpl(MethodImplOptions.AggressiveOptimization)] static int testSpeedMultithreadedVectorisedOptimised(int[] array) { if (array.Length == 0) throw new ArgumentException(); const int THRESHOLD = 5_000; int vectorSize = Vector<int>.Count; int length = array.Length; const int BLOCK_SIZE = 1_000_000; // Might not be optimal, adjust as needed. int total = 0; Parallel.For( 0, (length + BLOCK_SIZE - 1) / BLOCK_SIZE, () => 0, // Initial local count for each thread (chunkIdx, _ /* state - not used */, localCount) => { int start = chunkIdx * BLOCK_SIZE; int end = Math.Min(start + BLOCK_SIZE, length); Vector<int> count = new(); int i = start; var threshold = new Vector<int>(THRESHOLD); for (; i <= end - vectorSize; i += vectorSize) { var v = new Vector<int>(array, i); var mask = Vector.GreaterThan(v, threshold); count = Vector.Subtract(count, mask); } localCount += Vector.Sum(count); // Handle the leftover values that didn't fit in a Vector<int> for (; i < end; i++) { if (array[i] > THRESHOLD) localCount++; } return localCount; }, localCount => Interlocked.Add(ref total, localCount) ); return total; } [MethodImpl(MethodImplOptions.AggressiveOptimization)] static int testSpeed(ReadOnlySpan<int> array) { const int THRESHOLD = 5_000; int count = 0; foreach (int element in array) { if (element > THRESHOLD) { count++; } } return count; } [MethodImpl(MethodImplOptions.AggressiveOptimization)] static int testSpeedOptimised(ReadOnlySpan<int> array) { const int THRESHOLD = 5_000; int count = 0; foreach (int element in array) { count += element > THRESHOLD ? 1 : 0; } return count; } } 

Results

| Method | Mean | Error | StdDev | Ratio | Allocated | Alloc Ratio | |------------------- |-----------:|----------:|----------:|------:|----------:|------------:| | Original | 317.927 ms | 2.0464 ms | 1.9142 ms | 1.00 | 200 B | 1.00 | | Optimised | 44.469 ms | 0.4871 ms | 0.4557 ms | 0.14 | 33 B | 0.17 | | Vectorised | 20.041 ms | 0.3954 ms | 0.3698 ms | 0.06 | 12 B | 0.06 | | BetterVectorised | 16.191 ms | 0.2009 ms | 0.1879 ms | 0.05 | 12 B | 0.06 | | MtVectorised | 6.755 ms | 0.0590 ms | 0.0523 ms | 0.02 | 7806 B | 39.03 | | MtBetterVectorised | 6.590 ms | 0.1058 ms | 0.0990 ms | 0.02 | 7824 B | 39.12 | 

Observations

  • The simple inner loop change to count += (element > 100) ? 1 : 0; suggested by u/PeterCordes makes a huge difference, making it seven times faster.
  • The BetterVectorised() version makes minor difference.
  • The multithreaded vectorised version is around fifty times faster than the original version, at the expense of around 8K of additional memory usage.

16 Comments

Faster version, sum the vector in parallel, then sum the final result dotnetfiddle.net/WQcXl0
@VasilisKontopoulos and Matthew: If you're going to manually vectorize with SIMD, start with the byte array so you can skip the insanely inefficient conversion from u8 to i32. I assume C# has some generic SIMD stuff that can work with unsigned char or unsigned byte or whatever it calls it, and will compile to something like x86 pxor (to flip the sign bit) + pcmpgtb (packed compare of signed bytes) or something. Or to AVX-512 vpcmpub if available. Or to AArch64 CMHI (compare unsigned higher). 1/4 the cache footprint, 4x as much work per SIMD vector.
The one advantage to unpacking to i32 is that you can vertical add without worrying about overflow, only do a slow horizontal sum once at the end... but you're unfortunately not doing that, you're reducing to scalar to accumulate into int count inside the inner loop. If you're going to unpack, you should do it on the fly, since it just costs a pmovsxbd if SSE4.1 is available. Anyway, see How to count character occurrences using SIMD for how to handle 8-bit compare. In that case it's for ==, but accumulating is the same for any 0/-1 vector masks.
Actually the trick for unsigned compare x >= 100u without AVX-512 is pminub / pcmpeqb, to check if min(x, 100u) == 100u. godbolt.org/z/1Tx3j4vTY shows how GCC and Clang auto-vectorize a loop over bytes. (Not great, but much better than scalar.)
@PeterCordes Changing from int[] and IntVector to byte[] and ByteVector changed the JMH timings to 62.998 ± 1.056 ms/op, 7.430 ± 0.148 ms/op, and 3.843 ± 0.112 ms/op (timings with respect to my previous comment), so at least on Java making the switch definitely improves execution time for vector implementations in this case. But just wondering, I had to compare GT 100 and LT 0 in two steps to account for the fact Java bytes are signed; is there a way to do a single comparison without actually widening the values to an int?
|
3

You could try with a for loop rather than a foreach, but it seems the ASM is the same for those.

Either way, you need to use a tool such as BenchmarkDotNet for accurate results, do not rely on the stopwatch.

int[] _array; [GlobalSetup] public static void Setup() { string desktopPath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop); string fullPath = Path.Combine(desktopPath, "RandomFile.dat"); if (!File.Exists(fullPath)) { Console.WriteLine("The file does not exist."); return; } byte[] byteArray = File.ReadAllBytes(fullPath); _array = ByteArrayToIntArray(byteArray); } [Benchmark] [MethodImplAttribute(MethodImplOptions.AggressiveOptimization)] public static int TestSpeed() { var array = _array; if (array == null) throw new ArgumentNullException(nameof(array)); if (array.Length == 0) throw new ArgumentException("Empty array", nameof(array)); int count = 0; for (var i = 0; i < array.Length; i++) { if (array[i] > 100) count++; } return count; } 

You may also find that unrolling it will speed it up significantly. Try either 128- or 256-bit unroll, so either 4 or 8 step, depending on your processor. Removing branching is also likely to give a good speedup.

[Benchmark] [MethodImplAttribute(MethodImplOptions.AggressiveOptimization)] public static int TestSpeed2() { var array = _array; if (array == null) throw new ArgumentNullException(nameof(array)); if (array.Length == 0) throw new ArgumentException("Empty array", nameof(array)); int count = 0; int i = 0; for (; i < array.Length - 8; i += 8) { var temp0 = array[i] > 100; var temp1 = array[i + 1] > 100; var temp2 = array[i + 2] > 100; var temp3 = array[i + 3] > 100; var temp4 = array[i + 4] > 100; var temp5 = array[i + 5] > 100; var temp6 = array[i + 6] > 100; var temp7 = array[i + 7] > 100; count += Unsafe.As<bool, int>(ref temp0) + Unsafe.As<bool, int>(ref temp1) + Unsafe.As<bool, int>(ref temp2) + Unsafe.As<bool, int>(ref temp3) + Unsafe.As<bool, int>(ref temp4) + Unsafe.As<bool, int>(ref temp5) + Unsafe.As<bool, int>(ref temp6) + Unsafe.As<bool, int>(ref temp7); } for (; i < array.Length; i++) { if (array[i] > 100) { count++; } } return count; } 

You could even try using SIMD, with Vector<int>. A faster version of @MatthewWilson's parallelizes the summing, then horizontally sums only the final result, rather than horizontally summing on each loop. You can also use Unsafe.Add to avoid bounds checks.

[MethodImplAttribute(MethodImplOptions.AggressiveOptimization)] public static int testSpeedVectorised(ReadOnlySpan<int> array) { if (array.Length == 0) throw new ArgumentException(); const int THRESHOLD = 100; Vector<int> count = new(); int vectorSize = Vector<int>.Count; // initialize vector with all values the same var threshold = new Vector<int>(THRESHOLD); // Vectorised loop int i = 0; ref int start = ref MemoryMarshal.GetReference(array); for (; i <= array.Length - vectorSize; i += vectorSize) { var v = new Vector<int>(MemoryMarshal.CreateReadOnlySpan(Unsafe.Add(ref start, i), vectorSize)); var mask = Vector.GreaterThan(v, threshold); // Each 'true' in mask is -1, so we use Vector.Subtract rather than Vector.Add // Therefore, we subtract it from count rather than adding it. count = Vector.Subtract(count, mask); } var totalCount = Vector.Sum(count); // Handle the leftover values that didn't fit in a Vector<int> for (; i < array.Length; i++) { if (Unsafe.Add(ref start, i) > THRESHOLD) totalCount++; } return totalCount; } 

7 Comments

The Java code actually uses foreach (the syntax is called an "enhanced for loop" which is the equivalent of C#'s foreach) but the Java language spec guarantees that in circumstances like this one, the foreach loop is implemented as a for loop in the generated Java byte code. Are you sure C# doesn't do the same?
C# does do the same. sharplab.io/…
Yes you're right, thought it was worth a try. Unrolling saves a lot of time as well.
For the record, the vectorization strategy here, with Vector.Subtract(count, mask); inside the loop and Vector.Sum(count); outside, is much better than what Matthew Watson's answer is doing with scalarcount -= Vector.Sum inside the loop. On x86-64, Vector.Sum costs 2 or 3 shuffles + adds (for 128-bit or 256-bit vectors respectively), plus a vmovq from vector to scalar. But the JVM isn't vectorizing, just making branchless scalar asm according to k314159's answer.
I updated my answer with a Vector<byte> version that avoids unpacking first. I noticed that it and your answer both compile with bounds-checks on every SIMD load. Is there a way to avoid that, like maybe MemoryMarshal.Cast<byte, Vector<byte>> outside the loop, and index whole vectors? Or would one need .LoadUnsafe()?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.