SSE instruction to check if byte array is zeroes C#

Question

Suppose I have a byte[] and want to check if all bytes are zeros. For loop is an obvious way to do it, and LINQ All() is a fancy way to do it, but highest performance is critical.

How can I use Mono.Simd to speed up checking if byte array is full of zeroes? I am looking for cutting edge approach, not merely correct solution.

When doing performance tests on .NET apps, you should make sure to run a few times and skip the first one because the JIT can get involved there. If you want to talk about the absolute fastest performance, then you should probably specify the hardware, too... Using BenchmarkDotNet to run your different candidates and report the results would be ideal, because it makes sure to run the candidates in a way that's as accurate as we can get, and its output includes the run parameters like hardware, GC mode, etc. — Joe Amenta
– Joe Amenta, Commented Jul 17, 2017 at 12:12

ArekBulski · Accepted Answer · 2015-10-24 15:47:00Z

Best code is presented below. Other methods and time measuring are available in full source.

static unsafe bool BySimdUnrolled (byte[] data) { fixed (byte* bytes = data) { int len = data.Length; int rem = len % (16 * 16); Vector16b* b = (Vector16b*)bytes; Vector16b* e = (Vector16b*)(bytes + len - rem); Vector16b zero = Vector16b.Zero; while (b < e) { if ((*(b) | *(b + 1) | *(b + 2) | *(b + 3) | *(b + 4) | *(b + 5) | *(b + 6) | *(b + 7) | *(b + 8) | *(b + 9) | *(b + 10) | *(b + 11) | *(b + 12) | *(b + 13) | *(b + 14) | *(b + 15)) != zero) return false; b += 16; } for (int i = 0; i < rem; i++) if (data [len - 1 - i] != 0) return false; return true; } }

Eventually it was beaten by this code:

static unsafe bool ByFixedLongUnrolled (byte[] data) { fixed (byte* bytes = data) { int len = data.Length; int rem = len % (sizeof(long) * 16); long* b = (long*)bytes; long* e = (long*)(bytes + len - rem); while (b < e) { if ((*(b) | *(b + 1) | *(b + 2) | *(b + 3) | *(b + 4) | *(b + 5) | *(b + 6) | *(b + 7) | *(b + 8) | *(b + 9) | *(b + 10) | *(b + 11) | *(b + 12) | *(b + 13) | *(b + 14) | *(b + 15)) != 0) return false; b += 16; } for (int i = 0; i < rem; i++) if (data [len - 1 - i] != 0) return false; return true; } }

Time measurements (on 256MB array):

LINQ All(b => b == 0) : 6350,4185 ms Foreach over byte[] : 580,4394 ms For with byte[].Length property : 809,7283 ms For with Length in local variable : 407,2158 ms For unrolled 16 times : 334,8038 ms For fixed byte* : 272,386 ms For fixed byte* unrolled 16 times : 141,2775 ms For fixed long* : 52,0284 ms For fixed long* unrolled 16 times : 25,9794 ms SIMD Vector16b equals Vector16b.Zero : 56,9328 ms SIMD Vector16b also unrolled 16 times : 32,6358 ms

Conclusions:

Mono.Simd has only a limited set of instructions. I found no instructions for computing scalar sum(vector) or max(vector). There is however vector equality operator returning bool.
Loop unrolling is a powerful technique. Even fastest code benefits a lot from using it.
LINQ is ungodly slow because it uses delegates from lambda expressions. If you need cutting edge performance then clearly that is not the way to go.
All methods presented use short circuit evaluation, meaning they end as soon as they encounter non-zero.
SIMD code was eventually beaten. There are other questions on SO disputing whether SIMD actually makes things faster.

Posted this code on Peer Review, so far 2 bugs found and fixed.

This assumes your array is 16*N in length, a big assumption, but could be valid in controlled environments , also from your times on BySimdEquals I would highly assume that you are not running this with O=simd and thus are getting non-simd O=-simd times(?), which really do not improve that code execution times by that much. Coding this in C and p/invoking a GC-pinned array would be the faster.
Unrolled version is indeed faster, but unrolling the loop 2 times (thus comparing only 2 x 8 bytes each loop) give similar (if not better) performance as 16 times on my machine. It make sense when you know most x64 machines have only two 64-bit data channels (if you have 2 memory sticks and they are installed in the proper slots). Memory read is probably the biggest bottleneck here.

DragonSpit · Accepted Answer · 2019-06-27 11:28:55Z

The scalar implementation processes long's, which are 64-bit (8-bytes) at a time, and gets most of it's speedup from this parallelism, which is powerful.

The SIMD/SSE above code uses 128-bit SIMD/SSE (16-byte) instructions. When using the newer 256-bit (32-byte) SSE instructions, the SIMD implementation is about 10% faster. With AVX/AVX2 instructions at 512-bits (64-byte) in the latest processors, SIMD implementation using these should be even faster.

 private static bool ZeroDetectSseInner(this byte[] arrayToOr, int l, int r) { var zeroVector = new Vector<byte>(0); int concurrentAmount = 4; int sseIndexEnd = l + ((r - l + 1) / (Vector<byte>.Count * concurrentAmount)) * (Vector<byte>.Count * concurrentAmount); int i; int offset1 = Vector<byte>.Count; int offset2 = Vector<byte>.Count * 2; int offset3 = Vector<byte>.Count * 3; int increment = Vector<byte>.Count * concurrentAmount; for (i = l; i < sseIndexEnd; i += increment) { var inVector = new Vector<byte>(arrayToOr, i ); inVector |= new Vector<byte>(arrayToOr, i + offset1); inVector |= new Vector<byte>(arrayToOr, i + offset2); inVector |= new Vector<byte>(arrayToOr, i + offset3); if (!Vector.EqualsAll(inVector, zeroVector)) return false; } byte overallOr = 0; for (; i <= r; i++) overallOr |= arrayToOr[i]; return overallOr == 0; } public static bool ZeroValueDetectSse(this byte[] arrayToDetect) { return arrayToDetect.ZeroDetectSseInner(0, arrayToDetect.Length - 1); }

An improved version (thanks to Peter's suggestion) is shown in the above code, is safe and has been integrated into the HPCsharp nuget package, for a 20% speedup using 256-bit SSE instructions.

Why would you |= into an accumulator but still check that accumulator every iteration? It makes sense to |= a cache line or two of vectors together for one pcmpeqb / pmovmskb / test/ jnz loop break condition. But you want to start a fresh orVector after finding the previous was all zero, breaking the dependency chain. If this compiles as written, it's limited to 1 vector per cycle at best (data dependency through orVector, not 2x 16-byte loads per clock like modern x86 can do (AMD since K10, Intel since Sandybridge). Or 2x 32-byte loads per clock on Haswell and later.
Yeah, this was the initial implementation to show that the scalar unrolled loop can be beaten. The follow on implementation uses several independent orVectors within the loop to break dependencies. Short-circuit testing them for non-zero without introducing dependencies is the struggle. Would love to see suggestions.
At the top of a manually-unrolled loop in the source, do orVector = new Vector<byte>(arrayToOr, i); then for later vectors do orVector |= new Vector<byte>(arrayToOr, i+1); ; ... i + 2 etc. At the bottom of this loop, test it for non-zero.
Thank you for the explanation - helps a lot and made it concrete. This method brought another 10% speedup for a total gain of 20% with 256-bit (32-byte) SSE safe C# implementation over the unrolled scalar implementation above. Could you elaborate further how this method does not have dependencies, as it OR's all of the read Vector's into the single orVector? I implemented separate/multiple orVectors, but only for a minimal performance gain over your suggestion.
It has a dependency within the body of a single loop iteration, but breaks dependencies across loop iterations. So there's no loop-carried dependency (except the pointer-increment). This lets out-of-order execution overlap the work in separate iterations because they're separate dependency chains. (Branch prediction / speculative execution avoids waiting for the control dependency of the conditional branch out of the loop based on load+ALU results. i.e. branch speculation breaks data dependencies.)

Collectives™ on Stack Overflow

SSE instruction to check if byte array is zeroes C#

2 Answers 2

2 Comments

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

7 Comments

Linked

Related