Best code is presented below. Other methods and time measuring are available in full source.
static unsafe bool BySimdUnrolled (byte[] data) { fixed (byte* bytes = data) { int len = data.Length; int rem = len % (16 * 16); Vector16b* b = (Vector16b*)bytes; Vector16b* e = (Vector16b*)(bytes + len - rem); Vector16b zero = Vector16b.Zero; while (b < e) { if ((*(b) | *(b + 1) | *(b + 2) | *(b + 3) | *(b + 4) | *(b + 5) | *(b + 6) | *(b + 7) | *(b + 8) | *(b + 9) | *(b + 10) | *(b + 11) | *(b + 12) | *(b + 13) | *(b + 14) | *(b + 15)) != zero) return false; b += 16; } for (int i = 0; i < rem; i++) if (data [len - 1 - i] != 0) return false; return true; } }
Eventually it was beaten by this code:
static unsafe bool ByFixedLongUnrolled (byte[] data) { fixed (byte* bytes = data) { int len = data.Length; int rem = len % (sizeof(long) * 16); long* b = (long*)bytes; long* e = (long*)(bytes + len - rem); while (b < e) { if ((*(b) | *(b + 1) | *(b + 2) | *(b + 3) | *(b + 4) | *(b + 5) | *(b + 6) | *(b + 7) | *(b + 8) | *(b + 9) | *(b + 10) | *(b + 11) | *(b + 12) | *(b + 13) | *(b + 14) | *(b + 15)) != 0) return false; b += 16; } for (int i = 0; i < rem; i++) if (data [len - 1 - i] != 0) return false; return true; } }
Time measurements (on 256MB array):
LINQ All(b => b == 0) : 6350,4185 ms Foreach over byte[] : 580,4394 ms For with byte[].Length property : 809,7283 ms For with Length in local variable : 407,2158 ms For unrolled 16 times : 334,8038 ms For fixed byte* : 272,386 ms For fixed byte* unrolled 16 times : 141,2775 ms For fixed long* : 52,0284 ms For fixed long* unrolled 16 times : 25,9794 ms SIMD Vector16b equals Vector16b.Zero : 56,9328 ms SIMD Vector16b also unrolled 16 times : 32,6358 ms
Conclusions:
- Mono.Simd has only a limited set of instructions. I found no instructions for computing scalar sum(vector) or max(vector). There is however vector equality operator returning bool.
- Loop unrolling is a powerful technique. Even fastest code benefits a lot from using it.
- LINQ is ungodly slow because it uses delegates from lambda expressions. If you need cutting edge performance then clearly that is not the way to go.
- All methods presented use short circuit evaluation, meaning they end as soon as they encounter non-zero.
- SIMD code was eventually beaten. There are other questions on SO disputing whether SIMD actually makes things faster.
Posted this code on Peer Review, so far 2 bugs found and fixed.