FAST way to determine if first n bits of byte array are "0"

Question

I have a function that computes the checkLeadingZeroBits as follows:

 typedef std::array<uint8_t, 32> SHA256Hash; bool checkLeadingZeroBits(SHA256Hash& hash, unsigned int challengeSize) { int bytes = challengeSize / 8; const uint8_t * a = hash.data(); if (bytes == 0) { return a[0]>>(8-challengeSize) == 0; } else { for (int i = 0; i < bytes; i++) { if (a[i] != 0) return false; } int remainingBits = challengeSize - 8*bytes; if (remainingBits > 0) return a[bytes + 1]>>(8-remainingBits) == 0; else return true; } return false; }

I've also tried the following which is approximately the same run time:

 bool checkLeadingZeroBits(SHA256Hash& hash, unsigned int challengeSize) { int bytes = challengeSize / 8; const uint8_t * a = hash.data(); if (bytes == 0) { return a[0]>>(8-challengeSize) == 0; } else { if (memcmp(NULL_SHA256_HASH.data(), a, bytes) != 0) return false; int remainingBits = challengeSize - 8*bytes; if (remainingBits > 0) return a[bytes + 1]>>(8-remainingBits) == 0; else return true; } return false; }

I'd like to optimize this function to run as quickly as possible. Is there a simpler way to check the leading order bits than using the for loops shown in my implementation?

Are you targeting a particular compiler? Certain compilers provide certain intrinsics for bit counting like this. — Drew Dormann
– Drew Dormann, Commented Dec 15, 2021 at 23:19
Does this answer your question? What is the fastest way to check the leading characters in a char array? — Offtkp
– Offtkp, Commented Dec 15, 2021 at 23:22
The only relevant optimization I can think of is to reinterpret the memory at the machine word size rather than iterating a byte at a time. I think modern compilers might have already done this for you on this kind of code, although I'm not sure if it's simple enough to be "recognized" like that. — Karl Knechtel
– Karl Knechtel, Commented Dec 15, 2021 at 23:24
@Offtkp it kind of gets there, but it doesn't handle the case where nBits%8 != 0 — user491880
– user491880, Commented Dec 15, 2021 at 23:26

Alex · Accepted Answer · 2021-12-15 23:39:47Z

As written, no. A raw byte array can't be scanned at the bit level faster than a byte by byte comparison operation.

However, if you're targeting a platform with a count leading zeros instruction or equivalent, you can speed up the process by checking blocks the size of a native processor word (32 bit for x86/ARM, 64 bit for AMD64/AA64). This would give you a slight improvement, but overall is likely not significant enough to be useful in this instance.

To do this, you would need to transform the data into a C pointer, then cast that to a pointer the same size as a word on the machine (uintptr_t*). From there, you could read out blocks of memory, and pass them to either the compiler intrinsic that would count the leading zeros of that word.

There are a few caveats to this. For one, you need to be very careful that your last read from the new pointer doesn't overstep into memory you don't own. Additionally, for x86 architectures prior to the introduction of lzcnt (before AMD ABM or Intel Haswell), you need to handle the case where the bsr (bit scan reverse, MSB to LSB) instruction finds no bits, which sets the zero flag instead of returning a real result (this will probably be invisible to you, or automatically transformed by the intrinsic, but I know MSVC's _BitScanReverse function returns true or false based on whether the instruction found a bit or not).

Ultimately, you're likely going to gain a marginal improvement in performance, which will likely be counterbalanced by the memory accesses you'll need to do to read data out of the buffer. Overall, it's not really worth it - the compiler will do a much better job of optimizing the loops than you will with all the instructions in the world. Better to write code that's clear and easily maintained than code that's fragile, obtuse, and reliant on processor specific instructions.

Which is a shame because I really like the lzcnt, bsr/bsf, and clz instructions. They can make scanning bitmaps significantly faster, but they aren't really useful here.

Isn't this, strictly speaking, Undefined Behavior? You're reading a uintptr_t where there isn't one. I think you'd need to memcpy and hope the compiler optimizes it away.
Got it, seems like it's worth a try, but not guaranteed to give results. The check zero bits is used to check hashes in a crypto miner, so it will be called billions of times -- every little piece counts, even a 10% difference is money in the bank.
@FrançoisAndrieux Only by virtue of the pointer potentially lacking alignment and being oddly sized, which is why you need to be extremely careful. It's worth noting that at the assembly level, there's only integers and pointers, and if you're looking to use processor specific instructions you're basically one step up from that world. In fact, that's about the only way to make this code faster - hand code it in assembly.
"A raw byte array can't be scanned at the bit level faster than a byte by byte comparison operation." This is not true. In SSE 4.1 you can compare 16 bytes at a time with SIMD instructions.
@Alex just handle the unaligned bytes at the beginning and end manually. That's how glibc and other C libaries do for operations like memcpy, strlen... Auto-vectorization of compilers also do that: godbolt.org/z/vdKhn6Thz

Aki Suihkonen · Accepted Answer · 2021-12-16 05:26:07Z

One of the issues here is that the char array seems to be in big endian format. This can be mitigated in some architectures by byte swap and in some other architectures by reading 64 bits anyway but concentrating on the LSB bits instead.

 // MSB // x = 00 00 00 00 08 FE BB AB <-- as stored byte wise in memory // y = AB BB FE 08 00 00 00 00 <-- as read from memory in x86 bool check(uint8_t const *x, int challenge_count) { uint64_t y; std::memcpy(&y, x, 8); if (y & (~0ull << (challenge_count & ~7))) { } ... check the remaining byte (or nibble) }

Here if you need to perform the same comparison many times, it makes sense to cache (~0ull << (challenge_count & ~7)). Also here the handling of challenge_count >= 64 is omitted, which can be implemented by memcpy followed by comparison to zero.

In any case, if you need to compare 2^68 hashes in general purpose computer, I think you have already lost the game.

user8143588 · Accepted Answer · 2021-12-16 04:45:42Z

0

There are SSE4.2 instructions that can be used to mass-compare bytes and speed up this algorithm. You can compare 16-bytes at a time with one instruction.

https://github.com/WojciechMula/sse4-strstr/blob/master/sse4.2-strstr.cpp

answered Dec 16, 2021 at 4:45

user8143588

1 Comment

Peter Cordes Over a year ago

For checking against zero, SSE2 pcmpeqb (_mm_cmpeq_epi8) is fine. With _mm_movemask_epi8 / tzcnt, you can find the position of the first non-zero byte, and bit-scan it. SSE4.2 instructions aren't particularly helpful (slower), but yes SIMD in general is. Especially AVX2 to check 32 all bytes at once. Or even better, SSE4.1 ptest with the right mask to check that the low n bits are all zero.

Peter Cordes · Accepted Answer · 2022-06-08 21:31:39Z

For a loop-invariant n, x86 SSE4.1 ptest can check this in one fairly cheap instruction. Create a mask with bits set in all the positions you require to be zero, and zeros elsewhere. Do this once when n changes. Assuming a use-case like bitcoin, that's rare.

bool no_bits_set(const std::array<uint8_t, 32> &arr, __m128i mask) { __m128i v = _mm_loadu_si128((const __m128i*)arr.data()); return _mm_testz_si128(mask, v); }

With AVX1, this can test for any n up to 256 with the right mask, or with just SSE4.1 any n up to 128. (Interestingly, _mm256_testz_si256 only requires AVX1, not AVX2)

This is equivalent to Aki's answer, but scalar uint64_t is limited to n = 64 where you'd be testing if (y & -1ULL) to check all 64 low bits.

If there's anything odd about bit-order or byte-order, that's something you can sort out when making the mask; you don't need to shuffle the data.

If you want to know the position of the lowest set bit, _mm_cmpeq_epi8 against _mm_setzero_si128() / _mm_movemask_epi8 will give you a bitmask you can scan with __builtin_ctz to find the position of the lowest non-zero bit. (If your bytes aren't in little-endian order, shuffle them before movemask.) That gives you the position of the lowest non-zero byte, so load it and bit-scan it.

Collectives™ on Stack Overflow

FAST way to determine if first n bits of byte array are "0"

4 Answers 4

8 Comments

1 Comment

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

8 Comments

1 Comment

1 Comment

Comments

Linked

Related