counting bits across multiple integers... is there a faster way?

Question

I have an array of 4 longs that I am wanting to count the number of set bits within a given range. This is the function that I am currently using (where bitcount(uint64_t) is an inline asm function that gives the number of set bits in the argument):

unsigned count_bits(const uint64_t *long_ptr, size_t begin, size_t end) { uint64_t start_mask = ~((1L << (begin & 63)) - 1); uint64_t end_mask = ((1L << (end & 63)) - 1); if (begin >= 0 && begin < 64) { if (end < 64) { return bitcount(long_ptr[0] & start_mask & end_mask); } else if (end < 128) { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1] & end_mask); } else if (end < 192) { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2] & end_mask); } else if (end<256) { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2]) + bitcount(long_ptr[3] & end_mask); } else { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2]) + bitcount(long_ptr[3]); } } else if (begin >= 64 && begin < 128) { if (end < 128) { return bitcount(long_ptr[1] & start_mask & end_mask); } else if (end < 192) { return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2] & end_mask); } else if (end < 256) { return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2]) + bitcount(long_ptr[3] & end_mask); } else { return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2]) + bitcount(long_ptr[3]); } } else if (begin >= 128 && begin < 192) { if (end < 192) { return bitcount(long_ptr[2] & start_mask & end_mask); } else if (end < 256) { return bitcount(long_ptr[2] & start_mask) + bitcount(long_ptr[3] & end_mask); } else { return bitcount(long_ptr[2] & start_mask) + bitcount(long_ptr[3]); } } else if (begin<256) { if (end < 256) { return bitcount(long_ptr[3] & start_mask & end_mask); } else { return bitcount(long_ptr[3] & start_mask); } } else { return 0; } }

I am finding that performance of this code is quite good, but I am wondering if there is anything I can do to make it faster, or if a redesign of the algorithm could result in a performance boost.

I think your problem is how to count the bits set within a certain range of bits across multiple integers, rather than within simply 1 integer. Please make that very clear or people will skim your question and make assumptions about it. — Neil Kirk
– Neil Kirk, Commented Dec 7, 2015 at 2:16
Since you seem to be asking for review of existing, working code; codereview.stackexchange.com would be a more appropriate site to post on — M.M
– M.M, Commented Dec 7, 2015 at 2:27
I haven't run this, but if you have a test harness, what about something like: assert(begin <= end && end < 256); uint32_t tb = begin / 64; uint32_t te = end / 64; uint32_t ret = 0; uint64_t r; for (r = long_ptr[tb] & start_mask; tb < te ; tb++, r = long_ptr[tb]) { ret += bitcount(r); } return ret + bitcount(r & end_mask);? While OP code avoids the loop, I believe this has no more compares/jumps. And being smaller, may not pollute the code cache as much. Might be worth benchmarking, anyway. — David Wohlferd
– David Wohlferd, Commented Dec 7, 2015 at 4:48

Vladimir Sakharuk · Accepted Answer · 2016-03-05 20:46:52Z

I have created 2 different versions with zero branching and I believe David Wohlferd comment should be selected due compactness. I do not believe that none branching version would be any faster. Processor branch prediction may eliminate effectively jumps on consistent data. While none branching one will count bits 4 times all the time(unless SSE?). I am publishing here my second(very short) none branch version. First was complicated with bit computations.

unsigned bitcount2(const uint64_t *long_ptr, size_t begin, size_t end) { uint64_t mask[] = { 0, 0, 0, ~((1ULL << (begin & 63)) - 1), -1LL, -1LL, -1LL, ((1ULL << (end & 63)) - 1), 0, 0, 0 }; uint64_t* b_start = mask+(3 - begin / 64); uint64_t* b_end = mask + (7 - end / 64); return bitcount(long_ptr[0] & b_start[0] & b_end[0]) + bitcount(long_ptr[1] & b_start[1] & b_end[1]) + bitcount(long_ptr[2] & b_start[2] & b_end[2]) + bitcount(long_ptr[3] & b_start[3] & b_end[3]); }

The ~((1ULL << (begin & 63)) - 1) can't cross boundaries into a different array element. I don't think this works right when begin >= 64 or when end is outside a limited range. Perhaps you're trying to do address math to get an unaligned window into mask? That's not what your code does, because you're never casting mask to a uint8_t.
@Peter Cordes I did not cover boundary conditions when begin or end above 256. Obviously replacing size_t with uchar would be sufficient. Within 0-256 should work. I did not finish the boundary conditions because performance testing on my old windows box shows that David Wohlferd version is faster for random cases I build.
Oh, I think I see how it works now. I think I missed the fact that you're indexing b_end after setting it based on end. If the same start and end are used often, or with a pattern, then a branchy version will probably do better than this. Otherwise it looks fairly good.
SSSE3 pshufb or AVX2 vpshufb might do well, but getting the masking right is tricky.
I have tested on approximately 10 cases various regions and validated that there are no branching. I used graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable so my performance measurement could be off.

Collectives™ on Stack Overflow

counting bits across multiple integers... is there a faster way?

1 Answer 1

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Linked

Related