5

I have an array of 4 longs that I am wanting to count the number of set bits within a given range. This is the function that I am currently using (where bitcount(uint64_t) is an inline asm function that gives the number of set bits in the argument):

unsigned count_bits(const uint64_t *long_ptr, size_t begin, size_t end) { uint64_t start_mask = ~((1L << (begin & 63)) - 1); uint64_t end_mask = ((1L << (end & 63)) - 1); if (begin >= 0 && begin < 64) { if (end < 64) { return bitcount(long_ptr[0] & start_mask & end_mask); } else if (end < 128) { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1] & end_mask); } else if (end < 192) { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2] & end_mask); } else if (end<256) { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2]) + bitcount(long_ptr[3] & end_mask); } else { return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2]) + bitcount(long_ptr[3]); } } else if (begin >= 64 && begin < 128) { if (end < 128) { return bitcount(long_ptr[1] & start_mask & end_mask); } else if (end < 192) { return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2] & end_mask); } else if (end < 256) { return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2]) + bitcount(long_ptr[3] & end_mask); } else { return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2]) + bitcount(long_ptr[3]); } } else if (begin >= 128 && begin < 192) { if (end < 192) { return bitcount(long_ptr[2] & start_mask & end_mask); } else if (end < 256) { return bitcount(long_ptr[2] & start_mask) + bitcount(long_ptr[3] & end_mask); } else { return bitcount(long_ptr[2] & start_mask) + bitcount(long_ptr[3]); } } else if (begin<256) { if (end < 256) { return bitcount(long_ptr[3] & start_mask & end_mask); } else { return bitcount(long_ptr[3] & start_mask); } } else { return 0; } } 

I am finding that performance of this code is quite good, but I am wondering if there is anything I can do to make it faster, or if a redesign of the algorithm could result in a performance boost.

11
  • Fastest way to count bits Commented Dec 7, 2015 at 2:15
  • 1
    I think your problem is how to count the bits set within a certain range of bits across multiple integers, rather than within simply 1 integer. Please make that very clear or people will skim your question and make assumptions about it. Commented Dec 7, 2015 at 2:16
  • 1
    The mask expressions should start with 1ULL, not 1L Commented Dec 7, 2015 at 2:25
  • 2
    Since you seem to be asking for review of existing, working code; codereview.stackexchange.com would be a more appropriate site to post on Commented Dec 7, 2015 at 2:27
  • 2
    I haven't run this, but if you have a test harness, what about something like: assert(begin <= end && end < 256); uint32_t tb = begin / 64; uint32_t te = end / 64; uint32_t ret = 0; uint64_t r; for (r = long_ptr[tb] & start_mask; tb < te ; tb++, r = long_ptr[tb]) { ret += bitcount(r); } return ret + bitcount(r & end_mask);? While OP code avoids the loop, I believe this has no more compares/jumps. And being smaller, may not pollute the code cache as much. Might be worth benchmarking, anyway. Commented Dec 7, 2015 at 4:48

1 Answer 1

2

I have created 2 different versions with zero branching and I believe David Wohlferd comment should be selected due compactness. I do not believe that none branching version would be any faster. Processor branch prediction may eliminate effectively jumps on consistent data. While none branching one will count bits 4 times all the time(unless SSE?). I am publishing here my second(very short) none branch version. First was complicated with bit computations.

unsigned bitcount2(const uint64_t *long_ptr, size_t begin, size_t end) { uint64_t mask[] = { 0, 0, 0, ~((1ULL << (begin & 63)) - 1), -1LL, -1LL, -1LL, ((1ULL << (end & 63)) - 1), 0, 0, 0 }; uint64_t* b_start = mask+(3 - begin / 64); uint64_t* b_end = mask + (7 - end / 64); return bitcount(long_ptr[0] & b_start[0] & b_end[0]) + bitcount(long_ptr[1] & b_start[1] & b_end[1]) + bitcount(long_ptr[2] & b_start[2] & b_end[2]) + bitcount(long_ptr[3] & b_start[3] & b_end[3]); } 
Sign up to request clarification or add additional context in comments.

5 Comments

The ~((1ULL << (begin & 63)) - 1) can't cross boundaries into a different array element. I don't think this works right when begin >= 64 or when end is outside a limited range. Perhaps you're trying to do address math to get an unaligned window into mask? That's not what your code does, because you're never casting mask to a uint8_t.
@Peter Cordes I did not cover boundary conditions when begin or end above 256. Obviously replacing size_t with uchar would be sufficient. Within 0-256 should work. I did not finish the boundary conditions because performance testing on my old windows box shows that David Wohlferd version is faster for random cases I build.
Oh, I think I see how it works now. I think I missed the fact that you're indexing b_end after setting it based on end. If the same start and end are used often, or with a pattern, then a branchy version will probably do better than this. Otherwise it looks fairly good.
SSSE3 pshufb or AVX2 vpshufb might do well, but getting the masking right is tricky.
I have tested on approximately 10 cases various regions and validated that there are no branching. I used graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable so my performance measurement could be off.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.