Mask and aggregate bits

Question

I'm trying to efficiently execute the following task:

INPUT VALUE: 01101011 MASK: 00110010 MASK RESULT: --10--1- AGGREGATED: 00000101

I hope this examples explains clearly what I'm trying to achieve. What's the best way to do this in a non-naive way?

Toby Speight · Accepted Answer · 2017-07-17 09:44:52Z

This operation is called compress_right or just compress, and it is moderately terrible to implement without hardware support. The non-naive code from Hacker's Delight "7–4 Compress, or Generalized Extract" to implement this function is

unsigned compress(unsigned x, unsigned m) { unsigned mk, mp, mv, t; int i; x = x & m; // Clear irrelevant bits. mk = ~m << 1; // We will count 0's to right. for (i = 0; i < 5; i++) { mp = mk ^ (mk << 1); // Parallel suffix. mp = mp ^ (mp << 2); mp = mp ^ (mp << 4); mp = mp ^ (mp << 8); mp = mp ^ (mp << 16); mv = mp & m; // Bits to move. m = m ^ mv | (mv >> (1 << i)); // Compress m. t = x & mv; x = x ^ t | (t >> (1 << i)); // Compress x. mk = mk & ~mp; } return x; }

BMI2 (implemented in Haswell and later) will have the instruction pext for this operation.

If the mask is a constant (or not a constant but reused multiple times), a relatively obvious optimization is pre-calculating the 5 values that mv takes during the loop. The calculation of mv does not depend on x, so that can be calculated independantly, like this (the same algorithm as above really)

mk = ~m << 1; for (i = 0; i < 5; i++) { mp = mk ^ (mk << 1); mp = mp ^ (mp << 2); mp = mp ^ (mp << 4); mp = mp ^ (mp << 8); mp = mp ^ (mp << 16); mv = mp & m; mask[i] = mv; m = m ^ mv | (mv >> (1 << i)); mk = mk & ~mp; }

Still looks complicated, but everything here is a constant, so it can be pre-computed (if the compiler can't do it, then you can, simply by running it and then pasting the result into the code). The "real part" of the code, the code that actually has to run at runtime is this:

x = x & m; t = x & mask[0]; x = x ^ t | (t >> 1); t = x & mask[1]; x = x ^ t | (t >> 2); t = x & mask[2]; x = x ^ t | (t >> 4); t = x & mask[3]; x = x ^ t | (t >> 8); t = x & mask[4]; x = x ^ t | (t >> 16);

(this is also in Hacker's Delight, formatted a little differently)

Many cases can be simpler again, for example:

if m = 0, the result is 0.
if m = -1, the result is x.
if m = 1, the result is x & 1.
if m = ((1 << n) - 1) << k, the result is (x >> k) & m.
if m = 0x80000000, the result is x >> 31.
if m is an other power of two, the result is (x >> numberOfTrailingZeros(m)) & 1
if m is alternating, the "perfect unshuffle algorithm" can be used.
if m consists of a few "groups", the "bit group moving" algorithm can be used (ie mask a group, shift it into place (or shift first, mask second), OR all shifted groups together, though more sophisticated approaches exist), this is probably the most important case in practice.
...

For example, the mask from your question would fall in the "bit group moving" case, with code like this:

return ((x >> 1) & 1) | ((x >> 3) & 6);

@FilippoBistaffa it can be substantially optimized if the mask is a constant (or a loop constant), by the way.
Yes, in my scenario it would be a constant, but I think this kind of optimisation is automatically done by the compiler. Or is it better to explicitly do it?
@FilippoBistaffa it might, but I wouldn't count on it, and the code for a constant mask is simpler anyway. I'll edit in the general optimization, but for many masks the code can be yet simpler again. Reliably finding the best code for any given mask is an open problem as far as I know (though many special cases are known).

Collectives™ on Stack Overflow

Mask and aggregate bits

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related