5

I'm trying to efficiently execute the following task:

INPUT VALUE: 01101011 MASK: 00110010 MASK RESULT: --10--1- AGGREGATED: 00000101 

I hope this examples explains clearly what I'm trying to achieve. What's the best way to do this in a non-naive way?

1 Answer 1

7

This operation is called compress_right or just compress, and it is moderately terrible to implement without hardware support. The non-naive code from Hacker's Delight "7–4 Compress, or Generalized Extract" to implement this function is

unsigned compress(unsigned x, unsigned m) { unsigned mk, mp, mv, t; int i; x = x & m; // Clear irrelevant bits. mk = ~m << 1; // We will count 0's to right. for (i = 0; i < 5; i++) { mp = mk ^ (mk << 1); // Parallel suffix. mp = mp ^ (mp << 2); mp = mp ^ (mp << 4); mp = mp ^ (mp << 8); mp = mp ^ (mp << 16); mv = mp & m; // Bits to move. m = m ^ mv | (mv >> (1 << i)); // Compress m. t = x & mv; x = x ^ t | (t >> (1 << i)); // Compress x. mk = mk & ~mp; } return x; } 

BMI2 (implemented in Haswell and later) will have the instruction pext for this operation.


If the mask is a constant (or not a constant but reused multiple times), a relatively obvious optimization is pre-calculating the 5 values that mv takes during the loop. The calculation of mv does not depend on x, so that can be calculated independantly, like this (the same algorithm as above really)

mk = ~m << 1; for (i = 0; i < 5; i++) { mp = mk ^ (mk << 1); mp = mp ^ (mp << 2); mp = mp ^ (mp << 4); mp = mp ^ (mp << 8); mp = mp ^ (mp << 16); mv = mp & m; mask[i] = mv; m = m ^ mv | (mv >> (1 << i)); mk = mk & ~mp; } 

Still looks complicated, but everything here is a constant, so it can be pre-computed (if the compiler can't do it, then you can, simply by running it and then pasting the result into the code). The "real part" of the code, the code that actually has to run at runtime is this:

x = x & m; t = x & mask[0]; x = x ^ t | (t >> 1); t = x & mask[1]; x = x ^ t | (t >> 2); t = x & mask[2]; x = x ^ t | (t >> 4); t = x & mask[3]; x = x ^ t | (t >> 8); t = x & mask[4]; x = x ^ t | (t >> 16); 

(this is also in Hacker's Delight, formatted a little differently)

Many cases can be simpler again, for example:

  • if m = 0, the result is 0.
  • if m = -1, the result is x.
  • if m = 1, the result is x & 1.
  • if m = ((1 << n) - 1) << k, the result is (x >> k) & m.
  • if m = 0x80000000, the result is x >> 31.
  • if m is an other power of two, the result is (x >> numberOfTrailingZeros(m)) & 1
  • if m is alternating, the "perfect unshuffle algorithm" can be used.
  • if m consists of a few "groups", the "bit group moving" algorithm can be used (ie mask a group, shift it into place (or shift first, mask second), OR all shifted groups together, though more sophisticated approaches exist), this is probably the most important case in practice.
  • ...

For example, the mask from your question would fall in the "bit group moving" case, with code like this:

return ((x >> 1) & 1) | ((x >> 3) & 6); 
Sign up to request clarification or add additional context in comments.

3 Comments

@FilippoBistaffa it can be substantially optimized if the mask is a constant (or a loop constant), by the way.
Yes, in my scenario it would be a constant, but I think this kind of optimisation is automatically done by the compiler. Or is it better to explicitly do it?
@FilippoBistaffa it might, but I wouldn't count on it, and the code for a constant mask is simpler anyway. I'll edit in the general optimization, but for many masks the code can be yet simpler again. Reliably finding the best code for any given mask is an open problem as far as I know (though many special cases are known).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.