2

So I have an array of 8 bytes that I have no control over, and can't change the format of directly. This code is a bottle neck for communications to a piece of hardware so it's important that it be optimal.

My task is to extract 1 byte of useful data, using 1 bit from each of the 8 source bytes. Every bit I need to pull out of the byte is always at the same offset. I build of the result byte from most significant to least significant bit.

My solution right now is the following

const uint8_t MASK = 0x04; void extract(uint8_t* data, uint8_t* result) { // I assume result starts equal to 0 uint8_t j = 0x80; // Most significant bit first for (uint8_t i = 0; i < 8; ++i) { // Check if the bit I am interested in is high if (data[i] & MASK) { // Set the bit in result high *result |= j; } // Move on to the next bit j >>= 1; } } 

I feel like this is close to optimal but I am not good with bit magic, so I was curious if anyone knew a faster way.

The code is running on the TI-PRU that exists on the AM335X

12
  • Did you mean *result |= j;? One way to do it without conditionals is to build a 16-bit value with res16 |= (data[i] & mask); res16 <<= 1; and align res16 at the end. Commented Dec 13, 2019 at 17:10
  • Optimise by measuring, which means the only reliable way to determine the fastest way is to implement all ways you can come up with and measure which is the fastest. Without measuring, attempting to speed optimise is futile. Commented Dec 13, 2019 at 17:11
  • It depends on the micro too, some have a multi-bit shift, such a x86, others don't. Commented Dec 13, 2019 at 17:11
  • Your code does not compile, I suppose it's a typo. Commented Dec 13, 2019 at 17:18
  • 2
    AM335X is a 300MHz ARM Cortex-A8, and you think it is "very limited"? I dream of such horsepower! Be more quantitative - how fast does it need to be (in bits-per-second for example)? Because it seems unlikely that this would be a bottleneck for most communication channels. Commented Dec 13, 2019 at 21:30

3 Answers 3

4

Let's assume that your processor is a 32-bit one.

void extract_shift(uint8_t* data, uint8_t* result) { uint32_t x1 = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]; uint32_t x2 = (data[4] << 24) | (data[5] << 16) | (data[6] << 8) | data[7]; x1 &= (MASK << 24) | (MASK << 16) | (MASK << 8) | (MASK); x2 &= (MASK << 24) | (MASK << 16) | (MASK << 8) | (MASK); x1 = (x1 >> 19) | (x1 >> 12) | (x1 >> 5) | (x1 << 2); x2 = (x2 >> 23) | (x2 >> 16) | (x2 >> 9) | (x2 >> 2); *result = (x1 | x2); } 

This try to load the data using 32 bits loads (assuming your processor allows unaligned loads and is of the correct endianess or the compiler is able to do the byte swap in a better way; gcc on x86 does that correctly).

Then do the masking using the 32-bit word at a time.

Then gather the bits in the less significant nibbles to finish by combining the two nibbles. This is done interleaved to try to limit the number of dependencies.

Assuming your machine has an hardware multiplier, we can try to use it. How? A multiplication is a combination of left shifts. But here we have both left and right shifts. So let's build the result in the most significant bytes, and then shift it back in place:

void extract_premul(uint8_t* data, uint8_t* result) { uint32_t x1 = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]; uint32_t x2 = (data[4] << 24) | (data[5] << 16) | (data[6] << 8) | data[7]; x1 &= (MASK << 24) | (MASK << 16) | (MASK << 8) | (MASK); x2 &= (MASK << 24) | (MASK << 16) | (MASK << 8) | (MASK); x1 = (x1 << 5) | (x1 << 12) | (x1 << 19) | (x1 << 26); x2 = (x2 << 1) | (x2 << 8) | (x2 << 15) | (x2 << 22); *result = (x1 | x2) >> 24; } 

Now we can use multiplications, expressing them in binary help to understand the relationship with above version.

void extract_mul(uint8_t* data, uint8_t* result) { uint32_t x1 = (data[0] << 24) | (data[1] << 16) | (data[2] << 8) | data[3]; uint32_t x2 = (data[4] << 24) | (data[5] << 16) | (data[6] << 8) | data[7]; x1 &= (MASK << 24) | (MASK << 16) | (MASK << 8) | (MASK); x2 &= (MASK << 24) | (MASK << 16) | (MASK << 8) | (MASK); // 3 2 1 // 10987654321098765432109876543210 x1 *= 0b100000010000001000000100000; x2 *= 0b10000001000000100000010; *result = (x1 | x2) >> 24; } 

The relative performance of the two (pipelinable) multiplications compared to set of shifts depends on your hardware.

Sign up to request clarification or add additional context in comments.

3 Comments

This is exactly the kind of out there solution I was looking for. Obviously I'll have to play around with different versions of your answer but I'm confident some combination is going to yield better performance. Really nice work! Thanks
@MatthewNichols You can also investigate doing the same thing directly in 64 bits if your processor if a 64-bit one.
The PRU is a 32-bit processor, it has a MAC (Multiply-Accumulate) accelerator (kind of coprocessor)
3

The code presented is efficient enough, but if you are interested in alternatives, first you can get rid of the loop by manually unrolling it. Second you can replace the if logic with some bit-twiddling:

j = (!!(data[0] & MASK)) << 7; j |= (!!(data[1] & MASK)) << 6; ... j |= (!!(data[6] & MASK)) << 1; j |= (!!(data[7] & MASK)); 

Again, I do not think the produced code will be any better than the original with optimizations enabled.

14 Comments

I'm not sure if this code is better than the original one. Have a look here it's really interesting. But after all maybe it's faster, because there are no conditonal jumps
@Jabberwocky doesn't even a single ! make a conditional? ! is not ~.
@Jabberwocky Yes, I agree and noted this in the answer. Right, looks like my code is branchless in this specific case, but might be different for other architectures
@WeatherVane You can replace !! with cast to bool with the same effect.
@WeatherVane !! ensures the result is either 0 or 1.
|
0

Assuming the MASK is a constant 0x04:

Then read in two masked 32-bit values, like @AProgrammer:

uint32_t* dwptr = (uint32_t*)data; uint32_t x1 = dwptr[0] & 0x04040404; uint32_t x2 = dwptr[1] & 0x04040404; 

The relevant bits are set on bit 2 of each byte of the 2 variables.

Shift the bits to a convenient position (for the first 4 bytes we need the upper 4 bits of the result as flags, so for x1 shift it from bit 2 to bit 4 - x2 is already in the lower 4 bits, we will compensate that x2 is on bit 2 instead of bit 0 in the block after the following block):

x1 <<= 2; 

Quadruple those bits by shifting to the left and ORing:

x1 |= x1 << 1; x1 |= x1 << 2; x2 |= x2 << 1; x2 |= x2 >> 2; // we started on bit 2 and not bit 0 for x2 - saved us the shift of x2 in the block above 

Now remove the unwanted ones:

x1 &= 0x10204080; x2 &= 0x01020408; 

Build the result (combine all 8 bytes):

x1_8* = (uint8_t*)x1; x1_16* = (uint16_t*)x1; x1 |= x2; x1_16[0] |= x1_16[1]; result = x1_8[0] | x1_8[1]; 

I wrote this code with many lines to make it understandable, but it should run quite fast - we have only 5 shifts and 9 logic operations altogether for all 8 bits.


You can also try an assembler routine, something like

  • The result being e.g. R0.b0 (register 0, byte 0)
  • the MASKBIT being 2, for MASK being a constant 0x04
  • byte0/byte1/byte2 being loaded in registers, e.g. R1.b0, R1.b1, R1.b2, R1.b3, R2.b0, R2.b1, R2.b2, R2.b3, if loaded in registers R1 and R2
 ldi result, 0 qbbc flag0zero, byte0, MASKBIT set result, result, 7 flag0zero: qbbc flag1zero, byte1, MASKBIT set result, result, 6 flag1zero: qbbc flag2zero, byte2, MASKBIT set result, result, 5 flag2zero: 

and so on.

The PRU can do all these internal operations in 1 cycle, even the combined bittests/jumps. We have 8 bittests and 8 bit sets. Doing the previous algorithm in assembler could be even a little faster as the assignment are 'hidden' on a RISC architecture, as you can specify the target register separately to the source registers.


Probably accessing memory or the mentioned hardware peripheral is slower than the calculation of the flags. We are talking about ~20 operations @200 MHz that is 100ns for the complete flags calculation.

You can enable the cycle counter (https://nerdhut.de/2016/06/18/beaglebone-clock-cycle-counter/) to measure, what takes long and what solution is fastest.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.