Assuming the MASK is a constant 0x04:
Then read in two masked 32-bit values, like @AProgrammer:
uint32_t* dwptr = (uint32_t*)data; uint32_t x1 = dwptr[0] & 0x04040404; uint32_t x2 = dwptr[1] & 0x04040404;
The relevant bits are set on bit 2 of each byte of the 2 variables.
Shift the bits to a convenient position (for the first 4 bytes we need the upper 4 bits of the result as flags, so for x1 shift it from bit 2 to bit 4 - x2 is already in the lower 4 bits, we will compensate that x2 is on bit 2 instead of bit 0 in the block after the following block):
x1 <<= 2;
Quadruple those bits by shifting to the left and ORing:
x1 |= x1 << 1; x1 |= x1 << 2; x2 |= x2 << 1; x2 |= x2 >> 2; // we started on bit 2 and not bit 0 for x2 - saved us the shift of x2 in the block above
Now remove the unwanted ones:
x1 &= 0x10204080; x2 &= 0x01020408;
Build the result (combine all 8 bytes):
x1_8* = (uint8_t*)x1; x1_16* = (uint16_t*)x1; x1 |= x2; x1_16[0] |= x1_16[1]; result = x1_8[0] | x1_8[1];
I wrote this code with many lines to make it understandable, but it should run quite fast - we have only 5 shifts and 9 logic operations altogether for all 8 bits.
You can also try an assembler routine, something like
- The result being e.g. R0.b0 (register 0, byte 0)
- the MASKBIT being 2, for MASK being a constant 0x04
- byte0/byte1/byte2 being loaded in registers, e.g. R1.b0, R1.b1, R1.b2, R1.b3, R2.b0, R2.b1, R2.b2, R2.b3, if loaded in registers R1 and R2
ldi result, 0 qbbc flag0zero, byte0, MASKBIT set result, result, 7 flag0zero: qbbc flag1zero, byte1, MASKBIT set result, result, 6 flag1zero: qbbc flag2zero, byte2, MASKBIT set result, result, 5 flag2zero:
and so on.
The PRU can do all these internal operations in 1 cycle, even the combined bittests/jumps. We have 8 bittests and 8 bit sets. Doing the previous algorithm in assembler could be even a little faster as the assignment are 'hidden' on a RISC architecture, as you can specify the target register separately to the source registers.
Probably accessing memory or the mentioned hardware peripheral is slower than the calculation of the flags. We are talking about ~20 operations @200 MHz that is 100ns for the complete flags calculation.
You can enable the cycle counter (https://nerdhut.de/2016/06/18/beaglebone-clock-cycle-counter/) to measure, what takes long and what solution is fastest.
*result |= j;? One way to do it without conditionals is to build a 16-bit value withres16 |= (data[i] & mask); res16 <<= 1;and alignres16at the end.