I've have been doing arm asm for a while and tried to optimize simple loops with x86 asm SSSE3. I cannot find a way to convert big endian to little endian.
ARM NEON has a single vector instruction to do exactly this, but SSSE3 does not. I tried to use 2 shifts and an OR but that requires to go to 32-bit per slot instead of 16 if we are shifting by 8 to the left (data gets saturated).
I looked into PSHUFB but when I use it, the first half of 16 bit word is always 0.
I am using inline asm on x86 for android. Sorry for the incorrect syntax or other errors that may occur, please understand what I mean (it is hard to rip this out of my code).
# Data uint16_t dataSrc[] = {0x7000, 0x4401, 0x3801, 0xf002, 0x4800, 0xb802, 0x1800, 0x3c00, 0xd800..... uint16_t* src = dataSrc; uint8_t * dst = new uint8_t[16] = {0}; uint8_t * map = new uint8_t[16] = { 9,8, 11,10, 13,12, 15,14, 1,0,3,2,5,4,7,6,}; # I need to convert 0x7000 to 0x0077 by shifting each 16 bit by its byte vectorized. asm volatile ( "movdqu (%0),%%xmm1\n" "pshufb %2,%%xmm1\n" "movdqu %%xmm1,(%1)\n" : "+r" (src), "+r" (dst), "+r" (map) : : "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4" ); If I loop through the dataSrc variable my output for the first 8 bytes are:
0: 0 1: 0 2: 0 3: 0 4: 72 5: 696 6: 24 7: 60 Only the last 4 are swapped even if it is in the wrong order. Why are the first 4 all zeros? No matter how i change the map, the first is sometimes 0 and the next 3 are always zero, why? Am i doing something wrong?
Edit
I figured out why it didn't work, the map did not pass into the inline asm correctly, I had to free an input variable for it.
For other questions about intrinsics vs hand written asm. The code below is to convert 16-byte video frame data YUV42010BE to YUVP420 (8 bit), the problem is with shuffle, if I use little endian, then i would not have that section.
static const char map[16] = { 9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6 }; int dstStrideOffset = (dstStride - srcStride / 2); asm volatile ( "push %%ebp\n" // All 0s for packing "xorps %%xmm0, %%xmm0\n" "movdqu (%5),%%xmm4\n" "yloop:\n" // Set the counter for the stride "mov %2, %%ebp\n" "xloop:\n" // Load source data "movdqu (%0),%%xmm1\n" "movdqu 16(%0),%%xmm2\n" "add $32,%0\n" // The first 4 16-bytes are 0,0,0,0, this is the issue. "pshufb %%xmm4, %%xmm1\n" "pshufb %%xmm4, %%xmm2\n" // Shift each 16 bit to the right to convert "psrlw $0x2,%%xmm1\n" "psrlw $0x2,%%xmm2\n" // Merge both 16bit vectors into 1 8bit vector "packuswb %%xmm0, %%xmm1\n" "packuswb %%xmm0, %%xmm2\n" "unpcklpd %%xmm2, %%xmm1\n" // Write the data "movdqu %%xmm1,(%1)\n" "add $16, %1\n" // End loop, x = srcStride; x >= 0 ; x -= 32 "sub $32, %%ebp\n" "jg xloop\n" // End loop, y = height; y >= 0; --y "add %4, %1\n" "sub $1, %3\n" "jg yloop\n" "pop %%ebp\n" : "+r" (src), "+r" (dst), "+r" (srcStride), "+r" (height), "+r"(dstStrideOffset) : "x"(map) : "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4" ); I didn't get around to implement the shuffle for intrinsics yet, using little endian
const int dstStrideOffset = (dstStride - srcStride / 2); __m128i mdata, mdata2; const __m128i zeros = _mm_setzero_si128(); for (int y = height; y > 0; --y) { for (int x = srcStride; x > 0; x -= 32) { mdata = _mm_loadu_si128((const __m128i *)src); mdata2 = _mm_loadu_si128((const __m128i *)(src + 8)); mdata = _mm_packus_epi16(_mm_srli_epi16(mdata, 2), zeros); mdata2 = _mm_packus_epi16(_mm_srli_epi16(mdata2, 2), zeros); _mm_storeu_si128( (__m128i *)dst, static_cast<__m128i>(_mm_unpacklo_pd(mdata, mdata2))); src += 16; dst += 16; } dst += dstStrideOffset; } Probably not written correctly but benchmarking on Android emulator (API 27), x86 (SSSE3 is the highest, i686) with default compiler settings and added optimizations such (although made no difference in performance) -Ofast -O3 -funroll-loops -mssse3 -mfpmath=sse on average:
Intrinics: 1.9-2.1 ms
Hand written: 0.7-1ms
Is there a way to speed this up? Maybe I wrote the intrinsics wrong; is it possible to get closer speeds to hand written with intrinsics?
psrlw/psllw/por. (And seriously, you should translate your x86 asm into intrinsics. If your actual inline asm looks like this with constraints like that, the compiler will do at least as good a job. Especially if you're literally usingnewto dynamically allocate 16-byte arrays!)+operator doesn't have to compile to anaddinstruction, intrinsics are inputs to LLVM's shuffle optimizer. Anyway, maybe you're not letting the compiler optimize, maybe by using a loop counter where alias analysis fails so you get reloads, IDK?size_tcounter or a pointer-increment or whatever will help the compiler. Your best-case result is portable future-proof C++ with intrinsics that compiles to nice asm with current compilers. Not hand-written asm that might be good not but might suck with future CPUs.