1

I've have been doing arm asm for a while and tried to optimize simple loops with x86 asm SSSE3. I cannot find a way to convert big endian to little endian.

ARM NEON has a single vector instruction to do exactly this, but SSSE3 does not. I tried to use 2 shifts and an OR but that requires to go to 32-bit per slot instead of 16 if we are shifting by 8 to the left (data gets saturated).

I looked into PSHUFB but when I use it, the first half of 16 bit word is always 0.

I am using inline asm on x86 for android. Sorry for the incorrect syntax or other errors that may occur, please understand what I mean (it is hard to rip this out of my code).

# Data uint16_t dataSrc[] = {0x7000, 0x4401, 0x3801, 0xf002, 0x4800, 0xb802, 0x1800, 0x3c00, 0xd800..... uint16_t* src = dataSrc; uint8_t * dst = new uint8_t[16] = {0}; uint8_t * map = new uint8_t[16] = { 9,8, 11,10, 13,12, 15,14, 1,0,3,2,5,4,7,6,}; # I need to convert 0x7000 to 0x0077 by shifting each 16 bit by its byte vectorized. asm volatile ( "movdqu (%0),%%xmm1\n" "pshufb %2,%%xmm1\n" "movdqu %%xmm1,(%1)\n" : "+r" (src), "+r" (dst), "+r" (map) : : "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4" ); 

If I loop through the dataSrc variable my output for the first 8 bytes are:

0: 0 1: 0 2: 0 3: 0 4: 72 5: 696 6: 24 7: 60 

Only the last 4 are swapped even if it is in the wrong order. Why are the first 4 all zeros? No matter how i change the map, the first is sometimes 0 and the next 3 are always zero, why? Am i doing something wrong?

Edit

I figured out why it didn't work, the map did not pass into the inline asm correctly, I had to free an input variable for it.

For other questions about intrinsics vs hand written asm. The code below is to convert 16-byte video frame data YUV42010BE to YUVP420 (8 bit), the problem is with shuffle, if I use little endian, then i would not have that section.

static const char map[16] = { 9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6 }; int dstStrideOffset = (dstStride - srcStride / 2); asm volatile ( "push %%ebp\n" // All 0s for packing "xorps %%xmm0, %%xmm0\n" "movdqu (%5),%%xmm4\n" "yloop:\n" // Set the counter for the stride "mov %2, %%ebp\n" "xloop:\n" // Load source data "movdqu (%0),%%xmm1\n" "movdqu 16(%0),%%xmm2\n" "add $32,%0\n" // The first 4 16-bytes are 0,0,0,0, this is the issue. "pshufb %%xmm4, %%xmm1\n" "pshufb %%xmm4, %%xmm2\n" // Shift each 16 bit to the right to convert "psrlw $0x2,%%xmm1\n" "psrlw $0x2,%%xmm2\n" // Merge both 16bit vectors into 1 8bit vector "packuswb %%xmm0, %%xmm1\n" "packuswb %%xmm0, %%xmm2\n" "unpcklpd %%xmm2, %%xmm1\n" // Write the data "movdqu %%xmm1,(%1)\n" "add $16, %1\n" // End loop, x = srcStride; x >= 0 ; x -= 32 "sub $32, %%ebp\n" "jg xloop\n" // End loop, y = height; y >= 0; --y "add %4, %1\n" "sub $1, %3\n" "jg yloop\n" "pop %%ebp\n" : "+r" (src), "+r" (dst), "+r" (srcStride), "+r" (height), "+r"(dstStrideOffset) : "x"(map) : "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4" ); 

I didn't get around to implement the shuffle for intrinsics yet, using little endian

const int dstStrideOffset = (dstStride - srcStride / 2); __m128i mdata, mdata2; const __m128i zeros = _mm_setzero_si128(); for (int y = height; y > 0; --y) { for (int x = srcStride; x > 0; x -= 32) { mdata = _mm_loadu_si128((const __m128i *)src); mdata2 = _mm_loadu_si128((const __m128i *)(src + 8)); mdata = _mm_packus_epi16(_mm_srli_epi16(mdata, 2), zeros); mdata2 = _mm_packus_epi16(_mm_srli_epi16(mdata2, 2), zeros); _mm_storeu_si128( (__m128i *)dst, static_cast<__m128i>(_mm_unpacklo_pd(mdata, mdata2))); src += 16; dst += 16; } dst += dstStrideOffset; } 

Probably not written correctly but benchmarking on Android emulator (API 27), x86 (SSSE3 is the highest, i686) with default compiler settings and added optimizations such (although made no difference in performance) -Ofast -O3 -funroll-loops -mssse3 -mfpmath=sse on average:

Intrinics: 1.9-2.1 ms
Hand written: 0.7-1ms

Is there a way to speed this up? Maybe I wrote the intrinsics wrong; is it possible to get closer speeds to hand written with intrinsics?

4
  • PSHUFB is in SSSE3, not SSE3. They are different extensions. If you really need to support CPUs without SSSE3, you'll need a workaround for those CPUs, with psrlw / psllw / por. (And seriously, you should translate your x86 asm into intrinsics. If your actual inline asm looks like this with constraints like that, the compiler will do at least as good a job. Especially if you're literally using new to dynamically allocate 16-byte arrays!) Commented Aug 28, 2018 at 21:09
  • Maybe I should upload my entire code set, the code I have above was extracted from the larger asm code I have just to illustrate the issue, that is not the entire call otherwise yes I would use intrinics. Intrinics like I said in a below comment is 3x slower (maybe its written wrong). I will upload the code later tonight when i get home. 3-4ms vs 1 ms for handwritten (this does not include any pshufb). So far I have been using psrlw / psllw / por. Commented Aug 28, 2018 at 22:00
  • Oh you meant you got 3x slower code for x86 intrinsics? I assumed you were avoiding intrinsics because of bad experiences with them on ARM, where compilers do a horrible job, vs. mostly very good on x86. On x86, clang especially often does a good job of using better shuffles / blends in the asm than you specified with intrinsics; just like a + operator doesn't have to compile to an add instruction, intrinsics are inputs to LLVM's shuffle optimizer. Anyway, maybe you're not letting the compiler optimize, maybe by using a loop counter where alias analysis fails so you get reloads, IDK? Commented Aug 29, 2018 at 3:51
  • Anyway, look at the compiler-generate asm for your intrinsics, and see what the compiler is missing. Then look at your source and see why it thinks it has to sign-extend or reload something inside the loop or whatever, and use a local variable or a size_t counter or a pointer-increment or whatever will help the compiler. Your best-case result is portable future-proof C++ with intrinsics that compiles to nice asm with current compilers. Not hand-written asm that might be good not but might suck with future CPUs. Commented Aug 29, 2018 at 3:53

1 Answer 1

2

Your code doesn't work because you pass the address of map to pshufb. I'm not sure what code gcc generates for this, I can't imagine this compiles at all.

It is usually not a good idea to use inline assembly for this sort of thing. Instead, use intrinsic functions:

#include <immintrin.h> // swap pairs of bytes, and swap 8-byte halves of the vector void byte_swap(char dst[16], const char src[16]) { __m128i msrc, map, mdst; msrc = _mm_loadu_si128((const _m128i *)src); map = _mm_setr_epi8(9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6); mdst = _mm_shuffle_epi8(msrc, map); _mm_storeu_si128((_m128i *)dst, mdst); } 

Apart from being easier to maintain, this optimizes better because unlinke inline assembly, the compiler can introspect intrinsic functions and make informed decisions about which instructions to emit. For example, on an AVX target, it might emit the VEX-encoded vpshufb instead of pshufb to avoid a stall due to an AVX/SSE transition.

If for any reason you cannot use intrinsic functions, use inline assembly like this:

void byte_swap(char dst[16], const char src[16]) { typedef long long __m128i_u __attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1))); static const char map[16] = { 9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6 }; __m128i_u data = *(const __m128i_u *)src; asm ("pshufb %1, %0" : "+x"(data) : "xm"(* (__m128i_u *)map)); *(__m128i_u *)dst = data; } 
Sign up to request clarification or add additional context in comments.

9 Comments

I was debating intrinsics but arm was slower than the compiler running loops (I benchmarked) compared to full neon assembly about 20x slower. Testing the same algorithm but using intrinsics and loops is about 3x-4x slower. Each iteration of my simple loop is 1ms with asm and 3-4ms with intrinsics, might be llvm compiler with ndk. I have used intrinics in the past on desktop but i was using microsoft's visual studio compiler. The pshufb command worked in your code, I must have made a mistake somewhere, and yes i miswrote, meant to have the inline "m"(map), not "+r".
@user654628 Intrinsics can be slow if you compile without optimizations, so always compile with optimizations turned on. It is also helpful to give the compiler a chance to inline the relevant functions.
@user654628 Perhaps you could post your ARM code with intrinsics in a separate question and we can try to find out why it is slow.
Weird because Android always compiles with the optimizations... I don't have my arm code in intrinsics anymore, most are written in hand written assembly, I could probably rewrite it later.
I am still trying to figure out how to use pshufb properly. I don't know much about x86 asm but what does typedef long long m128i_u __attribute ((vector_size (16), may_alias, aligned (1))); and using "+x" instead of "+r" do?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.