_mm256_slli_si256: error "last argument must be an 8-bit intermediate"

Question

I have the following problem (g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4):

When I use _mm256_slli_si256() directly, such as:

__m256i x = _mm256_set1_epi8(0xff); x = _mm256_slli_si256(x, 3);

the code compiles without problem (g++ -Wall -march=native -O3 -o shifttest shifttest.C).

However, if I wrap it into a function

__m256i doit(__m256i x, const int imm) { return _mm256_slli_si256(x, imm); }

the compiler complains that

/usr/lib/gcc/x86_64-linux-gnu/4.8/include/avx2intrin.h: In function '__m256i doit(__m256i, int)': /usr/lib/gcc/x86_64-linux-gnu/4.8/include/avx2intrin.h:651:58: error: the last argument must be an 8-bit immediate return (__m256i)__builtin_ia32_pslldqi256 (__A, __N * 8);

regardless of whether the function is used or not.

This can't be a problem with the immediate operand, since the function doit() compiles if I use e.g. _mm256_slli_si32(x, imm) instead, and _mm256_slli_si32() also requires an immediate operand.

There is a related bug report on

https://gcc.gnu.org/bugzilla/show_bug.cgi?format=multiple&id=54825

but it is quite old (2012) and relates to gcc 4.8.0, so I thought the patch would be have been incorporated into g++ 4.8.4 already.

Is there a workaround for this problem?

Same for _mm256_alignr_epi8(), by the way. So no workaround using that one... — Ralf
– Ralf, Commented Jul 9, 2015 at 12:46
And _mm256_setr_m128i() which would help with a workaround using 128-bit shifts is missing completely. Oh, and the same problem as described above occurs with _mm_slli_si128(), so that doesn't work either. Something about this __N * 8 seems to confuse the compiler. — Ralf
– Ralf, Commented Jul 9, 2015 at 13:18
There are two versions of the shift intrinsics for each instruction, one with an immediate arg, and the other with the shift count for all elements in the low bits of an xmm register. The two versions share an asm mnemonic, but are different. (AVX2 also introduced variable-shift instructions that take the shift count for each element separately, from the corresponding element in the shift-count register. Those instructions have a different asm mnemonic, as well as a different intrinsic function name.) Oops, there's no variable-count shift-whole-reg-by-bytes, nvm. — Peter Cordes
– Peter Cordes, Commented Jul 9, 2015 at 20:52

Jason R · Accepted Answer · 2015-07-09 13:14:54Z

The argument indicating the number of bits to shift must be a compile-time constant, as it is encoded as an immediate value in the instruction (i.e. not loaded from a register; the actual shift value is part of the instruction encoding). As long as you use it directly, like this:

__m256i x = _mm256_set1_epi8(0xff); x = _mm256_slli_si256(x, 3);

then the compiler sees the shift value as a compile-time constant, 3. However, when in the context of your wrapping function:

__m256i doit(__m256i x, const int imm) { return _mm256_slli_si256(x, imm); }

there is no way for the compiler to infer the value of imm at compile time, which is required in order for it to synthesize the shift instruction. The fact that imm is a const int doesn't mean that its value is known at compile time, only that the semantics of the language don't allow it to be modified within the doit() function scope.

It's possible that if doit() were to be inlined by the compiler, then it may be able to statically determine the value of imm and therefore compile successfully, but that may be going too far out on a limb.

If you're using C++, another option would be to make doit() a function template with an argument indicating the shift size, like this:

template <int Shift> __m256i doit(__m256i x) { return _mm256_slli_si256(x, Shift); }

Oh hell, yes, I forgot the inline, you're absolutely right, thank you! It works if I add the inline. But why does it work without inline for other intrinsics? And you seem hesitant to rely on the inline solution: Do you think that this works is more a lucky coincidence than a feature?
The inline keyword isn't a guarantee that a function will be inlined; it's only a hint to the compiler, so I do hesitate to rely upon it working for your application. If you're not concerned about total portability, many compilers have special syntax that allow you to force inlining. I'm not sure what other intrinsics you're referring to, but the bit/byte shifting instructions are somewhat unique in that they require immediate arguments. Most SSE/AVX instructions do not.
Thanks. Are you sure about the immediate arguments? I thought, ALL integer parameters passed to SSE/AVX functions are immediate arguments (and thus encoded in the opcode itself)? E.g. __m256i _mm256_slli_epi32 (__m256i a, int imm8) from the Intel Intrinsics Reference.
@Ralf: Yes, I'll walk that statement back a bit. In many cases the intrinsics that can take a variable will take it via a __m256 type. If the reference indicates that it's an immediate, I would trust that.

Marat Dukhan · Accepted Answer · 2015-07-10 03:34:45Z

The problem is due to your function being public (i.e. callable by functions in other C/C++ modules). If you declare it as static (or inline), the compiler will not do code-generation for this function, and you won't get an error.

Collectives™ on Stack Overflow

_mm256_slli_si256: error "last argument must be an 8-bit intermediate"

2 Answers 2

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Linked

Related