70

I was playing with the Compiler Explorer and I stumbled upon an interesting behavior with the ternary operator when using something like this:

std::string get_string(bool b) { return b ? "Hello" : "Stack-overflow"; } 

The compiler generated code for this (clang trunk, with -O3) is this:

get_string[abi:cxx11](bool): # @get_string[abi:cxx11](bool) push r15 push r14 push rbx mov rbx, rdi mov ecx, offset .L.str mov eax, offset .L.str.1 test esi, esi cmovne rax, rcx add rdi, 16 #< Why is the compiler storing the length of the string mov qword ptr [rbx], rdi xor sil, 1 movzx ecx, sil lea r15, [rcx + 8*rcx] lea r14, [rcx + 8*rcx] add r14, 5 #< I also think this is the length of "Hello" (but not sure) mov rsi, rax mov rdx, r14 call memcpy #< Why is there a call to memcpy mov qword ptr [rbx + 8], r14 mov byte ptr [rbx + r15 + 21], 0 mov rax, rbx pop rbx pop r14 pop r15 ret .L.str: .asciz "Hello" .L.str.1: .asciz "Stack-Overflow" 

However, the compiler generated code for the following snippet is considerably smaller and with no calls to memcpy, and does not care about knowing the length of both strings at the same time. There are 2 different labels that it jumps to

std::string better_string(bool b) { if (b) { return "Hello"; } else { return "Stack-Overflow"; } } 

The compiler generated code for the above snippet (clang trunk with -O3) is this:

better_string[abi:cxx11](bool): # @better_string[abi:cxx11](bool) mov rax, rdi lea rcx, [rdi + 16] mov qword ptr [rdi], rcx test sil, sil je .LBB0_2 mov dword ptr [rcx], 1819043144 mov word ptr [rcx + 4], 111 mov ecx, 5 mov qword ptr [rax + 8], rcx ret .LBB0_2: movabs rdx, 8606216600190023247 mov qword ptr [rcx + 6], rdx movabs rdx, 8525082558887720019 mov qword ptr [rcx], rdx mov byte ptr [rax + 30], 0 mov ecx, 14 mov qword ptr [rax + 8], rcx ret 

The same result is when I use the ternary operator with:

std::string get_string(bool b) { return b ? std::string("Hello") : std::string("Stack-Overflow"); } 

I would like to know why the ternary operator in the first example generates that compiler code. I believe that the culprit lies within the const char[].

P.S: GCC does calls to strlen in the first example but Clang doesn't.

Link to the Compiler Explorer example: https://godbolt.org/z/Exqs6G

Thank you for your time!

sorry for the wall of code

3
  • 18
    the result type of the ternary is const char* while the strings individually are const char[N]s, presumably the compiler could optimize the latter much more Commented Aug 9, 2020 at 14:19
  • 3
    @kmdreko: the compiler still knows that it's a const char* pointing to one of two possible known-constant string literals. That's why clang is able to avoid the strlen in the branchless version. (GCC misses that optimization). Even clang's branchless version is not well optimized; significantly better would have been possible, e.g. 2x cmov to select between constants, and maybe a cmov to select an offset to store at. (So both versions can do 2 partially-overlapping 8-byte stores, writing either 8 or 14 bytes of data, including trailing zeros.) That's better than calling memcpy. Commented Aug 10, 2020 at 19:52
  • 1
    Or since it's loading constants from memory anyway, use SSE2 movdqa loads and turn the boolean into a vector mask to select between them. (This optimization relies on the compiler knowing it's safe to always store 16 bytes into the retval object, even though the C++ source probably leaves some trailing bytes unwritten. Inventing writes is generally a big no-no for compilers because of thread safety.) Commented Aug 10, 2020 at 19:55

2 Answers 2

60

The overarching difference here is that the first version is branchless.

16 isn’t the length of any string here (the longer one, with NUL, is only 15 bytes long); it’s an offset into the return object (whose address is passed in RDI to support RVO), used to indicate that the small-string optimization is in use (note the lack of allocation). The lengths are 5 or 5+1+8 stored in R14, which is stored in the std::string as well as passed to memcpy (along with a pointer chosen by CMOVNE) to load the actual string bytes.

The other version has an obvious branch (although part of the std::string construction has been hoisted above it) and actually does have 5 and 14 explicitly, but is obfuscated by the fact that the string bytes have been included as immediate values (expressed as integers) of various sizes.

As for why these three equivalent functions produce two different versions of the generated code, all I can offer is that optimizers are iterative and heuristic algorithms; they don’t reliably find the same “best” assembly independently of their starting point.

Sign up to request clarification or add additional context in comments.

6 Comments

Notably, in this case, one should note that memory writes are much trickier to optimize -- even when memcpy is an intrinsic internally the optimizer still needs to reason about the potential side-effects of a write occurring sooner or later. In the first snippet, the ternary expression is evaluated and then a write happens, in the second, the write happens as part of the evaluation of the ternary expression.
I agree, it shouldn't, but as you mention since optimizers are iterative and heuristic, ... it's not too surprising that it does :)
Being branchless is a red herring here. Juergen's answer is the correct one. The difference is the type on which the selection is performed (std::string vs. char*), and whether a constructor needs to be called with the result of the selection or not.
@cmaster-reinstatemonica: Being branchless is simply a description of the resulting assembly in one case (which is helpful for understanding the other differences). The constructor is entirely inlined (“evaluated at compile time”) in all cases here; the type of the return statement’s operand is in no way a constraint on the generated code (since no address of any string literal escapes).
If the compiler were able to perform the constant value propagation analysis to its bitter end, it should generate the exact same output in all three cases. But it doesn't. So, apparently, it didn't finish that analysis. And obviously it's thrown off by the fact that it has to construct a single object with one of two possible arguments in the first case, while it needs to select the construction of two different objects in the other cases. The comparison between the behavior of the first and last code examples is instructive.
|
13

The first version returns a string object which is initialized with a not-constant expression yielding one of the string literals, so the constructor is run as for any other variable string object, thus the memcpy to do the initialization.

The other variants return either one string object initialized with a string literal or another string object initialized with another string literal, both of which can be optimized to a string object constructed from a constant expression where no memcpy is needed.

So the real answer is: the first version operates the ?: operator on char[] expressions before initializing the objects and the other versions on the string objects already being initialized.

It does not matter whether one of the versions is branchless.

2 Comments

No memcpy was truly needed in the branchless asm either; this is a missed optimization vs. using more cmov instructions on immediate operands, or SSE2 compare. You answer does explain why the source led the compiler in the direction it went, though; compilers are far from perfect.
Note that in the OP's Godbolt link with all 3 versions uncommented, godbolt.org/z/597Kzd, return b ? std::string("Hello") : std::string("Stack-Overflow"); compiles to a branch with GCC and clang (same as the if version), despite the chance for constant-propagation to make const string objects.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.