25

I need a program to get the smaller of two numbers, and I'm wondering if using a standard "if x is less than y"

int a, b, low; if (a < b) low = a; else low = b; 

is more or less efficient than this:

int a, b, low; low = b + ((a - b) & ((a - b) >> 31)); 

(or the variation of putting int delta = a - b at the top and rerplacing instances of a - b with that).

I'm just wondering which one of these would be more efficient (or if the difference is too miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

13
  • 11
    This will depend greatly on your compiler and target CPU. I doubt that there's a generally true answer. Did you try benchmarking? Commented Jun 9, 2010 at 20:33
  • 8
    Any speed difference is negligible in this case. The efficiency in maintenance seems obvious. Commented Jun 9, 2010 at 20:35
  • 31
    FFS people, he didn't ask your opinion on when to optimize, just some technical details about two separete approaches. Commented Jun 9, 2010 at 20:46
  • 10
    With a decent compiler, min(a,b) should give you the optimal code - possibly faster than either, if it can use machine instructions that aren't directly available from C. Also, the second version isn't as portable, since right-shifting a negative value gives an implementation-defined result. Commented Jun 9, 2010 at 22:57
  • 4
    Or, you need to optimize a lot of things by a little bit apiece. That's the reality of achieving performance on fixed hardware. Commented Jun 10, 2010 at 0:30

16 Answers 16

34

(Disclaimer: the following deals with very low-level optimizations that are most often not necessary. If you keep reading, you waive your right to complain that computers are fast and there is never any reason to worry about this sort of thing.)

One advantage of eliminating an if statement is that you avoid branch prediction penalties.

Branch prediction penalties are generally only a problem when the branch is not easily predicted. A branch is easily predicted when it is almost always taken/not taken, or it follows a simple pattern. For example, the branch in a loop statement is taken every time except the last one, so it is easily predicted. However, if you have code like

a = random() % 10 if (a < 5) print "Less" else print "Greater" 

then this branch is not easily predicted, and will frequently incur the prediction penalty associated with clearing the cache and rolling back instructions that were executed in the wrong part of the branch.

One way to avoid these kinds of penalties is to use the ternary (?:) operator. In simple cases, the compiler will generate conditional move instructions rather than branches.

So

int a, b, low; if (a < b) low = a; else low = b; 

becomes

int a, b, low; low = (a < b) ? a : b 

and in the second case a branching instruction is not necessary. Additionally, it is much clearer and more readable than your bit-twiddling implementation.

Of course, this is a micro-optimization which is unlikely to have significant impact on your code.

Sign up to request clarification or add additional context in comments.

7 Comments

Finally, an answer that isn't bleating about premature optimization. Thank you.
@Justicle - the problem with not bleating about premature optimization is that you end up with an implied suggestion (particularly to people who are just learning) that one should write code like low = b + ((a - b) & ((a - b) >> 31)) everywhere for no good reason because someone said "it's faster". When, in fact, it's the wrong thing to do the vast majority of times.
At -O1 and higher, gcc produces identical code for the if statement and the ternary operator for the min() function, using a cmovg instruction in both cases. At -O0, it uses branches and labels for the if statement and cmovle for the ternary operator.
I agree that this is more readable, but it will certainly not be faster. See my answer.
"However after running experiments on a wide range of compilers I have concluded that with the optimizer turned on, you are better off with a simple if-else statement." Efficient C Tips #6 – Don’t use the ternary operator
|
10

Compiling this on gcc 4.3.4, amd64 (core 2 duo), Linux:

int foo1(int a, int b) { int low; if (a < b) low = a; else low = b; return low; } int foo2(int a, int b) { int low; low = b + ((a - b) & ((a - b) >> 31)); return low; } 

I get:

foo1: cmpl %edi, %esi cmovle %esi, %edi movl %edi, %eax ret foo2: subl %esi, %edi movl %edi, %eax sarl $31, %eax andl %edi, %eax addl %esi, %eax ret 

...which I'm pretty sure won't count for branch predictions, since the code doesn't jump. Also, the non if-statement version is 2 instructions longer. I think I will continue coding, and let the compiler do it's job.

1 Comment

You're correct, cmovcc is a data dependency, not a branch-predicted control dependency. This can be good, but can also be bad if a branch would have predicted well and broken a loop-carried dependency chain. Use profile-guided optimization to help compilers choose between branchy and branchless.
10

Simple answer: One conditional jump is going to be more efficient than two subtractions, one addition, a bitwise and, and a shift operation combined. I've been sufficiently schooled on this point (see the comments) that I'm no longer even confident enough to say that it's usually more efficient.

Pragmatic answer: Either way, you're not paying nearly as much for the extra CPU cycles as you are for the time it takes a programmer to figure out what that second example is doing. Program for readability first, efficiency second.

8 Comments

@nategoose: Which processors?
@Bill: many processors have a long instruction pipeline which must be flushed whenever there's a mispredicted branch, taking perhaps 10 or 20 cycles. In this case, the branch is likely to be mispredicted half the time, so the conditional version might take an average of 5 or 10 cycles, while the squiggly version takes 4 or 5. (Of course, other processors have conditional instructions, short pipelines and other ways to avoid misprediction, and then the conditional version will be faster).
And on the processor I mostly use, the first version takes 2 cycles, and the second takes 3.
On the in-order PowerPC processor used in many game consoles, an unpredicted branch is a 20 cycle bubble, and a correctly predicted branch is a 5 cycle bubble. x + ((y - x) & (a >> 31)) is 3 cycles due to dual-dispatch. The situation is even more extreme for floating point numbers, where the conditional-move has a throughput of 1/1 cycle, whereas branch on float comparison can be a 40 cycle bubble.
@nategoose, @Mike, @Crashworks: Well, that will teach me to make sweeping generalizations based on benchmarks from one machine. I stand corrected.
|
7

The biggest problem is that your second example won't work on 64-bit machines.

However, even neglecting that, modern compilers are smart enough to consider branchless prediction in every case possible, and compare the estimated speeds. So, you second example will most likely actually be slower

There will be no difference between the if statement and using a ternary operator, as even most dumb compilers are smart enough to recognize this special case.


[Edit] Because I think this is such an interesting topic, I've written a blog post on it.

8 Comments

I've looked at the assembly output of MSVC and GCC, and neither of them seem smart enough to emit branchless conditional moves half the time I want them to.
@Crashworks: That means the compiler decided the branchless conditional is actually slower (branchless conditionals require more clocks, but don't have the possibility of clearing the instruction pipeline)
Yes, but the compiler was wrong when it decided that. I've timed both pathways. My job consists of cramming more work into 16.6 milliseconds than the competing product can. In general, I have seen compilers emit many suboptimal code sequences. They are not perfect.
I sometimes do, but it's often easier to meet the compiler halfway and write code in such a way that it results in the code sequence I want; intrinsics in particular are an example of this. That's much easier to intermingle with other C++ code than inline assembly. It's a common practice in the embedded world; part of the job is learning what the compiler will emit for particular inputs.
In practice I wrote an isel(a,b,c) function which has the same effect as return a >= 0 ? b : c . We just use that. (It was named by analogue to the fsel intrinsic, which is the hardware's native floating point conditional-move.) It would be better if the compiler were just smart enough to emit the right code for ?:, but we haven't got a smart compiler, just GCC.
|
6

Like with any low-level optimization, test it on the target CPU/board setup.

On my compiler (gcc 4.5.1 on x86_64), the first example becomes

cmpl %ebx, %eax cmovle %eax, %esi 

The second example becomes

subl %eax, %ebx movl %ebx, %edx sarl $31, %edx andl %ebx, %edx leal (%rdx,%rax), %esi 

Not sure if the first one is faster in all cases, but I would bet it is.

Comments

4

Either way, the assembly will only be a few instructions and either way it'll take picoseconds for those instructions to execute.

I would profile the application an concentrate your optimization efforts to something more worthwhile.

Also, the time saved by this type of optimization will not be worth the time wasted by anyone trying to maintain it.

For simple statements like this, I find the ternary operator very intuitive:

low = (a < b) ? a : b;

Clear and concise.

5 Comments

x86 can map a comparison result to 0/1 without a jump.
Where is the conditional jump in low = b + ((a - b) & ((a - b) >> 31));
I must be missing something, why will there be a conditional jump in his second example?
I read it as a logical and for some reason, disregard my conditional comment, editing...
Nanoseconds, not picoseconds. Most processors operate at only the GHz clock range.
4

For something as simple as this, why not just experiment and try it out?

Generally, you'd profile first, identify this as a hotspot, experiment with a change, and view the result.

I wrote a simple program that compares both techniques passing in random numbers (so that we don't see perfect branch prediction) with Visual C++ 2010. The difference between the approaches on my machine for 100,000,000 iteration? Less than 50ms total, and the if version tended to be faster. Looking at the codegen, the compiler successfully converted the simple if to a cmovl instruction, avoiding a branch altogether.

Comments

2

One thing to be wary of when you get into really bit-fiddly kinds of hacks is how they may interact with compiler optimizations that take place after inlining. For example, the readable procedure

int foo (int a, int b) { return ((a < b) ? a : b); } 

is likely to be compiled into something very efficient in any case, but in some cases it may be even better. Suppose, for example, that someone writes

int bar = foo (x, x+3); 

After inlining, the compiler will recognize that 3 is positive, and may then make use of the fact that signed overflow is undefined to eliminate the test altogether, to get

int bar = x; 

It's much less clear how the compiler should optimize your second implementation in this context. This is a rather contrived example, of course, but similar optimizations actually are important in practice. Of course you shouldn't accept bad compiler output when performance is critical, but it's likely wise to see if you can find clear code that produces good output before you resort to code that the next, amazingly improved, version of the compiler won't be able to optimize to death.

2 Comments

Kinda obvius that (x+3 > x) so ofc it should optimize it away.
@andersfylling: Hardly. With unsigned x, where overflow is defined to wrap around, x+3 > x isn't true for all possible inputs, so the optimization isn't safe and you get lea / cmp / cmov from gcc and clang for x86-64. Hmm, compilers could shorten the critical path by comparing x against constant (UINT_MAX - 3) so it could run in parallel with the lea.
1

One thing I will point out that I haven't noticed mention that an optimization like this can easily be overwhelmed by other issues. For example, if you are running this routine on two large arrays of numbers (or worse yet, pairs of number scattered in memory), the cost of fetching the values on today's CPUs can easily stall the CPU's execution pipelines.

1 Comment

This is a comment at best, not an answer. A branch mispredict can reduce the throughput of other slow stuff; OOO execution can't hide the latency of a branch miss if the cache-miss load doesn't even start until after the branch is correctly resolved.
1

I'm just wondering which one of these would be more efficient (or if the difference is to miniscule to be relevant), and the efficiency of if-else statements versus alternatives in general.

Desktop/server CPUs are optimized for pipelining. Second is theoretically faster because CPU doesn't have to branch and can utilize multiple ALUs to evaluate parts of expression in parallel. More non-branching code with intermixed independent operations are best for such CPUs. (But even that is negated now by modern "conditional" CPU instructions which allow to make the first code branch-less too.)

On embedded CPUs branching if often less expensive (relatively to everything else), nor they have many spare ALUs to evaluate operations out-of-order (that's if they support out-of-order execution at all). Less code/data is better - caches are small too. (I have even seen uses of buble-sort in embedded applications: the algorithm uses least of memory/code and fast enough for small amounts of information.)

Important: do not forget about the compiler optimizations. Using many tricks, the compilers sometimes can remove the branching themselves: inlining, constant propagation, refactoring, etc.

But in the end I would say that yes, the difference is minuscule to be relevant. In long term, readable code wins.

The way things go on the CPU front, it is more rewarding to invest time now in making the code multi-threaded and OpenCL capable.

Comments

0

Why low = a; in the if and low = a; in the else? And, why 31? If 31 has anything to do with CPU word size, what if the code is to be run on a CPU of different size?

The if..else way looks more readable. I like programs to be as readable to humans as they are to the compilers.

1 Comment

If the non-portable implementation was actually useful, you'd obviously wrap it in a branchless_min() function instead of manually inlining it everywhere. And yes it assumes 32-bit 2's complement signed integer + arithmetic right shifts. Of course it's not actually useful because compilers generate better branchless code using cmov, but this still doesn't answer the question.
0

profile results with gcc -o foo -g -p -O0, Solaris 9 v240

 %Time Seconds Cumsecs #Calls msec/call Name 36.8 0.21 0.21 8424829 0.0000 foo2 28.1 0.16 0.37 1 160. main 17.5 0.10 0.4716850667 0.0000 _mcount 17.5 0.10 0.57 8424829 0.0000 foo1 0.0 0.00 0.57 4 0. atexit 0.0 0.00 0.57 1 0. _fpsetsticky 0.0 0.00 0.57 1 0. _exithandle 0.0 0.00 0.57 1 0. _profil 0.0 0.00 0.57 1000 0.000 rand 0.0 0.00 0.57 1 0. exit 

code:

int foo1 (int a, int b, int low) { if (a < b) low = a; else low = b; return low; } int foo2 (int a, int b, int low) { low = (a < b) ? a : b; return low; } int main() { int low=0; int a=0; int b=0; int i=500; while (i--) { for(a=rand(), b=rand(); a; a--) { low=foo1(a,b,low); low=foo2(a,b,low); } } return 0; } 

Based on data, in the above environment, the exact opposite of several beliefs stated here were not found to be true. Note the 'in this environment' If construct was faster than ternary ? : construct

3 Comments

However, compiling here, gcc -O2 -S -o output.S input.c, foo1 and foo2 compile to exactly the same 4 instructions. (Linux, gcc 4.3.4, amd64 (core 2 duo))
That was the whole point and why "bleating" about profiling is meaningful. Thanks.
Timing with -O0 is total nonsense, unless you're a compiler writer trying to improve the performance of debug builds. -O0 isn't just a linear slow-down that slows down everything by some constant factor; see stackoverflow.com/questions/32000917/…
0

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.

In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.

Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them

First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:

inline int bct_decoder_cond( unsigned bctData ) { unsigned lsB = bctData & 1; unsigned msB = bctData >> 1; return ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch ( lsB > msB ) ? 1 : -1; } 

This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)

Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:

inline int bct_decoder_mod( unsigned bctData ) { return ( int )( ( bctData + 1 ) % 3 ) - 1; } 

Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:

inline int bct_decoder_math( unsigned bctData ) { return ( int )( bctData & 1 ) - ( int )( bctData >> 1 ); } 

This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.

If this is not enough then we can go futher, having something special. Next is called lookup table approach:

inline int bct_decoder_lut( unsigned bctData ) { static const int decoderLUT[] = { 0, 1, -1, 0 }; return decoderLUT[ bctData & 0x3 ]; } 

In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.

So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

Comments

0

Updated answer taking the current (2018) state of compiler vectorization. Please see danben's answer for the general case where vectorization is not a concern.

TLDR summary: avoiding ifs can help with vectorization.

Because SIMD would be too complex to allow branching on some elements, but not others, any code containing an if statement will fail to be vectorized unless the compiler knows a "superoptimization" technique that can rewrite it into a branchless set of operations. I don't know of any compilers that are doing this as an integrated part of the vectorization pass (Clang does some of this independently, but not specificly toward helping vectorization AFAIK)

Using the OP's provided example:

int a, b, low; low = b + ((a - b) & ((a - b) >> 31)); 

Many compilers can vectorize this to be something approximately equivalent to:

__m128i low128i(__m128i a, __m128i b){ __m128i diff, tmp; diff = _mm_sub_epi32(a,b); tmp = _mm_srai_epi32(diff, 31); tmp = _mm_and_si128(tmp,diff); return _mm_add_epi32(tmp,b); } 

This optimization would require the data to be layed out in a fashion that would allow for it, but it could be extended to __m256i with avx2 or __m512i with avx512 (and even unroll loops further to take advantage of additional registers) or other simd instructions on other architectures. Another plus is that these instructions are all low latency, high-throughput instructions (latencies of ~1 and reciprocal throughputs in the range of 0.33 to 0.5 - so really fast relative to non-vectorized code)

I see no reason why compilers couldn't optimize an if statement to a vectorized conditional move (except that the corresponding x86 operations only work on memory locations and have low throughput and other architectures like arm may lack it entirely) but it could be done by doing something like:

void lowhi128i(__m128i *a, __m128i *b){ // does both low and high __m128i _a=*a, _b=*b; __m128i lomask = _mm_cmpgt_epi32(_a,_b), __m128i himask = _mm_cmpgt_epi32(_b,_a); _mm_maskmoveu_si128(_b,lomask,a); _mm_maskmoveu_si128(_a,himask,b); } 

However this would have a much higher latency due to memory reads and writes and lower throughput (higher/worse reciprocal throughput) than the example above.

2 Comments

gcc and clang can do some simpler conversions of if into branchless. One major obstacle is that if the abstract machine doesn't write a memory location, it's not ok for the compiler-generated asm to read/rewrite it with the same value. So _mm_maskmoveu_si128 can be correct where the other version isn't, but it's slow (NT store, so it evicts from cache, as well as being just plain slow). See Is it possible to use SIMD instruction for replace?: the AVX version is fast.
And BTW, SIMD CMOV between registers is called a blend, and is somewhat fast. Like blendvps. Or with AVX512, conditional move is built-in to everything with mask registers.
-1

Unless you're really trying to buckle down on efficiency, I don't think this is something you need to worry about.

My simple thought though is that the if would be quicker because it's comparing one thing, while the other code is doing several operations. But again, I imagine that the difference is minuscule.

Comments

-1

If it is for Gnu C++, try this

int min = i <? j; 

I have not profiled it but I think it is definitely the one to beat.

2 Comments

I don't know what Gnu C++ is, but I don't like its syntax.
Gnu C++ is of course the C++ compiler from GCC (the Gnu Compiler Collection). IIRD they've deprecated this form. Just use std::min(i,j). It's unlikely GCC's std::min<int> is slower than this.