CMOV (Linus Torvalds)

Index Home About Blog

 From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: kernel + gcc 4.1 = several problems Date: Wed, 03 Jan 2007 16:11:19 UTC Message-ID: <fa.iHnBwDV8c5pMY8pAiaPKRUcP4QQ@ifi.uio.no> On Wed, 3 Jan 2007, Grzegorz Kulewski wrote: > Could you explain why CMOV is pointless now? Are there any benchmarks proving > that? CMOV (and, more generically, any "predicated instruction") tends to generally a bad idea on an aggressively out-of-order CPU. It doesn't always have to be horrible, but in practice it is seldom very nice, and (as usual) on the P4 it can be really quite bad. On a P4, I think a cmov basically takes 10 cycles. But even ignoring the usual P4 "I suck at things that aren't totally normal", cmov is actually not a great idea. You can always replace it by	j<negated condition> forward	mov ..., %reg	forward: and assuming the branch is AT ALL predictable (and 95+% of all branches are), the branch-over will actually be a LOT better for a CPU. Why? Because branches can be predicted, and when they are predicted they basically go away. They go away on many levels, too. Not just the branch itself, but the _conditional_ for the branch goes away as far as the critical path of code is concerned: the CPU still has to calculate it and check it, but from a performance angle it "doesn't exist any more", because it's not holding anything else up (well, you want to do it in _some_ reasonable time, but the point stands..) Similarly, whichever side of the branch wasn't taken goes away. Again, in an out-of-order machine with register renaming, this means that even if the branch isn't taken above, and you end up executing all the non-branch instructions, because you now UNCONDITIONALLY over-write the register, the old data in the register is now DEAD, so now all the OTHER writes to that register are off the critical path too! So the end result is that with a conditional branch, on a good CPU, the _only_ part of the code that is actually performance-sensitive is the actual calculation of the value that gets used! In contrast, if you use a predicated instruction, ALL of it is on the critical path. Calculating the conditional is on the critical path. Calculating the value that gets used is obviously ALSO on the critical path, but so is the calculation for the value that DOESN'T get used too. So the cmov - rather than speeding things up - actually slows things down, because it makes more code be dependent on each other. So here's the basic rule: - cmov is sometimes nice for code density. It's not a big win, but it certainly can be a win. - if you KNOW the branch is totally unpredictable, cmov is often good for performance. But a compiler almost never knows that, and even if you train it with input data and profiling, remember that not very many branches _are_ totally unpredictable, so even if you were to know that something is unpredictable, it's going to be very rare. - on a P4, branch mispredictions are expensive, but so is cmov, so all the above is to some degree exaggerated. On nicer microarchitectures (the Intel Core 2 in particular is something I have to say is very nice indeed), the difference will be a lot less noticeable. The loss from cmov isn't very big (it's not as sucky as P4), but neither is the win (branch misprediction isn't that expensive either). Here's an example program that you can test and time yourself. On my Core 2, I get	[torvalds@woody ~]$ gcc -DCMOV -Wall -O2 t.c	[torvalds@woody ~]$ time ./a.out	600000000	real 0m0.194s	user 0m0.192s	sys 0m0.000s	[torvalds@woody ~]$ gcc -Wall -O2 t.c	[torvalds@woody ~]$ time ./a.out	600000000	real 0m0.167s	user 0m0.168s	sys 0m0.000s ie the cmov is quite a bit slower. Maybe I did something wrong. But note how cmov not only is slower, it's fundamnetally more limited too (ie the branch-over can actually do a lot of things cmov simply cannot do). So don't use cmov. Except for non-performance-critical code, or if you really care about code-size, and it helps (which is actually fairly rare: quite often cmov isn't even smaller than a conditional jump and a regular move, partly because a regular move can take arguments that a cmov cannot: move to memory, move from an immediate etc etc, so depending on what you're moving, cmov simply isn't good even if it's _just_ a move). (For me, the "cmov" version of the function ends up being three bytes shorter. So it's actually a good example of everything above)	Linus (*) x86 only has "move to register" as a predicated instruction, but some other architectures have lots of them, potentially all instructions. I don't count conditional branches as "predicated", although some crazy people do. ARM has predicated instructions (but they are gone in Thumb, I think), and ia64 obviously has predicated instructions (but it will be gone in a few years ;) ----------------- #include <stdio.h> /* How many iterations? */ #define ITERATIONS (100000000) /* Which bit of the counter to test? */ #define BIT 1 #ifdef CMOV #define choose(i, a, b) ({	\	unsigned long result;	\	asm("testl %1,%2 ; cmovne %3,%0"	\	:"=r" (result)	\	:"i" (BIT),	\ "g" (i),	\ "rm" (a),	\ "0" (b));	\	result; }) #else #define choose(i, a, b) ({	\	unsigned long result;	\	asm("testl %1,%2 ; je 1f ; mov %3,%0\n1:"	\	:"=r" (result)	\	:"i" (BIT),	\ "g" (i),	\ "g" (a),	\ "0" (b));	\	result; }) #endif int main(int argc, char **argv) {	int i;	unsigned long sum = 0;	for (i = 0; i < ITERATIONS; i++) {	unsigned long a = 5, b = 7;	sum += choose(i, a, b);	}	printf("%lu\n", sum);	return 0; }

 From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: kernel + gcc 4.1 = several problems Date: Wed, 03 Jan 2007 20:31:41 UTC Message-ID: <fa.wnZOfBUpE6Q0mfwaI4vu5DInh+c@ifi.uio.no> On Wed, 3 Jan 2007, Tim Schmielau wrote: > Well, on a P4 (which is supposed to be soo bad) I get: Interesting. My P4 gets basically exactly the same timings for the cmov and branch cases. And my Core 2 is consistently faster (something like 15%) for the branch version. Btw, the test-case should be the best possible one for cmov, since there are no data-dependencies except for ALU operations, and everything is totally independent (the actual values have no data dependencies at all, since they are constants). So the critical path issue never show up.	Linus

From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: kernel + gcc 4.1 = several problems Date: Wed, 03 Jan 2007 20:43:34 UTC Message-ID: <fa.UR+Ao/GZvI0cYZhFujAgZNRLLSs@ifi.uio.no> On Wed, 3 Jan 2007, Denis Vlasenko wrote: > > Why CPU people do not internally convert cmov into jmp,mov pair? Probably because - it's not worth it. cmov's certainly _can_ be faster for unpredictable input. So expecially if you teach your compiler (by using profiling) to use cmov's mainly for unpredictable cases, turning it into a conditional jump internally would likely be a bad idea. - the biggest reason to do it would likely be microarchitectural: if you have an ALU or a bypass network that just isn't suitable for bypassing the flags that way (because you designed your pipeline for a conditional branch), you might decide that it just simplifies things to turn the cmov internally into a branch+mov uop pair. - cmov's simply aren't common enough to be worth worrying about, especially as it's not likely that the difference is all that big in the end. The limitations on cmov's means that the compiler can only use them under certain fairly limited circumstances anyway, so it's not like you'll make a huge difference by doing anything clever. So see above - it's simply a wash, and likely ends up just depending on other issues. And don't get me wrong. cmov's can make a difference. You can use them to avoid polluting your branch prediction tables, you can use them to make code smaller, and you can use them when they simply just fit the problem really well. It's just _not_ the case that they are "obviously better". They simply aren't. Conditional branches aren't "evil". There are many MUCH worse things you can do, and other things you should avoid. It really all boils down to: there's simply no real reason to use cmov. It's not horrible either, so go ahead and use it if you want to, but don't expect your code to really magically run any faster. Linus

 From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: kernel + gcc 4.1 = several problems Date: Wed, 03 Jan 2007 22:17:35 UTC Message-ID: <fa.7+6Bbs6IVAQuLP04rP/e0sLj7iM@ifi.uio.no> On Wed, 3 Jan 2007, Denis Vlasenko wrote: > > IOW: yet another slot in instruction opcode matrix and thousands of > transistors in instruction decoders are wasted because of this > "clever invention", eh? Well, in all fairness, it can probably help more on certain microarchitectures. Intel is fairly aggressively OoO, especially in Core 2, and predicted branches are not only free, they allow OoO to do a great job around them. But an in-order architecture doesn't have that, and cmov might show more of an advantage there.	Linus

 From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: Re: kernel + gcc 4.1 = several problems Date: Wed, 03 Jan 2007 22:12:37 UTC Message-ID: <fa./kcA//b15BAP6eVUkdIB/kcv0aU@ifi.uio.no> On Wed, 3 Jan 2007, Thomas Sailer wrote: > > IF... Counterexample: Add-Compare-Select in a Viterbi Decoder. Yes. [De]compression stuff tends to be (a) totally unpredictable and (b) a situation where people care about performance. It's fairly rare in many other situations. That said, any real performance these days is about avoiding cache misses. There cmov really can help more, if it results in denser code (fairly big if, though).	Linus

 From: Linus Torvalds <torvalds@osdl.org> Newsgroups: fa.linux.kernel Subject: RE: kernel + gcc 4.1 = several problems Date: Thu, 04 Jan 2007 15:39:37 UTC Message-ID: <fa.4uXnzxtRZUP0N24sb6V1HnchsLY@ifi.uio.no> On Thu, 4 Jan 2007, Zou, Nanhai wrote: > > cmov will stall on eflags in your test program. And that is EXACTLY my point. CMOV is a piece of CRAP for most things, exactly because it serializes three streams of data: the two inputs, and the conditional. My test-case was actually _good_ for cmov, because there was just the one conditional (which was 100% ALU) thing that was serialized. In real life, the two data sources also come from memory, and _any_ of them being delayed ends up delaying the cmov, and screwing up your out-of-order pipeline because you now introduced a serialization point that was very possibly not necessary at all. In contrast, a conditional branch-around serializes absolutely NOTHING, because branches get predicted. > I think you will see benefit of cmov if you can manage to put some > instructions which does NOT modify eflags between testl and cmov. A lot of the time, the conditional _is_ the critical path. The whole point of this discussion was that cmov isn't really all that great. It has fundamental problems that a conditional branch that gets predicted simply does not have. That's qiute apart from the fact that cmov has rather limited semantics, and that in 99% of all cases you have to use a conditional branch anyway.	Linus

Index Home About Blog