Floating point vs integer calculations on modern hardware

Question

I am doing some performance critical work in C++, and we are currently using integer calculations for problems that are inherently floating point because "its faster". This causes a whole lot of annoying problems and adds a lot of annoying code.

Now, I remember reading about how floating point calculations were so slow approximately circa the 386 days, where I believe (IIRC) that there was an optional co-proccessor. But surely nowadays with exponentially more complex and powerful CPUs it makes no difference in "speed" if doing floating point or integer calculation? Especially since the actual calculation time is tiny compared to something like causing a pipeline stall or fetching something from main memory?

I know the correct answer is to benchmark on the target hardware, what would be a good way to test this? I wrote two tiny C++ programs and compared their run time with "time" on Linux, but the actual run time is too variable (doesn't help I am running on a virtual server). Short of spending my entire day running hundreds of benchmarks, making graphs etc. is there something I can do to get a reasonable test of the relative speed? Any ideas or thoughts? Am I completely wrong?

The programs I used as follows, they are not identical by any means:

#include <iostream> #include <cmath> #include <cstdlib> #include <time.h> int main( int argc, char** argv ) { int accum = 0; srand( time( NULL ) ); for( unsigned int i = 0; i < 100000000; ++i ) { accum += rand( ) % 365; } std::cout << accum << std::endl; return 0; }

Program 2:

#include <iostream> #include <cmath> #include <cstdlib> #include <time.h> int main( int argc, char** argv ) { float accum = 0; srand( time( NULL ) ); for( unsigned int i = 0; i < 100000000; ++i ) { accum += (float)( rand( ) % 365 ); } std::cout << accum << std::endl; return 0; }

Edit: The platform I care about is regular x86 or x86-64 running on desktop Linux and Windows machines.

Edit 2(pasted from a comment below): We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards.

Anyway, thanks for all your excellent answers and help. Feel free to add anything else :).

What you have as your test now is trivial. There's also probably very little difference in the assembly, (addl replaced with fadd, for example). The only way to really get a good measurement is get a core part of your real program and profile different versions of that. Unfortunately that can be pretty hard without using tons of effort. Perhaps telling us the target hardware and your compiler would help people at least give you pre-existing experience, etc. About your integer use, I suspect you could make a sort of fixed_point template class that would ease such work tremendously. — GManNickG
– GManNickG, Commented Mar 31, 2010 at 3:22
There are still a lot of architectures out there that don't have dedicated floating point hardware - some tags explaining the systems you care about will help you get better answers. — Carl Norum
– Carl Norum, Commented Mar 31, 2010 at 3:24
I believe the hardware in my HTC Hero (android) doesn't have FPU, but the hardware in the Google NexusOne (android) does. what is your target? desktop/server pcs? netbooks (possible arm+linux)? phones? — SteelBytes
– SteelBytes, Commented Mar 31, 2010 at 3:33
If you want fast FP on x86, try to compile with optimization and SSE code generation. SSE (whatever version) can do at least float add, subtract, and multiply in a single cycle. Divide, mod, and higher functions will always be slow. Also note that float gets the speed boost, but usually double doesn't. — Mike D.
– Mike D., Commented Mar 31, 2010 at 4:18
Fixed-point integer approximates FP by using multiple integer operations to keep the results from overflowing. That's almost always slower than just using the extremely capable FPUs found in modern desktop CPUs. e.g. MAD, the fixed-point mp3 decoder, is slower than libmpg123, and even though it's good quality for a fixed point decoder, libmpg123 still has less rounding error. wezm.net/technical/2008/04/mp3-decoder-libraries-compared for benchmarks on a PPC G5. — Peter Cordes
– Peter Cordes, Commented Jul 13, 2015 at 1:51

kelalaka · Accepted Answer · 2018-09-25 08:41:45Z

For example (lesser numbers are faster),

64-bit Intel Xeon X5550 @ 2.67GHz, gcc 4.1.2 -O3

short add/sub: 1.005460 [0] short mul/div: 3.926543 [0] long add/sub: 0.000000 [0] long mul/div: 7.378581 [0] long long add/sub: 0.000000 [0] long long mul/div: 7.378593 [0] float add/sub: 0.993583 [0] float mul/div: 1.821565 [0] double add/sub: 0.993884 [0] double mul/div: 1.988664 [0]

32-bit Dual Core AMD Opteron(tm) Processor 265 @ 1.81GHz, gcc 3.4.6 -O3

short add/sub: 0.553863 [0] short mul/div: 12.509163 [0] long add/sub: 0.556912 [0] long mul/div: 12.748019 [0] long long add/sub: 5.298999 [0] long long mul/div: 20.461186 [0] float add/sub: 2.688253 [0] float mul/div: 4.683886 [0] double add/sub: 2.700834 [0] double mul/div: 4.646755 [0]

As Dan pointed out, even once you normalize for clock frequency (which can be misleading in itself in pipelined designs), results will vary wildly based on CPU architecture (individual ALU/FPU performance, as well as actual number of ALUs/FPUs available per core in superscalar designs which influences how many independent operations can execute in parallel -- the latter factor is not exercised by the code below as all operations below are sequentially dependent.)

Poor man's FPU/ALU operation benchmark:

#include <stdio.h> #ifdef _WIN32 #include <sys/timeb.h> #else #include <sys/time.h> #endif #include <time.h> #include <cstdlib> double mygettime(void) { # ifdef _WIN32 struct _timeb tb; _ftime(&tb); return (double)tb.time + (0.001 * (double)tb.millitm); # else struct timeval tv; if(gettimeofday(&tv, 0) < 0) { perror("oops"); } return (double)tv.tv_sec + (0.000001 * (double)tv.tv_usec); # endif } template< typename Type > void my_test(const char* name) { Type v = 0; // Do not use constants or repeating values // to avoid loop unroll optimizations. // All values >0 to avoid division by 0 // Perform ten ops/iteration to reduce // impact of ++i below on measurements Type v0 = (Type)(rand() % 256)/16 + 1; Type v1 = (Type)(rand() % 256)/16 + 1; Type v2 = (Type)(rand() % 256)/16 + 1; Type v3 = (Type)(rand() % 256)/16 + 1; Type v4 = (Type)(rand() % 256)/16 + 1; Type v5 = (Type)(rand() % 256)/16 + 1; Type v6 = (Type)(rand() % 256)/16 + 1; Type v7 = (Type)(rand() % 256)/16 + 1; Type v8 = (Type)(rand() % 256)/16 + 1; Type v9 = (Type)(rand() % 256)/16 + 1; double t1 = mygettime(); for (size_t i = 0; i < 100000000; ++i) { v += v0; v -= v1; v += v2; v -= v3; v += v4; v -= v5; v += v6; v -= v7; v += v8; v -= v9; } // Pretend we make use of v so compiler doesn't optimize out // the loop completely printf("%s add/sub: %f [%d]\n", name, mygettime() - t1, (int)v&1); t1 = mygettime(); for (size_t i = 0; i < 100000000; ++i) { v /= v0; v *= v1; v /= v2; v *= v3; v /= v4; v *= v5; v /= v6; v *= v7; v /= v8; v *= v9; } // Pretend we make use of v so compiler doesn't optimize out // the loop completely printf("%s mul/div: %f [%d]\n", name, mygettime() - t1, (int)v&1); } int main() { my_test< short >("short"); my_test< long >("long"); my_test< long long >("long long"); my_test< float >("float"); my_test< double >("double"); return 0; }

why did you mix mult and div? Shouldn't it be interesting if mult is maybe (or expectedly?) much faster then div?
Multiplication is much faster than division in both integer and floating point cases. Division performance depend also on the size of the numbers. I usually assume that division is ~15 times slower.
pastebin.com/Kx8WGUfg I took your benchmark and separated out each operation to its own loop and added volatile to make sure. On Win64, the FPU is unused and MSVC will not generate code for it, so it compiles using mulss and divss XMM instructions there, which are 25x faster than the FPU in Win32. Test machine is Core i5 M 520 @ 2.40GHz
@JamesDunne just be careful, for fp ops v will quickly reach either 0 or +/-inf very very quickly, which may or may not be (theoretically) treated as a special case/fastpatheed by certain fpu implementations.
This "benchmark" has no data parallelism for out-of-order execution, because every operation is done with the same accumulator (v). On recent Intel designs, divide isn't pipelined at all (divss/divps has 10-14 cycle latency, and the same reciprocal throughput). mulss however is 5 cycle latency, but can issue one every cycle. (Or two per cycle on Haswell, since port 0 and port 1 both have an multiplier for FMA).

MrMesees · Accepted Answer · 2025-02-25 05:50:50Z

TIL This varies (a lot). Here are some results using gnu compiler (btw I also checked by compiling on machines, gnu g++ 5.4 from xenial is a hell of a lot faster than 4.6.3 from linaro on precise)

Intel i7 4700MQ xenial

short add: 0.822491 short sub: 0.832757 short mul: 1.007533 short div: 3.459642 long add: 0.824088 long sub: 0.867495 long mul: 1.017164 long div: 5.662498 long long add: 0.873705 long long sub: 0.873177 long long mul: 1.019648 long long div: 5.657374 float add: 1.137084 float sub: 1.140690 float mul: 1.410767 float div: 2.093982 double add: 1.139156 double sub: 1.146221 double mul: 1.405541 double div: 2.093173

Intel i3 2370M has similar results

short add: 1.369983 short sub: 1.235122 short mul: 1.345993 short div: 4.198790 long add: 1.224552 long sub: 1.223314 long mul: 1.346309 long div: 7.275912 long long add: 1.235526 long long sub: 1.223865 long long mul: 1.346409 long long div: 7.271491 float add: 1.507352 float sub: 1.506573 float mul: 2.006751 float div: 2.762262 double add: 1.507561 double sub: 1.506817 double mul: 1.843164 double div: 2.877484

Intel(R) Celeron(R) 2955U (Acer C720 Chromebook running xenial)

short add: 1.999639 short sub: 1.919501 short mul: 2.292759 short div: 7.801453 long add: 1.987842 long sub: 1.933746 long mul: 2.292715 long div: 12.797286 long long add: 1.920429 long long sub: 1.987339 long long mul: 2.292952 long long div: 12.795385 float add: 2.580141 float sub: 2.579344 float mul: 3.152459 float div: 4.716983 double add: 2.579279 double sub: 2.579290 double mul: 3.152649 double div: 4.691226

DigitalOcean 1GB Droplet Intel(R) Xeon(R) CPU E5-2630L v2 (running trusty)

short add: 1.094323 short sub: 1.095886 short mul: 1.356369 short div: 4.256722 long add: 1.111328 long sub: 1.079420 long mul: 1.356105 long div: 7.422517 long long add: 1.057854 long long sub: 1.099414 long long mul: 1.368913 long long div: 7.424180 float add: 1.516550 float sub: 1.544005 float mul: 1.879592 float div: 2.798318 double add: 1.534624 double sub: 1.533405 double mul: 1.866442 double div: 2.777649

AMD Opteron(tm) Processor 4122 (precise)

short add: 3.396932 short sub: 3.530665 short mul: 3.524118 short div: 15.226630 long add: 3.522978 long sub: 3.439746 long mul: 5.051004 long div: 15.125845 long long add: 4.008773 long long sub: 4.138124 long long mul: 5.090263 long long div: 14.769520 float add: 6.357209 float sub: 6.393084 float mul: 6.303037 float div: 17.541792 double add: 6.415921 double sub: 6.342832 double mul: 6.321899 double div: 15.362536

This uses code from http://pastebin.com/Kx8WGUfg as benchmark-pc.c

g++ -fpermissive -O3 -o benchmark-pc benchmark-pc.c

I've run multiple passes, but this seems to be the case that general numbers are the same.

One notable exception seems to be ALU mul vs FPU mul. Addition and subtraction seem trivially different.

Here is the above in chart form (click for full size, lower is faster and preferable):

Update to accomodate @Peter Cordes

https://gist.github.com/Lewiscowles1986/90191c59c9aedf3d08bf0b129065cccc

i7 4700MQ Linux Ubuntu Xenial 64-bit (all patches to 2018-03-13 applied)

 short add: 0.773049 short sub: 0.789793 short mul: 0.960152 short div: 3.273668 int add: 0.837695 int sub: 0.804066 int mul: 0.960840 int div: 3.281113 long add: 0.829946 long sub: 0.829168 long mul: 0.960717 long div: 5.363420 long long add: 0.828654 long long sub: 0.805897 long long mul: 0.964164 long long div: 5.359342 float add: 1.081649 float sub: 1.080351 float mul: 1.323401 float div: 1.984582 double add: 1.081079 double sub: 1.082572 double mul: 1.323857 double div: 1.968488

AMD Opteron(tm) Processor 4122 (precise, DreamHost shared-hosting)

 short add: 1.235603 short sub: 1.235017 short mul: 1.280661 short div: 5.535520 int add: 1.233110 int sub: 1.232561 int mul: 1.280593 int div: 5.350998 long add: 1.281022 long sub: 1.251045 long mul: 1.834241 long div: 5.350325 long long add: 1.279738 long long sub: 1.249189 long long mul: 1.841852 long long div: 5.351960 float add: 2.307852 float sub: 2.305122 float mul: 2.298346 float div: 4.833562 double add: 2.305454 double sub: 2.307195 double mul: 2.302797 double div: 5.485736

Intel Xeon E5-2630L v2 @ 2.4GHz (Trusty 64-bit, DigitalOcean VPS)

 short add: 1.040745 short sub: 0.998255 short mul: 1.240751 short div: 3.900671 int add: 1.054430 int sub: 1.000328 int mul: 1.250496 int div: 3.904415 long add: 0.995786 long sub: 1.021743 long mul: 1.335557 long div: 7.693886 long long add: 1.139643 long long sub: 1.103039 long long mul: 1.409939 long long div: 7.652080 float add: 1.572640 float sub: 1.532714 float mul: 1.864489 float div: 2.825330 double add: 1.535827 double sub: 1.535055 double mul: 1.881584 double div: 2.777245

Apple Mac Mini M1

 short add: 0.794701 short sub: 0.752165 short mul: 1.002816 short div: 1.510412 long add: 0.704235 long sub: 0.704065 long mul: 0.891701 long div: 1.391481 long long add: 0.703971 long long sub: 0.704361 long long mul: 0.890722 long long div: 1.392378 float add: 1.376483 float sub: 1.377145 float mul: 1.377523 float div: 1.754344 double add: 1.378830 double sub: 1.380009 double mul: 1.378437 double div: 2.005511

Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz

 short add: 0.625791 short sub: 0.612076 short mul: 0.808043 short div: 3.223206 long add: 0.598402 long sub: 0.594910 long mul: 0.783385 long div: 4.568725 long long add: 0.594657 long long sub: 0.597185 long long mul: 0.778999 long long div: 4.467567 float add: 0.972729 float sub: 0.963480 float mul: 0.968124 float div: 1.767378 double add: 0.973561 double sub: 0.968600 double mul: 0.976119 double div: 1.967776

Apple MacBook Air M2

short add: 0.761225 short sub: 0.738152 short mul: 0.832800 short div: 1.407643 long add: 0.278027 long sub: 0.278680 long mul: 0.469060 long div: 0.971469 long long add: 0.278614 long long sub: 0.277795 long long mul: 0.469232 long long div: 0.972268 float add: 1.378481 float sub: 1.389127 float mul: 1.392117 float div: 1.722389 double add: 1.386530 double sub: 1.395797 double mul: 1.389992 double div: 1.969165

gcc5 maybe auto-vectorizes something that gcc4.6 didn't? Is benchmark-pc measuring some combination of throughput and latency? On your Haswell (i7 4700MQ), integer multiply is 1 per clock throughput, 3 cycle latency, but integer add/sub is 4 per clock throughput, 1 cycle latency (agner.org/optimize). So presumably there's a lot of loop overhead diluting those numbers for add and mul to come out so close (long add: 0.824088 vs. long mul: 1.017164). (gcc defaults to not unrolling loops, except for fully unrolling very low iteration counts).
And BTW, why does it not test int, only short and long? On Linux x86-64, short is 16 bits (and thus has partial-register slowdowns in some cases), while long and long long are both 64-bit types. (Maybe it's designed for Windows where x86-64 still uses 32-bit long? Or maybe it's designed for 32-bit mode.) On Linux, the x32 ABI has 32-bit long in 64-bit mode, so if you have the libraries installed, use gcc -mx32 to compiler for ILP32. Or just use -m32 and look at the long numbers.
And you should really check if your compiler auto-vectorized anything. e.g. using addps on xmm registers instead of addss, to do 4 FP adds in parallel in one instruction that's as fast as scalar addss. (Use -march=native to allow using whatever instruction sets your CPU supports, not just the SSE2 baseline for x86-64).
Floating point division vs floating point multiplication collects latency and throughput data from a variety of x86 microarchitectures for FP multiply vs. FP divide (for both single and double precision), for scalar and 128-bit vector vs. 256-bit vectors. (See the tables near the bottom of my answer. And read the rest of my answer and BeeOnRope's answer I linked earlier to understand more about what that means for an out-of-order exec CPU.)
For raw data for integer instructions with different operand-sizes in x86 assembly, see uops.info which microbenchmarks every instruction for latency and throughput. e.g. add r32, r32, imul r32, r32, or idiv r32 for 32-bit int math; without the store/reload your benchmark would measure just instruction latency, so the results it would generate are already available online. In its current state, it's measuring the equivalent of add m32, r32 latency for that case. (mul and div don't have forms with a memory destination.)

vy32 · Accepted Answer · 2013-12-19 21:03:42Z

Alas, I can only give you an "it depends" answer...

From my experience, there are many, many variables to performance...especially between integer & floating point math. It varies strongly from processor to processor (even within the same family such as x86) because different processors have different "pipeline" lengths. Also, some operations are generally very simple (such as addition) and have an accelerated route through the processor, and others (such as division) take much, much longer.

The other big variable is where the data reside. If you only have a few values to add, then all of the data can reside in cache, where they can be quickly sent to the CPU. A very, very slow floating point operation that already has the data in cache will be many times faster than an integer operation where an integer needs to be copied from system memory.

I assume that you are asking this question because you are working on a performance critical application. If you are developing for the x86 architecture, and you need extra performance, you might want to look into using the SSE extensions. This can greatly speed up single-precision floating point arithmetic, as the same operation can be performed on multiple data at once, plus there is a separate* bank of registers for the SSE operations. (I noticed in your second example you used "float" instead of "double", making me think you are using single-precision math).

*Note: Using the old MMX instructions would actually slow down programs, because those old instructions actually used the same registers as the FPU does, making it impossible to use both the FPU and MMX at the same time.

And on some processors FP math can be faster than integer math. The Alpha processor had a FP divide instruction but not an integer one, so integer division had to be done in software.
Will SSEx also speed up double precision arithmetic? I'm sorry, I'm not too familiar with SSE
@JohannesSchaub-litb: SSE2 (baseline for x86-64) has packed double-precision FP. With only two 64-bit doubles per register, the potential speedup is smaller than float for code that vectorizes well. Scalar float and double use XMM registers on x86-64, with legacy x87 only used for long double. (So @ Dan: no, MMX registers don't conflict with normal FPU registers, because normal FPU on x86-64 is the SSE unit. MMX would be pointless because if you can do integer SIMD, you want 16-byte xmm0..15 instead of 8-byte mm0..7, and modern CPUs have worse MMX than SSE throughput.)
But MMX and SSE*/AVX2 integer instructions do compete for the same execution units, so using both at once is almost never useful. Just use the wider XMM / YMM versions to get more work done. Using SIMD integer and FP at the same time competes for the same registers, but x86-64 has 16 of them. But total throughput limits mean you can't get twice as much work done by using integer and FP execution units in parallel.

Ben Voigt · Accepted Answer · 2010-03-31 05:20:48Z

There is likely to be a significant difference in real-world speed between fixed-point and floating-point math, but the theoretical best-case throughput of the ALU vs FPU is completely irrelevant. Instead, the number of integer and floating-point registers (real registers, not register names) on your architecture which are not otherwise used by your computation (e.g. for loop control), the number of elements of each type which fit in a cache line, optimizations possible considering the different semantics for integer vs. floating point math -- these effects will dominate. The data dependencies of your algorithm play a significant role here, so that no general comparison will predict the performance gap on your problem.

For example, integer addition is commutative, so if the compiler sees a loop like you used for a benchmark (assuming the random data was prepared in advance so it wouldn't obscure the results), it can unroll the loop and calculate partial sums with no dependencies, then add them when the loop terminates. But with floating point, the compiler has to do the operations in the same order you requested (you've got sequence points in there so the compiler has to guarantee the same result, which disallows reordering) so there's a strong dependency of each addition on the result of the previous one.

You're likely to fit more integer operands in cache at a time as well. So the fixed-point version might outperform the float version by an order of magnitude even on a machine where the FPU has theoretically higher throughput.

+1 for pointing out how naive benchmarks can yield 0-time loops because of unrolled constant integer operations. Moreover, the compiler can completely discard the loop (integer or FP) if the result is not actually used.
The conclusion to that is : one must call a function having the looping variable as argument. Since i think no compiler could be able to see that the function does nothing and that the call can be ignored. Since there's a call overhead, only the differences of time == ( float time - integer time ) will be significant.
@GameAlchemist: Many compilers do eliminate calls to empty functions, as a side effect of inlining. You have to make an effort to prevent that.
The OP sounded like he was talking about using integer for things where FP would be a more natural fit, so it would take more integer code to achieve the same result as the FP code. In this case, just use FP. For example, on hardware with an FPU (e.g. a desktop CPU), fixed-point integer MP3 decoders are slower (and slightly more rounding errors) than floating-point decoders. Fixed-point implementations of codecs mainly exist to run on stripped-down ARM CPUs with no FP hardware, only slow emulated FP.
one example for the first point: on x86-64 with AVX-512 there are only 16 GP registers but 32 zmm registers so scalar floating-point math may be faster

Potatoswatter · Accepted Answer · 2010-03-31 03:24:05Z

20

Addition is much faster than rand, so your program is (especially) useless.

You need to identify performance hotspots and incrementally modify your program. It sounds like you have problems with your development environment that will need to be solved first. Is it impossible to run your program on your PC for a small problem set?

Generally, attempting FP jobs with integer arithmetic is a recipe for slow.

answered Mar 31, 2010 at 3:24

Potatoswatter

139k29 gold badges281 silver badges435 bronze badges

4 Comments

maxpenguin Over a year ago

Yeah, as well as the conversion from a rand integer to a float in the floating point version. Any ideas on a better way to test this?

Mike D. Over a year ago

If you're trying to profile speed, look at POSIX's timespec_t or something similar. Record the time at the start and end of the loop and take the difference. Then move the rand data generation out of the loop. Make sure that your algorithm gets all its data from arrays and puts all its data in arrays. That gets your actual algorithm by itself, and gets setup, malloc, result printing, everything but task switching and interrupts out of your profiling loop.

Potatoswatter Over a year ago

@maxpenguin: the question is what you are testing. Artem has assumed you are doing graphics, Carl considered whether you're on an embedded platform sans FP, I supposed you're coding science for a server. You can't generalize or "write" benchmarks. Benchmarks are sampled from the actual work your program does. One thing I can tell you is that it won't remain "essentially the same speed" if you touch the performance-critical element in your program, whatever that is.

maxpenguin Over a year ago

good point and good answer. We have an extensive code base currently. Really I have come up against the generalization that we "must not use float since integer calculation is faster" - and I am looking for a way (if this is even true) to disprove this generalized assumption. I realize that it would be impossible to predict the exact outcome for us short of doing all the work and profiling it afterwards. Anyway, thanks for your help.

jcoder · Accepted Answer · 2010-03-31 08:11:44Z

Two points to consider -

Modern hardware can overlap instructions, execute them in parallel and reorder them to make best use of the hardware. And also, any significant floating point program is likely to have significant integer work too even if it's only calculating indices into arrays, loop counter etc. so even if you have a slow floating point instruction it may well be running on a separate bit of hardware overlapped with some of the integer work. My point being that even if the floating point instructions are slow that integer ones, your overall program may run faster because it can make use of more of the hardware.

As always, the only way to be sure is to profile your actual program.

Second point is that most CPUs these days have SIMD instructions for floating point that can operate on multiple floating point values all at the same time. For example you can load 4 floats into a single SSE register and the perform 4 multiplications on them all in parallel. If you can rewrite parts of your code to use SSE instructions then it seems likely it will be faster than an integer version. Visual c++ provides compiler intrinsic functions to do this, see http://msdn.microsoft.com/en-us/library/x5c07e2a(v=VS.80).aspx for some information.

One should note that on Win64, the FPU instructions are not generated by the MSVC compiler any more. Floating point is always using SIMD instructions there. This makes for a large speed discrepancy between Win32 and Win64 regarding flops.
"My point being that even if the floating point instructions are slow that integer ones, your overall program may run faster..." Relative to what?

vladr · Accepted Answer · 2010-03-31 05:05:50Z

Unless you're writing code that will be called millions of times per second (such as, e.g., drawing a line to the screen in a graphics application), integer vs. floating-point arithmetic is rarely the bottleneck.

The usual first step to the efficiency questions is to profile your code to see where the run-time is really spent. The linux command for this is gprof.

Edit:

Though I suppose you can always implement the line drawing algorithm using integers and floating-point numbers, call it a large number of times and see if it makes a difference:

http://en.wikipedia.org/wiki/Bresenham's_algorithm

Scientific applications use FP. The only advantage of FP is that precision is scale-invariant. It's like scientific notation. If you know the scale of the numbers already (eg, that the line length is a number of pixels), FP is obviated. But before you get to drawing the line, that's not true.

Goran D · Accepted Answer · 2014-12-22 16:31:18Z

The floating point version will be much slower, if there is no remainder operation. Since all the adds are sequential, the cpu will not be able to parallelise the summation. The latency will be critical. FPU add latency is typically 3 cycles, while integer add is 1 cycle. However, the divider for the remainder operator will probably the critical part, as it is not fully pipelined on modern cpu's. so, assuming the divide/remainder instruction will consume the bulk of the time, the difference due to add latency will be small.

Evan Carroll · Accepted Answer · 2018-05-28 18:16:49Z

Today, integer operations are usually a little bit faster than floating point operations. So if you can do a calculation with the same operations in integer and floating point, use integer. HOWEVER you are saying "This causes a whole lot of annoying problems and adds a lot of annoying code". That sounds like you need more operations because you use integer arithmetic instead of floating point. In that case, floating point will run faster because

as soon as you need more integer operations, you probably need a lot more, so the slight speed advantage is more than eaten up by the additional operations
the floating-point code is simpler, which means it is faster to write the code, which means that if it is speed critical, you can spend more time optimising the code.

There is a lot of wild speculation here, not accounting for any of the secondary effects present in hardware, which often dominate computation time. Not a bad starting point, but it needs to be checked on each particular application via profiling, and not taught as gospel.

dan04 · Accepted Answer · 2010-03-31 04:47:24Z

2

I ran a test that just added 1 to the number instead of rand(). Results (on an x86-64) were:

short: 4.260s
int: 4.020s
long long: 3.350s
float: 7.330s
double: 7.210s

answered Mar 31, 2010 at 4:47

dan04

92k23 gold badges169 silver badges206 bronze badges

5 Comments

GManNickG Over a year ago

Source, compile options, and timing method? I'm a bit surprised by the results.

dan04 Over a year ago

Same loop as OP with "rand( ) % 365" replaced by "1". No optimization. User time from "time" command.

Dean Harding Over a year ago

"No optimization" is the key. You never profile with optimization turned off, always profile in "release" mode.

Stan Rogers Over a year ago

In this case, though, the optimization off forces the op to occur, and is done deliberately -- the loop is there to dilate time to a reasonable scale of measurement. Using the constant 1 removes the cost of rand(). A sufficiently smart optimizing compiler would see 1 added 100,000,000 times with no way out of the loop and simply add 100000000 in a single op. That sort of gets around the whole purpose, doesn't it?

vladr Over a year ago

@Stan, make the variable volatile. Even a smart optimizing compiler should honour the multiple ops then.

James Curran · Accepted Answer · 2010-03-31 03:24:18Z

0

Based of that oh-so-reliable "something I've heard", back in the old days, integer calculation were about 20 to 50 times faster that floating point, and these days it's less than twice as faster.

answered Mar 31, 2010 at 3:24

James Curran

104k37 gold badges186 silver badges264 bronze badges

2 Comments

MrMesees Over a year ago

Please consider looking at this again offering more than opinion (especially given that the opinion seems to fly in the face of facts gathered)

Jonatan Öström Over a year ago

@MrMesees While this answer is not terribly useful, I would say it is consistent with the tests you made. And the historical trivia is probably fine too.

Collectives™ on Stack Overflow

Floating point vs integer calculations on modern hardware

11 Answers 11

12 Comments

Update to accomodate @Peter Cordes

i7 4700MQ Linux Ubuntu Xenial 64-bit (all patches to 2018-03-13 applied)

AMD Opteron(tm) Processor 4122 (precise, DreamHost shared-hosting)

Intel Xeon E5-2630L v2 @ 2.4GHz (Trusty 64-bit, DigitalOcean VPS)

Apple Mac Mini M1

Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz

Apple MacBook Air M2

25 Comments

4 Comments

5 Comments

4 Comments

2 Comments

1 Comment

Comments

1 Comment

5 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

12 Comments

Update to accomodate @Peter Cordes

i7 4700MQ Linux Ubuntu Xenial 64-bit (all patches to 2018-03-13 applied)

AMD Opteron(tm) Processor 4122 (precise, DreamHost shared-hosting)

Intel Xeon E5-2630L v2 @ 2.4GHz (Trusty 64-bit, DigitalOcean VPS)

Apple Mac Mini M1

Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz

Apple MacBook Air M2

25 Comments

4 Comments

5 Comments

4 Comments

2 Comments

1 Comment

Comments

1 Comment

5 Comments

2 Comments

Linked

Related