3

I wrote this code:

program exponent implicit none real(8) :: sum integer(8) :: i integer :: limit real :: start, end sum = 0d0 limit = 10000000 call CPU_TIME(start) do i=1, limit sum = sum + exp(i*1.d0/limit) end do call CPU_TIME(end) print *, sum print '("Time = ",f6.3," seconds.")',end-start end program exponent 

And I compiled it with gfortran 10.1.0 and ifort 19.1.3.304 on CentOS Linux 7 using:

ifort *.f90 -O3 -o intel.out

gfortran *.f90 -O3 -o gnu.out

and the outputs are:

gnu:

17182819.143730670 Time = 0.248 seconds. 

intel:

17182819.1437313 Time = 0.051 seconds. 

When I run a few times, the run time of each is pretty much the same.

Why is ifort faster than gfortran and how can I make gfortran run as fast as ifort?

9
  • 2
    This looks like a difference in the exp implementation. Intel might be already cutting some corners at O3 while GCC uses GLIBC. However, try also running the tests in the opposite order. The test is too small to really let the CPU to fully spin up. Also, time just the loop. Commented May 18, 2021 at 8:50
  • 1
    I can confirm the difference. On a computer without CPU throttling it is quite enough to do one try. Multiple tries in a loop did not make a difference. -ffast-math makes gfortran marginally faster but not much. Commented May 18, 2021 at 8:58
  • 1
    Like @VladimirF already said only time the loop, when using time you also record the time the OS needs to start / shutdown the application, loading libraries etc.. time is not usable for bench marking. Commented May 18, 2021 at 9:15
  • 1
    Should have added my times are from just timing the loop. And on second thought Intel might be vectorising or otherwise reordering the loop differently, rather than approximating exp Commented May 18, 2021 at 9:17
  • 1
    added compiler versions and OS to the question Commented May 18, 2021 at 12:25

1 Answer 1

6

ifort is mainly faster because it uses its own optimized math library called SVML (provided with the Intel compiler). This library is often faster since it provides optimized vectorized primitives, even without -ffastmath. Moreover, the Intel compiler tends to better vectorize loops (especially with reduction like this).

You can see the difference on GodBolt: the ifort version vectorizes the loop by working on 2 numbers at a time while the gfortran version uses a slower scalar exponential.

Note that using -mavx2 helps ifort to generate a faster code thanks to the AVX instruction set. Using AVX-512 instructions (if available on the target machine) could be even faster. gfortran can vectorize the loop with -march=native on GodBolt (but strangely not with -march=skylake and -ffast-math).

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you for this answer! I added -no-vec to ifort compilation and -fno-tree-vectorize to gfortran. The new timings are: ifort - 0.142 sec, gnu - 0.248 sec. So, indeed, vectorization took part in the run time difference, but ifort is still faster. Is there a way to make gfortran as fast as ifort?
Probably yes, by either linking the SVML in the gfortran version. I am not sure it is well supported. At least, it should be possible to use the SVML manually in the worst case. Alternatively, you can use the standard libmath installed in the ifort version although both versions will be slow in this case. Another solution is to write your own vectorized code but this is cumbersome to do (although they are good research paper for that).
@steve There are significantly faster implementation nowadays. The GNU implementation already use some optimized table-based algorithm, but state of the art algorithms (like the one use in SVML) outperform it (especially due to vectorization, but not only). Here is a reference of a quite new paper: "Fast Exponential Computation on SIMD Architectures" (2015).
even without -ffast-math - note that ICC's (and I assume ifort's) default is -fp-model fast=1. Not quite as aggressive as fp-model fast=2 but still more like -ffast-math in the ways that matter here. e.g. it allows ifort to pretend that FP math is associative and vectorize sums of arrays, or in this case addpd on a pair of vectorized exp results instead of separately adding each element to a scalar sum inside the loop.
Strange that gfortran will only use vectorized exp with -march=skylake-avx512 (native on those cloud server), not -march=skylake. Even -mveclibabi=svml or acml doesn't help. sourceware.org/glibc/wiki/libmvec says gcc4.9 should be able to vectorize C functions like cos() with -O1 -fopenmp -ffast-math -lm -mavx2, or -O1 -ftree-loop-vectorize -ffast-math -lm -mavx. (-O3 includes -ftree-vectorize)
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.