4

Recently, I read a post on Stack Overflow about finding integers that are perfect squares. As I wanted to play with this, I wrote the following small program:

PROGRAM PERFECT_SQUARE IMPLICIT NONE INTEGER*8 :: N, M, NTOT LOGICAL :: IS_SQUARE N=Z'D0B03602181' WRITE(*,*) IS_SQUARE(N) NTOT=0 DO N=1,1000000000 IF (IS_SQUARE(N)) THEN NTOT=NTOT+1 END IF END DO WRITE(*,*) NTOT ! should find 31622 squares END PROGRAM LOGICAL FUNCTION IS_SQUARE(N) IMPLICIT NONE INTEGER*8 :: N, M ! check if negative IF (N.LT.0) THEN IS_SQUARE=.FALSE. RETURN END IF ! check if ending 4 bits belong to (0,1,4,9) M=IAND(N,15) IF (.NOT.(M.EQ.0 .OR. M.EQ.1 .OR. M.EQ.4 .OR. M.EQ.9)) THEN IS_SQUARE=.FALSE. RETURN END IF ! try to find the nearest integer to sqrt(n) M=DINT(SQRT(DBLE(N))) IF (M**2.NE.N) THEN IS_SQUARE=.FALSE. RETURN END IF IS_SQUARE=.TRUE. RETURN END FUNCTION 

When compiling with gfortran -O2, running time is 4.437 seconds, with -O3 it is 2.657 seconds. Then I thought that compiling with ifort -O2 could be faster since it might have a faster SQRT function, but it turned out running time was now 9.026 seconds, and with ifort -O3 the same. I tried to analyze it using Valgrind, and the Intel compiled program indeed uses many more instructions.

My question is why? Is there a way to find out where exactly the difference comes from?

EDITS:

  • gfortran version 4.6.2 and ifort version 12.0.2
  • times are obtained from running time ./a.out and is the real/user time (sys was always almost 0)
  • this is on Linux x86_64, both gfortran and ifort are 64-bit builds
  • ifort inlines everything, gfortran only at -O3, but the latter assembly code is simpler than that of ifort, which uses xmm registers a lot
  • fixed line of code, added NTOT=0 before loop, should fix issue with other gfortran versions

When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.

EDIT2:

I tried with ifort version 12.1.2.273 it's much faster, so looks like they fixed that.

11
  • Are those wall times or CPU times? Can you paste the output of time <program> for each one? And were these 32-bit builds or 64-bit builds? Commented Jan 17, 2012 at 10:49
  • Have you tried disassembling the object files emitted by each compiler and comparing them? Commented Jan 17, 2012 at 11:02
  • @talonmies: no I didn't, since I don't really understand assembly. Although running through valgrind --tool=callgrind --dump-instr=yes also gives the assembly code, but that's really complex (many differences) and depends on the level of optimization. Commented Jan 17, 2012 at 11:08
  • Did you try more aggressive optimization levels? They might be worth it. Commented Jan 17, 2012 at 13:03
  • Are you sure your program is correct? With more recent versions of gfortran than 4.5 i get different answers. Commented Jan 17, 2012 at 13:10

1 Answer 1

5

What compiler versions are you using? Interestingly, it looks like a case where there is a performance regression from 11.1 to 12.0 -- e.g. for me, 11.1 (ifort -fast square.f90) takes 3.96s, and 12.0 (same options) took 13.3s. gfortran (4.6.1) (-O3) is still faster (3.35s). I have seen this kind of a regression before, although not quite as dramatic. BTW, replacing the if statement with

is_square = any(m == [0, 1, 4, 9]) if(.not. is_square) return 

makes it run twice as fast with ifort 12.0, but slower in gfortran and ifort 11.1.

It looks like part of the problem is that 12.0 is overly aggressive in trying to vectorize things: adding

!DEC$ NOVECTOR 

right before the DO loop (without changing anything else in the code) cuts the run time down to 4.0 sec.

Also, as a side benefit: if you have a multi-core CPU, try adding -parallel to the ifort command line :)

Sign up to request clarification or add additional context in comments.

6 Comments

See edit: with ifort version 12.1.2.273 it worked. Also, there is a difference when compiling and linking seperately or on one line, strange. Now I have: gfortran 3.1 s with or without your any statement, and ifort 3.3 s original and 5.0 s with the any statement.
yes, I think the reason any made things faster in 12.0 was by preventing vectorization. Difference between compiling separately and on one line I think suggests something like that instruction cache issue might be going on. Also, the whole thing would speed up a lot if you use single instead of double precision, and I think also if you modify the function to return 1 or 0 and add results up instead of having an if statement (conditional statements in loops are often bad for performance).
thnx for the comments especially the addition hint, I'll try that. As for double precision: I need it to get the correct square root for very large integers. This was just an exploration of integer square root, to see if it can be done faster with other methods, but it turned out that the sqrt instruction can't be beaten...
well, turns out that with the integer is_square function and simple addition in the loop, the code runs slower, i.e. 3.9s instead of 3.1s :)
I see. Just tried -- for me this also happened, but only with gfortran (not ifort), and only when if function was declared as integer(8) -- with machine default integer the speed is the same. No improvement though -- I guess there are quite a few jumps there already, so one more does not really matter :)
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.