2

I am learning assembly. So I wrote a routine that returns the square root of its input if the input is non-negative, and it returns 0 otherwise.

I have implemented the routine in both assembly and C, I would like to understand why my C routines compiled with -O2 are much faster than my assembly routine. The disassembled code for the C routines look slightly more complex than my assembly routine, so I don't understand where I am going wrong.

The assembly routine (srt.asm) :

global srt section .text srt: pxor xmm1,xmm1 comisd xmm0,xmm1 jbe P sqrtsd xmm0,xmm0 retq P: pxor xmm0,xmm0 retq 

I am compiling the above as

nasm -g -felf64 srt.asm 

The C routines (srtc.c)

#include <stdio.h> #include <math.h> #include <time.h> extern double srt(double); double srt1(double x) { return sqrt( (x > 0) * x ); } double srt2(double x) { if( x > 0) return sqrt(x); return 0; } int main(void) { double v = 0; clock_t start; clock_t end; double niter = 2e8; start = clock(); v = 0; for( double i = 0; i < niter; i++ ) { v += srt(i); } end = clock(); printf("time taken srt = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v); start = clock(); v = 0; for( double i = 0; i < niter; i++ ) { v += srt1(i); } end = clock(); printf("time taken srt1 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v); start = clock(); v = 0; for( double i = 0; i < niter; i++ ) { v += srt2(i); } end = clock(); printf("time taken srt2 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v); return 0; } 

The above is compiled as

gcc -g -O2 srt.o -o srtc srtc.c -lm 

The output of the program is

time taken srt = 0.484375 v=1.88562e+12 time taken srt1 = 0.312500 v=1.88562e+12 time taken srt2 = 0.312500 v=1.88562e+12 

So my assembly routine is significantly slower.

The disassembled C code is

Disassembly of section .text: 0000000000000000 <srt1>: 0: f3 0f 1e fa endbr64 4: 66 0f ef c9 pxor xmm1,xmm1 8: 66 0f 2f c1 comisd xmm0,xmm1 c: 77 04 ja 12 <srt1+0x12> e: f2 0f 59 c1 mulsd xmm0,xmm1 12: 66 0f 2e c8 ucomisd xmm1,xmm0 16: 66 0f 28 d0 movapd xmm2,xmm0 1a: f2 0f 51 d2 sqrtsd xmm2,xmm2 1e: 77 05 ja 25 <srt1+0x25> 20: 66 0f 28 c2 movapd xmm0,xmm2 24: c3 ret 25: 48 83 ec 18 sub rsp,0x18 29: f2 0f 11 54 24 08 movsd QWORD PTR [rsp+0x8],xmm2 2f: e8 00 00 00 00 call 34 <srt1+0x34> 34: f2 0f 10 54 24 08 movsd xmm2,QWORD PTR [rsp+0x8] 3a: 48 83 c4 18 add rsp,0x18 3e: 66 0f 28 c2 movapd xmm0,xmm2 42: c3 ret 43: 66 66 2e 0f 1f 84 00 data16 nop WORD PTR cs:[rax+rax*1+0x0] 4a: 00 00 00 00 4e: 66 90 xchg ax,ax 0000000000000050 <srt2>: 50: f3 0f 1e fa endbr64 54: 66 0f ef c9 pxor xmm1,xmm1 58: 66 0f 2f c1 comisd xmm0,xmm1 5c: 66 0f 28 d1 movapd xmm2,xmm1 60: 77 0e ja 70 <srt2+0x20> 62: 66 0f 28 c2 movapd xmm0,xmm2 66: c3 ret 67: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0] 6e: 00 00 70: 66 0f 2e c8 ucomisd xmm1,xmm0 74: 66 0f 28 d0 movapd xmm2,xmm0 78: f2 0f 51 d2 sqrtsd xmm2,xmm2 7c: 76 e4 jbe 62 <srt2+0x12> 7e: 48 83 ec 18 sub rsp,0x18 82: f2 0f 11 54 24 08 movsd QWORD PTR [rsp+0x8],xmm2 88: e8 00 00 00 00 call 8d <srt2+0x3d> 8d: f2 0f 10 54 24 08 movsd xmm2,QWORD PTR [rsp+0x8] 93: 48 83 c4 18 add rsp,0x18 97: 66 0f 28 c2 movapd xmm0,xmm2 9b: c3 ret 
6
  • 1
    And BTW, the reason for GCC's extra code when inlining sqrt() that conditionally calls into the libm function is because you didn't use -fno-math-errno. See How to force GCC to assume that a floating-point expression is non-negative? - you should generally always use that Commented Jun 25, 2021 at 22:42
  • 1
    I am running my code on WSL. /proc/cpuinfo shows the model name as i5-8365u I did run my code multiple times, I got similar results each time. I aslo ran the assembly version last, instead of running it first. The numbers were similar in all cases. Commented Jun 25, 2021 at 22:44
  • 2
    I tested on my i7-6700k Skylake (same microarchitecture as your Kaby Lake); I can repro the performance effect even with warm-up runs, so that doesn't seem to be it. Probably after inlining into the loop, GCC can optimize away some work; have a look at the actual asm for main since you didn't use __attribute__((noinline,noclone)) on your C functions. Commented Jun 25, 2021 at 22:45
  • 4
    Oh right, simply being a non-inline function is the problem. x86-64 System V doesn't have any call-preserved XMM registers, so the add dependency chain through v includes a store/reload for srt(), but not when srt1 or srt2 inline. Commented Jun 25, 2021 at 22:49
  • 2
    After disabling inlining, I get similar performance. Thanks. If you make your comment an answer I will accept it. Commented Jun 25, 2021 at 23:00

1 Answer 1

1

Peter Cordes comment explains what is happening here. srt1 and srt2 are inlined while srt is not. Quoting Peter Cordes :

Oh right, simply being a non-inline function is the problem. x86-64 System V doesn't have any call-preserved XMM registers, so the add dependency chain through v includes a store/reload for srt(), but not when srt1 or srt2 inline

.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.