Why is my assembly code much slower than the C implementation

Question

I am learning assembly. So I wrote a routine that returns the square root of its input if the input is non-negative, and it returns 0 otherwise.

I have implemented the routine in both assembly and C, I would like to understand why my C routines compiled with -O2 are much faster than my assembly routine. The disassembled code for the C routines look slightly more complex than my assembly routine, so I don't understand where I am going wrong.

The assembly routine (srt.asm) :

global srt section .text srt: pxor xmm1,xmm1 comisd xmm0,xmm1 jbe P sqrtsd xmm0,xmm0 retq P: pxor xmm0,xmm0 retq

I am compiling the above as

nasm -g -felf64 srt.asm

The C routines (srtc.c)

#include <stdio.h> #include <math.h> #include <time.h> extern double srt(double); double srt1(double x) { return sqrt( (x > 0) * x ); } double srt2(double x) { if( x > 0) return sqrt(x); return 0; } int main(void) { double v = 0; clock_t start; clock_t end; double niter = 2e8; start = clock(); v = 0; for( double i = 0; i < niter; i++ ) { v += srt(i); } end = clock(); printf("time taken srt = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v); start = clock(); v = 0; for( double i = 0; i < niter; i++ ) { v += srt1(i); } end = clock(); printf("time taken srt1 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v); start = clock(); v = 0; for( double i = 0; i < niter; i++ ) { v += srt2(i); } end = clock(); printf("time taken srt2 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v); return 0; }

The above is compiled as

gcc -g -O2 srt.o -o srtc srtc.c -lm

The output of the program is

time taken srt = 0.484375 v=1.88562e+12 time taken srt1 = 0.312500 v=1.88562e+12 time taken srt2 = 0.312500 v=1.88562e+12

So my assembly routine is significantly slower.

The disassembled C code is

Disassembly of section .text: 0000000000000000 <srt1>: 0: f3 0f 1e fa endbr64 4: 66 0f ef c9 pxor xmm1,xmm1 8: 66 0f 2f c1 comisd xmm0,xmm1 c: 77 04 ja 12 <srt1+0x12> e: f2 0f 59 c1 mulsd xmm0,xmm1 12: 66 0f 2e c8 ucomisd xmm1,xmm0 16: 66 0f 28 d0 movapd xmm2,xmm0 1a: f2 0f 51 d2 sqrtsd xmm2,xmm2 1e: 77 05 ja 25 <srt1+0x25> 20: 66 0f 28 c2 movapd xmm0,xmm2 24: c3 ret 25: 48 83 ec 18 sub rsp,0x18 29: f2 0f 11 54 24 08 movsd QWORD PTR [rsp+0x8],xmm2 2f: e8 00 00 00 00 call 34 <srt1+0x34> 34: f2 0f 10 54 24 08 movsd xmm2,QWORD PTR [rsp+0x8] 3a: 48 83 c4 18 add rsp,0x18 3e: 66 0f 28 c2 movapd xmm0,xmm2 42: c3 ret 43: 66 66 2e 0f 1f 84 00 data16 nop WORD PTR cs:[rax+rax*1+0x0] 4a: 00 00 00 00 4e: 66 90 xchg ax,ax 0000000000000050 <srt2>: 50: f3 0f 1e fa endbr64 54: 66 0f ef c9 pxor xmm1,xmm1 58: 66 0f 2f c1 comisd xmm0,xmm1 5c: 66 0f 28 d1 movapd xmm2,xmm1 60: 77 0e ja 70 <srt2+0x20> 62: 66 0f 28 c2 movapd xmm0,xmm2 66: c3 ret 67: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0] 6e: 00 00 70: 66 0f 2e c8 ucomisd xmm1,xmm0 74: 66 0f 28 d0 movapd xmm2,xmm0 78: f2 0f 51 d2 sqrtsd xmm2,xmm2 7c: 76 e4 jbe 62 <srt2+0x12> 7e: 48 83 ec 18 sub rsp,0x18 82: f2 0f 11 54 24 08 movsd QWORD PTR [rsp+0x8],xmm2 88: e8 00 00 00 00 call 8d <srt2+0x3d> 8d: f2 0f 10 54 24 08 movsd xmm2,QWORD PTR [rsp+0x8] 93: 48 83 c4 18 add rsp,0x18 97: 66 0f 28 c2 movapd xmm0,xmm2 9b: c3 ret

And BTW, the reason for GCC's extra code when inlining sqrt() that conditionally calls into the libm function is because you didn't use -fno-math-errno. See How to force GCC to assume that a floating-point expression is non-negative? - you should generally always use that — Peter Cordes
– Peter Cordes, Commented Jun 25, 2021 at 22:42
I am running my code on WSL. /proc/cpuinfo shows the model name as i5-8365u I did run my code multiple times, I got similar results each time. I aslo ran the assembly version last, instead of running it first. The numbers were similar in all cases. — Arin Chaudhuri
– Arin Chaudhuri, Commented Jun 25, 2021 at 22:44
I tested on my i7-6700k Skylake (same microarchitecture as your Kaby Lake); I can repro the performance effect even with warm-up runs, so that doesn't seem to be it. Probably after inlining into the loop, GCC can optimize away some work; have a look at the actual asm for main since you didn't use __attribute__((noinline,noclone)) on your C functions. — Peter Cordes
– Peter Cordes, Commented Jun 25, 2021 at 22:45
Oh right, simply being a non-inline function is the problem. x86-64 System V doesn't have any call-preserved XMM registers, so the add dependency chain through v includes a store/reload for srt(), but not when srt1 or srt2 inline. — Peter Cordes
– Peter Cordes, Commented Jun 25, 2021 at 22:49
After disabling inlining, I get similar performance. Thanks. If you make your comment an answer I will accept it. — Arin Chaudhuri
– Arin Chaudhuri, Commented Jun 25, 2021 at 23:00

Arin Chaudhuri · Accepted Answer · 2021-07-01 17:36:34Z

Peter Cordes comment explains what is happening here. srt1 and srt2 are inlined while srt is not. Quoting Peter Cordes :

Oh right, simply being a non-inline function is the problem. x86-64 System V doesn't have any call-preserved XMM registers, so the add dependency chain through v includes a store/reload for srt(), but not when srt1 or srt2 inline

.

Collectives™ on Stack Overflow

Why is my assembly code much slower than the C implementation

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related