Why does the x86-64 GCC function prologue allocate less stack than the local variables?

Question

Consider the following simple program:

int main(int argc, char **argv) { char buffer[256]; buffer[0] = 0x41; buffer[128] = 0x41; buffer[255] = 0x41; return 0; }

Compiled with GCC 4.7.0 on a x86-64 machine. Disassembly of main() with GDB gives:

0x00000000004004cc <+0>: push rbp 0x00000000004004cd <+1>: mov rbp,rsp 0x00000000004004d0 <+4>: sub rsp,0x98 0x00000000004004d7 <+11>: mov DWORD PTR [rbp-0x104],edi 0x00000000004004dd <+17>: mov QWORD PTR [rbp-0x110],rsi 0x00000000004004e4 <+24>: mov BYTE PTR [rbp-0x100],0x41 0x00000000004004eb <+31>: mov BYTE PTR [rbp-0x80],0x41 0x00000000004004ef <+35>: mov BYTE PTR [rbp-0x1],0x41 0x00000000004004f3 <+39>: mov eax,0x0 0x00000000004004f8 <+44>: leave 0x00000000004004f9 <+45>: ret

Why does it sub rsp with only 0x98 = 152d when the buffer is 256 byte? When I mov data into buffer[0] it simply seems to use data outside of the allocated stack frame and use rbp to reference, so what is even the point of the sub rsp,0x98?

Another question, what do these lines do?

0x00000000004004d7 <+11>: mov DWORD PTR [rbp-0x104],edi 0x00000000004004dd <+17>: mov QWORD PTR [rbp-0x110],rsi

Why does EDI and not RDI need to be saved? I see that it moves this outside of the maximum range of the allocated buffer in the C code however. Also of interest is why the delta between the two variables is so big. Since EDI is just 4 bytes, why does it need a 12 byte separation for the two variables?

The 12 byte separation is due to alignment. rsi is 8 bytes, so padding is needed to keep it aligned to 8 bytes. But I can't speak for the under-allocation of the stack. — Mysticial
– Mysticial, Commented Nov 2, 2012 at 19:21
It probably saves EDI and RSI simply because it is not required to save these by the caller? But still the manner in which they are saved seems weird. — csstudent2233
– csstudent2233, Commented Nov 2, 2012 at 19:22
what happens when you compile it with gcc -s (to get assembly output) - because if you don't have debugging turned on in the compilation in the first place, your gdb results can be odd — KevinDTimm
– KevinDTimm, Commented Nov 2, 2012 at 19:22
When I compile with gcc -S to get assembly output I simply see reproduced results. — csstudent2233
– csstudent2233, Commented Nov 2, 2012 at 19:30

Matthew Slattery · Accepted Answer · 2012-11-05 17:34:39Z

The x86-64 ABI used by Linux (and some other OSes, although notably not Windows, which has its own different ABI) defines a "red zone" of 128 bytes below the stack pointer, which is guaranteed not to be touched by signal or interrupt handlers. (See figure 3.3 and §3.2.2.)

A leaf function (i.e. one which does not call anything else) may therefore use this area for whatever it wants - it isn't doing anything like a call which would place data at the stack pointer; and any signal or interrupt handler will follow the ABI and drop the stack pointer by at least an additional 128 bytes before storing anything.

(Shorter instruction encodings are available for signed 8-bit displacements, so the point of the red zone is that it increases the amount of local data that a leaf function can access using these shorter instructions.)

That's what's happening here.

But... this code isn't making use of those shorter encodings (it's using offsets from rbp rather than rsp). Why not? It's also saving edi and rsi completely unnecessarily - you ask why it's saving edi instead of rdi, but why is it saving it at all?

The answer is that the compiler is generating really crummy code, because no optimisations are enabled. If you enable any optimisation, your entire function is likely to collapse down to:

mov eax, 0 ret

because that's really all it needs to do: buffer[] is local, so the changes made to it will never be visible to anything else, so can be optimised away; beyond that, all the function needs to do is return 0.

So, here's a better example. This function is complete nonsense, but makes use of a similar array, whilst doing enough to ensure that things don't all get optimised away:

$ cat test.c int foo(char *bar) { char tmp[256]; int i; for (i = 0; bar[i] != 0; i++) tmp[i] = bar[i] + i; return tmp[1] + tmp[200]; }

Compiled with some optimisation, you can see similar use of the red zone, except this time it really does use offsets from rsp:

$ gcc -m64 -O1 -c test.c $ objdump -Mintel -d test.o test.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <foo>: 0: 53 push rbx 1: 48 81 ec 88 00 00 00 sub rsp,0x88 8: 0f b6 17 movzx edx,BYTE PTR [rdi] b: 84 d2 test dl,dl d: 74 26 je 35 <foo+0x35> f: 4c 8d 44 24 88 lea r8,[rsp-0x78] 14: 48 8d 4f 01 lea rcx,[rdi+0x1] 18: 4c 89 c0 mov rax,r8 1b: 89 c3 mov ebx,eax 1d: 44 28 c3 sub bl,r8b 20: 89 de mov esi,ebx 22: 01 f2 add edx,esi 24: 88 10 mov BYTE PTR [rax],dl 26: 0f b6 11 movzx edx,BYTE PTR [rcx] 29: 48 83 c0 01 add rax,0x1 2d: 48 83 c1 01 add rcx,0x1 31: 84 d2 test dl,dl 33: 75 e6 jne 1b <foo+0x1b> 35: 0f be 54 24 50 movsx edx,BYTE PTR [rsp+0x50] 3a: 0f be 44 24 89 movsx eax,BYTE PTR [rsp-0x77] 3f: 8d 04 02 lea eax,[rdx+rax*1] 42: 48 81 c4 88 00 00 00 add rsp,0x88 49: 5b pop rbx 4a: c3 ret

Now let's tweak it very slightly, by inserting a call to another function, so that foo() is no longer a leaf function:

$ cat test.c extern void dummy(void); /* ADDED */ int foo(char *bar) { char tmp[256]; int i; for (i = 0; bar[i] != 0; i++) tmp[i] = bar[i] + i; dummy(); /* ADDED */ return tmp[1] + tmp[200]; }

Now the red zone cannot be used, so you see something more like you originally expected:

$ gcc -m64 -O1 -c test.c $ objdump -Mintel -d test.o test.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <foo>: 0: 53 push rbx 1: 48 81 ec 00 01 00 00 sub rsp,0x100 8: 0f b6 17 movzx edx,BYTE PTR [rdi] b: 84 d2 test dl,dl d: 74 24 je 33 <foo+0x33> f: 49 89 e0 mov r8,rsp 12: 48 8d 4f 01 lea rcx,[rdi+0x1] 16: 48 89 e0 mov rax,rsp 19: 89 c3 mov ebx,eax 1b: 44 28 c3 sub bl,r8b 1e: 89 de mov esi,ebx 20: 01 f2 add edx,esi 22: 88 10 mov BYTE PTR [rax],dl 24: 0f b6 11 movzx edx,BYTE PTR [rcx] 27: 48 83 c0 01 add rax,0x1 2b: 48 83 c1 01 add rcx,0x1 2f: 84 d2 test dl,dl 31: 75 e6 jne 19 <foo+0x19> 33: e8 00 00 00 00 call 38 <foo+0x38> 38: 0f be 94 24 c8 00 00 movsx edx,BYTE PTR [rsp+0xc8] 3f: 00 40: 0f be 44 24 01 movsx eax,BYTE PTR [rsp+0x1] 45: 8d 04 02 lea eax,[rdx+rax*1] 48: 48 81 c4 00 01 00 00 add rsp,0x100 4f: 5b pop rbx 50: c3 ret

(Note that tmp[200] was in range of a signed 8-bit displacement in the first case, but is not in this one.)

Excellent illustrative answer. Particularly the observation that with optimization, it should simply set eax=0 and return.
Maybe it stores edi instead of rsi because it is an int, not long. Neither is necessary however as you say.

Collectives™ on Stack Overflow

Why does the x86-64 GCC function prologue allocate less stack than the local variables?

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related