Difficulty understanding how compilers and assembly language piece together

Question

This is more of a conceptual question, but I am learning about embedded systems for an upcoming project. I have been looking through the tutorial on tutorials point.

https://www.tutorialspoint.com/embedded_systems/es_tools.htm

This webpage talks about compilers, assemblers, and coupling.

BASICALLY: How does the assembly process work with compilers if at all. Where and how can I piece this information? What am I not getting?

Here is one piece of the puzzle: en.wikipedia.org/wiki/Three-address_code — samgak
– samgak, Commented Jul 17, 2019 at 5:32
compilers implement the logic of the high-level source code using assembly language for whatever target machine. e.g. see How to remove "noise" from GCC/clang assembly output? for more about looking at the asm output of compilers, especially Matt Godbolt's CppCon talk might is a good intro. (youtube link in my answer there) — Peter Cordes
– Peter Cordes, Commented Jul 17, 2019 at 5:54
As your link says: Compilers translate the source code from a highlevel programming language to a low-level language (e.g., assembly language or machine code). And that's kinda the key. If the compiler translates from (say) C to assembly language, you're still going to need an assembler to translate that to machine code (which is what the hw actually needs). However (as the docs say), some compilers translate C directly to machine code, no assembly required! — David Wohlferd
– David Wohlferd, Commented Jul 17, 2019 at 6:21
@DavidWohlferd if compilers can translate to either assembly language to machine code, why doesn't it translate strictly to machine code. Is it because its faster to translate from, for example, C to assembly language to machine code? — Arnab Das
– Arnab Das, Commented Jul 17, 2019 at 17:32
Just because a compiler can output machine code doesn't mean that the people who wrote the compiler will choose to do it that way. Why not? I doubt that there's "one" good reason why some compiler writers do it one way and others don't. Being able to inspect the 'intermediate' stages might be seen as a benefit. Or being able to swap assemblers. De-coupling might ease debugging. As you mentioned, speed can be a consideration. Not having to write an assembler might be (a little) easier. It might even just be "we've always done it that way, no one remembers why." — David Wohlferd
– David Wohlferd, Commented Jul 17, 2019 at 21:47

halfer · Accepted Answer · 2021-08-22 07:15:38Z

Try it yourself using the GNU tools:

#define FIVE 5 extern unsigned int more_fun ( unsigned int ); unsigned int fun ( void ) { return(more_fun(FIVE)+1); }

Saving temps gcc first needs to pre-process to pull in includes and replace defines/macros

# 1 "so.c" # 1 "<built-in>" # 1 "<command-line>" # 1 "/usr/include/stdc-predef.h" 1 3 4 # 1 "<command-line>" 2 # 1 "so.c" extern unsigned int more_fun ( unsigned int ); unsigned int fun ( void ) { return(more_fun(5)+1); }

That gets fed to the actual compiler, gcc the program is not the compiler it is a program that calls other programs. The compiler output is assembly language

 .arch armv5t .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 34, 0 .eabi_attribute 18, 4 .file "so.c" .text .align 2 .global fun .syntax unified .arm .type fun, %function fun: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 push {r4, lr} mov r0, #5 bl more_fun add r0, r0, #1 pop {r4, pc} .size fun, .-fun .ident "GCC: (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609" .section .note.GNU-stack,"",%progbits

gcc then calls the assembler to assemble that into an object, which is as much of the machine code that the assembler can resolve, plus ideally other information for debugging and linking. Using a disassembler we can see the code produced by the assembler:

Disassembly of section .text: 00000000 <fun>: 0: e92d4010 push {r4, lr} 4: e3a00005 mov r0, #5 8: ebfffffe bl 0 <more_fun> c: e2800001 add r0, r0, #1 10: e8bd8010 pop {r4, pc}

The bl 0 in the middle the call to the more_fun function was not resolved as that code was not part of the original C source file so a placeholder is put in there and the linker will come along later and link the objects together. If you don't specify -c then gcc will also call the linker for you.

Most "toolchains" work this way, it's the sane way to do it. For just in time and "why do you climb mountains, because they are there" reasons there are some compilers that go more directly to machine code, but even llvm doesn't do that and it claims to be JIT, although its primary use is otherwise. A toolchain doesn't have to use separate executables, various ways to solve the problem.

I don't remember if that site you linked is on the list of sites you should avoid at all costs, there is one or some like it that have some very bad information that is confusing and wrong. That page wasn't bad nor confusing, but I only skimmed it.

Decompilers don't really exist in the form folks would like compiling as you can see in this simple example, information from the original code is lost, you can't completely recreate this code from the binary. Pretty easy to make similar simple examples that demonstrate this.

IIRC tutorialspoint's NASM topic is kinda toxic, so this is probably the site you remember as "avoid". But the particular tutorials are done by different authors, so maybe the one used by OP is ok. ... it's Internet, everybody should be paranoid....

Collectives™ on Stack Overflow

Difficulty understanding how compilers and assembly language piece together

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related