From source to execution

When writing code, what happens to transform it from source code to an executable binary? First we have the compiler, a piece of software that takes source code written in a programming language and converts it into another, typically assembly. Different processors support different instruction sets, but the most common are x86/x64 or ARM.

Once we have assembly, we can convert it to binary (object code) using an assembler.

The final step involves the linker. The linker is responsible for taking these object files and resolving variable names and other pieces of data in order to make the program executable.

Inside the compiler

Within a compiler, there are sub-responsiblities split into two(or more) sections:

frontend:

  • lexing: the process of scanning the input text and converting to tokens. Tokens represent the smallest unit of meaning such as keywords (think "for", "class", "return" etc), identifiers and operators.
  • parser: builds up the program into a representation to be anaylsed, typically an abstract-syntax tree(AST). This is a hierarchial structure of the source, allowing it to be easily interpretered.

backend:

  • Intermediate representation: Before generating machine code, many compilers manipulate the AST in order to produce some intermediate representation - this is a lower level represetnation of the program but still high-level enough to be independant of the underlying hardware.
  • code generation: take the IR and output machine code/assembly language specific to the target CPU.

there could be more stages for handling a preprocessor and introducing optimisations

Linker

The linker combines multiple object files produced by the compiler into a single executable. It resolves references(variables, functions) between different object files and memory addresses.

It also handles static and dynamic libraries. Static libraries are baked into the executable at compile time. This can result in faster executable times but larger binaries since common functionalities cannot be shared across multiple executables (each program will require it to be compiled with the library.)

Dynamic/shared libraries are not included directly in the binary, they are loaded at runtime instead. This results in more reuse (ie libc), ability to change behaviour at runtime among other things.