Programs written in high-level languages like C must be translated into machine-readable instructions before they can execute on a computer.
The GCC compiler driver performs this translation through a four-phase compilation system that progressively transforms source code (c code) into an executable binary file (machine code).

Preprocessing Phase:
C File —> clean C File
The preprocessor (cpp) modifies the original C source code by processing directives that begin with '#' (includes)
- It removes all comments
- expands macros to their defined values
- includes header files directly into the code (
#include <stdio.h>is replaced with the entire contents of stdio.h)
output: modified C program with a .i extension
Compiler phase:
clean C File —> Assembly language File
The compiler (cc1) translates the preprocessed .i file into assembly language, producing a .s file
assembly language: Assembly language serves as an intermediate representation that different high-level languages (C, Fortran, etc.) can compile to, providing a common ground. Each line in the assembly file represents a low-level machine instruction in textual form, such as movl or call instructions
output: producing a .s file in assembly language
Assembly Phase:
Assembly language File —> Binary File
The assembler (as) converts assembly language instructions from the .s file into actual machine code—binary instructions the processor can execute.
output : relocatable object file (.o extension) containing binary-encoded instructions.
Linker Phase:
Binary File + all other files (printf.o) —> executable
if we have printf function in our file the compiler create a separate precompiled object file called printf.o which must be merged with our hello.o program The linker (ld) handles this merging The result is an executable object file—or simply executable—called hello
ready to be loaded into memory and executed by the system
Compiler (in detail):
how the english text is converted into assembly code? heres how: there are these steps:
- Lexical analysis (Lexical = vocabulary)
- Syntax analysis (Syntax = format/way)
- Semantic analysis (Semantic = logic)
- Intermediate code generation
- Optimization
- Code generation

-
Lexical analysis the compiler reads your source code converts it into tokens (token represents a single meaningful element that can't be subdivided further , like a word! ).

Lexical analyser groups characters into tokens based on the language's syntax. It removes whitespace, comments, and newlines since they're not needed.
-
Syntax analysis Syntax analysis (also called parsing) check code is following grammer rule of the programming language or not. with the use of tokens generated by Lexical analysis.

**Abstract Syntax Tree (AST)**Tokens are used to create a structure called an Abstract Syntax Tree (AST) or Parse tree. (shows logical structure of code)
-
Semantic analysis in Semantic phase compiler uses abstract syntax tree to check these three steps:
type checking - Ensures operations use compatible data types. Example error: int x = "hello";
flow checking - check if flow is correct. Example break is outside the loop or classes called incorrectly.
Label checking - if reserved function like if or else used as variable name or not.
why its is this step if we already checked grammar (syntax analysis)? Some errors can't be caught by just checking grammar—you need to understand what the code means. 😊
- Intermediate code generation converts source code into simplified, platform-independent format called Intermediate Representation (IR) (it like a universal translator).
it is generated because give a common ground for frontend (language code) and backend (machine code). - Frontends: “language → IR” (C → IR, Rust → IR, etc.) - Backends: “IR → machine code” (IR → x86, IR → ARM, IR → RISC‑V)
Fun fact : That's how Java programs work on all systems—the Java compiler outputs intermediate code (bytecode), and users run it on their machine's JVM.
- Optimization
- Takes IR and tries to make it faster/smaller, without changing behavior:
- Remove dead code (never used).
- Fold constants (
2 + 2→4). - Eliminate repeated calculations.
- Improve loops, reuse registers, etc.
- Takes IR and tries to make it faster/smaller, without changing behavior:
- Code generation
takes optimised IR code and convert it into machine code:
Assembly code (human-readable but CPU-specific) like
MOV,ADD,JMP,CALLor directly into binary machine code.
PRACTICAL EXAMPLE
