Skip to content

0x332 Compiler

1. Foundation

  • compiler
  • JIT compiler
  • Interpreter

2. Lexical Analysis

3. Syntax Analysis

Syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

3.1. Top-down parsing

LL parser

3.2. Bottom-up parsing


Operator-precedence parser (Shunting-yard)

4. Semantic Analysis

5. Intermediate Representation


The following code snippet is from this blog, illustrating how LLVM IR looks like

unsigned add1(unsigned a, unsigned b) {
    return a+b;

    // Perhaps not the most efficient way to add two numbers.
unsigned add2(unsigned a, unsigned b) {
    if (a == 0) return b;
    return add2(a-1, b+1);

Those C codes will be translated to the following IRs

``` define i32 @add1(i32 %a, i32 %b) { entry: %tmp1 = add i32 %a, %b ret i32 %tmp1 }

define i32 @add2(i32 %a, i32 %b) { entry: %tmp1 = icmp eq i32 %a, 0 br i1 %tmp1, label %done, label %recurse

%tmp2 = sub i32 %a, 1
%tmp3 = add i32 %b, 1
%tmp4 = call i32 @add2(i32 %tmp2, i32 %tmp3)
ret i32 %tmp4

ret i32 %b



Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.

Loop optimization

Dataflow optimization

Dataflow is represetned with Control Flow Graph (also abbreviated as CFG), it consists of the basic block, which is a block of code that can can only branch to the beginning of other blocks at the end of this block. This help analysis of data flow.

Some simple optimizations are:

  • constant folding and propagation: replacing expressions consisting with constants with their final value (e.g. 3+5 -> 8)

5. Security

There are several compilers features to mitigate security risks

Stack smashing protector To mitigate risk of stack buffer overflow. (only mitigate, can be exploited by overwriting canary by the correct value, maybe overwriting the canary's global reference ?) insert canary variable into higher stack address before running risk function. Check whether the canary (expected value is on heap) has been modified after running risk.

/* Note how buffer overruns are undefined behavior and the compilers tend to
   optimize these checks away if you wrote them yourself, this only works
   robustly because the compiler did it itself. */
extern uintptr_t __stack_chk_guard;
noreturn void __stack_chk_fail(void);
void foo(const char* str)
    uintptr_t canary = __stack_chk_guard;  // automatically inserted by compiler
    char buffer[16];
    strcpy(buffer, str);
    if ( (canary = canary ^ __stack_chk_guard) != 0 ) // automatically inserted by compiler

6. Debugger

6.1. gdb

i r: print variables x/(number) (memory or register): print data around memory

6.2. lldb

Main Reference is here

6.2.1. frame instructions

  • bt: print stack frames
  • fr v (-L): print variables in a frame (with its location)
  • fr s frame_number: switch to different frames

6.2.2. memory instructions

  • mem read (memory or register)+(offset): print data around memory

6.2.3. evaluation

  • eval (expression)

7. IDE

There is a concept called Language Server Protocol, which splits the language-related parsing features into a service provided from a remote server, the client is usually an editor which interacts with server using this json-based protocol.

8. Reference

[1] lldb cheat sheet [2] lldb documentation [3] 9cc compiler and its explanation (Japanese)