0x332 Compiler

1. Frontend
2. LLVM
- 2.1. MLIR
- 2.2. LLVM IR
3. Optimization (Passes)
- 3.1. Loop optimization
- 3.2. Dataflow optimization
4. Backend
- 4.1. Code Generation
- JIT
5. Security
6. JIT
7. IDE
8. Reference

1. Frontend

1.1. Lexical Analysis

Most of lexical grammars can be handled by regular languages. There are exceptions though, for example, Python's lexical requires indents counts of previous line, which requires non-regular context

1.2. Syntax Analysis

Syntactic analysis is the process of analyzing a string of symbols (or tokens), either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

1.2.1. Top-down parsing

recursive descent parsing is a top-down parsing which is constructed from top and from left to right

1.2.2. Bottom-up parsing

Shift-Reduce

Operator-precedence parser (Shunting-yard)

1.3. Semantic Analysis

types, scopes, variable binding...

2. LLVM

Command

# compile into llvm IR text format
clang -S -emit-llvm hello.c -o hello.ll

2.1. MLIR

2.2. LLVM IR

Check the Language Reference and this Youtube

LLVM tutorial

The following code snippet is from this blog, illustrating how LLVM IR looks like

unsigned add1(unsigned a, unsigned b) {
    return a+b;
}

    // Perhaps not the most efficient way to add two numbers.
unsigned add2(unsigned a, unsigned b) {
    if (a == 0) return b;
    return add2(a-1, b+1);
}

Those C codes will be translated to the following IRs

define i32 @add1(i32 %a, i32 %b) {
    entry:
    %tmp1 = add i32 %a, %b
    ret i32 %tmp1
}

define i32 @add2(i32 %a, i32 %b) {
    entry:
    %tmp1 = icmp eq i32 %a, 0
    br i1 %tmp1, label %done, label %recurse

    recurse:
    %tmp2 = sub i32 %a, 1
    %tmp3 = add i32 %b, 1
    %tmp4 = call i32 @add2(i32 %tmp2, i32 %tmp3)
    ret i32 %tmp4

    done:
    ret i32 %b
}

2.2.1. Structures

Global variables are pointers whose names get prefixed with "@" initialized with global or constant (constant pointers)

Basic Block is a list of non-terminator instructions ending with a terminator instruction

2.2.2. Type Systems

2.2.3. Instructions

LLVM Instruction

Terminator instructions: every basic block in a program should end with a terminator instruction.

ret: return
br: Branching transfer to another basic blocks
switch

Binary Operations

add
fadd
mul
fmul

Memory Operations

alloca: stack allocation of a type and get automatically released
load
store

3. Optimization (Passes)

Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.

3.1. Loop optimization

3.2. Dataflow optimization

Dataflow is represetned with Control Flow Graph (also abbreviated as CFG), it consists of the basic block, which is a block of code that can can only branch to the beginning of other blocks at the end of this block. This help analysis of data flow.

Some simple optimizations are:

constant folding and propagation: replacing expressions consisting with constants with their final value (e.g. 3+5 -> 8)
dead code elimination: remove code that does not affect results

4. Backend

See this turorial for developing a llvm backend

4.1. Code Generation

JIT

Watch this video for LLVM JIT API (MCJIT vs ORC JIT)

Both are using the same underlying architecture:

static compiler
JIT linker (RuntimeDyld)

MCJIT implements ExecutionEngine API which is restricted in many scenarios, ORC instead is better by

offer a superset of feature
has more flexible API
supports remoteness and laziness

5. Security

There are several compilers features to mitigate security risks

Stack smashing protector To mitigate risk of stack buffer overflow. (only mitigate, can be exploited by overwriting canary by the correct value, maybe overwriting the canary's global reference ?) insert canary variable into higher stack address before running risk function. Check whether the canary (expected value is on heap) has been modified after running risk.

/* Note how buffer overruns are undefined behavior and the compilers tend to
   optimize these checks away if you wrote them yourself, this only works
   robustly because the compiler did it itself. */
extern uintptr_t __stack_chk_guard;
noreturn void __stack_chk_fail(void);
void foo(const char* str)
{
    uintptr_t canary = __stack_chk_guard;  // automatically inserted by compiler
    char buffer[16];
    strcpy(buffer, str);
    if ( (canary = canary ^ __stack_chk_guard) != 0 ) // automatically inserted by compiler
        __stack_chk_fail();
}

6. JIT

check this tutorial to get the high-level concept of how it works.

The following snippe is to implement an identity function from the previous tutorial, it writes asm corresponding to the identity function into some memory with some flag and then execute this memory.

// jitproto.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

typedef long(*fn)(long);

fn compile_identity(void) {
  // Allocate some memory and set its permissions correctly. In particular, we
  // need PROT_EXEC (which isn't normally enabled for data memory, e.g. from
  // malloc()), which tells the processor it's ok to execute it as machine
  // code.
  char *memory = mmap(NULL,             // address
                      4096,             // size
                      PROT_READ | PROT_WRITE | PROT_EXEC,
                      MAP_PRIVATE | MAP_ANONYMOUS,
                      -1,               // fd (not used here)
                      0);               // offset (not used here)
  if (memory == MAP_FAILED) {
    perror("failed to allocate memory");
    exit(1);
  }

  int i = 0;

  // mov %rdi, %rax
  memory[i++] = 0x48;           // REX.W prefix
  memory[i++] = 0x8b;           // MOV opcode, register/register
  memory[i++] = 0xc7;           // MOD/RM byte for %rdi -> %rax

  // ret
  memory[i++] = 0xc3;           // RET opcode

  return (fn) memory;
}

int main() {
  fn f = compile_identity();
  int i;
  for (i = 0; i < 10; ++i)
    printf("f(%d) = %ld\n", i, (*f)(i));
  munmap(f, 4096);
  return 0;
}

7. IDE

There is a concept called Language Server Protocol, which splits the language-related parsing features into a service provided from a remote server, the client is usually an editor which interacts with server using this json-based protocol.

8. Reference

[1] lldb cheat sheet
[2] lldb documentation
[3] 9cc compiler and its explanation (Japanese)