Skip to content

0x332 Compiler

1. Foundation

  • compiler
  • JIT compiler
  • Interpreter

2. Lexical Analysis

3. Syntax Analysis

Syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

3.1. Top-down parsing

LL parser

3.2. Bottom-up parsing

Shift-Reduce

Operator-precedence parser (Shunting-yard)

4. Semantic Analysis

5. Intermediate Representation

LLVM

The following code snippet is from this blog, illustrating how LLVM IR looks like

unsigned add1(unsigned a, unsigned b) {
    return a+b;
}

    // Perhaps not the most efficient way to add two numbers.
unsigned add2(unsigned a, unsigned b) {
    if (a == 0) return b;
    return add2(a-1, b+1);
}

Those C codes will be translated to the following IRs

``` define i32 @add1(i32 %a, i32 %b) { entry: %tmp1 = add i32 %a, %b ret i32 %tmp1 }

define i32 @add2(i32 %a, i32 %b) { entry: %tmp1 = icmp eq i32 %a, 0 br i1 %tmp1, label %done, label %recurse

recurse:
%tmp2 = sub i32 %a, 1
%tmp3 = add i32 %b, 1
%tmp4 = call i32 @add2(i32 %tmp2, i32 %tmp3)
ret i32 %tmp4

done:
ret i32 %b

}

6. Optimization

Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.

6.1. Loop optimization

6.2. Dataflow optimization

Dataflow is represetned with Control Flow Graph (also abbreviated as CFG), it consists of the basic block, which is a block of code that can can only branch to the beginning of other blocks at the end of this block. This help analysis of data flow.

Some simple optimizations are:

  • constant folding and propagation: replacing expressions consisting with constants with their final value (e.g. 3+5 -> 8)

7. Security

There are several compilers features to mitigate security risks

Stack smashing protector To mitigate risk of stack buffer overflow. (only mitigate, can be exploited by overwriting canary by the correct value, maybe overwriting the canary's global reference ?) insert canary variable into higher stack address before running risk function. Check whether the canary (expected value is on heap) has been modified after running risk.

/* Note how buffer overruns are undefined behavior and the compilers tend to
   optimize these checks away if you wrote them yourself, this only works
   robustly because the compiler did it itself. */
extern uintptr_t __stack_chk_guard;
noreturn void __stack_chk_fail(void);
void foo(const char* str)
{
    uintptr_t canary = __stack_chk_guard;  // automatically inserted by compiler
    char buffer[16];
    strcpy(buffer, str);
    if ( (canary = canary ^ __stack_chk_guard) != 0 ) // automatically inserted by compiler
        __stack_chk_fail();
}

8. Debugger

8.1. gdb

i r: print variables x/(number) (memory or register): print data around memory

8.2. lldb

Main Reference is here

8.2.1. frame instructions

  • bt: print stack frames
  • fr v (-L): print variables in a frame (with its location)
  • fr s frame_number: switch to different frames

8.2.2. memory instructions

  • mem read (memory or register)+(offset): print data around memory

8.2.3. evaluation

  • eval (expression)

9. JIT

check this tutorial to get the high-level concept of how it works.

The following snippe is to implement an identity function from the previous tutorial, it writes asm corresponding to the identity function into some memory with some flag and then execute this memory.

// jitproto.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

typedef long(*fn)(long);

fn compile_identity(void) {
  // Allocate some memory and set its permissions correctly. In particular, we
  // need PROT_EXEC (which isn't normally enabled for data memory, e.g. from
  // malloc()), which tells the processor it's ok to execute it as machine
  // code.
  char *memory = mmap(NULL,             // address
                      4096,             // size
                      PROT_READ | PROT_WRITE | PROT_EXEC,
                      MAP_PRIVATE | MAP_ANONYMOUS,
                      -1,               // fd (not used here)
                      0);               // offset (not used here)
  if (memory == MAP_FAILED) {
    perror("failed to allocate memory");
    exit(1);
  }

  int i = 0;

  // mov %rdi, %rax
  memory[i++] = 0x48;           // REX.W prefix
  memory[i++] = 0x8b;           // MOV opcode, register/register
  memory[i++] = 0xc7;           // MOD/RM byte for %rdi -> %rax

  // ret
  memory[i++] = 0xc3;           // RET opcode

  return (fn) memory;
}

int main() {
  fn f = compile_identity();
  int i;
  for (i = 0; i < 10; ++i)
    printf("f(%d) = %ld\n", i, (*f)(i));
  munmap(f, 4096);
  return 0;
}

10. IDE

There is a concept called Language Server Protocol, which splits the language-related parsing features into a service provided from a remote server, the client is usually an editor which interacts with server using this json-based protocol.

11. Reference

[1] lldb cheat sheet [2] lldb documentation [3] 9cc compiler and its explanation (Japanese)