0x332 Compiler
- 1. Foundation
- 2. Lexical Analysis
- 3. Syntax Analysis
- 4. Semantic Analysis
- 5. Intermediate Representation
- Optimization
- 5. Security
- 6. Debugger
- 7. IDE
- 8. Reference
1. Foundation
- compiler
- JIT compiler
- Interpreter
2. Lexical Analysis
3. Syntax Analysis
Syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.
3.1. Top-down parsing
LL parser
3.2. Bottom-up parsing
Shift-Reduce
Operator-precedence parser (Shunting-yard)
4. Semantic Analysis
5. Intermediate Representation
LLVM
The following code snippet is from this blog, illustrating how LLVM IR looks like
unsigned add1(unsigned a, unsigned b) {
return a+b;
}
// Perhaps not the most efficient way to add two numbers.
unsigned add2(unsigned a, unsigned b) {
if (a == 0) return b;
return add2(a-1, b+1);
}
Those C codes will be translated to the following IRs
``` define i32 @add1(i32 %a, i32 %b) { entry: %tmp1 = add i32 %a, %b ret i32 %tmp1 }
define i32 @add2(i32 %a, i32 %b) { entry: %tmp1 = icmp eq i32 %a, 0 br i1 %tmp1, label %done, label %recurse
recurse:
%tmp2 = sub i32 %a, 1
%tmp3 = add i32 %b, 1
%tmp4 = call i32 @add2(i32 %tmp2, i32 %tmp3)
ret i32 %tmp4
done:
ret i32 %b
}
Optimization
Compiler optimization is generally implemented using a sequence of optimizing transformations, algorithms which take a program and transform it to produce a semantically equivalent output program that uses fewer resources or executes faster.
Loop optimization
Dataflow optimization
Dataflow is represetned with Control Flow Graph (also abbreviated as CFG), it consists of the basic block, which is a block of code that can can only branch to the beginning of other blocks at the end of this block. This help analysis of data flow.
Some simple optimizations are:
- constant folding and propagation: replacing expressions consisting with constants with their final value (e.g. 3+5 -> 8)
5. Security
There are several compilers features to mitigate security risks
Stack smashing protector To mitigate risk of stack buffer overflow. (only mitigate, can be exploited by overwriting canary by the correct value, maybe overwriting the canary's global reference ?) insert canary variable into higher stack address before running risk function. Check whether the canary (expected value is on heap) has been modified after running risk.
/* Note how buffer overruns are undefined behavior and the compilers tend to
optimize these checks away if you wrote them yourself, this only works
robustly because the compiler did it itself. */
extern uintptr_t __stack_chk_guard;
noreturn void __stack_chk_fail(void);
void foo(const char* str)
{
uintptr_t canary = __stack_chk_guard; // automatically inserted by compiler
char buffer[16];
strcpy(buffer, str);
if ( (canary = canary ^ __stack_chk_guard) != 0 ) // automatically inserted by compiler
__stack_chk_fail();
}
6. Debugger
6.1. gdb
i r: print variables x/(number) (memory or register): print data around memory
6.2. lldb
Main Reference is here
6.2.1. frame instructions
- bt: print stack frames
- fr v (-L): print variables in a frame (with its location)
- fr s frame_number: switch to different frames
6.2.2. memory instructions
- mem read (memory or register)+(offset): print data around memory
6.2.3. evaluation
- eval (expression)
7. IDE
There is a concept called Language Server Protocol, which splits the language-related parsing features into a service provided from a remote server, the client is usually an editor which interacts with server using this json-based protocol.
8. Reference
[1] lldb cheat sheet [2] lldb documentation [3] 9cc compiler and its explanation (Japanese)