Skip to content

0x221 Microarchitecture

1. Arithmetic

1.1. Integer

1.2. Real Numbers

Note that historically, floating point is not the only representation for real numbers, there were fixed point representations where the gaps are all of the same size

1.2.1. Fixed-point Representation

Proposed by William Kahan (Turing 1989), as an effort to design intel 8087.

1.2.2. Float Representation

The IEEE 754 standard defines the representation of floating point as follows

\[(-1)^S (1+Fraction) \times 2^{(Exponent - Bias)}\]

The part of \(1+Fraction\) is also called significand, the fraction is also known as mantissa

Representations are different depending on the values of exponential

1. normalized case when exponent are not all zero or all one

\[(-1)^S (1.f_{n-1}f_{n-2}...f_{0}) \times 2^{(e_{k-1}e_{k-2}...e_{0} - Bias)}\]

where \(bias=2^{k-1}-1\)

2. denormalized case when exponent are all zero, then

\[(-1)^S (0.f_{n_1}f_{n-2}...f_{0}) \times 2^{1-Bias}\]

Notice both significant and exponent part have changed. This representation has a smooth transition from the denormalized case into normalized case. Additionally it provides a way to represent 0 (actually two way +0.0, -0.0 depending on the sign)

3. special case when exponent are all 1

  • if fraction is 0, it is infinity
  • otherwise fraction are nonzero, it is NaN

floatpoint

8 bit float number

exponent 4 bit, fraction 3 bit example from CSAPP

8bitfloat

single-precision

float

  • S is 1 bit
  • Exponent is 8 bit and Bias is \(127_{Ten}\)
  • Fraction is 24 bit (6 decimal digits of precision)
  • range is around \([2.0 \times 10^{-38}, 2.0 \times 10^{38}]\)
var f float32 = 16777216  // 1<<24
fmt.Println(f == f+1)  // true

double-precision

In double-precision

  • S is 1 bit
  • Fraction is 11 bit and Bias is \(1023_{Ten}\)
  • Fraction is 52 bit (15 decimal digits of precision)
  • range is around \([2.0 \times 10^{-308}, 2.0 \times 10^{308}]\)

To find the detailed numbers on each machine, you can consult from standard C header.

1.2.2.1. Rounding

IEE754 use the Round-to-Even as the default mode.

  • It in general rounds to the nearest number
  • when the target is at the half of two numbers (e.g: \(XXX.YYY1000\)), then it rounds so that the least significant bit is even (0).

Other possible roundings are

  • round toward zero
  • round up
  • round down

2. Processor

2.1. Notable CPU

  • 4 bit: intel 4004 (first intel chip, 1970, 2k transistor)
  • 8 bit: intel 8008 (1972, 3k transistors)
  • 16 bit: intel 8086 (1976), PDP-11 (minicomputer DEC 1970)
  • 32 bit: intel 80386 (1985), VAX-11 (DEC 1977)
  • 64 bit

2.2. Flynn's taxonomy

2.2.1. SISD

pipelines

superscalar processor The superscalar architecture implements parallelism within one core executing independent part of instruction from the same instruction stream. This was one of the main strategies in the pre-multi core era, but requires a lot transistors for cache, branch predictor, out-of-order logics. P5 Pentium was the first x86 superscalar processor.

The following example shows a single-core architecture which can execute two independent instructions simultaneously from a single instruction stream.

superscalar

2.2.2. SIMD

vector processor

instruction stream coherence: same instruction sequence applied to all elements, which is necessary for efficient SIMD execution, but not necessary for multicore parallelization

SSE instructions: 128 bit (4 wide float)

AVX instructions: 256 bits (8 wide float)

2.2.3. MISD

2.2.4. MIMD

hyper-threading: super-scalar with multiple execution contexts in a single core

multi-core: thread-level parallelism. simultaneously execute a completely different instruction stream on each core

multicore

2.3. Pipeline

2.4. Execution

3. Frontend

Single-cycle implementation: an instruction is executed in one clock cycle, the slowest instruction decide cycle time

Multi-cycle implementation: instruction processing broken into multiple cycles/stages.

3.1. Branch Prediction

3.2. Decode

4. Backend

4.1. Microcode

5. Memory

5.1. Cache

Cache is usually implemented with SRAM

5.1.1. Hierarchy

  • L1: reference 1ns, usually in core
  • L2: reference 4ns, usually out core
  • L3: usually shared by multiple cores

5.1.2. Placement Policy

  • full associative cache: each memory can be placed anywhere
  • directed mapped cache: each memory can be placed at one place
  • LRU: Least Recently Used

5.1.3. Management

  • Write-through: write data to cache and RAM at the same time
  • Write-back: delay writing data to RAM

5.2. Memory Controller

5.2.1. MMU

the unit to translate virtual address into physical address

5.2.2. TLB

  • the cache that saves the recent address mapping
  • it is a cache of page tables
  • only store the final translation even it is a multiple-level memory
  • change cr3 in x86 can clear TLB automatically

6. Storage

7. Reference

[1] Patterson, David A., and John L. Hennessy. Computer Organization and Design ARM Edition: The Hardware Software Interface. Morgan kaufmann, 2016.

[2] Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.

[3] CMU 15-418/15-618: Parallel Computer Architecture and Programming

[4] CMU 18-447 Introduction to Computer Architecture

[5] CSAPP