0x220 Computer Microarchitecture

Computer Microarchitecture is the implementation of the ISA under specific design constraints and goals, it is an abstraction layer between the logic and architecture levels

Arithemtics

ALU

FPU

Processor

Notable CPU

  • 4 bit: intel 4004 (first intel chip, 1970, 2k transistor)
  • 8 bit: intel 8008 (1972, 3k transistors)
  • 16 bit: intel 8086 (1976), PDP-11 (minicomputer DEC 1970)
  • 32 bit: intel 80386 (1985), VAX-11 (DEC 1977)
  • 64 bit

Flynn’s taxonomy

SISD

pipelines

superscalar processor The superscalar architecture implements parallelism within one core executing independent part of instruction from the same instruction stream. This was one of the main strategies in the pre-multi core era, but requires a lot transistors for cache, branch predictor, out-of-order logics. P5 Pentium was the first x86 superscalar processor.

The following example shows a single-core architecture which can execute two independent instructions simultaneously from a single instruction stream.

Reference: CMU 15-418 Parallel Computer Architecture and Programming

SIMD

vector processor

instruction stream coherence: same instruction sequence applied to all elements, which is necessary for efficient SIMD execution, but not necessary for multicore parallelization

SSE instructions: 128 bit (4 wide float)

AVX instructions: 256 bits (8 wide float)

MISD

MIMD

hyper-threading: super-scalar with multiple execution contexts in a single core

multi-core: thread-level parallelism. simultaneously execute a completely different instruction stream on each core

Reference: CMU 15-418 Parallel Computer Architecture and Programming

Pipeline

Execution

Branch Prediction

security: meltdown and spectre

Processor

Single-cycle implementation: an instruction is executed in one clock cycle, the slowest instruction decide cycle time

Multi-cycle implementation: instruction processing broken into multiple cycles/stages.

Microcode

Memory

Cache

Cache is usually implemented with SRAM

Hierarchy

  • L1: reference 1ns, usually in core
  • L2: reference 4ns, usually out core
  • L3: usually shared by multiple cores

Placement Policy

  • full associative cache: each memory can be placed anywhere
  • directed mapped cache: each memory can be placed at one place
  • LRU: Least Recently Used

Management

  • Write-through: write data to cache and RAM at the same time
  • Write-back: delay writing data to RAM

Memory Controller

MMU

the unit to translate virtual address into physical address

TLB

  • the cache that saves the recent address mapping
  • it is a cache of page tables
  • only store the final translation even it is a multiple-level memory
  • change cr3 in x86 can clear TLB automatically

Storage

Reference

[1] Patterson, David A., and John L. Hennessy. Computer Organization and Design ARM Edition: The Hardware Software Interface. Morgan kaufmann, 2016.

[2] Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.

[3] CMU 15-418/15-618: Parallel Computer Architecture and Programming

[4] CMU 18-447 Introduction to Computer Architecture