Skip to content

0x500 Accelerator

1. FPGA

FPGA provides faster hardware prototype and optimized hardware acceleration. Looks like modern FPGA is integrated with processors. But they are expensive, the price of Intel Stratix 10 looks like over 20k dollars.

1.1. Microarchitectures

System-on-Chip Products

  • Altera (Intel): Arria 10 (used for NLU on bing) -> Stratix 10
  • Xilinx: Zynq Ultrascale+

Look-Up Tables (LUT) is intended to implement arbitrary combinational logic.

implemented by MUX and SRAM. SRAM stores the configuration memory. The input is used to select the stored memory via MUX. Can simulate any truth table (e.g. : AND NOR ... ) Typically 6-LUTs

Switches Switches can be configured to connect LUTs. It is something like bus

1.2. Synthesize

Language: Verilog or VHDL

FGPA routing is a NP-hard problem (a disadvantage of FPGA)

2. GPU

2.1. Microarchitecture

Compared with CPU, GPU have more cores but less cache and flow control:

gpu cpu

2.1.1. Generation

Pascal

  • P100

Volta (Turing)

  • V100

Ampere

  • A100

Hopper

  • H100

2.2. Architecture

2.2.1. PTX

GPU primitives

2.3. Performance

See this doc

3. CUDA API

CUDA reference

3.1. Stream Managment

3.2. Memory Management

Unified Virtual Addressing CUDA devices can share a unified address space with the host

3.3. GPUDirect

Official doc

3.3.1. CUDA-Aware MPI

See this blog

With CUDA-Aware MPI, we can MPI_Send and MPI_Recv from GPU directly to another GPU without going through host memory.

4. TPU

4.1. Microarchitecture

For design details, check this paper

From high-level to low-level hardware concepts:

  • each pod is a contiguous set of TPUs grouped together over a specialized network
  • each slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI)
  • each TPU chips have one or two TensorCores
  • each TensorCore (i.e. TPU's core) has one or more matrix-multiply units (MXUs)
  • each MXU is composed of 128 x 128 multiply-accumulators in a systolic array

4.2. Architecture

Some concepts:

  • A slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI)

5. Reference

  • altera FPGA white paper
  • What is a LUT
  • A HN thread comparing FPGA with GPU
  • Lecture on FPGA
  • Operating Systems Three Easy Pieces