0x500 Accelerator
1. FPGA
FPGA provides faster hardware prototype and optimized hardware acceleration. Looks like modern FPGA is integrated with processors. But they are expensive, the price of Intel Stratix 10 looks like over 20k dollars.
1.1. Microarchitectures
System-on-Chip Products
- Altera (Intel): Arria 10 (used for NLU on bing) -> Stratix 10
- Xilinx: Zynq Ultrascale+
Look-Up Tables (LUT) is intended to implement arbitrary combinational logic.
implemented by MUX and SRAM. SRAM stores the configuration memory. The input is used to select the stored memory via MUX. Can simulate any truth table (e.g. : AND NOR ... ) Typically 6-LUTs
Switches Switches can be configured to connect LUTs. It is something like bus
1.2. Synthesize
Language: Verilog or VHDL
FGPA routing is a NP-hard problem (a disadvantage of FPGA)
2. GPU
2.1. Microarchitecture
Compared with CPU, GPU have more cores but less cache and flow control:
2.1.1. Generation
Pascal
- P100
Volta (Turing)
- V100
Ampere
- A100
Hopper
- H100
2.2. Architecture
2.2.1. PTX
GPU primitives
2.3. Performance
See this doc
3. CUDA API
3.1. Stream Managment
3.2. Memory Management
Unified Virtual Addressing CUDA devices can share a unified address space with the host
3.3. GPUDirect
3.3.1. CUDA-Aware MPI
See this blog
With CUDA-Aware MPI, we can MPI_Send
and MPI_Recv
from GPU directly to another GPU without going through host memory.
4. TPU
4.1. Microarchitecture
For design details, check this paper
From high-level to low-level hardware concepts:
- each pod is a contiguous set of TPUs grouped together over a specialized network
- each slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI)
- each TPU chips have one or two TensorCores
- each TensorCore (i.e. TPU's core) has one or more matrix-multiply units (MXUs)
- each MXU is composed of 128 x 128 multiply-accumulators in a systolic array
4.2. Architecture
Some concepts:
- A slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI)
5. Reference
- altera FPGA white paper
- What is a LUT
- A HN thread comparing FPGA with GPU
- Lecture on FPGA
- Operating Systems Three Easy Pieces