0x500 Accelerator
- 1. GPU
- 2. TPU
- 3. CPU
- 4. Reference
1. GPU
1.1. Computing
The latest H100's computing elements has the following hierarchy
- each card consists of Streaming Multiprocessors
- each SM has multiple Processing Block (e.g. 4 in the following diagram)
- each Processing Block execute a warp every clock cycle, it consists of cuda cores and tensor core
A100 has 108SMs with 64 cores each, totally 6912 cores.
Conceptually, GPU have more cores but less cache and flow control compared with CPU
1.1.1. Cuda Core
A Fermi's SM is illustrated as follows where
- each CUDA core can perform one single-precision fused multiply-add operation in each clock period
- large
- memory operations are hanlded by a set of 16 LD/ST (load/store) units in each SM by using two dimensional (x,y) values
- Special Function Units (SFU) is available to handle special operations such as sin, cos, exp
Each SM is typically splitted into multiple processing blocks where all cores in a processing block share an instruction fetch/dispatch unit (cost efficiency). Threads in the same warp are assigned to the same processing block
For example there are 4 processing blocks in A100 SM (i.e. each processing block has 16 core).
1.1.2. Tensor Core
Tensor Core is a specialized unit to perform 4x4 matrix multiplication and addition
1.2. Memory
Memory Hierarchy is as follows:
Register
- private to threads
- has a large size to reduce context switching
- when register size is not enough, we use local memory, which is thread-private memory resides in global memory
Shared Memory
- on-chip memory SRAM
- shared by all threads within a block
L2 Cache
- caching instruction, data from global memory
- managed by hardware
Global Memory (Device memory)
- off-chip memory (DRAM) accessible from all threads
- persist across kernel launchs
Host Memory
- far far far away
1.2.1. Shared Memory
Striding through global memory is problematic. Shared memory latency is roughly 100x lower than uncached global memory latency
Bank Conflicts : shared memory is divided into equally sized memory modules (banks). If multiple threads requested the same memory bank, the accesses are serialized and efficienfcy drops.
1.2.2. Global Memory (HBM)
Global memory can be allocated using __device__
or cudaMalloc
__device__ int globalArray[256];
void foo()
{
...
int *myDeviceMemory = 0;
cudaError_t result = cudaMalloc(&myDeviceMemory, 256 * sizeof(int));
...
}
load/store instructions wrt global memory from a swap is usually coaleced into a few small transactions to minimize DRAM bandwidth. See this post for more details
1.2.3. Host Memory
Unified Virtual Addressing CUDA devices can share a unified address space with the host
1.3. Communication
To inspect communication links, use nvidia-smi nvlink -s
1.3.1. PCI
Typically, PCI is the protocol to communicate between CPU and GPU, PCI can be divided into lanes where bandwidth of a single lane (i.e. PCIe x1) depends on the generations:
- PCIe 4.0: 2GB
- PCIe 5.0: 4GB
- PCIe 6.0: 8GB
at best, PCIe x16 bandwidth is ~100GB/s
To transfer between Host/Device efficiently, we can
- use pinned host memory (i.e. swapping not allowed) before memcpy (this prevent CUDA driver from creating temporary pinned memory buffer), which can be done using
cudaMallocHost
- batching small transfers into a single transfer.
- overlapping (see the following)
1.3.2. SXM
SXM is a high bandwidth socket to conenect GPU with higher performance than PCIs equivalents
1.3.3. NVLink
NVLink provides > 1TB/S bidirectional, direct GPU-to-GPU interconnection
NVLink Switch connect NVLinks to provide all-to-all GPU communication
1.3.4. InfiniBand
communications between interconnected nodes
1.3.5. GPUDirect
With CUDA-Aware MPI, we can MPI_Send
and MPI_Recv
from GPU directly to another GPU without going through host memory.
1.4. Programming Model
The computational elements is known as kernels (typically written in C), can be compiled into many threads executing the same program in parallel.
1.4.1. ISA
PTX (Parallel Thread Execution) is a low-level parallel-thread execution virtual machine and instruction set architecture (ISA). PTX exposes the GPU as a parallel computing device. CUDA generate PTX instructions
__global__ void vecAddKernel(float* A, float* B, float* C, int n){
int i = blockIdx.x*blockDim.x + threadIdx.x;
if(i<n){
C[i] = A[i] + B[i];
}
}
the code example above get compiled into the following PTX (via nvcc -ptx add.cu
)
.visible .entry _Z12vecAddKernelPfS_S_i(
.param .u64 _Z12vecAddKernelPfS_S_i_param_0,
.param .u64 _Z12vecAddKernelPfS_S_i_param_1,
.param .u64 _Z12vecAddKernelPfS_S_i_param_2,
.param .u32 _Z12vecAddKernelPfS_S_i_param_3
)
{
.reg .pred %p<2>;
.reg .f32 %f<4>;
.reg .b32 %r<6>;
.reg .b64 %rd<11>;
ld.param.u64 %rd1, [_Z12vecAddKernelPfS_S_i_param_0];
ld.param.u64 %rd2, [_Z12vecAddKernelPfS_S_i_param_1];
ld.param.u64 %rd3, [_Z12vecAddKernelPfS_S_i_param_2];
ld.param.u32 %r2, [_Z12vecAddKernelPfS_S_i_param_3];
mov.u32 %r3, %ctaid.x;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r3, %r4, %r5;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra $L__BB0_2; // braching based on if (i<n)
cvta.to.global.u64 %rd4, %rd1; // cvta seems to be address conversion
mul.wide.s32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
cvta.to.global.u64 %rd7, %rd2;
add.s64 %rd8, %rd7, %rd5;
ld.global.f32 %f1, [%rd8];
ld.global.f32 %f2, [%rd6];
add.f32 %f3, %f2, %f1;
cvta.to.global.u64 %rd9, %rd3;
add.s64 %rd10, %rd9, %rd5;
st.global.f32 [%rd10], %f3;
$L__BB0_2:
ret;
}
SASS is the low-level assembly language that compiles to binary microcode, which executes natively on NVIDIA GPU hardware
1.4.2. Scheduling
Grid
- Thread blocks are grouped into a grid
- Each block in the grid can be executed and dispatched to a single SM in any order relative to other blocks
gridDim.x
range from 1 to 2^31-1,gridDim.y
andgridDim.z
can be up to 2^16-1 (65535)
Block scheduling
- Block will run on a single SM, so threads within the same block can cooperate and share memory (e.g. using
__syncthreads
) within its SM - Multiple blocks maybe assigned to the same SM simultaneously
- each block can contain up to 1024 threads (product of 3 dim in blockDim)
Note that Block is also called CTA (cooperative thread array) in some documents.
Warp scheduling
- warp (set of 32 threads) is the fundamental unit of dispatch within a single SM
- Inside the assigned SM, warp is further assigned to a
processing block
, which executes all threads in this wrap. It might take multiple clock cycle depending on the number of cores in the processing block (see this QA)
Control Divergence
Warp uses active mask to handle control divergence
Volta introduces more fine-grained concurrency by maintaining more state (e.g. PC) per thread. See section of independent thread scheduling in this nvidia blog
There are also discussion about dynamic warp formation (Fung et al., 2007)1 to maximize utilization. Watch 1:34:00 from Onur's lecture about this. Unfortunately nvidia does not seem to release any public info about it.
1.4.3. Stream Management
CUDA stream is a sequence (queue) of device operations (e.g. kernel, data transfer).
Each device operation in CUDA run in a stream (default stream if not specified). It can be used to overlap data tranfer, host computing and device computing
1.5. Performance
See this doc
Arithmetic Intensity is formulated as the ratio between ops and byte.
Different DNN layers typically has idfferent arithmetic intensity. Some layer has low arithmetic intesities (e.g. ReLU activation, layer normalization)
1.6. Tools
deviceQuery from the cuda-sample is useful to print basic info about devices
2. TPU
2.1. Computing
TPU v4 peak flops: 275 TFLOPS
A few published papers mentioned TPU architecture: TPU v1, TPU v2, TPU v4i (Jouppi et al., 2021)2 and TPU v4 (Jouppi et al., 2023)3
In general, from high-level to low-level hardware concepts:
- each pod is a contiguous set of TPUs grouped together over a specialized network
- each slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI)
- each TPU chips have one or two TensorCores
- each TensorCore (i.e. TPU's core) has one or more matrix-multiply units (MXUs)
- each MXU is composed of 128 x 128 multiply-accumulators in a systolic array
2.1.1. TensorCore
Each TPU chip has 1 or 2 TensorCore depending on the generations
A single TensorCore has the following component:
scalar processor (SPU) used for control flow
vector processor (VPU) used for vector processing.
transpose-permute Unit: specialized for matrix transpose/permute
Matrix-multiply unit (MXU) is based on systolic array See this illustration to understand how matrix multiplication works in systolic array
- a single PE (processing element) is replaced with a regular array of PEs, where flow of data is carefully orchestrated orchestrated.
- array structure can be nonlinear and dimensional
Advantages:
- efficiently make use of limited memory bandwidth, balances computation to IO bandwidth
- specialized (computation need to fit PE organizaiton/functions)
Downside:
- not generally appliable
2.1.2. SparseCore
Originally called BarnaCore.
specialized for embedding lookup. Check Section 3 of TPU v4 paper
2.2. Memory
TPU has no hardware-managed cache such as L1 and L2. all memory transfer is explicitly controlled by XLA
2.2.1. SMEM, VMEM
scalar memory, vector memory (organized in 2d 128x8), fast
2.2.2. CMEM
common memory, on-chip, small (128MB in gen4)
2.2.3. HBM
TPU v4: 32Gib, 120 GBps
slow memory
2.2.4. Infeed/Outfeed Queue
The TPU host streams data into an infeed queue. The TPU loads data from the infeed queue and stores them in HBM memory.
When the computation is completed, the TPU loads the results into the outfeed queue. The TPU host then reads the results from the outfeed queue and stores them in the host's memory.
2.3. Communication
Note that Gang-scheduling is essential in the case of TPUs, since they are single-threaded and only run non-preemptible kernels, so the system will deadlock if communicating computations are not enqueued in a consistent order
2.3.1. Host
TPU supports DMA to transfer
The TPU has two hardware queues: it loads data from the infeed queue and stores them in HBM memory. When the computation is completed, the TPU loads the results into the outfeed queue
2.3.2. ICI
TPU uses optical circuit switching instead of electrical packet switching as adopted by NVSwitch.
2.3.3. DCN
DCN (data-center network) connect between slices
3. CPU
3.1. Computing
3.1.1. AMX
Advanced Matrix Extensions (AMX) are extensions to x86 ISA, started from Sapphire Rapids microarchitecture for Xeon.
Its has using 2-dim registers and accelerator called matrix multiply unit (TMUL) performing tiled matrix multiplication.
4. Reference
- altera FPGA white paper
- What is a LUT
- A HN thread comparing FPGA with GPU
- Lecture on FPGA
- Operating Systems Three Easy Pieces
-
Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In 40th annual IEEE/ACM international symposium on microarchitecture (MICRO 2007), pages 407–420. IEEE. ↩
-
Norman P Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, et al. 2021. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th annual international symposium on computer architecture (ISCA), pages 1–14. IEEE. ↩
-
Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture, pages 1–14. ↩