Skip to content

0x500 Accelerator

1. GPU

1.1. Computing

The latest H100's computing elements has the following hierarchy

  • each card consists of Streaming Multiprocessors
  • each SM has multiple Processing Block (e.g. 4 in the following diagram)
  • each Processing Block execute a warp every clock cycle, it consists of cuda cores and tensor core

A100 has 108SMs with 64 cores each, totally 6912 cores.

h100

Conceptually, GPU have more cores but less cache and flow control compared with CPU

gpu cpu

1.1.1. Cuda Core

A Fermi's SM is illustrated as follows where

  • each CUDA core can perform one single-precision fused multiply-add operation in each clock period
  • large
  • memory operations are hanlded by a set of 16 LD/ST (load/store) units in each SM by using two dimensional (x,y) values
  • Special Function Units (SFU) is available to handle special operations such as sin, cos, exp

fermi

Each SM is typically splitted into multiple processing blocks where all cores in a processing block share an instruction fetch/dispatch unit (cost efficiency). Threads in the same warp are assigned to the same processing block

For example there are 4 processing blocks in A100 SM (i.e. each processing block has 16 core).

1.1.2. Tensor Core

Tensor Core is a specialized unit to perform 4x4 matrix multiplication and addition

tensor core

1.2. Memory

Memory Hierarchy is as follows:

Register

  • private to threads
  • has a large size to reduce context switching
  • when register size is not enough, we use local memory, which is thread-private memory resides in global memory

Shared Memory

  • on-chip memory SRAM
  • shared by all threads within a block

L2 Cache

  • caching instruction, data from global memory
  • managed by hardware

Global Memory (Device memory)

  • off-chip memory (DRAM) accessible from all threads
  • persist across kernel launchs

Host Memory

  • far far far away

1.2.1. Shared Memory

Striding through global memory is problematic. Shared memory latency is roughly 100x lower than uncached global memory latency

Bank Conflicts : shared memory is divided into equally sized memory modules (banks). If multiple threads requested the same memory bank, the accesses are serialized and efficienfcy drops.

1.2.2. Global Memory (HBM)

Global memory can be allocated using __device__ or cudaMalloc

__device__ int globalArray[256];

void foo()
{
    ...
    int *myDeviceMemory = 0;
    cudaError_t result = cudaMalloc(&myDeviceMemory, 256 * sizeof(int));
    ...
}

load/store instructions wrt global memory from a swap is usually coaleced into a few small transactions to minimize DRAM bandwidth. See this post for more details

1.2.3. Host Memory

Unified Virtual Addressing CUDA devices can share a unified address space with the host

1.3. Communication

To inspect communication links, use nvidia-smi nvlink -s

1.3.1. PCI

Typically, PCI is the protocol to communicate between CPU and GPU, PCI can be divided into lanes where bandwidth of a single lane (i.e. PCIe x1) depends on the generations:

  • PCIe 4.0: 2GB
  • PCIe 5.0: 4GB
  • PCIe 6.0: 8GB

at best, PCIe x16 bandwidth is ~100GB/s

To transfer between Host/Device efficiently, we can

  • use pinned host memory (i.e. swapping not allowed) before memcpy (this prevent CUDA driver from creating temporary pinned memory buffer), which can be done using cudaMallocHost
  • batching small transfers into a single transfer.
  • overlapping (see the following)

1.3.2. SXM

SXM is a high bandwidth socket to conenect GPU with higher performance than PCIs equivalents

NVLink provides > 1TB/S bidirectional, direct GPU-to-GPU interconnection

NVLink Switch connect NVLinks to provide all-to-all GPU communication

1.3.4. InfiniBand

communications between interconnected nodes

CUDA reference

1.3.5. GPUDirect

Official doc

With CUDA-Aware MPI, we can MPI_Send and MPI_Recv from GPU directly to another GPU without going through host memory.

1.4. Programming Model

The computational elements is known as kernels (typically written in C), can be compiled into many threads executing the same program in parallel.

1.4.1. ISA

PTX (Parallel Thread Execution) is a low-level parallel-thread execution virtual machine and instruction set architecture (ISA). PTX exposes the GPU as a parallel computing device. CUDA generate PTX instructions

__global__ void vecAddKernel(float* A, float* B, float* C, int n){
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if(i<n){
        C[i] = A[i] + B[i];
    }
}

the code example above get compiled into the following PTX (via nvcc -ptx add.cu)

.visible .entry _Z12vecAddKernelPfS_S_i(
    .param .u64 _Z12vecAddKernelPfS_S_i_param_0,
    .param .u64 _Z12vecAddKernelPfS_S_i_param_1,
    .param .u64 _Z12vecAddKernelPfS_S_i_param_2,
    .param .u32 _Z12vecAddKernelPfS_S_i_param_3
)
{
    .reg .pred  %p<2>;
    .reg .f32   %f<4>;
    .reg .b32   %r<6>;
    .reg .b64   %rd<11>;


    ld.param.u64    %rd1, [_Z12vecAddKernelPfS_S_i_param_0];
    ld.param.u64    %rd2, [_Z12vecAddKernelPfS_S_i_param_1];
    ld.param.u64    %rd3, [_Z12vecAddKernelPfS_S_i_param_2];
    ld.param.u32    %r2, [_Z12vecAddKernelPfS_S_i_param_3];
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r4, %ntid.x;
    mov.u32     %r5, %tid.x;
    mad.lo.s32  %r1, %r3, %r4, %r5;

    setp.ge.s32     %p1, %r1, %r2;
    @%p1 bra    $L__BB0_2; // braching based on if (i<n)

    cvta.to.global.u64  %rd4, %rd1; // cvta seems to be address conversion
    mul.wide.s32    %rd5, %r1, 4;
    add.s64     %rd6, %rd4, %rd5;
    cvta.to.global.u64  %rd7, %rd2;
    add.s64     %rd8, %rd7, %rd5;
    ld.global.f32   %f1, [%rd8];
    ld.global.f32   %f2, [%rd6];
    add.f32     %f3, %f2, %f1;
    cvta.to.global.u64  %rd9, %rd3;
    add.s64     %rd10, %rd9, %rd5;
    st.global.f32   [%rd10], %f3;

$L__BB0_2:
    ret;

}

SASS is the low-level assembly language that compiles to binary microcode, which executes natively on NVIDIA GPU hardware

1.4.2. Scheduling

Grid

  • Thread blocks are grouped into a grid
  • Each block in the grid can be executed and dispatched to a single SM in any order relative to other blocks
  • gridDim.x range from 1 to 2^31-1, gridDim.y and gridDim.z can be up to 2^16-1 (65535)

Block scheduling

  • Block will run on a single SM, so threads within the same block can cooperate and share memory (e.g. using __syncthreads) within its SM
  • Multiple blocks maybe assigned to the same SM simultaneously
  • each block can contain up to 1024 threads (product of 3 dim in blockDim)

Note that Block is also called CTA (cooperative thread array) in some documents.

Warp scheduling

  • warp (set of 32 threads) is the fundamental unit of dispatch within a single SM
  • Inside the assigned SM, warp is further assigned to a processing block, which executes all threads in this wrap. It might take multiple clock cycle depending on the number of cores in the processing block (see this QA)

Control Divergence

Warp uses active mask to handle control divergence

old divergence

Volta introduces more fine-grained concurrency by maintaining more state (e.g. PC) per thread. See section of independent thread scheduling in this nvidia blog

pascal

There are also discussion about dynamic warp formation (Fung et al., 2007)1 to maximize utilization. Watch 1:34:00 from Onur's lecture about this. Unfortunately nvidia does not seem to release any public info about it.

1.4.3. Stream Management

CUDA stream is a sequence (queue) of device operations (e.g. kernel, data transfer).

Each device operation in CUDA run in a stream (default stream if not specified). It can be used to overlap data tranfer, host computing and device computing

1.5. Performance

See this doc

Arithmetic Intensity is formulated as the ratio between ops and byte.

Different DNN layers typically has idfferent arithmetic intensity. Some layer has low arithmetic intesities (e.g. ReLU activation, layer normalization)

1.6. Tools

deviceQuery from the cuda-sample is useful to print basic info about devices

2. TPU

2.1. Computing

TPU v4 peak flops: 275 TFLOPS

A few published papers mentioned TPU architecture: TPU v1, TPU v2, TPU v4i (Jouppi et al., 2021)2 and TPU v4 (Jouppi et al., 2023)3

In general, from high-level to low-level hardware concepts:

  • each pod is a contiguous set of TPUs grouped together over a specialized network
  • each slice is a collection of chips all located inside the same TPU Pod connected by high-speed inter chip interconnects (ICI)
  • each TPU chips have one or two TensorCores
  • each TensorCore (i.e. TPU's core) has one or more matrix-multiply units (MXUs)
  • each MXU is composed of 128 x 128 multiply-accumulators in a systolic array

2.1.1. TensorCore

Each TPU chip has 1 or 2 TensorCore depending on the generations

TPU_tensorcore

A single TensorCore has the following component:

scalar processor (SPU) used for control flow

vector processor (VPU) used for vector processing.

transpose-permute Unit: specialized for matrix transpose/permute

Matrix-multiply unit (MXU) is based on systolic array See this illustration to understand how matrix multiplication works in systolic array

  • a single PE (processing element) is replaced with a regular array of PEs, where flow of data is carefully orchestrated orchestrated.
  • array structure can be nonlinear and dimensional

Advantages:

  • efficiently make use of limited memory bandwidth, balances computation to IO bandwidth
  • specialized (computation need to fit PE organizaiton/functions)

Downside:

  • not generally appliable

2.1.2. SparseCore

Originally called BarnaCore.

specialized for embedding lookup. Check Section 3 of TPU v4 paper

2.2. Memory

TPU has no hardware-managed cache such as L1 and L2. all memory transfer is explicitly controlled by XLA

2.2.1. SMEM, VMEM

scalar memory, vector memory (organized in 2d 128x8), fast

2.2.2. CMEM

common memory, on-chip, small (128MB in gen4)

2.2.3. HBM

TPU v4: 32Gib, 120 GBps

slow memory

2.2.4. Infeed/Outfeed Queue

The TPU host streams data into an infeed queue. The TPU loads data from the infeed queue and stores them in HBM memory.

When the computation is completed, the TPU loads the results into the outfeed queue. The TPU host then reads the results from the outfeed queue and stores them in the host's memory.

2.3. Communication

Note that Gang-scheduling is essential in the case of TPUs, since they are single-threaded and only run non-preemptible kernels, so the system will deadlock if communicating computations are not enqueued in a consistent order

tpu_v4

2.3.1. Host

TPU supports DMA to transfer

The TPU has two hardware queues: it loads data from the infeed queue and stores them in HBM memory. When the computation is completed, the TPU loads the results into the outfeed queue

2.3.2. ICI

TPU uses optical circuit switching instead of electrical packet switching as adopted by NVSwitch.

2.3.3. DCN

DCN (data-center network) connect between slices

3. CPU

3.1. Computing

3.1.1. AMX

Advanced Matrix Extensions (AMX) are extensions to x86 ISA, started from Sapphire Rapids microarchitecture for Xeon.

Its has using 2-dim registers and accelerator called matrix multiply unit (TMUL) performing tiled matrix multiplication.

4. Reference

  • altera FPGA white paper
  • What is a LUT
  • A HN thread comparing FPGA with GPU
  • Lecture on FPGA
  • Operating Systems Three Easy Pieces

  1. Wilson WL Fung, Ivan Sham, George Yuan, and Tor M Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In 40th annual IEEE/ACM international symposium on microarchitecture (MICRO 2007), pages 407–420. IEEE. 

  2. Norman P Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, et al. 2021. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th annual international symposium on computer architecture (ISCA), pages 1–14. IEEE. 

  3. Norm Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. 2023. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th annual international symposium on computer architecture, pages 1–14.