Skip to content

0x511 Computing

This note is about kernel implementation in a single-device

Check this lecture series for Heterogeneous computing (mainly GPU)

1. Foundation

How to transponse efficiently


See this post

3. Convolution

3.1. Winograd

Fast Algorithms for Convolutional Neural Networks

4. Attention

4.1. FlashAttention

reduce communication cost between SRAM and HBM by tiling + rematerilization

4.2. Flash Decoding

flash decoding splits over the sequence dim and applies FlashAttention at two level

5. Communications

This section is about communication primitives and its implementations.

5.1. ReduceScatter

5.2. AllReduce

There are few methods to implement allreduce. For example, it can be implemented with ReduceScatter + AllGather.

Ring-based algorithm are implemented in Horovod and Baidu Allreduce. See Baidu's simple allreduce's implementation using MPI_Irecv and MPI_Send. An advanced ring-based approach is 2d ring algorithm.

Double-binary Tree are NCCL implementation. See this blog

6. Libs

6.1. Cudnn

There are two APIs right now:

6.1.1. Graph API

7. Reference