0x421 Module
 1. MLP
 2. Sequence Model
 3. Convolution Model
 4. Graph Model
 5. Attention Model
 6. Transformer
 7. Reference
This note focuses on developing deep learning modules which would be used in other notes.
The abbreviation notations for common tensors are
 B: batch size
 T: time step
 H: hidden vector
 A: attention size
 C: channel
1. MLP
1.1. FFN
1.2. Gated Models
Deep models are difficult to train (gradient vanishing). To train efficiently, we want to keep the singular value of Jacobian near 1, explained by this paper
One approach to achieve this is to use the models with gating mechanisms.
Model (residual network)
Model (Highway network)
where the \(T(X)\) is the transform gatew, which controls the behavior of highway layer from the plain layer \(H(X)\) or identity layer \(X\)
 when \(T(X)=1\), \(y=H(X)\)
 when \(T(X)=0\), \(y=X\)
The transform gate can be implemented as follows:
where \(\sigma\) is sigmoid and \(b\) should be a negative value to initially bias towards the identity layer
Model (sparselygated mixture of experts) experts can be more than thousands
Model (gMLP, MLP with gating): selfattention is not always necessary depending on the task. MLP might be enough
First we have a channel projection (linear projection) and activation.
Then we model the spacial interaction \(s\)
If \(s\) is identical, the entire operation is reduced to FFN. This work uses Spatial Gating Unit which first splits \(Z\) into \((Z_1, Z_2)\)
2. Sequence Model
Using a standard network model (e.g: MLP model to predict output concats with input concats) to represent sequence is not good, we have several problems here
 inputs, outputs length can be variable in different samples
 features learned at each sequence position are not shared
 too many parameters
2.1. RNN
The Vanilla RNN has the following architecture (from deeplearning ai courses)
The formula are
2.1.1. Bidirectional RNN
This unidirectional vanilla RNN has some problem, let's think about an example of NER from the deeplearning.ai course.
 He said Teddy Roosevelt was a great president
 He said Teddy bears are on sale
It is difficult to make decision at the word Teddy whether it is a person name or not without looking at the future words.
2.1.2. Gradient Vanishing/Explosion
RNN has the issue of Gradient Vanishing and Gradient Explosion when it is badly conditioned. This paper shows the conditions when these issues happen:
Suppose we have the recurrent structure
And the loss \(E_t = \mathcal{L}(h_t)\) and \(E = \sum_t E_t\), then
the hidden partial can be further decomposed into
Suppose the nonlinear \(\sigma'(h_t)\) is bounded by \(\gamma\), then \( diag(\sigma'(h_t))  \leq \gamma\) (e.g: \(\gamma = 1/4\) for sigmoid function)
It is sufficient for the largest eigenvector \(\lambda_1\) of \(W_{rec}\) to be less than \(1/\gamma\), for the gradient vanishing problem to occur because
Similarly, by inverting the statement, we can see the necessary condition of gradient explosion is \(\lambda_1 > \gamma\)
2.2. LSTM
Model (LSTM) Avoid some problems of RNNs
It computes the forget gates, input gate, output gates
The following one is the key formula, where the saturating \(f_t \in [0,1]\) will pass grad through \(c_{t}\) to \(c_{t1}\)
2.3. EncoderDecoder
3. Convolution Model
3.1. Vanilla Convolution
Concept (receptive field) the receptive field has a closed form
where \(r_0\) is the receptive field wrt the input layer, \(L\) is the total layer size, \(k_l, s_l\) is the kernel and stride with in layer \(l\)
Model (conv 1d) applied convolution to 1d signal
 Input: \((C_{in}, L_{in})\)
 Output: \((C_{out}, L_{out})\)
 Param: \(C_{in}C_{out}kernel\)
where \(C_{in}, C_{out}\) are input, output channels, \(L_{in}, L_{out}\) are input, output lengths
Recall the length is determined by
When \(stride==1\), it is convenient to set \(kernel = 2padding + 1\) to maintain the same length
Kim's convolution network is a good illustrution of Conv 1d, where the number of words in length, and embedding size is the channel
3.2. Transposed Convolution
Transposed convolution is used to upsample dimension (as opposed to the downsampling by normal convolution). It can be used in semantic segmentation, which classifies every pixel into classes.
It is call "transposed" because it can be implemented using matrix transposition, see this chapter
3.3. Pointwise Convolution
Pointwise Convolution or Conv 1x1 or Network in Network
Model (Conv 1x1) It is a fully connected layer across channel dim for every pixel.
 It is to exchange information across channels for each pixel. Explained in the Network in Network
 It can be used to reduce channel size (28x28x192 > 28x28x32)
1x1 convolution. Watch this video
3.4. Depthwise Convolution
Model (depthwise separable conv, Xception) Split the standard convolution into two steps
 depthwise conv: conv per each channel
 pointwise conv: 1x1 conv
Model (MobileNet)
Figure from the mobilenet paper
3.5. Residual Convolution
Model (ResNeXt, Aggregated Transformation) Repeat the building blocks by adding a new dimension called cardinality \(C\) (number of the set of transformation)
Model (DenseNet) DenseNet connects each layer to every other layer using a feedforward connection.
Figure from densenet paper
Model (UNet)
Model (EfficientNet)
4. Graph Model
Graph Neural Network. Check the survey here
4.1. Recurrent Graph
4.2. Convolutional Graph
4.3. Graph Autoencoder
4.4. SpatialTemporal Graph
5. Attention Model
5.1. Attention
The problem of a standard sequencebased encoder/decoder is we cannot cram all information into a single hidden vector. Instead of a single vector, we can store more information in variablesize vectors.
Model (attention)
 Use query vector (from decoder state) and key vectors (from encoder state).
 For each query, key pair, we calculate a weight and weights are normalized with softmax.
 Weights are multiplied with a value vector to obtain the target hidden vector.
Weight \(a(q,k)\) can be obtained using different attention functions, for example:
multilayer perception (Bahdanau 2015)
bilinear (Luong 2015)
dot product (Luong 2015)
scaled dot product (Vaswani 2017, attention paper)
As mentioned in the attention paper, this scaling is to make sure variance is 1 under the assumption \(q=(q_1, ..., q_d), k=(k_1, ..., k_d)\) has independent mean 0 var 1 distribution. Recall the variance of indepedent multiplication from the probability note
5.2. selfattention
Selfattention allows each element in the sequence to attend other elements, to make a more context sensitive encoding. For example, in the translation task, the in English might be translated to le, la, les in French depending on the target noun. An attention from the to the target noun will make it more context sensitive.
Model (selfattention) The selfattention transforms an embedding \((T,H)\) into another embedding \((T,H)\). Shape is invariant under the selfattention transformation
 Input: \((T,H)\)
 Output: \((T,H)\)
 Params: \(W_q, W_k, W_v \in R^{(H,H)}\)
where \(T\) is the sequence length, \(H\) is the hidden size, and \(W_q, W_k, W_v\) are weight matrix.
We first created three matrix \(Q, K, V\), meaning query, key and value.
Next, we query each token to all keys in the sentence and compute their similarity matrix \(S\) normalized by hidden size (scaled dot product)
The \((i,j)\) element is the similarity between token \(i\) and \(j\), each row is normalized by the softmax.
Finally, the similarity is multiplied by the value matrix to get the new embedding \(SV \in R^{(T,H)}\)
6. Transformer
Transformer combines two important concepts:
 recurrent free architecture
 multihead attention aggregate spacial information across tokens
Model (positionalembedding) encode the position into an embedding. There are multiple variants of the positional embedding
The original transformer is using the sinusoidal positional encoding
This resembles the binary representation of position integer
 where the LSB bit is alternating fast (sinusoidal position embedding has the fastest frequency when \(k=0\))
 But higher bits is alternating slowly (higher position has slower frequency)
Another characterstics is the relative positioning is easier because there exists a linear transformation (rotation matrix) to connect \(e(t)\) and \(e(t+k)\) for any \(k\).
Sinusoidal position encoding has symmetric distance which decays with time.
Model (multihead selfattention) Multihead attention use multiple head for selfattention
6.1. Positional Encoding
The original PE is deterministic, however, there are several learnable choices for the positional encoding.
Absolute Positional Encoding learns \(p_i \in R^d\) for each position and uses \(w_i + p_i\) as input. In the self attention, energy is computed as
Relative Positional Encoding In this work, relative encoding \(a_{ji}\) is learned for every selfattention layer
6.2. Efficient Transformer
Model (Primer) has smaller training cost than the original for autoregressive LM.
6.3. LongSequence Models
Check this survey
Standard Transformer cannot process long sequences due to the quandratic complexity \(O(T^2)\) of selfattention, both time and memory complexity. For example, BERT can only handle 512 tokens. Instead of the standard global attention, the following models try to circumvent this issue by using restricted attention
Model (sparse transformer) it introduces two sparse selfattention patterns:
 stride pattern: capture periodic structure
 fixed pattern: specific cells summarize previous locations and propagate them into all future cells
Model (Longformer) Longformerâ€™s attention mechanism is a combination of the following two
 windowed localcontext selfattention
 dilated sliding window
 an end task motivated global attention that encodes inductive bias about the task.
Model (BigBird) use global token (e.g: CLS) and sparse attention
7. Reference

[1] Coursera Deeplearning.ai courses

[2] CMU 10714 deep learning system

[3] This tutorial for transformer is really good

[4] For a pytorch official tutorial, see this