Skip to content

0x421 Module

This note focuses on developing deep learning modules which would be used in other notes.

The abbreviation notations for common tensors are

  • B: batch size
  • T: time step
  • H: hidden vector
  • A: attention size
  • C: channel

1. MLP

1.1. FFN

1.2. Gated Models

Deep models are difficult to train (gradient vanishing). To train efficiently, we want to keep the singular value of Jacobian near 1, explained by this paper

One approach to achieve this is to use the models with gating mechanisms.

Model (residual network)

Model (Highway network)

\[y = H(X)T(X) + X(1-T(X))\]

where the \(T(X)\) is the transform gatew, which controls the behavior of highway layer from the plain layer \(H(X)\) or identity layer \(X\)

  • when \(T(X)=1\), \(y=H(X)\)
  • when \(T(X)=0\), \(y=X\)

The transform gate can be implemented as follows:

\[T(X) = \sigma(WX+b)\]

where \(\sigma\) is sigmoid and \(b\) should be a negative value to initially bias towards the identity layer

Model (sparsely-gated mixture of experts) experts can be more than thousands

sparse moe

Model (gMLP, MLP with gating): self-attention is not always necessary depending on the task. MLP might be enough

First we have a channel projection (linear projection) and activation.

\[Z = \sigma(XU)\]

Then we model the spacial interaction \(s\)

\[\tilde{Z} = s(Z)\]

If \(s\) is identical, the entire operation is reduced to FFN. This work uses Spatial Gating Unit which first splits \(Z\) into \((Z_1, Z_2)\)

\[f(Z_2) = WZ_2+b\]
\[\tilde{Z} = Z_1 \circ f(Z_2)\]


2. Sequence Model

Using a standard network model (e.g: MLP model to predict output concats with input concats) to represent sequence is not good, we have several problems here

  • inputs, outputs length can be variable in different samples
  • features learned at each sequence position are not shared
  • too many parameters

2.1. RNN

The Vanilla RNN has the following architecture (from deeplearning ai courses)

rnn from

The formula are

\[h_{t+1} = \sigma(W_{rec} h_t + W_{in} x_t) = \sigma (W [h_t, x_t])\]
\[y_{t} = \sigma(W_{out} h_t)\]

2.1.1. Bidirectional RNN

This unidirectional vanilla RNN has some problem, let's think about an example of NER from the course.

  • He said Teddy Roosevelt was a great president
  • He said Teddy bears are on sale

It is difficult to make decision at the word Teddy whether it is a person name or not without looking at the future words.

2.1.2. Gradient Vanishing/Explosion

RNN has the issue of Gradient Vanishing and Gradient Explosion when it is badly conditioned. This paper shows the conditions when these issues happen:

Suppose we have the recurrent structure

\[h_t = W_{rec} \sigma(h_{t-1}) + W_{in} x_t\]

And the loss \(E_t = \mathcal{L}(h_t)\) and \(E = \sum_t E_t\), then

\[\frac{\partial E_t}{\partial \theta} = \sum_{1 \leq k \leq t} \frac{\partial E_t}{\partial h_t} \frac{\partial h_t}{\partial h_k} \frac{\partial h_k}{\partial \theta}\]

the hidden partial can be further decomposed into

\[\frac{\partial h_t}{\partial h_k} = \prod_{t \geq i \geq k} \frac{\partial x_i}{\partial x_{i-1}} = \prod_i W^T_{rec} diag(\sigma'(h_{i-1}))\]

Suppose the nonlinear \(\sigma'(h_t)\) is bounded by \(\gamma\), then \(|| diag(\sigma'(h_t)) || \leq \gamma\) (e.g: \(\gamma = 1/4\) for sigmoid function)

It is sufficient for the largest eigenvector \(\lambda_1\) of \(W_{rec}\) to be less than \(1/\gamma\), for the gradient vanishing problem to occur because

\[||\frac{\partial h_{t+1}}{\partial h_{t}}|| \leq ||W_{rec}|| ||diag(\sigma'(h_t))|| < 1\]

Similarly, by inverting the statement, we can see the necessary condition of gradient explosion is \(\lambda_1 > \gamma\)

2.2. LSTM

Model (LSTM) Avoid some problems of RNNs

It computes the forget gates, input gate, output gates

\[\begin{bmatrix} i_t \\ f_t \\ g_t \\ o_t \end{bmatrix} = \begin{pmatrix} \sigma \\ \sigma \\ tanh \\ \sigma \end{pmatrix} (W_{hh} h_{t-1} + W_{hx} x_t + b_h)\]

The following one is the key formula, where the saturating \(f_t \in [0,1]\) will pass grad through \(c_{t}\) to \(c_{t-1}\)

\[c_t = c_{t-1} \circ f_t + i_t \circ g_t\]


2.3. Encoder-Decoder

3. Convolution Model

3.1. Vanilla Convolution

Concept (receptive field) the receptive field has a closed form

\[r_0 = \sum_{l=1}^L \Bigl( (k_l-1)\prod_{i=1}^{l-1}s_i \Bigr) + 1\]

where \(r_0\) is the receptive field wrt the input layer, \(L\) is the total layer size, \(k_l, s_l\) is the kernel and stride with in layer \(l\)

Model (conv 1d) applied convolution to 1d signal

  • Input: \((C_{in}, L_{in})\)
  • Output: \((C_{out}, L_{out})\)
  • Param: \(C_{in}C_{out}kernel\)

where \(C_{in}, C_{out}\) are input, output channels, \(L_{in}, L_{out}\) are input, output lengths

Recall the length is determined by

\[L_{out} = floor(\frac{L_{in} + 2padding - kernel}{stride} + 1)\]

When \(stride==1\), it is convenient to set \(kernel = 2padding + 1\) to maintain the same length

Kim's convolution network is a good illustrution of Conv 1d, where the number of words in length, and embedding size is the channel


3.2. Transposed Convolution

Transposed convolution is used to upsample dimension (as opposed to the downsampling by normal convolution). It can be used in semantic segmentation, which classifies every pixel into classes.

It is call "transposed" because it can be implemented using matrix transposition, see this chapter

3.3. Pointwise Convolution

Pointwise Convolution or Conv 1x1 or Network in Network

Model (Conv 1x1) It is a fully connected layer across channel dim for every pixel.

  • It is to exchange information across channels for each pixel. Explained in the Network in Network
  • It can be used to reduce channel size (28x28x192 -> 28x28x32)

1x1 convolution. Watch this video

3.4. Depthwise Convolution

Model (depthwise separable conv, Xception) Split the standard convolution into two steps

  • depthwise conv: conv per each channel
  • pointwise conv: 1x1 conv

Model (MobileNet)

Figure from the mobilenet paper

depthwise conv

3.5. Residual Convolution

Model (ResNeXt, Aggregated Transformation) Repeat the building blocks by adding a new dimension called cardinality \(C\) (number of the set of transformation)

\[\mathcal{F}(x) = \sum^C_i \mathcal{T}_i (x)\]

Model (DenseNet) DenseNet connects each layer to every other layer using a feed-forward connection.

Figure from densenet paper


Model (U-Net)

Model (EfficientNet)

4. Graph Model

Graph Neural Network. Check the survey here

4.1. Recurrent Graph

4.2. Convolutional Graph

4.3. Graph Autoencoder

4.4. Spatial-Temporal Graph

5. Attention Model

5.1. Attention

The problem of a standard sequence-based encoder/decoder is we cannot cram all information into a single hidden vector. Instead of a single vector, we can store more information in variable-size vectors.

Model (attention)

  • Use query vector (from decoder state) and key vectors (from encoder state).
  • For each query, key pair, we calculate a weight and weights are normalized with softmax.
  • Weights are multiplied with a value vector to obtain the target hidden vector.

Weight \(a(q,k)\) can be obtained using different attention functions, for example:

multilayer perception (Bahdanau 2015)

\[a(q, k) = w_2^T \tanh(W[q;k])\]

bilinear (Luong 2015)

\[a(q, k) = q^TWk\]

dot product (Luong 2015)

\[a(q,k) = q^Tk\]

scaled dot product (Vaswani 2017, attention paper)

\[a(q, k) = \frac{q^Tk}{\sqrt{d}}\]

As mentioned in the attention paper, this scaling is to make sure variance is 1 under the assumption \(q=(q_1, ..., q_d), k=(k_1, ..., k_d)\) has independent mean 0 var 1 distribution. Recall the variance of indepedent multiplication from the probability note

\[Var(XY) = Var(X)Var(Y) + (EX)^2Var(X) + (EY)^2Var(Y)\]

5.2. self-attention

Self-attention allows each element in the sequence to attend other elements, to make a more context sensitive encoding. For example, in the translation task, the in English might be translated to le, la, les in French depending on the target noun. An attention from the to the target noun will make it more context sensitive.

Model (self-attention) The self-attention transforms an embedding \((T,H)\) into another embedding \((T,H)\). Shape is invariant under the self-attention transformation

  • Input: \((T,H)\)
  • Output: \((T,H)\)
  • Params: \(W_q, W_k, W_v \in R^{(H,H)}\)

where \(T\) is the sequence length, \(H\) is the hidden size, and \(W_q, W_k, W_v\) are weight matrix.

We first created three matrix \(Q, K, V\), meaning query, key and value.

\[Q = XW_q, K=XW_k, V = XW_v\]

Next, we query each token to all keys in the sentence and compute their similarity matrix \(S\) normalized by hidden size (scaled dot product)

\[S = \text{Softmax}(\frac{QK^T}{\sqrt{H}}) \in R^{(T,T)}\]

The \((i,j)\) element is the similarity between token \(i\) and \(j\), each row is normalized by the softmax.

Finally, the similarity is multiplied by the value matrix to get the new embedding \(SV \in R^{(T,H)}\)

6. Transformer

Transformer combines two important concepts:

  • recurrent free architecture
  • multi-head attention aggregate spacial information across tokens


Model (positional-embedding) encode the position into an embedding. There are multiple variants of the positional embedding

The original transformer is using the sinusoidal positional encoding

\[e(t)_i = \begin{cases} sin(t / 10000^{2k/d}) \text{ if } t=2k \\ cos(t / 10000^{2k/d}) \text{ if } t=2k+1 \end{cases}\]

This resembles the binary representation of position integer

  • where the LSB bit is alternating fast (sinusoidal position embedding has the fastest frequency when \(k=0\))
  • But higher bits is alternating slowly (higher position has slower frequency)

Another characterstics is the relative positioning is easier because there exists a linear transformation (rotation matrix) to connect \(e(t)\) and \(e(t+k)\) for any \(k\).

Sinusoidal position encoding has symmetric distance which decays with time.

Model (multi-head self-attention) Multihead attention use multiple head for self-attention

\[Q_i = XW^i_q, K_i=XW^i_k, V_i = XW^i_v\]
\[\text{Head}_i = \text{Attention}(Q_i, K_i, V_i)\]
\[\text{MultiheadAttention} = \text{Concat}(\text{Head}_1, ..., \text{Head}_n)W_o\]

6.1. Positional Encoding

The original PE is deterministic, however, there are several learnable choices for the positional encoding.

Absolute Positional Encoding learns \(p_i \in R^d\) for each position and uses \(w_i + p_i\) as input. In the self attention, energy is computed as

\[\alpha_{i,j} \propto ((w_i+p_i)W_q)((w_j+p_j)W_k)^T\]

Relative Positional Encoding In this work, relative encoding \(a_{j-i}\) is learned for every self-attention layer

\[\alpha_{i,j} \propto (x_i W_q)( x_j W_k + a_{j-i})^T\]

6.2. Efficient Transformer

Model (Primer) has smaller training cost than the original for autoregressive LM.

6.3. Long-Sequence Models

Check this survey

Standard Transformer cannot process long sequences due to the quandratic complexity \(O(T^2)\) of self-attention, both time and memory complexity. For example, BERT can only handle 512 tokens. Instead of the standard global attention, the following models try to circumvent this issue by using restricted attention

Model (sparse transformer) it introduces two sparse self-attention patterns:

  • stride pattern: capture periodic structure
  • fixed pattern: specific cells summarize previous locations and propagate them into all future cells

sparse self-attention

Model (Longformer) Longformer’s attention mechanism is a combination of the following two

  • windowed local-context self-attention
  • dilated sliding window
  • an end task motivated global attention that encodes inductive bias about the task.

longformer self-attention

Model (BigBird) use global token (e.g: CLS) and sparse attention

7. Reference