Visualization of Gradient Vanishing for RNN/LSTM

In previous posts, I introduced the pytensor framework and implemented several basic operations and models with it. Before going next to implement seq2seq and attention operations, I am interested in doing some simple experiments with it. In this post, I am going to visualize flows of gradients in both RNN/LSTM, I will show that the gradient of RNN will vanish very quickly, but LSTM tends to flow its gradients much easier when using a proper forget gate bias.

RNN Gradients

Let me show the how gradients of RNN are updated during forward and backward operations. First the forward hidden states are updated as follows

$$ Z_t = H_{t-1} \cdot W + I_t \cdot U $$

$$ H_t = tanh(Z_t) $$

The backward are updated as follows:

$$ \frac{\partial \mathcal{L}}{\partial Z_t} = \frac{\partial \mathcal{L}}{\partial H_t} \cdot (1-H_t)^2 $$

$$ \frac{\partial \mathcal{L}}{\partial H_{t-1}} = \frac{\partial \mathcal{L}}{\partial Z_t} \cdot W^\intercal $$

$$ \frac{\partial \mathcal{L}}{\partial H_{t-1}} = \frac{\partial \mathcal{L}}{\partial H_t} \cdot (1-H_t)^2 \cdot W^\intercal $$

Suppose that the input sequence is $I_{1:n}$, then recursively applying previous formula will lead to following results:

$$ \frac{\partial \mathcal{L}}{\partial H_{1}} = \frac{\partial \mathcal{L}}{\partial H_n} \cdot \prod_i { ( (1-H_i)^2 \cdot W^\intercal)} $$

As $tanh(Z_t)$ ranges from -1 to 1. Each element of $(1-H_i)^2$ is also bounded and ranges from 0 to 1. Suppose the maximum of its eigenvalue value is $\gamma = \mathrm{max} \{ \mathrm{eigen}( (1 – H_i )^2 ) | 1 \leqslant i \leqslant n \}$. Then it is sufficient that gradient will vanish when the largest eigenvalue $\lambda$ of $W^\intercal$ is strictly less than $\gamma$. This can be shown as follows:

$$ \eta = \gamma \lambda < 1 $$

$$ \frac{\partial \mathcal{L}}{\partial H_{1}} < \frac{\partial \mathcal{L}}{\partial H_n} \cdot \| \eta^n \| $$

The right side will converge to 0 when $n \rightarrow \infty$. Similarly we can show the necessary condition for the gradient explosion using similar ideas. While the gradient does not vanish in every condition as it is just a sufficient condition, it looks that gradient RNN usually vanish fast. Much more formal details are available in the paper On the difficulty of training Recurrent Neural Networks [link]

To visualize the gradient vanishing problem, I use the RNN operation implemented in previous post to show gradient flows into each inputs $ \frac{\partial \mathcal{L}}{\partial I_{i}} $ for each timestamp $i$.

from pytensor.model.rnn import *
import seaborn as sns
import matplotlib.pyplot as plt

seq_len=30
model = RNNClassifier(1000,10,100)
lst = list(range(seq_len))
model.forward(lst)
model.softmaxLoss.loss(LongVariable([seq_len]))
model.backward()
gs = np.concatenate([model.embedding_variables[i].grad for i in range(seq_len)])
sns.heatmap(gs[-seq_len:].T)
plt.show()

Running this code will produce following figure. The x axis is the time stamp $i$ and y axis shows each element of $\frac{\partial \mathcal{L}}{\partial I_{i}} $. In this model, RNN starts to backpropagate its gradients from timestamp 29, and will end at timestamp 0. As the figure indicates, the gradients disappear fast during backpropagation.

RNN gradient of first epoch

LSTM Gradients

The gradient of LSTM is a bit more complicated, the most critical forward part related to states updates is as follows:

$$! Cell_t = f_t*Cell_{t-1} + i_t*c_t $$

where $f_t, i_t, o_t$ are forget gate, input gate and output gate. Applying the chain rule recusively will give following results:

$$ \frac{\partial \mathcal{L}}{\partial Cell_{t-1}} = f_t * \frac{\partial \mathcal{L}}{\partial Cell_{t}} $$

$$ \frac{\partial \mathcal{L}}{\partial Cell_{1}} = ( \prod_i f_i ) * \frac{\partial \mathcal{L}}{\partial Cell_{n}} $$

There is still a multiplication factor over times in this case, but $f_i$ is easier to control by using appropriate parameters in forget gates.

The computation of forget gate is shown as follows:

$$ f_t =\sigma(H_{t-1} \cdot W_{fh} + I_t \cdot W_{fi} + b_{fi} ) $$

The most important factor here is the bias term $b_{fi}$. If we initialize all parameters around 0, Then in the first few iterations, the gradient will decrease by the factor of 0.5 (because $\sigma(0) = 0.5$). It is visualized in the following figure.

LSTM gradient of first epoch when forget bias is 0

However, it we use a larger forget bias, then the nonlinear curve of $\sigma$ will help us flow gradient much more easier. The following figure is the gradients visualization when forget bias is 1. It shows that the gradients are back propagated through all timestamps much more efficiently when compared with previous figures.

LSTM gradient of first epoch when forget bias is 1

In tensorflow, forget bias is 1.0 by default for the LSTMBlockCell, but is 0.0 by default in cudnn LSTM.

In my current implementation, I set the default value of forget bias to be 1.0. But it is easier to play with by changing its value in the RNNCell.

2 Comments

  1. Hey. answer, please. Will there be a continuation of articles on writing a framework? where can I get knowledge to develop a deep learning framework (computer vision)?

    1. Hi Denis, I probably will continue to add operations to support attention and write one article for that. But I have not done that yet.

      After that, actually I do not have any plans right now, because current single thread numpy is very slow to implement large models. I guess I might either adapt this framework to be my acoustic model engine for CPU speech recognition (to reduce tf or pytorch dependency), or I will largely change the implementation to use cuda kernel to support much more expensive kernels (e.g.: transformer)

      If you are interested in the actual implementations of current sota deep learning frameworks, maybe you can check dynet. Its C++ code is very clean, short and easy to understand.

Leave a Comment

Your email address will not be published. Required fields are marked *