In previous posts, I introduced the
RNN Gradients
Let me show the how gradients of RNN are updated during
$$ Z_t = H_{t-1} \cdot W + I_t \cdot U $$
$$ H_t = tanh(Z_t) $$
The backward are updated as follows:
$$ \frac{\partial \mathcal{L}}{\partial Z_t} = \frac{\partial \mathcal{L}}{\partial H_t} \cdot (1-H_t)^2 $$
$$ \frac{\partial \mathcal{L}}{\partial H_{t-1}} = \frac{\partial \mathcal{L}}{\partial Z_t} \cdot W^\intercal $$
$$ \frac{\partial \mathcal{L}}{\partial H_{t-1}} = \frac{\partial \mathcal{L}}{\partial H_t} \cdot (1-H_t)^2 \cdot W^\intercal $$
Suppose that the input sequence is $I_{1:n}$, then recursively applying previous formula will lead to following results:
$$ \frac{\partial \mathcal{L}}{\partial H_{1}} = \frac{\partial \mathcal{L}}{\partial H_n} \cdot \prod_i { ( (1-H_i)^2 \cdot W^\intercal)} $$
As $tanh(Z_t)$ ranges from -1 to 1. Each element of $(1-H_i)^2$ is also bounded and ranges from 0 to 1. Suppose the maximum of its eigenvalue value is $\gamma = \mathrm{max} \{ \mathrm{eigen}( (1 – H_i )^2 ) | 1 \leqslant i \leqslant n \}$. Then it is sufficient that gradient will vanish when the largest eigenvalue $\lambda$ of $W^\intercal$ is strictly less than $\gamma$. This can be shown as follows:
$$ \eta = \gamma \lambda < 1 $$
$$ \frac{\partial \mathcal{L}}{\partial H_{1}} < \frac{\partial \mathcal{L}}{\partial H_n} \cdot \| \eta^n \| $$
The right side will converge to 0 when $n \rightarrow \infty$.
To visualize the gradient vanishing problem, I use the RNN operation implemented in previous post to show gradient flows into each inputs $ \frac{\partial \mathcal{L}}{\partial I_{i}} $ for each timestamp $i$.
from pytensor.model.rnn import * import seaborn as sns import matplotlib.pyplot as plt seq_len=30 model = RNNClassifier(1000,10,100) lst = list(range(seq_len)) model.forward(lst) model.softmaxLoss.loss(LongVariable([seq_len])) model.backward() gs = np.concatenate([model.embedding_variables[i].grad for i in range(seq_len)]) sns.heatmap(gs[-seq_len:].T) plt.show()
Running this code will produce

LSTM Gradients
The gradient of LSTM is a bit more complicated, the most critical forward part related to
$$! Cell_t = f_t*Cell_{t-1} + i_t*c_t $$
where $f_t, i_t, o_t$ are forget gate, input gate
$$ \frac{\partial \mathcal{L}}{\partial Cell_{t-1}} = f_t * \frac{\partial \mathcal{L}}{\partial Cell_{t}} $$
$$ \frac{\partial \mathcal{L}}{\partial Cell_{1}} = ( \prod_i f_i ) * \frac{\partial \mathcal{L}}{\partial Cell_{n}} $$
There is still a multiplication factor over times in this case, but $f_i$ is easier to control by using appropriate parameters in forget gates.
The computation of forget gate is shown as follows:
$$ f_t =\sigma(H_{t-1} \cdot W_{fh} + I_t \cdot W_{fi} + b_{fi} ) $$
The most important factor here is the bias term $b_{fi}$. If we initialize all parameters around 0, Then in the first few iterations, the gradient will decrease by the factor of 0.5 (because $\sigma(0) = 0.5$). It is visualized in the following figure.

However,

In tensorflow, forget bias is 1.0 by default for the LSTMBlockCell, but is 0.0 by default in cudnn LSTM.
In my current implementation, I set the default value of
Hey. answer, please. Will there be a continuation of articles on writing a framework? where can I get knowledge to develop a deep learning framework (computer vision)?
Hi Denis, I probably will continue to add operations to support attention and write one article for that. But I have not done that yet.
After that, actually I do not have any plans right now, because current single thread numpy is very slow to implement large models. I guess I might either adapt this framework to be my acoustic model engine for CPU speech recognition (to reduce tf or pytorch dependency), or I will largely change the implementation to use cuda kernel to support much more expensive kernels (e.g.: transformer)
If you are interested in the actual implementations of current sota deep learning frameworks, maybe you can check dynet. Its C++ code is very clean, short and easy to understand.