# Export Tensorflow CudnnLSTM to numpy

Recently I am interested in distributing an acoustic model implemented with tensorflow cudnn. However, I found it might be difficult to distribute the model if it depends on tensorflow, as their API has changed so fast (especially 1.3 -> 2.0) that the model might not be able to run at some point in the future.

While it is possible to distribute the model using complicated stuff such as docker or VM, I prefer a cleaner way to distribute the model. For instance, a simple pip. Therefore, I decided to reproduce the inference part of tensorflow cudnn stack bidirectional lstm with numpy. Then everything should be able to run within numpy happily.

My model is a standard tensorflow Cudnn BLSTM model initialized as simple as follows

cudnn_model = tf.contrib.cudnn_rnn.CudnnLSTM(layer_size, hidden_size, direction='bidirectional')

In my acoustic model, my input size is 120, the hidden size is 320 and the layer size is 5. To begin with, I checked the dumped parameters of this model. The variables (name, shape) related to lstm inference is as follows

{'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [440,
1280],

'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [440,
1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280]}

I was surprised at first that I cannot find either 120 or 320 in those shapes. Additionally, there were only one kernel variable and one bias variable for every unidirectional LSTM. I originally expected that tensorflow would store variables of different gates separately as I implemented in pytensor (e.g: forget gate, input gate …). But it seems that they glue all gate variables together in some order.

To investigate how to run inference with those glued parameters, I checked the code of CudnnCompatibleLSTMCell, which is the CPU (eigen) implementation of cudnn LSTM.

class CudnnCompatibleLSTMCell(lstm_ops.LSTMBlockCell):
"""Cudnn Compatible LSTMCell.
A simple wrapper around tf.contrib.rnn.LSTMBlockCell to use along with
tf.contrib.cudnn_rnn.CudnnLSTM. The latter's params can be used by
this cell seamlessly.
"""

def __init__(self, num_units, reuse=None):
super(CudnnCompatibleLSTMCell, self).__init__(
num_units, forget_bias=0, cell_clip=None, use_peephole=False,
reuse=reuse, name="cudnn_compatible_lstm_cell")

Apparently, CudnnCompatibleLSTMCell is just a wrapper over LSTMBlockCell with a reasonable set of parameters. The python LSTMBlockCell wrapper would next lead to following code

def _lstm_block_cell(x,
cs_prev,
h_prev,
w,
b,
wci=None,
wcf=None,
wco=None,
forget_bias=None,
cell_clip=None,
use_peephole=None,
name=None):
r"""Computes the LSTM cell forward propagation for 1 time step.
This implementation uses 1 weight matrix and 1 bias vector, and there's an
optional peephole connection.
This kernel op implements the following mathematical equations:
python
xh = [x, h_prev]
[i, ci, f, o] = xh * w + b
f = f + forget_bias
if not use_peephole:
wci = wcf = wco = 0
i = sigmoid(cs_prev * wci + i)
f = sigmoid(cs_prev * wcf + f)
ci = tanh(ci)
cs = ci .* i + cs_prev .* f
cs = clip(cs, cell_clip)
o = sigmoid(cs * wco + o)
co = tanh(cs)
h = co .* o

While the actual eigen implementation is here. The comments above have already highlight the forward inference. To compare the interface above and the parameter set from CudnnCompatibleLSTMCell, the actual inference could be reduced to the following python pseudo code.

  xh = [x, h_prev]
[i, ci, f, o] = xh * w + b
i = sigmoid(i)
f = sigmoid(f)
ci = tanh(ci)
cs = ci .* i + cs_prev .* f
o = sigmoid(o)
co = tanh(cs)
h = co .* o

The code here is a very standard LSTM cell and its parameters is only $w$ and $b$ which obviously corresponding to the kernel and the bias appearing in the dumped files.

As the code shows, the main inference is to firstly concatenate input variable and previous hidden state, then it computes values of 4 gates simultaneously. This explains the shape variable above. For my first layer, the kernel shape is [440, 1280]. 440 is 120 (input size) + 320 (previous hidden size). 1280 is 4 (4 gates) times 320 (hidden size). For the remaining layers, the kernel shape is [960, 1280]. As I am using bidirectional LSTM, both forward hidden variable and backward hidden variable are fed into next layer. Then 960 could be decomposed into 320 (forward hidden various in previous layer) + 320 (backward hidden in previous layer) + 320 (previous hidden variable in this layer).

After a bit testing with numpy, I found my investigation was correct and the LSTM cell inference could be implemented as pseudo-code described above

def lstm_cell(input_tensor, prev_hidden_tensor, prev_cell_state, kernel, bias):
"""
forward inference logic of a lstm cell
reference: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/lstm_ops.py

:param input_tensor: input tensor
:param prev_hidden_tensor: tensor of previous hidden state
:param kernel: weight
:param bias: bias
:return: hidden tensor, cell state tensor
"""

xh = np.concatenate([input_tensor, prev_hidden_tensor])
h = np.dot(xh, kernel)+bias
i, ci, f, o = np.split(h, 4)

# embed sigmoid to reduce function call
i = 1. / (1. + np.exp(-i))
f = 1. / (1. + np.exp(-f))
o = 1. / (1. + np.exp(-o))
ci = np.tanh(ci)

cs = np.multiply(ci, i) + np.multiply(prev_cell_state, f)
co = np.tanh(cs)
h = np.multiply(co, o)

return h, cs`

Then we can recursively implement unidirectional lstm, bidirectional lstm and the stacked bidirectional lstm.

The full model is here, and corresponding test is here.

xinjianl

1. ERRON says:

Hello! do you have mail or twitter?

1. xinjianl says:

Hi ERRON, please feel free to send email to me: xinjianl (at) cs.cmu.edu 🙂

1. ERRON says:

HI.
where did you advise to get the same knowledge as yours for the implementation of the framework for NLP?

1. xinjianl says:

Hi ERRON,

I think a good start point is to follow the assignments in some open courses (e.g: stanford’s cs231n, cs224n or cmu’s deep learning course http://deeplearning.cs.cmu.edu/), then try organizing those codes into your own frameworks. This should give you an insight how to create a toy framework like mine.

Then if you are interested in implementing more decent frameworks, probably you should refer to the actual C++ implementation of modern frameworks (e.g: dynet)

1. Erron says:

after these courses, will I have enough knowledge to implement such projects as Face Recognition, video subtitles and speech recognition from scratch (using my own framework)?
THANKS

2. xinjianl says:

Hi ERRON,

Yeah, I guess you could acquire enough knowledge to implement those if you finish them all 🙂

1. xinjianl says:

Exactly!

1. Erron says:

Thank you so much for the answers.

2. dima says:

Hello!
Do you know computer vision?
If yes.
1) what skills do you need to please this milking and where to get it?
2) can you say something about these resources?
http://dlsys.cs.washington.edu/
https://course.fast.ai/videos/?lesson=8
https://github.com/pjreddie/uwimg
https://sergioskar.github.io/Neural_Network_from_scratch/
https://course.fast.ai/part2
Can you tell me the resources for studying (tensors, autograd, NN abstractions, optimizers, data pipeline, training loop abstraction, cnn, distributed training, graph compilation)?
Thanks!

1. xinjianl says:

Hi Dima,

Actually I am not very familiar with CV, but I think all of the resources you listed can be a good start for studying.
Those urls should cover the basic stuff such as how to implement autograd or optimization.
I would recommend you to first start with some python implementation (e.g: in your first url or the library in this blog).
Then move to the c++ (e.g: in your 4th url) to speed up.
The 3rd one looks a bit different because it is mostly about the vision preprocessing and traditional approaches, which should be very helpful eventually if you work in CV.

For more advanced topics (e.g: distributed training implementation, graph compile), I do not know whether there are any good tutorials or not.
Maybe at this level, you need to read actual codes in the modern libraries (e.g: pytorch, nccl) and research papers.

Thanks!

1. Dima says:

can you recommend resources where development of deep learning projects from scratch is shown?
and where to study more deeply? (ensors, autograd, NN abstractions, optimizers, CNN)

1. xinjianl says:

The resources you listed have already contained some skeleton to start, but details might be omitted.
Unfortunately, I do not know whether there are complete tutorials to help you go through all the implementation details.
But I think there are enough materials online covering every essential point you need to know.
So just start with the skeleton project and try to implement it yourself 🙂

To study more deeply of implementation, I would recommend reading some middle-size neural libraries.
For example, dynet is very good resource to learn how the actual framework is implemented in c++/cuda.
Chainer is also a good option as most of it is implemented in python, which should be easy to understand. Unfortunately, they just announced this week that chainer will not be supported anymore, but at the current point, I think it is still a good choice to learn implementation.
On the other hand, both pytorch and tensorflow are too large to read, and their implementations are actually not very well organized to follow (especially tensorflow)

1. Dima says:

Thanks!
and what resources did you study to study implementations from scratch?

2. xinjianl says:

Actually, I did not read any specific resources for implementing from scratch, but I had lots of hint from this repo

3. Dima says:

Thanks!!!

4. Denis says:

Hi
happy New Year!
will there be a continuation of the content in your framework?

1. xinjianl says:

Hi Denis, thanks! I am now implementing the continuation and hope to write some new blogs about it when I have time.