Recently I am interested in distributing an acoustic model implemented with tensorflow cudnn. However, I found it might be difficult to distribute the model if it depends on tensorflow, as their API has changed so fast (especially 1.3 -> 2.0) that the model might not be able to run at some point in the future.

While it is possible to distribute the model using complicated stuff such as docker or VM, I prefer a cleaner way to distribute the model. For instance, a simple pip. Therefore, I decided to reproduce the inference part of tensorflow cudnn stack bidirectional lstm with numpy. Then everything should be able to run within numpy happily.

My model is a standard tensorflow Cudnn BLSTM model initialized as simple as follows

`cudnn_model = tf.contrib.cudnn_rnn.CudnnLSTM(layer_size, hidden_size, direction='bidirectional')`

In my acoustic model, my input size is 120, the hidden size is 320 and the layer size is 5. To begin with, I checked the dumped parameters of this model. The variables (name, shape) related to lstm inference is as follows

```
{'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_4/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [440,
1280],
'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_5/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [440,
1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_1/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280],
'layer/stack_bidirectional_rnn/cell_0/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_3/bidirectional_rnn/bw/cudnn_compatible_lstm_cell/bias': [1280],
'layer/stack_bidirectional_rnn/cell_2/bidirectional_rnn/fw/cudnn_compatible_lstm_cell/kernel': [960,
1280]}
```

I was surprised at first that I cannot find either 120 or 320 in those shapes. Additionally, there were only one kernel variable and one bias variable for every unidirectional LSTM. I originally expected that tensorflow would store variables of different gates separately as I implemented in pytensor (e.g: forget gate, input gate …). But it seems that they glue all gate variables together in some order.

To investigate how to run inference with those glued parameters, I checked the code of CudnnCompatibleLSTMCell, which is the CPU (eigen) implementation of cudnn LSTM.

```
class CudnnCompatibleLSTMCell(lstm_ops.LSTMBlockCell):
"""Cudnn Compatible LSTMCell.
A simple wrapper around `tf.contrib.rnn.LSTMBlockCell` to use along with
`tf.contrib.cudnn_rnn.CudnnLSTM`. The latter's params can be used by
this cell seamlessly.
"""
def __init__(self, num_units, reuse=None):
super(CudnnCompatibleLSTMCell, self).__init__(
num_units, forget_bias=0, cell_clip=None, use_peephole=False,
reuse=reuse, name="cudnn_compatible_lstm_cell")
```

Apparently, CudnnCompatibleLSTMCell is just a wrapper over LSTMBlockCell with a reasonable set of parameters. The python LSTMBlockCell wrapper would next lead to following code

```
def _lstm_block_cell(x,
cs_prev,
h_prev,
w,
b,
wci=None,
wcf=None,
wco=None,
forget_bias=None,
cell_clip=None,
use_peephole=None,
name=None):
r"""Computes the LSTM cell forward propagation for 1 time step.
This implementation uses 1 weight matrix and 1 bias vector, and there's an
optional peephole connection.
This kernel op implements the following mathematical equations:
```python
xh = [x, h_prev]
[i, ci, f, o] = xh * w + b
f = f + forget_bias
if not use_peephole:
wci = wcf = wco = 0
i = sigmoid(cs_prev * wci + i)
f = sigmoid(cs_prev * wcf + f)
ci = tanh(ci)
cs = ci .* i + cs_prev .* f
cs = clip(cs, cell_clip)
o = sigmoid(cs * wco + o)
co = tanh(cs)
h = co .* o
```

While the actual eigen implementation is here. The comments above have already highlight the forward inference. To compare the interface above and the parameter set from CudnnCompatibleLSTMCell, the actual inference could be reduced to the following python pseudo code.

```
xh = [x, h_prev]
[i, ci, f, o] = xh * w + b
i = sigmoid(i)
f = sigmoid(f)
ci = tanh(ci)
cs = ci .* i + cs_prev .* f
o = sigmoid(o)
co = tanh(cs)
h = co .* o
```

The code here is a very standard LSTM cell and its parameters is only $w$ and $b$ which obviously corresponding to the kernel and the bias appearing in the dumped files.

As the code shows, the main inference is to firstly concatenate input variable and previous hidden state, then it computes values of 4 gates simultaneously. This explains the shape variable above. For my first layer, the kernel shape is [440, 1280]. 440 is 120 (input size) + 320 (previous hidden size). 1280 is 4 (4 gates) times 320 (hidden size). For the remaining layers, the kernel shape is [960, 1280]. As I am using bidirectional LSTM, both forward hidden variable and backward hidden variable are fed into next layer. Then 960 could be decomposed into 320 (forward hidden various in previous layer) + 320 (backward hidden in previous layer) + 320 (previous hidden variable in this layer).

After a bit testing with numpy, I found my investigation was correct and the LSTM cell inference could be implemented as pseudo-code described above

```
def lstm_cell(input_tensor, prev_hidden_tensor, prev_cell_state, kernel, bias):
"""
forward inference logic of a lstm cell
reference: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/rnn/python/ops/lstm_ops.py
:param input_tensor: input tensor
:param prev_hidden_tensor: tensor of previous hidden state
:param kernel: weight
:param bias: bias
:return: hidden tensor, cell state tensor
"""
xh = np.concatenate([input_tensor, prev_hidden_tensor])
h = np.dot(xh, kernel)+bias
i, ci, f, o = np.split(h, 4)
# embed sigmoid to reduce function call
i = 1. / (1. + np.exp(-i))
f = 1. / (1. + np.exp(-f))
o = 1. / (1. + np.exp(-o))
ci = np.tanh(ci)
cs = np.multiply(ci, i) + np.multiply(prev_cell_state, f)
co = np.tanh(cs)
h = np.multiply(co, o)
return h, cs
```

Then we can recursively implement unidirectional lstm, bidirectional lstm and the stacked bidirectional lstm.