Implement a deep learning framework: Part 4 – Implement RNN, LSTM and Language Models

In this part, we continue to add basic components to the framework by implementing RNN/LSTM related operations. RNN/LSTM are neural network models which are widely used in various NLP and ML tasks such as pos-tagging and speech recognition.


Firstly, it is necessary to support the embedding operations for the model. In the previous posts, we have implemented the standard Variable for the linear model and MLP model. However, the standard variable cannot be used as an embedding variable efficiently.

The reason is that looking into an embedding variable is a sparse operation, so we only need partial information of the entire variable. In the following example, we only need two rows from the entire variable by looking up the index 1 and 3.


To support embedding lookup, we implemented the embedding operation in pytensor.ops.embedding_ops. The embedding variable is implemented as a list of standard variables in the Parameter class. To lookup the embedding for a specific word, we can just pick up the corresponding embedding from the list.

The related codes are as follows.

class Parameter:
   def get_embedding(self, vocab_size, word_dim):

        # get current embedding if it is created already
        if self.embeddings != None:
            return self.embeddings

        # initialize the embedding
        self.embeddings = []

        # embedding is implemented as a list of variables
        # this is for efficient update
        for i in range(vocab_size):
            embedding = Variable([np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), word_dim)])

        return self.embeddings

class Embedding(Operation):
    def forward(self, input_variables):
        get the embedding

        :param input_variables: input variable is a LongVariable containing word ids
        :return: embedding
        super(Embedding, self).forward(input_variables)

        # embedding only takes 1 input variable
        assert(len(input_variables) == 1)

        word_id = input_variables[0].value[0]

        assert (word_id < self.vocab_size)
        output_variable = self.embedding_variables[word_id]

        if self.trainable:

        return output_variable

RNN Operation

Equipped with the embedding operation, we can then continue to add the RNN operation. In our implementation, an RNN operation consists of multiple RNNCell operations. When we run RNN over a long sentence, each RNNCell will receive a word and update its internal state.

The RNNCell operation will compute its forward state by following equations.

$$ Z_t = H_{t-1} \cdot W + I_t \cdot U $$

$$ H_t = tanh(Z_t) $$

where $H_t$ is the hidden state at timestamp $t$, $I_t$ is the input at timestamp $t$.

The backward will propagate the gradient from $H_t$ into $Z_t$ with chain rules as follows.

$$ \frac{\partial \mathcal{L}}{\partial Z_t} = \frac{\partial \mathcal{L}}{\partial H_t} \cdot \frac{\partial tanh(Z_t)}{\partial Z_t} $$

$$ \frac{\partial \mathcal{L}}{\partial Z_t} = \frac{\partial \mathcal{L}}{\partial H_t} \cdot (1-H_t)^2 $$

Then the gradient is backpropageted into each of the remaining variables.

$$ \frac{\partial \mathcal{L}}{\partial H_{t-1}} = \frac{\partial \mathcal{L}}{\partial Z_t} \cdot W^\intercal $$

$$ \frac{\partial \mathcal{L}}{\partial W} =H_{t-1}^\intercal \cdot \frac{\partial \mathcal{L}}{\partial Z_t}$$

$$ \frac{\partial \mathcal{L}}{\partial I_{t}} = \frac{\partial \mathcal{L}}{\partial Z_t} \cdot U^\intercal $$

$$ \frac{\partial \mathcal{L}}{\partial U} =I_{t}^\intercal \cdot \frac{\partial \mathcal{L}}{\partial Z_t}$$

Finally, the $ \frac{\partial \mathcal{L}}{\partial I_{t}}$ is propagated into each embedding variables.

One of the common issues here is that $ \frac{\partial \mathcal{L}}{\partial W} $ and $ \frac{\partial \mathcal{L}}{\partial U} $ should accumulate their gradients without overwriting them by the new gradient. It took me a lot of time to debug this…

LSTM Operation

RNN can also be extended into an LSTM model which can alleviate gradient vanishing and exploding problems. The LSTM is also comprised of its cells (LSTMCell) like the RNNCell. The difference between LSTMCell and RNNCell is that LSTMCell needs to remember two variables: hidden state and cell state.

The LSTMCell operation will update its forward states with the following equations.

$$ f_t =\sigma(H_{t-1} \cdot W_{fh} + I_t \cdot W_{fi}) $$

$$ i_t =\sigma(H_{t-1} \cdot W_{ih} + I_t \cdot W_{ii}) $$

$$ o_t = \sigma(H_{t-1} \cdot W_{oh} + I_t \cdot W_{oi}) $$

$$ c_t = tanh(H_{t-1} \cdot W_{ch} + I_t \cdot W_{ci}) $$

$$ Cell_t = f_t*Cell_{t-1} + i_t*c_t $$

$$ H_t = o_t*tanh(Cell_t) $$

where $f_t, i_t $ and $o_t$ are forget gate, input gate, and output gate respectively.

The backward operation of LSTM cell is too long to show. The corresponding code can be seen here.

Penn Treebank  language model

Finally, we can use those components to create a language model. We will train a language model using sentences from Penn Treebank. The dataset is obtained from the word2vec script in Mikolov’s website.

We show a RNN language implementation as follows. LSTM can be implemented in the same style by replacing RNN with LSTM.

class RNNLM:

    def __init__(self, vocab_size, input_size, hidden_size):

        # embedding size
        self.vocab_size = vocab_size
        self.word_dim = input_size

        # network size
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = vocab_size

        # num steps
        self.max_num_steps = 100
        self.num_steps = 0

        # graph
        self.graph = Graph('RNN')

        # word embedding
        embed_argument = {'vocab_size': self.vocab_size, 'embed_size': self.input_size}
        self.word_embedding = self.graph.get_operation('Embedding', embed_argument)

        # rnn
        rnn_argument = {'input_size': self.input_size, 'hidden_size': self.hidden_size, 'max_num_steps': self.max_num_steps}
        self.rnn = self.graph.get_operation('RNN', rnn_argument)

        # affines
        affine_argument = {'input_size': self.hidden_size, 'hidden_size': self.output_size}
        self.affines = [self.graph.get_operation('Affine', affine_argument, "Affine") for i in range(self.max_num_steps)]

        # softmax
        self.softmaxLosses = [self.graph.get_operation('SoftmaxLoss') for i in range(self.max_num_steps)]

    def forward(self, word_lst):

        # get num steps
        self.num_steps = min(len(word_lst), self.max_num_steps)

        # create embeddings
        embedding_variables = []
        for word_id in word_lst:

        # run RNN
        rnn_variables = self.rnn.forward(embedding_variables)

        # softmax variables
        softmax_variables = []

        for i in range(self.num_steps):
            output_variable = self.affines[i].forward(rnn_variables[i])
            softmax_variable = self.softmaxLosses[i].forward(output_variable)

        return softmax_variables

    def loss(self, target_ids):

        ce_loss = 0.0

        for i in range(self.num_steps):
            cur_ce_loss = self.softmaxLosses[i].loss(LongVariable([target_ids[i]]))
            ce_loss += cur_ce_loss

        return ce_loss

In the next post of this series, I hope to implement seq2seq model with RNN in this post.


  1. Hello! only thanks to a friend I found your site with such interesting tutorials. can you write an article what you need to know to build your deep learning framework?

    1. Hi Ban,

      Thanks for your comments!
      It is a good suggestion as some other people also look interested in the same question.
      I am not sure when I will have time to write about it, but I will try writing one when I have time šŸ™‚


      1. thanks for the answer!
        I hope you will soon have free time. In the meantime, can you give general advice on developing a framework?

    1. I think finishing any of those large courses can give you enough hints about what you need to learn to create your own framework. However, I guess none of them can teach you enough to do it.
      Reading codes after finishing one course is probably a good way to start. For example, the tinyflow you mentioned is probably a good startpoint.

  2. Hello!
    Excuse for troubling.
    reading the comments under the post posts I had a question. How can I implement a framework for recognizing objects using R-CNN (or other technology)? what would you recommend? and is it possible to add this feature to pytensor?

    1. Hi Ban,
      I guess you can modify this framework to implement your R-CNN.
      The basic structure does not need to change a lot, you still need backward/forward interface and tensor objects.
      However, there’s lots of things that were not included yet. For example, you need to implement CNN layer (ideally with GPU), region proposal.
      I would recommend you to finish one of those large CV course, then you probably will understand what you need to implement.

      1. Hello!
        Thank you so much for the answer. Little by little, I started working on this. I’m waiting for an article on writing a framework.
        thanks again!

Leave a Comment

Your email address will not be published. Required fields are marked *