Implement a deep learning framework: Part 1 – Implement Variable and Operation

Recently deep learning frameworks have attracted a lot of interest as they offer an easy way to define both static graphs (e.g. tensorflow, CNTK) and dynamic graphs (e.g. pytorch, dynet), in addition they save much time by doing automatic differentiations instead of the users.

However, those sophisticated frameworks have wrapped logics so deeply that it is hard to grasp what is happening inside them. In addition, it seems to be an arduous task when prototyping a new native operation for those frameworks.

As a result, I decided to create a new deep learning framework. This is for two purposes:  (1) to create a lightweight deep learning framework to deepen my understanding (2) to make it easier when prototyping a new operation. In this series of blogs, I will describe my ongoing project pytensor to create a deep learning framework with pure numpy. The following figure shows a typical architecture in the framework.

In this first article, I will describe the basic modules of the framework such as variable, operation and optimizer.

Modules of the framework

Prior to implementing pieces of the framework, we should define several modules and important concepts in the framework.


Tensor is the lowest concept in this framework, it is just a numpy array in this framework. We will denote green square as a tensor in this series.


Variable is the basic class in the computation graph. It will be used to pass through values in the graph, feed inputs into the graph and update gradients during training.

The difference between a Variable and a numpy array is that Variable has two numpy arrays: one for the forward value and the other for the backward gradient as following code.

class Variable:
    Variable is the basic structure in the computation graph
    It holds value for forward computation and grad for
    and grad for backward computation

    def __init__(self, value, name='Variable', trainable=True):
        :param value: numpy val
        :param name: name for the variable
        :param trainable: whether the variable can be trained or not

        # value for forward computation
        self.value = np.array(value)

        # value for backward computation
        self.grad = np.zeros(self.value.shape) = name

        self.trainable = trainable

The value of each variable will be set during the forward computation, and the grad will be updated during the backward computation.

There are three ways to create variables:

  • Variables can be instantiated directly by users as inputs or targets.
  • Variables can be retrieved using Parameter (it will be defined later). This is for managing trainable variables easily.
  • Variables can be created as a result of an operation (as the output variable)


To make variables pass through the graph, we clarify an interface that every operation should implement.

class Operation:
    An interface that every operation should implement

    def forward(self, input_variables):
        forward computation

        :param input_variables: input variables
        :return: output variable
        raise NotImplementedError

    def backward(self):
        backprop loss and update

        raise NotImplementedError

The Operation should define two methods: forward and backward. In the forward method, it will compute its inside operation and generate a new output variable. In the backward method, we assume that the gradient has already been back-propagated into the output variable. The operation should continue to back-propagate the gradient into its input variables.

In the following diagram, forward will update green tensor (value) on the left side in each variable, then the backword will update tensor (grad) on the right side.

We show an implementation of typical add operation here.

class Add(Operation):

    def __init__(self, name='add', argument=None, graph=None):
        super(Add, self).__init__(name, graph, argument)

    def forward(self, input_variables):
        Add all variables in the input_variable

        :param input_variables:
        super(Add, self).forward(input_variables)

        # value for the output variable
        value = np.zeros_like(self.input_variables[0])

        for input_variable in self.input_variables:
            value += input_variable.value

        self.output_variable = Variable(value)

        return self.output_variable

    def backward(self):
        backward grad into each input variable


        for input_variable in self.input_variables:
            input_variable.grad += self.output_variable.grad

The forward method is basically doing following math.
$$ V_{out} = \sum_{i}{V_i} $$

According to the chain rule, the gradient of each input variable is equal to its output gradient, so we just add output variable’s gradient to each input variable’s gradient.

$$ \frac{\partial \mathcal{L}}{\partial V_i} = \frac{\partial \mathcal{L}}{\partial V_{out}} \cdot \frac{\partial V_{out}}{\partial V_i} = \frac{\partial \mathcal{L}}{\partial V_{out}} $$


The Loss class is a special type of operation. It should implement a loss function in addition to the forward and backward functions, the loss function will take the target variable as an input and return a scalar loss value. When backward is called, it will compute the gradient against this target and the gradient can then be back-propagated through all the graph subsequently.

class Loss(Operation):

    def forward(self, input_variables):
        raise NotImplementedError

    def backward(self):
        raise NotImplementedError

    def loss(self, target):
        raise NotImplementedError

Using this interface, we can define a simple square error loss function.
$$ \mathcal{L}_{square} = \frac{1}{2}\sum_{i}{(y_i – t_i)^2} $$

Where $y_i$ denotes the i-th output and $t_i$ denotes the i-th target.

class SquareErrorLoss(Loss):

    def __init__(self, name="SquareErrorLoss"): = name

    def forward(self, input_variable):
        self.input_variable = input_variable

    def loss(self, target): = target
        loss_val =  mean_squared_error(self.input_variable.value,
        return loss_val

    def backward(self):
        # update grad
        self.input_variable.grad = self.input_variable.value -

        # back prop


      1. What do I need to know to create a deep learning framework for working with computer vision? What you need to know and where to study better? Please advise the resources for learning (tensors, autograd, NN abstractions, optimizers, data pipeline, training loop abstraction, cnn, distributed training, graph compilation, …) , as well as the rest of what I need to know in order to create my own deep learning structure. all for the purposes of interest, as well as for educational purposes. is this the end or will it be continued?

        thank you
        with respect

        1. Hi Denis,

          Thanks for your comments! I guess the answer depends on which level you are interested in implementing.

          If you are only interested in learning the basic workflow of CV framework, and want to implement a toy framework like this one, A good start point might be looking at the assignments of stanford CS231n. They implemented lots of essential primitives for CV which I have not supported yet. (e.g.: cython-based im2col for faster CNN operations). The drawback is that the structure of their implementation is not very object-oriented, which makes it harder to expand on top of it. (but probably is good for assigment purpose). Maybe you can reuse some components of this pytensor framework and incorporate primitives from that assignments. These will not allow you to implement realtime pipelines (e.g: R-CNN object detection), but would be enough to understand how they work.

          If you intend to implement something actually useful, then numpy is not enough (too slow). You probably need to use GPU somehow and take a look at more practical frameworks implementations. For example, a good start point might be replacing numpy with cupy from chainer.

          If you have enough time, you should consider migrating your implementation entirely from python to C++ for better memory management, thread management and GPU integration. In particular, you would end up implementing two kernels for each operation in C++: CPU kernel with SIMD asm (e.g: SSE, AVX) and an equivalent GPU kernel by calling cuda/cudnn interface. After that, you need to wrap your C++ kernel into python by using cython or pybind11. A good reference is dynet’s C++ implementation. Its code is relatively simple and easy to understand. At this level, you should be able to try many interesting things like how to distribute CPU kernel into multicore with pthread, how to apply garbage collection to your tensors.

          If you are interested in much more complicated things, probably you should check low-level libraries used by tensorflow and pytorch. For example, if you want to learn how multi-GPU works, then you should dig deeper into frameworks such as horovod (implemented with nccl primitives) or baidu ringallreduce (implemented with cuda). If you have multiple GPU-equipped
          machines, maybe you can think about how to serialize your tensor and how to communicate efficiently with network protocols (tcp or you can implement your own faster protocol)

          These were the actual steps I planned to follow to expand this framework, but I do not have enough time to implement yet, so I have no idea when will I continue to write the next one.

  1. thanks for the answer.
    Will CS231n be enough? to implement object detection (with its own framework)? c ++ and python is not a problem for me
    What would you recommend except CS231n?
    What do you think of the fastai course?
    Regards Denis

    1. Hi Denis,

      I think lecture contents of CS231n should be enough to understand how to implement object detection, but its own “framework” is far from doing any decent object detection. As you can see in their assignment, they switched to pytorch or tensorflow after assignment 2. If you want to implement something practical, you cannot use their framework as it is. course looks good too, other related courses I know is deep learning course of CMU: which also have similar assignments to implement NN from scratch.

      1. Thanks!
        which specific repository would you recommend from CS231n? and how would you advise improving this framework for object recognition?
        if you know more courses where everyone writes from scratch then please share.
        Regards Denis

          1. Hello
            I apologize for getting into the conversation but I have a question.
            their structure is just too slow so it can not be used to recognize objects? Or do I need to use C ++ and GPU? or something else?

          2. Hi Deran,

            I do not have much experience in training object detection, but from my experience of working with speech processing, training with only CPU is too slow to implement anything interesting. You probably can finish training a tiny MNIST model with it, but it’s hard to implement large model.

            To show you a short comparison, a modern GPU such as 2080Ti would give you computing power around 10TFlops. On the other hand, numpy-based framework will only allow you to use a single CPU core. If you are using a decent intel CPU, you probably would get at least 1 AVX-512 unit in your CPU, which can execute 32 single precision FMA instruction per cycle. Combined with the frequency around 3GHz, in the most ideal scenario, you probably can reach 200GFlops. Compared with the GPU number, GPU is still 40x to 50x times faster.

            If you can spend days or even weeks to train your object detection model, then probably current framework would be enough. Otherwise, I would suggest you to find a way to use GPU.

  2. Thank you very much for your help. and forgive me for the time spent. I hope you will have time to continue articles about pytensor.

  3. Hey.
    thanks for the answer.
    since you’re talking about speech recognition then after Cs231n (there will be enough skills)
    will it be possible to implement speech recognition and machine translation and subtitles for video from scratch (building your own framework)?

    1. Yes, I guess you should gain basic ideas on how to do that. But there are still lots of details you need to learn by reading other people’s code on Github.

        1. I am not very familiar with machine translation, but for speech recognition, you can read espnet, kaldi, eesen, fairseq, wav2letter, deepspeech…

Leave a Comment

Your email address will not be published. Required fields are marked *