Implement a deep learning framework Part 3 – auto differentiations

In this part, we will continue to polish the pytensor framework. We will implement two differentiation methods: numerical differentiation and auto differentiation. Both differentiations are important for the framework. Numerical differentiation is required to validate the gradient propagation when implementing the backward method, and auto differentiation will enable users to create a new graph without explicitly specifying back-propagation logics.

At the end of this article, we will combine those two differentiation methods to create a 3 layer feed-forward neural network. Instead of specifying the entire backward logic, we will use the auto differentiation mechanism to compute all gradients automatically, then we will apply numerical differentiation method to validate those gradients.


Methods of computing gradients can be largely classified into 3 groups: numerical differentiation, symbolic differentiation, and automatic differentiation. We will implement numerical differentiation and automatic differentiation with numpy in this article. On the other hand, Symbolic differentiation is not as important as the other two differentiations in deep learning discipline because of its memory inefficiency. As a result, we will skip symbolic differentiation part in this article (we can use sympy for symbolic differentiation if necessary)

Automatic Differentiation

Automatic differentiation is used to back-propagation gradients automatically without any specification. In the previous part, we explained that a graph is comprised of multiple variables and operations. To make the automatic differentiation work, we must make sure each variable and operation can back-propagate its gradient into its connected nodes.

The desired back-propagation is: first, we finish forward computation and use the loss function to retrieve the loss value. Then we call the backward function of the loss operation to set up the gradient for the last variable or operation. The last variable or operation will then continue to propagate the loss into previous operations or nodes until every node in the graph receives their gradients.

The backward behaviors of variables and operations are slightly different. Operations usually back-propagate their gradients immediately, but variables might not back-propagate soon. The situation is illustrated in the previous figure. There is a single variable $V$ in the graph, it is produced by the input OP and is used by three Dependency OP. The important part is that $V$ should not back-propagate its error back to input OP until it receives errors from all three Dependency OP.

One typical solution to handle this dependency issue is to topological-sort the entire graph and back-propagate each variable and operation in this order. We use an operation stack to maintain this topological order in our implementation.

During the training phase when we compute any operation in the forward direction, we will push this operation into the operation stack. When all forward computations get finished, we pop up and back-propagate operations in the stack one by one. The logic of related source code is shown as follows.

class Graph:

    def __init__(self, name):
        graph is a data structure to manage parameter and execution order


        # graph name = name

        # state
        self.train_state = True

        # forward operation stack
        self.forward_ops = []

    def forward(self, ops):
        register a forward ops to the stack (for backward computation)

        :param ops:

        if self.train_state:

    def backward(self):
        backward the error in reverse topological order


        for ops in reversed(self.forward_ops):

        # clear all operation

In the case, we can make sure that every variable will receive all its gradients before it gets used by any other operations. In the previous figure’s example, the stack after all forward computations will contain Input OP firstly, and then all three dependency OPs. It guarantees that input OP will not be back-propagated until three dependency OPs finished their back-propagations.

Numerical Differentiation

Numerical Differentiation is used to validate whether the back-propagation’s implementation is correct. This functionality proves to be extremely useful for complex operations, such as LSTM and CTC which we will mention in latter articles.

Suppose that the function of an operation is $f_{op}$, its gradient is defined theoretically as follows

$$ \frac{\partial f_{op}(x)}{\partial x} = \lim_{h\to 0} \frac{f_{op}(x+h)-f_{op}(x)}{h} $$.

Instead of using the theoretical definition, we can approximate the gradient with the central finite difference method.

$$ \frac{\partial f_{op}(x)}{\partial x} \approx \frac{f_{op}(x+h)-f_{op}(x-h)}{2h} $$.

where $h$ is a small number such as 0.0001.

In our implementation, we can validate all trainable variables in the Parameter dictionary we introduced in the last article. First, we compute the automatic difference using pre-defined back-propagation functions. Then, the numerical gradient is computed for each variable and compared with the automatic one. We can say that the automatic difference is correct if their absolute difference is small enough.

def numerical_gradient_check(model, input_variables, target_variable):
    validate numerical gradient with respect to model

    :param model: the model we want to validate
    :param input_variables: input variables for gradient check
    :param target_variable: target variable for gradient check

    # clear gradient before validation

    # run normal procedures to compute automatic gradient

    # variables we need to check
    variable_dict = model.graph.parameter.variable_dict

    for var_name, var in variable_dict.items():
        print("Now checking ", var)

        h = 1e-4
        v = var.value

        numerical_grad = np.zeros_like(v)

        # compute numerical gradient of this variable
        it = np.nditer(v, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            idx = it.multi_index
            tmp_val = v[idx]

            # f(x+h)
            v[idx] = float(tmp_val) + h
            loss_1 = model.loss(target_variable)

            # f(x-h)
            v[idx] = tmp_val - h
            loss_2 = model.loss(target_variable)

            numerical_grad[idx] = (loss_1 - loss_2) / (2 * h)

            v[idx] = tmp_val

            # clear ops

        # compare numerical grad with auto grad
        diff = np.sum(var.grad - numerical_grad)

        if diff < 0.001:
            print(var_name, " match")
            print(var_name, " NOT MATCH !")
            print("Auto grad\n", var.grad)
            print("Numerical grad\n", numerical_grad)


Multilayer Perceptron

Next, we will demonstrate how to use all pieces defined so far to implement a multilayer perceptron (MLP) model.

The MLP model we implemented is a common three-layer neural network. Its implementation is defined as follows.

class MLP:

    def __init__(self, input_size, hidden_size, output_size):
        self.graph = Graph("MLP")

        # make graph
        self.affine1 = self.graph.get_operation('Affine', [input_size, hidden_size])
        self.sigmoid = self.graph.get_operation('Sigmoid')
        self.affine2 = self.graph.get_operation('Affine', [hidden_size, output_size])
        self.softmaxloss = self.graph.get_operation('SoftmaxLoss')

    def forward(self, input_variable):
        affine1_variable = self.affine1.forward(input_variable)
        sigmoid_variable = self.sigmoid.forward(affine1_variable)
        affine2_variable = self.affine2.forward(sigmoid_variable)

        return self.softmaxloss.forward(affine2_variable)

    def loss(self, target_variable):
        return self.softmaxloss.loss(target_variable)

We define the model using a graph structure which contains all information about the model. The only parts we need to implement is to define the forward logic and its loss function. We note that the backward is no more necessary because it can be computed automatically thanks to the graph class.

We can validate the MLP model by checking its gradients and use it to recognize digits we mentioned in the previous article.

It performs slightly better than the linear model and reached 97 percent accuracy in the test dataset.

=== Epoch  0  Summary ===
train loss:  3.14572418426
test accuracy  0.47333333333333333
=== Epoch  1  Summary ===
train loss:  2.3808683056
test accuracy  0.78
=== Epoch  38  Summary ===
train loss:  0.0995458441227
test accuracy  0.9711111111111111
=== Epoch  39  Summary ===
train loss:  0.0969445379643
test accuracy  0.9711111111111111

Leave a Comment

Your email address will not be published. Required fields are marked *