In this post, we continue to implement the basic components from the previous post. We will define Graph, Parameter and Optimizer.
Almost all deep learning frameworks have a concept of the computational graph. We also use the same terminology to define the architecture of the computation. The architecture of a typical graph is depicted in the following figure.
Another important class is Parameter. The Parameter class is used to manage a set of trainable variables. When we want to define a new trainable variable in the graph, we retrieve the variable from Parameter class instead of instantiating the variable directly. The relationship between graph and parameter is shown in the following graph.
This separation offers two advantages:
- It provides a convenient way for optimizer to update variables as the Parameter class holds all the trainable variables
- It allows variable to be shared in the different parts in the graph. This is a very crucial point for RNN and LSTM because we want to apply same weight variables during different time stamps.
class Parameter: """ Parameter is a structure to manage all trainable variables in the graph. Each trainable variable should be initialized using Parameter """ def __init__(self): # a dictionary mapping names to variables self.variable_dict = dict() def get_variable(self, name, shape): """ retrieve a variable with its name :param name: name of the variable :param shape: desired shape :return: """ if name in self.variable_dict: # if the variable exists in the dictionary, # retrieve it directly return self.variable_dict[name] else: # if not created yet, initialize a new variable for it value = np.random.standard_normal(shape) / np.sqrt(shape) variable = Variable(value, name=name) # register the variable self.variable_dict[name] = variable return variable def clear_grads(self): """ clear gradients of all variables :return: """ for k, v in self.variable_dict.items(): v.clear_grad()
Finally, we define the optimizer class which can update variables in the parameter.
The optimizer can iterate all trainable variables in the parameter and update its value based on the gradient. In this article, we will implement the stochastic gradient descent optimizer.
class SGD: def __init__(self, parameter, lr=0.001): self.parameter = parameter self.lr = lr def update(self): for param_name in self.parameter.variable_dict.keys(): # update param value param = self.parameter.variable_dict[param_name] # update if param.trainable: param.value -= self.lr * param.grad # clear all gradients self.parameter.clear_grads()
Linear Regression Model
Equipped with all components defined previously, we can now implement a linear regression model as an example. A typical model should define the forward, backward and loss function.
A linear regression model can be implemented as the following code.
class LinearModel: def __init__(self, input_size, output_size): """ a simple linear model: y = w*x :param input_size: :param output_size: """ # initialize size self.input_size = input_size self.output_size = output_size # initialize parameters self.parameter = Parameter() self.W = self.parameter.get_variable('weight', [self.input_size, self.output_size]) # ops and loss self.matmul = Matmul() self.loss_ops = SoftmaxLoss() def forward(self, input_variable): output_variable = self.matmul.forward([input_variable, self.W]) self.loss_ops.forward(output_variable) return output_variable def loss(self, target_variable): loss_val = self.loss_ops.loss(target_variable) return loss_val def backward(self): self.loss_ops.backward() self.matmul.backward()
The model basically computes $y$ using following equation.
$$ y = w \cdot x $$
The $w$ is the only trainable variable in the model and gets retrieved from the Parameter class. When the model receives an input variable, it will forward the matmul operation to get the output value. Then it computes the loss using square error loss function and the target variable. Finally, the loss will back propagated through the entire graph.
To test the model, we use the handwritten digits recognition dataset provided inside the scikit-learn package. The dataset contains about 1800 gray-scale images and each of the images corresponds to a single digit.
Here is an image example in the dataset which we want to recognize as 0.
We use the cross-entropy as the loss function for this model and it converges fast with SGD optimizer.
=== Epoch 0 Summary === test accuracy 0.8933333333333333 === Epoch 1 Summary === test accuracy 0.9088888888888889 ... === Epoch 39 Summary === test accuracy 0.9688888888888889
The code for this linear model is available here