What is Autograd?

As AI keeps resonating in our daily conversations, it’s driving more and more attention and curiosity to understand both the power and potential of machine learning. I have also been trying to get my hands dirty with PyTorch by building some basic Deep Learning models. Now, this article is suitable for people who are already somewhat familiar with basic algorithms running at the heart of machine learning and it’s associated terminology.

When I started using PyTorch, I naturally wondered why would people prefer to train the models using the torch library rather than just Numpy. Well, one of the main reasons is its feature called autograd.

So, what is autograd?

Autograd is a feature in PyTorch that possesses an inbuilt method of computing and keeping track of multiple partial derivatives. This is exactly what we need to compute in back-propagation - the algorithm that is absolutely crucial for machine learning models. This quite simplifies our work by increasing both: efficiency and accuracy.

Example to understand autograd

In this example, we define two torch tensors, a and b with explicit attributes highlighting that we require gradients to be tracked.

Next, define a function f that depends on both a and b. It can be easily seen that the the gradient of f with respect to a is 4a and with respect to b is 2b. But even if it was complicated, we don’t need to know it. Autograd will compute it for us once back-propagation is initiated.

By default, PyTorch accumulates gradients. In order to avoid interference from previous gradients, it is common to specify an explicit gradient to override the accumulated gradients and perform a clean back-propagation pass. For this reason, we define the grad tensor that has the same shape as f. It’s entries represent the gradient of f with respect to itself. When back-propagating through f, the gradients of f with respect to a and b are multiplied by the values in the grad tensor.

After calling f.backward(grad), the gradients of f with respect to a and b are stored in the .grad attribute of each tensor.

Why makes autograd special?

The most powerful fact about autograd is that it can alter and adjust our model on the fly during execution (runtime). It keeps a track of all operations for tensors whose gradients are specified to be tracked. It does so by creating a computational graph of every step. Through this graph, it analyses dependencies between tensors and traces exact operations that led to those dependencies. This dynamic behaviour allows autograd to perform automatic differentiation and compute gradients accurately on the fly.

A simple example to demonstrate use of autograd during training a model.

Step 1: We import the torch library and generate sample data for our learning model. Our input tensor X contains seven inputs and our output tensor Y contains the seven respective outputs.

This sample data will be used to train our model. It’s easy to see that in our case, we have a linear function relating input values to output values :

Output value = 10 x Input value + 1

(Note that our model is unaware of this relation it at this step)

Step 2: In this step, define the method that the model is going to use to predict output, also known as the forward pass.

We check what’s the output value predicted by the model for a random input before training. We get the answer to be zero as expected.

Step 3: We now start training our model. This involves a few steps:

First, define the loss function which gives the error between the predicted output and the expected output. I have used the MSE (Mean square error) loss function here.
Second, set the hyperparameters: Learning rate, which denotes our step size during optimisation and epochs, which denotes the number of iterations that the model will run to train on the same data.
Finally, explain what the model should do in each epoch. The model generates output, computes loss, initiates back-propagation (this is where autograd steps in), updates weight and bias and then kills their corresponding gradients before the next epoch begins.
Then we print the results after every 50 epochs and observe.

Step 3: It’s time to see the training results. It’s clear from the results that as loss reduces in each epoch, the weight (w) is getting closer to 10 and bias (b) is getting closer to 1. This mean’s that the model has approximately discovered the relation between the sample data and hence can now be tested on some new data.

We check what’s the output value predicted by the model for a random input after training. We get the answer to be 81 for input value 8 which is precisely the same that as the expected value (10 x 8 + 1).

I hope this article was useful. If you are a beginner and have any questions regarding the code, feel free to connect or leave comments.

Subscribe to my blog • Don’t miss out!

What is Autograd?

Recent Posts

Comments