Optimization

Loss Function Under arbitrary parameter values, the model will likely provide a poor estimate of the output. But what values of parameters make a good model? The quality of the model is the model’s ability to explain data, which are input and output pairs $(\mathbf{x}_i,\mathbf{y}_i)$ that come from some underlying joint distribution. We are interested in finding the model that best explains the data. We measure this quality with a loss function. Loss function This function takes parameter $\theta$ as input and output a value that indicate how well this parameter perform given a dataset. We first define the loss value for a single data point $(\mathbf{x}_i, \mathbf{y}_i)$ as $l(\theta|\mathbf{x}_i,\mathbf{y}_i).$ Then loss function over the full the dataset is \begin{equation} $$ L(\theta) = E_{(\mathbf{x}_i, \mathbf{y}_i)} l (\theta | \mathbf{x}_i , \mathbf{y}_i), $$ where $(\mathbf{x}_i, \mathbf{y}_i)$ are paired input and output data from the dataset. Let’s look at some examples of loss functions: Log likelihood A common loss function is the negative log likelihood of the data under the model: $$ l(\theta|\mathbf{x}_i,\mathbf{y}_i) = -\log P(\mathbf{y}_i | \mathbf{x}_i),$$ where P is the probability distribution of the data under the model parameter $\theta$. This loss function measures how likely the model can generate the ground-truth output $\mathbf{y}_i$ given the input $\mathbf{x}_i$. The more likely the model can generate the ground-truth output, the lower the loss value will be. Binary classification For binary classification, we use $$l(\theta,\hat{y}_i,\mathbf{y}_i) = -\log P(\mathbf{y}_i | \hat{y}_i)= -\mathbf{y}_i\log \hat{y}_i - (1-\mathbf{y}_i)\log(1-\hat{y}_i.)$$ When $\mathbf{y}_i=1$, the loss is $\log P(\mathbf{y}_i=1|\mathbf{x}_i)=\log \hat{y}_i$ and when $\mathbf{y}_i=0$, the loss is $\log P(\mathbf{y}_i=0|\mathbf{x}_i)=\log (1 - \hat{y}_i)$. This function can be extended to multi-class classification with $$l(\theta,\hat{y}_i,\mathbf{y}_i) = -\log P(\hat{y} = \mathbf{y}_i).$$ Regression For regression, the output of the model is the mean of the distribution. The regression loss under the Gaussian assumption (See EC) is then $$ l(\theta|\mathbf{x}_i,\mathbf{y}_i) = \frac{1}{2}||\hat{y} - y||^2, \text{where } \hat{y} = \mathbf{w}^T \mathbf{x}+b, \theta=(\mathbf{w}, b). $$ We call this loss function L2 loss as it measures the L2 distance between the predicted $\hat{y}$ and ground-truth $y$. The more accurate the model outputs are, the lower the loss value will be. Having seen the above examples, let’s look at some basic properties of the loss function: Basic properties of loss function The loss function is a function of the model parameter $\theta$, it measures the quality of the model parameter on the dataset. High loss value means the model does not perform well, while low loss value means the model performs well. We can always figure out what a high and loss value is: a high loss value is the loss value when the model makes random predictions a low loss value is the loss value when the model makes the ground-truth predictions. Through this class you will develop an intuition about this. Most loss function factorizes over the dataset — the loss function over the full dataset is the average over each of the $N$ data points: $$L(\theta) = \frac{1}{N} \sum_i l(\theta|\mathbf{x}_i,\mathbf{y}_i)$$ In some special cases, this factorization does not hold (e.g., CLIP loss). Don’t worry: in the first part of the class, we will only use loss functions that factorize over the dataset. We will see more non-factorized loss functions in the second part of the class. For a given model (e.g. linear model), our goal is to find the parameters that minimize the loss function. We call this process model optimization. Gradient Descent To reduce the loss value, we update $\theta$ in the direction that reduces the loss value. We call this process the optimization algorithm. In this section, we will introduce one of the most simple and commonly used optimization algorithm: gradient descent. In later sections, we will introduce more efficient optimization algorithms like stochastic gradient descent. Gradient descent Gradient descent finds a local optima of $\theta$ by iteratively updating $\theta$ along small steps of the negative gradient. Each step is expressed as: $$ \theta^\prime = \theta - \epsilon \bigg[ \frac{\partial{L(\theta)}}{\partial{\theta}} \bigg]^\top. $$ We call $\epsilon$ the learning rate. Under some conditions, each step of gradient descent will lower the loss value $L(\theta)$. Specifically, when 1) $\epsilon$ is small enough, 2) gradient is not zero, 3) loss function is smooth (differentiable) in the neighborhood of $\theta$, each gradient descent step is garanteed to lower the loss $\theta$: $$ L(\theta^\prime) < L(\theta). $$ Recall from that gradient $\frac{\partial{L(\theta)}}{\partial{\theta}}$ is a row vector while $\theta$ is a colulmn vector. We use $[\frac{\partial{L(\theta)}}{\partial{\theta}}]^\top$ to make the shape of $\theta^\prime$ and $\theta$ consistent. In pseudo code, gradient descent algorithm is: 1 2 3 4 θ = random_init() for iteration in range(n): g = ∇L(θ) θ = θ - ϵ * g.mT Gradient descent Let’s define some notations that we will use in the following sections: Symbol Description $L$ loss function $\theta$ model parameter $\epsilon$ learning rate $\nabla_{\theta} (L)$ Gradient of $\theta$ over $L$ (same as $\frac{\partial{L(\theta)}}{\partial{\theta}}$) Gradient descent for deep learning in practice Mathematicians establish the above conditions on loss function $L$ for gradient descent to work. However, in practice, we often use gradient descent without worrying about these conditions. In deep learning, we often optimize non-continuous and non-convex loss functions without a problem. We will discuss how to practically utilize gradient descent for deep leanring in the next chapter. Next we will introduce properties and issues of gradient descent. To begin with, let’s first understand what is a local optima and a global optima. Local optima and global optima For a function $f(x)$, a local optima is a point $x$ where the function value is smaller than its nearby points. However, its value might be greater than a distant point. In contrast, a global optima is a point where the function value is smaller than all other points. Let’s look at an example of local optima and global optima: 2024-02-09T08:29:50.854660 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ 2024-02-09T08:29:51.175562 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ In the figure, the red point is a local optima, while the green point is a global optima. Note that the red point has lower value than all its neighbors, but it has higher value than far away points. In contrast, the green point has lower value than all other points. For optimization of loss value, we aim to find the global optima of the loss function. However, in practice, we often find a local optima of the loss function. Let’s see how this happens with gradient descent. Gradient descent often finds local optima Gradient descent often finds a local optima of $\theta$ but not necessarily the global optima. On one hand, this is because gradient descent is a first-order optimization algorithm, which only consider the gradient of the function at the current point. On the other hand, gradient descent starts with a random initialization of $\theta$. Therefore, it is likely to get stuck in a local optima when the loss function is non-convex. Let’s look at an example where different initializations of $\theta$ can lead to different local optima with gradient descent. Here we consider a 1-dimentional parameter $\theta$. We show the value of $\theta$ on the x-axis and the loss value $L(\theta)$ on the y-axis below: 2024-02-09T08:29:56.270956 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ As shown in the figure, when we initialize $\theta$ at different points, curves of the optimization curve $L(\theta)$ can be very different (green one reaching the global minimum, while the red one reaching a local minimum). This becomes problematic as we randomly initialize $\theta$ in gradient descent. This indicates that gradient descent can be very sensitive to the initialization of $\theta$. Choosing the right initialization is important for gradient descent to find a good local optima. If the function is convex, then gradient descent will always find the global optima. However, in deep learning, the loss function is often non-convex. In this case, gradient descent will find a local optima. The quality of the local optima depends on the initialization of $\theta$ and the learning rate $\epsilon$. As shown in the above examples, gradient descent has some issues. It has long-time be shuned by serious optimization researchers for the following issues: Issues of gradient descent Slow convergence: gradient descent can take a long time to converge. Local optima on non-convex functions: for non-convex functions, gradient descent can get stuck in local minima, failing to reach the global minimum. Hard to tune learning rate: if the learning rate is too large, gradient descent can overshoot the minimum and diverge. If the learning rate is too small, gradient descent can take a long time to converge. Sensitive to initialization: different initializations can lead to very different local minima. This makes it hard to find a good initialization. Nevertheless, gradient descent is still the most commonly used optimization algorithm in deep learning. This is because it is simple and easy to implement. Furthermore, it is often good enough for deep learning models. In the next section, we will introduce different variants of gradient descent and practical techniques to make gradient descent work better. Now, having introduced gradient descent in general, let’s try to use it to optimize different models through some excersizes. Gradient descent on linear regression Assume we are doing a linear regression task with 1-dimentional input $x$ and output $y$. The model is defined as: $$\hat{y} = w^\top x + b$$ The loss function is defined as: $$L(w, b) = \sum_i (\mathbf{y}_i - \hat{y}_i)^2$$ where $\mathbf{y}_i$ is the ground-truth label of the $i$-th data point. We initialize the parameters as follows: $w=1$ $b=0$ We also use the dataset shown below: $x_0$ = 0, $y_0$ = 1 $x_1$ = 1, $y_1$ = 0 We use gradient descent to optimize $w$ and $b$ for 1 step with learning rate is $\epsilon=1$. What is the updated $w^\prime$ after 1 step of gradient descent? -2 -1 0 1 2 What is the updated $b^\prime$ after 1 step of gradient descent? -2 -1 0 1 2 Solution: To solve this problem, we first compute the gradient of $L(w, b)$ over $w$. Note that $$ \frac{\partial L(w, b)}{\partial w} = \sum_i - 2(\mathbf{y}_i - \hat{y}_i) \frac{\partial \hat{y}_i}{\partial w} $$ Then, we have $$ \frac{\partial \hat{y}_i}{\partial w} = \frac{\partial (w^\top x + b)}{\partial w} = \mathbf{x}_i $$ Therefore, we have: $$ \frac{\partial L(w, b)}{\partial w} = \sum_i - 2(\mathbf{y}_i - \hat{y}_i) \mathbf{x}_i $$ At the first step, we have $w=1$ and $b=0$, therefore we have $\hat{y}_0 = 0, \hat{y}_1=1$. We use these values to compute the gradient: $$ \frac{\partial L(w, b)}{\partial w} = -2 (y_0 - \hat{y}_0) x_0 - 2 (y_1 - \hat{y}_1) x_1 = -2 (1 - 0) \times 0 - 2 (0 - 1) \times 1 = 2 $$ Therefore, according to gradient descent update rule, we have: $$ w^\prime = w - \epsilon \frac{\partial L(w, b)}{\partial w} = 1 - 1 \times 2 = -1 $$ Similarly, we can compute the gradient of $L(w, b)$ over $b$. We have: $$ \frac{\partial L(w, b)}{\partial b} = \sum_i - 2(\mathbf{y}_i - \hat{y}_i) \frac{\partial \hat{y}_i}{\partial b} $$ Note that for any $w$, we have $\frac{\partial \hat{y}_i}{\partial b} = 1$. Therefore, we have: Therefore, we have: $$ \frac{\partial L(w, b)}{\partial b} = \sum_i - 2(\mathbf{y}_i - \hat{y}_i) $$ Recall that at the first step we have $\hat{y}_0 = 0, \hat{y}_1=1$. We use these values to compute the gradient: $$ \frac{\partial L(w, b)}{\partial b} = -2 (y_0 - \hat{y}_0) - 2 (y_1 - \hat{y}_1) = -2 (1 - 0) \times 1 - 2 (0 - 1) \times 1 = 0 $$ Therefore, according to gradient descent update rule, we have: $$ b^\prime = b - \epsilon \frac{\partial L(w, b)}{\partial b} = 0 - 1 \times 0 = 0 $$ In summary, after the first step of gradient descent, we have $w^\prime = -1$ and $b^\prime = 0$. Gradient descent on binary linear classifier Assume we are doing a binary linear classifier with 1-dimentional input $x$ and output $y$. The model is defined as: $$\hat{y} = \sigma (w^\top x + b)$$ where $\sigma$ is the sigmoid function. The loss function is defined as: $$L(w, b) = -\sum_i \mathbf{y}_i \log \hat{y}_i + (1-\mathbf{y}_i) \log (1-\hat{y}_i)$$ where $\mathbf{y}_i$ is the ground-truth label of the $i$-th data point. We initialize the parameters as follows: $w=1$ $b=0$ We also use the dataset shown below: $x_0$ = 0, $y_0$ = 1 $x_1$ = 1, $y_1$ = 0 We use gradient descent to optimize $w$ and $b$ for 1 step with learning rate is $\epsilon=1$. What is the updated $w^\prime$ after 1 step of gradient descent? -2 -1 0 1 2 What is the updated $b^\prime$ after 1 step of gradient descent? -2 -1 0 1 2 Solution: To solve this problem, we first compute the gradient of $L(w, b)$ over $w$. Note that $$ \frac{\partial L(w, b)}{\partial w} = \sum_i - \frac{\mathbf{y}_i}{\hat{y}_i} \frac{\partial \hat{y}_i}{\partial w} + \frac{1-\mathbf{y}_i}{1-\hat{y}_i} \frac{\partial \hat{y}_i}{\partial w} $$ Here, we have $\hat{y} = \sigma (w^\top x + b)$ and $\frac{\partial \sigma(z)}{\partial z} = \sigma(z) (1 - \sigma(z))$. Accroding to chain rule $\hat{y}$, we have: $$ \frac{\partial \hat{y}_i}{\partial w} = \frac{\partial \hat{y}_i}{\partial (w^\top x + b)} \frac{\partial (w^\top x + b)}{\partial w} = \hat{y}_i (1 - \hat{y}_i) \mathbf{x}_i $$ Therefore, we have: $$ \begin{aligned} \frac{\partial L(w, b)}{\partial w} = & -\sum_i \frac{\mathbf{y}_i}{\hat{y}_i} \hat{y}_i (1-\hat{y}_i) \mathbf{x}_i + \frac{1-\mathbf{y}_i}{1-\hat{y}_i} \hat{y}_i (1-\hat{y}_i) \mathbf{x}_i \cr = & -\sum_i \mathbf{y}_i (1-\hat{y}_i) \mathbf{x}_i + (1-\mathbf{y}_i) \hat{y}_i \mathbf{x}_i \cr \end{aligned} $$ At the first step, we have $w=1$ and $b=0$, therefore we have $\hat{y}_0 = 0, \hat{y}_1=1$. We use these values to compute the gradient: $$ \frac{\partial L(w, b)}{\partial w} = -y_0 (1-\hat{y}_0) x_0 + (1-y_0) \hat{y}_0 x_0 - y_1 (1-\hat{y}_1) x_1 + (1-y_1) \hat{y}_1 x_1 = 1 $$ Therefore, according to gradient descent update rule, we have: $$ w^\prime = w - \epsilon \frac{\partial L(w, b)}{\partial w} = 1 - 1 \times 1 = 0 $$ Similarly, we can compute the gradient of $L(w, b)$ over $b$. We have: $$ \frac{\partial \hat{y}_i}{\partial b} = \frac{\partial \hat{y}_i}{\partial (w^\top x + b)} \frac{\partial (w^\top x + b)}{\partial b} = \hat{y}_i (1 - \hat{y}_i) $$ Therefore, we have: $$ \begin{aligned} \frac{\partial L(w, b)}{\partial b} = & -\sum_i \frac{\mathbf{y}_i}{\hat{y}_i} \hat{y}_i (1-\hat{y}_i) + \frac{1-\mathbf{y}_i}{1-\hat{y}_i} \hat{y}_i (1-\hat{y}_i)\cr & = -\sum_i \mathbf{y}_i (1-\hat{y}_i) + (1-\mathbf{y}_i) \hat{y}_i \cr \end{aligned} $$ Recall that at the first step we have $\hat{y}_0 = 0, \hat{y}_1=1$. We use these values to compute the gradient: $$ \frac{\partial L(w, b)}{\partial b} = -y_0 (1-\hat{y}_0) + (1-y_0) \hat{y}_0 - y_1 (1-\hat{y}_1) + (1-y_1) \hat{y}_1 = 0 $$ Therefore, according to gradient descent update rule, we have: $$ b^\prime = b - \epsilon \frac{\partial L(w, b)}{\partial b} = 0 - 1 \times 0 = 0 $$ In summary, after the first step of gradient descent, we have $w^\prime = 0$ and $b^\prime = 0$. A key step of gradient descent is to compute the gradient of parameter over loss value $\frac{\partial{L(\theta)}}{\partial{\theta}}$. Computation by hand like how we did in the above excersizes is feasible for simple models like linear regression. However, when the model becomes slightly more complex (e.g., linear classification), this process becomes tedius and error-prone. Furthermore, in deep learning, models could become very complex and contain a large number of parameters. We need to compute the gradient of the loss over each of the many parameters. The computation by hand becomes infesible. To make this process easier, we introduce computation graphs in the next section. It allows us to compute the gradient of the loss over each parameter in a systematic and convenient way. Computation Graphs Let’s first define what is a computation graph: Computation graph Computation graph represents the computation of an arbitrary function in a graph form: Nodes in this graph represent functions or inputs. Edges represent data/tensors that are passed between functions. Any computaitonal graph is directed and acylic. Here let’s look at an abstract example of a computation graph: Computation graph In the above graph, we have: Green nodes represent inputs to the graph. Gray nodes represent functions. They take inputs from other nodes and output a value to other nodes. Edges represent data that are passed to functions. For example, the lower func node takes inputs from an input node and an upper func node. Having seen this abstraction example, let us see an example computation graph of linear regression. Computation graph for linear regression Here we use the model $\mathbf{w}^T\mathbf{x} + \mathbf{b}$: The model inputs (e.g., $\mathbf{x}$) are represented as green nodes. The computations we perform (e.g., matrix multiplication of $\mathbf{x}$ and $\mathbf{w}$) is represented as a gray node. The output of each operation is represented as an edge into the next operation. Now we have defined what a computation graph is. How do we use it do to gradient computation? To do this, we need two important steps: Forward pass We first compute the output of each node in the graph given input values from the top level to the bottom level. Backward pass We then compute the gradient of each node over its immediate result from the bottom level to the top level. Then we multiply the gradients from the lowest level to the top level and obtain the gradient of the parameter. Let’s look at an example how to use computation graph to compute the gradient of the parameter $\mathbf{w}$ in linear regression. Computation graph for linear regression - graph construction We go through the loss computation with the linear regression excersise in our previous section. Before diving into the computation details, let’s first define the model and loss function for linear regression. $$\hat{\mathbf{y}} = \mathbf{w}^\top \mathbf{x} + \mathbf{b}$$ The loss value for the $i$-th data point is defined as $$l(\mathbf{w}, \mathbf{b}|\mathbf{\mathbf{x}_i}, \mathbf{\mathbf{y}_i}) = (\mathbf{\mathbf{y}_i} - \hat{\mathbf{y}}_i)^2 = (\mathbf{\mathbf{y}_i} - (\mathbf{w}^\top \mathbf{x} + \mathbf{b}))^2$$ where $\mathbf{\mathbf{y}_i}$ is the ground-truth label of the $i$-th data point. For later use, we also define the shape of each variable: $\mathbf{x} \in \mathbb{R}^{n}$ $\mathbf{w} \in \mathbb{R}^{n \times d}$ $\mathbf{b} \in \mathbb{R}^{d}$ $\mathbf{y} \in \mathbb{R}^{d}$ In our previous excersise, we have $n=d=1$. In real world, $n$ and $d$ can be much larger. Now, let’s construct the computation graph for computation of the loss value $l(\mathbf{w}, \mathbf{b}|\mathbf{\mathbf{x}_i}, \mathbf{\mathbf{y}_i})$. To do this, we only need to abstract each of the operations in the loss computation into a node in the graph. $\text{matmul} \in \mathbb{R}^{d}$: matrix multiplication of $\mathbf{w}$ and $\mathbf{x}_i$ $\text{add} \in \mathbb{R}^{d}$: addition of $\text{matmul}$ and $\mathbf{b}$ $\text{sub} \in \mathbb{R}^{d}$: subtraction of $\text{add}$ and $\mathbf{y}_i$ $\text{sqr} \in \mathbb{R}$: square of the L2 norm of $\text{sub}$ Note that here we use operation (e.g., $\text{matmul}$) to represent both the node in the graph as well as the output of the node, as the output of each node is the result of the operation. Therefore, each operation also has a shape. For example, $\text{matmul} \in \mathbb{R}^{d}$. We can easily convert the previous computation into the computation graph form: $$ l(\mathbf{w}, \mathbf{b}|\mathbf{\mathbf{x}_i}, \mathbf{\mathbf{y}_i}) = (\mathbf{\mathbf{y}_i} - (\mathbf{w}^\top \mathbf{x}_i + \mathbf{b}))^2 = \text{sqr}(\text{sub}(\mathbf{\mathbf{y}_i}, \text{add}(\text{matmul}(\mathbf{w}, \mathbf{x}_i), \mathbf{b}))) $$ Then, let’s draw the computation graph with the nodes for this computation: The edge in the graph show the data flow between nodes for the forward pass. We’ve constructed the computation graph for linear regression loss computation. Now, we show how to use the computation graph to efficiently compute the loss value and its gradient with respect to the model parameters $\mathbf{w}$ and $\mathbf{b}$. Computation graph for linear regression - fordward pass Let’s use the computation graph we constructed to compute the loss value $l(\mathbf{w}, \mathbf{b}|\mathbf{\mathbf{x}_i}, \mathbf{\mathbf{y}_i})$. Note that the output of the $\text{sqr}$ node is naturally the loss value. This means that we can compute the loss value by going through the graph from the top level to the bottom level. The cool part is that at each node, we only need to care about its local input and operation. Let’s do this step by step. Here we initialize the parameters as $\mathbf{w}=[1]$, $\mathbf{b}=[0]$, and compute on the following datapoint $\mathbf{\mathbf{x}_i}$ = [1], $\mathbf{y}$ = [0]. We write this computation process line-by-line: $$ \begin{aligned} \text{matmul} & = \mathbf{w}^T \mathbf{x} = [1] \cr \text{add} & = \text{matmul} + b = [1] + [0] = [1] \cr \text{sub} & = y - \text{add} = [0] - [1] = [-1] \cr \text{sqr} & = \sum_{k=1}^d \text{sub}_k^2 = (-1)^2 = 1 \cr \end{aligned} $$ When we go through these four nodes one-by-one, the output of the last node is natually the loss value $\text{sqr} = l(\mathbf{w}, \mathbf{b}|\mathbf{\mathbf{x}_i}, \mathbf{\mathbf{y}_i}) = 1$. Now, we have computed the loss value $l(\mathbf{w}, \mathbf{b}|\mathbf{x}, \mathbf{y})$ with computation graph. How do we use this graph to compute the gradient of the parameter $\mathbf{w}$? To do this, we need to compute the gradient of each node over its immediate result. This is called gradient computation. Don’t worry, we will show you how to do this with the above example. Before diving into the backward passing of the computation graph, let’s first do some analysis on the computational cost of gradient computation. Computation graph for linear regression - cost gradient computation We use chain rule to compute the gradient of parameters. Here, let’s use parameter $\mathbf{w}$ as an example. We have: $$ \begin{aligned} & \frac{\partial}{\partial\mathbf{w}} l(\mathbf{w}, \mathbf{b}|\mathbf{x}, \mathbf{y}) \cr = & \frac{\partial}{\partial\mathbf{w}} \text{sqr}(\text{sub}(\mathbf{y}, \text{add}(\text{matmul}(\mathbf{w}, \mathbf{x}), \mathbf{b}))) \cr = & \nabla_{\text{sub}} \text{sqr}\cdot \nabla_{\text{add}} \text{sub} \cdot \nabla_{\text{matmul}} \text{add} \cdot \nabla_{\mathbf{w}} \text{matmul} \end{aligned} $$ Note that to compute the full gradient, we can compute each of the local gradient and multiply them together. That is, we need to compute the gradient of each node over its immediate result (e.g., $\nabla_{\text{sub}} \text{sqr}$). There are two ways we can compute each gradient value: right to left or left to right. Let’s see how many operations we need to compute each gradient value. Note that here we have $\text{w} \in \mathbb{R}^{n\times d}$, $\text{matmul} \in \mathbb{R}^{d}$, $\text{add} \in \mathbb{R}^{d}$, $\text{sub} \in \mathbb{R}^{d}$, $\text{sqr} \in \mathbb{R}$. We can then obtain the shape of each gradient value: $\nabla_{\mathbf{w}} \text{matmul} \in \mathbb{R}^{d \times (n\times d)} $ $\nabla_{\text{matmul}} \text{add} \in \mathbb{R}^{d \times d}$ $\nabla_{\text{add}} \text{sub} \in \mathbb{R}^{d \times d}$ $\nabla_{\text{sub}} \text{sqr} \in \mathbb{R}^{1 \times d}$ Right-to-left (Top-down) For this method, We compute the gradient $\nabla_{\mathbf{w}} \text{matmul}$ first, and then multiply each gradient value from the top level to the bottom level. Let’s keep track of the shape of the gradient and the number of operations we need to compute each gradient value: $$ \begin{aligned} \frac{\partial}{\partial\mathbf{w}} l(\mathbf{w}, \mathbf{b}|\mathbf{x}, \mathbf{y}) = & \overbrace{\nabla_{\text{sub}} \text{sqr}\cdot \overbrace{\nabla_{\text{add}} \text{sub} \cdot \underbrace{\nabla_{\text{matmul}} \text{add} \cdot \nabla_{\mathbf{w}} \text{matmul}}_{\text{shape}: \mathbb{R}^{d \times (n\times d)} \ \ \text{cost}: d \cdot d \cdot (n \cdot d)}}^{\text{shape}: \mathbb{R}^{d \times (n\times d)} \ \ \text{cost}: d \cdot d \cdot (n \cdot d)}}^{\text{shape}: \mathbb{R}^{1 \times (n\times d)} \ \ \text{cost}: 1 \cdot d \cdot (n \cdot d)} \cr \end{aligned} $$ Breaking the above computation step by step, we have: $\nabla_{{\mathbf{w}}} \text{add}$ = $\nabla_{\text{matmul}} \text{add} \cdot \nabla_{\mathbf{w}} \text{matmul}$ Shape: $\mathbb{R}^{d \times (n\times d)}$ Cost: $d \cdot d \cdot (n \cdot d) = d^2 \cdot (n \cdot d)$ $\nabla_{{\mathbf{w}}} \text{sub} $ = $\nabla_{\text{add}} \text{sub} \cdot \nabla_{{\mathbf{w}}} \text{add}$ Shape: $\mathbb{R}^{d \times (n\times d)}$ Cost: $d \cdot d \cdot (n \cdot d) = d^2 \cdot (n \cdot d)$ $\nabla_{{\mathbf{w}}} \text{sqr}$ = $\nabla_{\text{sub}} \text{sqr} \cdot \nabla_{{\mathbf{w}}} \text{sub}$ Shape: $\mathbb{R}^{1 \times (n\times d)}$ Cost: $1 \cdot d \cdot (n \cdot d) = d \cdot (n \cdot d)$ Total cost is: $d^2 \cdot (n \cdot d) + d^2 \cdot (n \cdot d) + d \cdot (n \cdot d) = 2d^2 \cdot (n \cdot d) + d \cdot (n \cdot d)$. As in this process we compuate the gradient in the same direction as the forward computation. call this process forward gradient propagation. Left-to-right (Bottom-up) Let’s do it in the different direction: compute the gradient $\nabla_{\text{sub}} \text{sqr}$ first, and then multiply each gradient value from the bottom level to the top level. Again, we’ll keep track of the shape of the gradient and the number of operations we need to compute each gradient value: $$ \frac{\partial}{\partial\mathbf{w}} l(\mathbf{w}, \mathbf{b}|\mathbf{x}, \mathbf{y}) = \underbrace{\overbrace{\overbrace{\nabla_{\text{sub}} \text{sqr} \cdot \nabla_{\text{add}} \text{sub}}^{\text{shape}: \mathbb{R}^{1 \times d} \ \ \text{cost}: d \cdot d} \cdot \nabla_{\text{matmul}} \text{add}}^{\text{shape}: \mathbb{R}^{1 \times d} \ \text{cost}: d \cdot d } \cdot \nabla_{\mathbf{w}} \text{matmul}}_{shape: \mathbb{R}^{1 \times (n\times d)} \ \ \text{cost}: d \cdot (n \cdot d) } $$ Break into steps, we have: $\nabla_{\text{add}} \text{sqr}$ = $\nabla_{\text{sub}} \text{sqr} \cdot \nabla_{\text{add}} \text{sub}$ Shape: $\mathbb{R}^{1 \times d}$ Cost: $d \cdot d = d^2 $ $\nabla_{\text{matmul}} \text{sqr}$ = $\nabla_{\text{add}} \text{sqr} \cdot \nabla_{\text{matmul}} \text{add}$ Shape: $\mathbb{R}^{1 \times d}$ Cost: $d \cdot d = d^2 $ $\nabla_{\mathbf{w}} \text{sqr}$ = $\nabla_{\text{matmul}} \text{sqr} \cdot \nabla_{\mathbf{w}} \text{matmul}$ Shape: $\mathbb{R}^{1 \times (n\times d)}$ Cost: $d \cdot (n \cdot d)$ Total cost is: $d^2 + d^2 + d \cdot (n \cdot d) = 2d^2 + d \cdot (n \cdot d)$. We call this process backward gradient propagation. Comparison From this example, we can see that backward propagation leads to less computation cost when $n \cdot d > 1$. This is because in the back propagation we start computing the gradient of a scale value $\text{sqr}$, and then multiply the gradient from the bottom level to the top level. This is more efficient than computing the gradient of a matrix $\mathbf{w}$ and then multiply the gradient from the top level to the bottom level. As we will see later, in deep learning, $n \cdot d$ is often very large. Therefore, backward propagation is more efficient in deep learning. We will see how to conduct back propagation in the next example. Computation graph for linear regression - back propagation Now, let’s use the computation graph we constructed to compute the gradient of the parameter $\mathbf{w}$ with back propagation. Using our previous example, we need to compute the gradient for nodes $\text{sqr}$, $\text{sub}$, $\text{add}$, and $\text{matmul}$ that are on the path towards the parameter $\mathbf{w}$. Let’s denote them by $\nabla_{\text{sub}} \text{sqr}$, $\nabla_{\text{add}} \text{sub}$, $\nabla_{\text{matmul}} \text{add}$, and $\nabla_{\mathbf{w}} \text{matmul}$. Note that when we compute the gradient from the bottom level to the top level, we are constructing antoher computation graph that is the reverse of the original computation graph. We call this computation graph the gradient graph. We show this computation graph in the figure below: Let’s compute the gradient of each node over its immediate result one-by-one: $$ \begin{align*} \nabla_{\text{sub}} \text{sqr} & = \frac{\partial \text{sub}^2 }{\partial \text{sub}} = 2 \text{sub} & \nabla_{\text{sub}} \text{sqr} & = \frac{\partial \text{sub}^2 }{\partial \text{sub}} = 2 \text{sub}\cr \nabla_{\text{add}} \text{sub} & = \frac{\partial y - \text{add}}{\partial \text{add}} = [-1] & \nabla_{\text{add}} \text{sub} & = \frac{\partial y - \text{add}}{\partial \text{add}} = [-1]\cr \nabla_{\text{matmul}} \text{add} & = \frac{\partial \text{matmul} + b}{\partial \text{matmul}} = [1] & \nabla_b \text{add} & = \frac{\partial \text{matmul} + b}{\partial \text{b}} = [1] \cr \nabla_{\mathbf{w}} \text{matmul} & = \frac{\partial \mathbf{w}^T\mathbf{x}}{\partial \mathbf{w}} = \mathbf{x} & & \end{align*} $$ Note that the first two lines of gradient computation is actually shared by the computation graph of $\mathbf{w}$ and $\mathbf{b}$. This is an important benefit of back propagation: different parameters share gradient value of their common descendant nodes. This makes gradient computation very efficient for models with a large number of parameters. Recall that we have already computed all the node values in the forwarding step ($\text{matmul} = [1]$, $\text{add} = [1]$, $\text{sub} = [-1]$, $\text{sqr} = [1]$). We can use these values to compute the gradient of each node over its immediate result: $$\nabla_{\text{sub}} \text{sqr}= [-2], \nabla_{\text{add}} \text{sub} = [-1], \nabla_{\text{matmul}} \text{add} = [1], \nabla_{\mathbf{w}} \text{matmul} = [1].$$ Cool, now we have computed the immediate gradient value for all the nodes that is on the path from parameter $\mathbf{w}$ to the loss value $l(\mathbf{w}, \mathbf{b}|\mathbf{x}, \mathbf{y})$. Using the results we have computed in the gradient computation step, we can compute the gradient of the parameter $\mathbf{w}$ as follows: $$ \frac{\frac{\partial}{\partial\mathbf{w}} l(\mathbf{w}, \mathbf{b}|\mathbf{x}, \mathbf{y})}{\partial \mathbf{w}} = \nabla_{\text{sub}} \text{sqr}\cdot \nabla_{\text{add}} \text{sub} \cdot \nabla_{\text{matmul}} \text{add} \cdot \nabla_{\mathbf{w}} \text{matmul} = [-2] \times [-1] \times [1] \times [1] = [2] $$