Stochastic Gradient Descent

Optimization for Deep Networks As mentioned in the last section, in order to obtain better-performing deep network, we need to minimize $\mathcal{L}(\theta)$. There, we have introduced the Gradient Descent algorihtm. Gradient Descent (GD) In the last section, we have introduced the Gradient descent (GD) algorithm. For each step of GD, we do: $$ \theta’ = \theta - \epsilon \frac{\partial{L(\theta)}}{\partial{\theta}}, $$ where move $\theta$ in the negative gradient direction for a small step size $\epsilon$. when $\epsilon$ is small enough and $\frac{\partial{L(\theta)}}{\partial{\theta}} \neq 0$, the updated $\theta’$ will have lower $L(\theta’).$ We repeat this step until convergence. What is the issue with Gradient descent? In GD, in each update step we need to compute the gradient of $\theta$ over the whole dataset to obtain $\frac{\partial{L(\theta)}}{\partial{\theta}}$. This can be very slow and expensive to compute in each step, as we will need to compute $L(\theta)$ for every data point $(x,y)$ in the full dataset $D$. In other words, we need to compute $$ \frac{\partial{L(\theta)}}{\partial{\theta}} = \mathbb{E}_{(x,y)\in D}[\frac{\partial{\mathcal{l}(f(x, \theta), y)}}{\partial{\theta}}] $$ in every optimization step. This issue makes GD very slow and inefficient for parameter optimization. To overcome this issue, we can use another algorithm called Stochastic Gradient Descent (SGD). Stochastic Gradient Descent (SGD) In SGD, for each step we do: $$ \theta’ = \theta - \epsilon \frac{\partial{\mathcal{l}(f(x,\theta), y)}}{\partial{\theta}}, $$ where $(x,y)$ is a single data point randomly sampled from the dataset $D$. We regard $\frac{\partial{\mathcal{l}(f(x,\theta), y)}}{\partial{\theta}}$ as an approximation of $\mathbb{E}_{(x,y)\in D}[\frac{\partial{\mathcal{l}(f(x, \theta), y)}}{\partial{\theta}}]$, and use this approximated gradient to update the theta. In practice, SGD allows us to do gradient descent with smaller computational requirement, while each update step might have higher variance than GD. Comparison of GD and SGD Let’s look at an example training curve comparison of GD and SGD below: We can see that SGD has higher variance than GD, but each step of SGD is much faster than GD. In practice, we can use SGD to achieve faster optimization. What is the issue with Stochastic Gradient Descent? The issue of SGD is that it has high variance, which makes it instable and inefficient for parameter optimization. The core reason is that each step it only compute gradient over a single data point $(x,y)$, which has very high variance. To deal with this issue, instead of computing gradient over only a single data point, in each step we can compute gradient over a small set of sampled data points from $D$ to obtain more accurate gradient estimation. We call this set mini-batch. We use it in the following algorithm: Mini-batch Stochastic Gradient Descent Concretely, for each epoch, we split the dataset $D$ into $m$ mini-batches $B_0,…,B_m-1$ of size $BS$. Then we iterate over for each batch $B_i$: $$ \theta’ = \theta - \epsilon \mathbb{E}_{(x,y)\in B_i} \frac{\partial{\mathcal{l}(f(x,\theta), y)}}{\partial{\theta}}. $$ We call this algorithm SGD with mini-batch. In practice, we encourage you always use mini-batch to achieve efficient optimization.