Gradients

Deep networks are very large statistical models $f_\theta(x)$. The most common training paradigm for deep networks is gradient descent, covered in TODO: Section REF. This section covers the basic math behind gradient computation, vector valued notations of derivatives and the chain-rule. Partial derivative For a scalar function $f: \mathbb{R} \to \mathbb{R}$ the partial derivative is $$ \frac{\partial}{\partial x} f(x). $$ You may have seen $f^\prime(x)$ as a notation in other classes. We try to avoid this notation as it may lead to confusion. Gradient Given a scalar function $f: \mathbb{R}^n \to \mathbb{R}$ with vector valued inputs $x \in \mathbb{R}^n$, the gradient is the partial derivative of the function with respect to each input $$ \nabla_\mathbf{x}f(x) = \frac{\partial f(\mathbf{x})}{\partial\mathbf{x}} = \begin{bmatrix}\frac{\partial f(\mathbf{x})}{\partial x_1} & \frac{\partial f(\mathbf{x})}{\partial x_2} & \cdots & \frac{\partial f(\mathbf{x})}{\partial x_n}\end{bmatrix}. $$ Jacobian For a vector-valued function $f: \mathbb{R}^n \to \mathbb{R}^m$ the Jacobian is the partial derivative of each output with respect to each input $$ J_f = \nabla_\mathbf{x}f(x) = \begin{bmatrix}\nabla_\mathbf{x}f_1(x)\\\nabla_\mathbf{x}f_2(x)\\ \ldots\\ \nabla_\mathbf{x}f_m(x)\\ \end{bmatrix} = \small\begin{bmatrix} \frac{\partial f_1(\mathbf{x})}{\partial x_1} & \frac{\partial f_1(\mathbf{x})}{\partial x_2} & \ldots & \frac{\partial f_1(\mathbf{x})}{\partial x_n} \\ \frac{\partial f_2(\mathbf{x})}{\partial x_1} & \frac{\partial f_2(\mathbf{x})}{\partial x_2} & \ldots & \frac{\partial f_2(\mathbf{x})}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m(\mathbf{x})}{\partial x_1} & \frac{\partial f_m(\mathbf{x})}{\partial x_2} & \ldots & \frac{\partial f_m(\mathbf{x})}{\partial x_n} \end{bmatrix} $$ Gradient Gradients of functions $f: \mathbb{R}^n \to \mathbb{R}$ are size-$n$ row vectors $$ \nabla_\mathbf{x}f(\mathbf{x}) =\begin{bmatrix} \cdot & \cdot & \cdot & \cdot \end{bmatrix}. $$ Partial derivatives of vector-valued functions $f: \mathbb{R} \to \mathbb{R}^m$ are size-$m$ column vectors: $$ \frac{\partial}{\partial x} f(x) =\begin{bmatrix} \cdot \\ \cdot \\ \cdot \end{bmatrix} $$ Jacobians of vector-valued functions $f: \mathbb{R}^n \to \mathbb{R}^m$ are $m \times n$ matrices: $$ J_f =\begin{bmatrix} \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot & \cdot \end{bmatrix} $$ All this notation builds up to one of the most fundamental mathematical rules in deep learning: The chain rule of vector-value functions. Chain rule Consider two functions $f: \mathbb{R}^n \to \mathbb{R}^m$ and $g: \mathbb{R}^m \to \mathbb{R}^k$. The Jacobian of $g(f(\mathbf{x}))$ is $$ \nabla_\mathbf{x} g(f(\mathbf{x})) = \nabla_\mathbf{y} g(\mathbf{y}) \nabla_\mathbf{x} f(\mathbf{x}), $$ where $\mathbf{y}=f(\mathbf{x})$. The order of the chain-rule for vector-valued functions matters. In our notation, it is evaluated from outermost to innermost function from left to right. The chain-rule forms the basis of almost all training algorithms for deep networks and shows up at a few places throughout this class.