We have described linear models for both classification and regression. How do we make linear models predict what we want? Under arbitrary parameter values, the model will likely provide a poor estimate of the output. What values of parameters make a good model? Loss Function The quality of the model is its ability to explain data. Recall that data is a collection of samples from a data generating distribution. Our data comes in the form of labeled datasets. Dataset A dataset $\mathcal{D}$ with $N$ samples is $$\mathcal{D}=\{(\textbf{x}_i,\textbf{y}_i)\}_{i=1}^N \text{where } (\textbf{x}_i,\textbf{y}_i) \sim P(X,Y)$$ where $\textbf{x}_i \in \mathbb{R}^n$ is the input, $\textbf{y}_i \in \mathbb{R}^d$ is the output, and $P(X,Y)$ is the joint distribution of the input and output. We want to find the model that best fits the samples in our dataset. Fitting the data is mathematically defined as finding the model that maximizes the likelihood of the dataset. Loss function The loss function for a single data sample $(\textbf{x}_i,\textbf{y}_i)$ is $$ l(\theta|\textbf{x}_i,\textbf{y}_i)=-\log P(\textbf{y}_i | \textbf{x}_i, \theta) $$ It is defined as the negative log likelihood of the observed data given parameters $\theta$. Let’s see how the likelihood instantiates for classification and regression. Let $\hat{y}$ be the predicted output of the model. Binary Classification For some ground truth $y\in \{0,1\}$ and predicted $\hat{y}\in[0,1]$, $$ l(\theta,\hat{y}_i,y_i) = -\log P(y_i | \hat{y}_i)= -y_i\log \hat{y}_i - (1-y_i)\log(1-\hat{y}_i) $$ Notice that when $y_i=1$, the loss is $-\log P(y_i=1|\textbf{x}_i)=-\log \hat{y}_i$ and when $y_i=0$, the loss is $-\log P(y_i=0|\textbf{x}_i)=-\log (1 - \hat{y}_i)$. These are exactly the negative log-likelihood definitions described above. Multi-Class Classification For some ground truth $y\in \{1,…,d\}$ and predicted distribution $\hat{y}\in [0,1]^d$, $$l(\theta,\hat{y}_i,y_i) = -\log P(\hat{y}_{y_i})$$ Avoid implementing this yourself Note that these functions should never be implemented explicitly as they are numerically unstable. Use the torch module torch.nn.BCEWithLogitsLoss or function torch.nn.functional.binary_cross_entropy_with_logits instead. TODO example of how to use these? print ranges to show the unnormalized logits Regression For some ground truth $y\in \mathbb{R}^d$ and predicted $\hat{y}\in\mathbb{R}^d$, $$ l(\theta|\mathbf{x}_i,\textbf{y}_i) = \frac{1}{2}||\hat{\textbf{y}} - \textbf{y}||^2_2 $$ We call this loss function L2 loss as it measures the L2 distance between the predicted $\hat{\textbf{y}}$ and ground-truth $\textbf{y}$. There is also the popular L1 loss, which is the L1 distance between the predicted $\hat{y}$ and ground-truth $y$. $$ l(\theta|\mathbf{x}_i,\textbf{y}_i) = ||\hat{\textbf{y}} - \textbf{y}||_1 $$ Mathematically, the optimal model for the L2 loss tends towards the mean prediction while the one for the L1 loss tends towards the median. In the context of deep networks, the difference between the two is largely negligible. The L1 and L2 loss for linear regression seems like a departure from log-likelihood. Surprisingly, these losses are well motivated and are indeed variants of the log-likelihood objective under some modeling assumptions. Log Likelihood with Regression Recall that the model’s output is always a distribution. We can interpret the prediction $\hat{\textbf{y}}$ as the mean of the distribution. Under a gaussian noise assumption, the log-likelihood of the observed data is precisely the L2 loss. TODO Now that we have concretely defined the loss function, we can discuss how to find the parameters that minimize the negative log likelihood (i.e. loss). Training We are now ready to discuss what it means to train our model. Remember that the loss function gives us a way to measure how well a model fits our data. Training Training is the process of updating the weights of the model $\theta$ to reduce the loss and improve the fitness of our model. Formally, it is the process of finding optimal weights $\theta^* = \min_\theta \sum_i l(\theta |\mathbf{x}_i,\textbf{y}_i)$. The path from the inputs to the loss is one large differentiable function with respect to the weights. Regression loss To see this, let’s write out the regression loss. $$ \begin{aligned} \sum_i l(\theta |\mathbf{x}_i,\textbf{y}_i) &= \sum_i \frac{1}{2}||\hat{\textbf{y}}_i - \textbf{y}_i||^2_2 \\ &= \sum_i \frac{1}{2}||(\textbf{W}\textbf{x}_i + \textbf{b}) - \textbf{y}_i||^2_2 \ \end{aligned} $$ The regression loss just a function of the weights $\textbf{W}$ and $\textbf{b}$. This ensures that the gradient can be computed and are well-defined. Training is then performed by gradient descent.