We now have the mathematical background to introduce our first machine learning models. Let’s look at the three simpliest tasks in machine learning: linear regression, linear binary classification, and linear multi-class classification. Linear Regression Recall that a regression model is a statistical model $f_\theta: \mathbb{R}^n \to \mathbb{R}^d$. Linear Regressor For some input $\textbf{x}\in \mathbb{R}^n$, a linear regressor is a function $$ f_\theta(\textbf{x}) = \textbf{W}\textbf{x}+\textbf{b} $$ with parameters $\theta=(\textbf{W},\textbf{b})$ describing an affine transformation with weights $\textbf{W}\in \mathbb{R}^{n \times d}$ and bias $\textbf{b} \in \mathbb{R}^d$. Let’s take a look at what a linear regressor can model. Temperature forecast Let $x$ be the day of the week and $f(x)$ be the average temperature on that day. Suppose we are given the temperature from Monday to Thursday ($x=1$ to $x=4$) and would like to predict the temperature on Friday ($x=5$). The red line shows the model’s predicted temperature for each day of the week and the blue dots show the real measurements. Given that it has been getting hotter this week, our regressor does a decent job of predicting that the temperature will continue to get hotter. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:39.194062 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ Note that linear regressors will have trouble with data that is not linearly correlated. Limitations in Temperature forecasting The linear regression model performs well because there is a linear upward trend in the temperature during the week. Suppose we would like to take advantage of the temperatures of the last 100 days, rather than the last 4 days. The red line shows the model’s predicted temperature for each day and the blue dots show the real measurements. The linear regressor is unable to model the cyclic nature of the temperature during the year. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:39.369138 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ Let’s see how a linear regressor is implemented in PyTorch. Linear Layer A linear model is referred to as a linear layer in the PyTorch framework. Here, we dig into the torch.nn.Linear module that implements a linear layer. We show how to initialize a linear model, how to access its weights and biases, and how to apply it. 1 2 3 4 5 6 linear = torch.nn.Linear(4, 2) print(f"{linear.weight=}") print(f"{linear.bias=}") x = torch.as_tensor([1, 2, 3, 4], dtype=torch.float32) print(f"{linear(x)=}") linear.weight=Parameter containing: tensor([[-0.1396, -0.4321, -0.2756, -0.0652], [-0.0856, 0.4810, 0.2468, -0.1347]], requires_grad=True) linear.bias=Parameter containing: tensor([-0.0762, 0.0061], requires_grad=True) linear(x)=tensor([-2.1674, 1.0839], grad_fn=) Linear Binary Classification In binary classification, our goal is to predict which of two categories an input belongs to. However, models produce continuous valued outputs and are not able to output discrete categories directly. One could output a real number $\hat{y}$ (as in linear regression) and predict class 2 if $\hat{y} > 0$ and class 1 if $\hat{y} \leq 0$. Unfortunately, thresholding at 0 is not differentiable. Instead, the binary classification model is a statistical model that outputs probabilities of categories: $f_\theta: \mathbb{R}^n \to P(X)$ where $P(X)$ is a probability distribution of 2 categories. In this setting, the binary classifier only outputs a continuous valued probability of the first category. The probability of the second category is immediately known since the two probabilities must sum to 1. Let’s look at the sigmoid function, which converts a real number to a binary probability. Sigmoid For some input $x \in \mathbb{R}$, the sigmoid function $\sigma: \mathbb{R} \to [0,1]$ is the function $$\sigma(x)=\frac{1}{1+e^{-x}}$$ <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:39.553393 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ Observe that when $o$ approache infinity, $\sigma(o)$ approaches 1, when $o$ approaches negative infinity, $\sigma(o)$ approaches 0 . We build a linear binary classifier from a linear regressor by regressing the probability of the first category. Binary Logistic Regression For some input $x \in \mathbb{R}^n$, a linear binary logistic regressor is a function $$ f_\theta(\textbf{x}) = \sigma(\textbf{W}\textbf{x}+\textbf{b}) $$ with parameters $\theta=(\textbf{W},\textbf{b})$ describing an affine transformation where $\textbf{W}\in \mathbb{R}^{n \times 1}$ and $\textbf{b} \in \mathbb{R}^1$. The output $f_\theta(\textbf{x})$ describes the probability of label 1. In other words, $P(y=1)=f_\theta(\textbf{x})$. The term before the sigmoid function $\textbf{W}\textbf{x}+\textbf{b}$ is called the logit. Let’s see how we can use a binary logistic regressor to classify inputs. Linear Binary Classification From this probability value, we can now classify our input with $$ \text{Classify}(\textbf{x}) = \begin{cases} 1 & \text{if } P(y=1) = f_\theta(\textbf{x}) = \sigma(\textbf{W}\textbf{x}+\textbf{b}) > 0.5 \\ 2 & \text{if } P(y=1) = f_\theta(\textbf{x}) = \sigma(\textbf{W}\textbf{x}+\textbf{b}) \leq 0.5 \\ \end{cases} $$ Observe that this is equivalent to $$ \text{Classify}(\textbf{x}) = \begin{cases} 1 & \text{if } \textbf{W}\textbf{x}+\textbf{b} > 0 \\ 2 & \text{if } \textbf{W}\textbf{x}+\textbf{b} \leq 0 \end{cases} $$ since $\sigma(0)=0.5$ and $\sigma^{-1}(0.5)=0$. Notice that for binary classifiation $\textbf{W} = \begin{bmatrix} \textbf{w}^T \end{bmatrix}$ where $\textbf{w} \in \mathbb{R}^n$. $$ \text{Classify}(\textbf{x}) = \begin{cases} 1 & \text{if } \textbf{w}^T\textbf{x}+\textbf{b} > 0 \\ 2 & \text{if } \textbf{w}^T\textbf{x}+\textbf{b} \leq 0 \end{cases} $$ This sets up the halfspace interpretation of binary classification. Binary Classification as Half-Space From our classification rule, $\textbf{w}^T\textbf{x}+\textbf{b}$ can be interpreted as a decision boundary, a line in $\mathbb{R}^n$ that separates the space into two half-spaces. The tangent to this line is $\textbf{w}$. Here is a pictorial in $\mathbb{R}^2$. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:40.855536 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:41.105754 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ The illustration above shows a hard decision boundary. But our logistic regression model actually outputs a probability of the first category. We show the smooth transition from one category to the other in the logits (and subsequently probabilities) below. Notice that the decision boundary is the white line where the logits are 0. Geometrically, the logits can be interpretted as the signed distance from the decision boundary. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:41.433673 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ Let’s look at an example of binary classification. Rain forecast Let $x$ be the average daily temperature on Monday and Tuesday for a given week and $f(x)$ be whether it will rain on Wednesday of that week. The linear binary classifier computes $P(\text{rain}) = f_\theta(\textbf{x}) = \sigma(\textbf{W}\textbf{x}+\textbf{b})$. If $P(\text{rain}) > 0.5$, we predict that it will rain on $x$ and if $P(\text{rain}) \leq 0.5$, we predict that it will not rain on $x$. We show the ground truth data with the linear classifier’s decision boundary below. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:40.065876 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ We show the ground truth data with the linear classifier’s probabilities below. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:40.438549 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ Often times, there are more than 2 possible categories in which the input can belong. Let’s look at how we extend binary classification to multiple categories. Linear Multi-Class Classification Recall that a classification model is a statistical model that outputs probabilities of categories: $f_\theta: \mathbb{R}^n \to P(X)$. $P(X)$ is a probability distribution of $d$ categories of which only 1 category is correct. In order to assign a probability for each category, we will use the softmax function. Softmax For some input $\textbf{x}=\begin{bmatrix} x_1 \\ … \\ x_d\end{bmatrix} \in \mathbb{R}^d$, the function $\text{softmax}: \mathbb{R}^n \to P(X)$ computes $$\text{softmax}(\textbf{x}) = \frac{1}{\sum_i e^{x_i}} \begin{bmatrix} e^{x_1} \\ … \\ e^{x_d}\end{bmatrix} $$ The softmax function is equivalently written $\text{softmax}(\textbf{x})_j = \frac{e^{x_j}}{\sum_ie^{x_i} }$ by indexing the $j$ element of the vector. Softmax Let’s visualize this for three classes. We fix $x_1=-1$ and $x_2=0$ and compute $\text{softmax}(\begin{bmatrix} -1 \\ 0 \\ x_3\end{bmatrix})$ when varying $x_3$. Observe that when $x_3$ is small, the value $\text{softmax}(\textbf{x})_3$ is near 0 and when $x_3$ is large, its value $\text{softmax}(\textbf{x})_3$ is near 1. Also notice that when $x_3=x_2=0$, they share the same value $\frac{e^{0}}{e^0+e^0+e^{-1}}=\frac{1}{2+e^{-1}} \approx 0.42$. You can experiment with different values for each class below. Class 1: 0 Class 2: 0 Class 3: 0 Intuitively, the softmax function outputs $K$ non-negative numbers which sum to 1. We interpret this output as a probability distribution. Linear Multi-Class Classifier For some input $\textbf{x}\in \mathbb{R}^n$, a linear multi-class classifier is a function $$ f_\theta(\textbf{x}) = \text{softmax}(\textbf{W}\textbf{x}+\textbf{b}) $$ with parameters $\theta=(\textbf{W},\textbf{b})$ describing an affine transformation where $\textbf{W}\in \mathbb{R}^{n \times d}$ and $\textbf{b} \in \mathbb{R}^d$. The output describes the probability distribution over the $d$ categories. Let’s see how we can use a multi-class classifier to classify inputs. Linear Multi-Class Classification Let’s write $\textbf{W} = \begin{bmatrix} \textbf{w}_1^T \\ \textbf{w}_2^T \\ … \\ \textbf{w}_d^T \end{bmatrix}$ where $\textbf{w}_j \in \mathbb{R}^n$. $$ \begin{align} \text{Classify}(\textbf{x}) &= \argmax_{j \in \{1,…,d\}} \text{softmax}(\textbf{W}\textbf{x}+\textbf{b})_j \\ &= \argmax_{j \in \{1,…,d\}} (\textbf{W}\textbf{x}+\textbf{b})_j \\ &= \argmax_{j \in \{1,…,d\}} \textbf{w}_j^T \textbf{x} + \textbf{b}_j \\ \end{align} $$ Let’s interpret this result geometrically. Multi-Class Classification as d-Space Linear multi-class classification splits $\mathbb{R}^n$ into $d$ subspaces. Here, we see the data points with the classifier’s decision boundaries. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:41.878057 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ Let’s visualize the decision boundaries with the tangent vector. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:42.288476 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ The boundaries above are hard. They change sharply at the decision boundary. Softmax can be interpretted as a soft decision boundary that models the changes between categories smoothly. If we weight the colors by the probability for each class, the decision boundaries become like the following. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-02-09T12:01:42.569581 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ Let’s look at an example of multi-class classification. Precipitation forecast Let $x$ be the day of the week and $f(x)$ be the precipitation on that day (rain, snow, sleet, hail). The linear multi-class classifier computes $P(\text{rain}) = f_\theta(\textbf{x})_1 = \text{softmax}(\textbf{W}\textbf{x}+\textbf{b})_1$, $P(\text{snow}) = f_\theta(\textbf{x})_2 = \text{softmax}(\textbf{W}\textbf{x}+\textbf{b})_2$ and so forth. We predict the precipitant with the highest probability. Relationship Between Different Classifications Let’s look at the relationship between binary and multi-class classification more deeply. Why do we use a multi-class classifer as opposed to multiple binary classifiers? After all, we could perform multi-class classification by extending our binary classifier to $d$ binary classifiers, one for each category. The key difference is that in multi-class classification, we assume that each input has exactly one corresponding category. This is not necessarily the case when we use multiple binary classifiers. However, these is another task for which multiple binary classifiers are better suited. Multi-Label classification In the multi-label classification setting, each input can have multiple corresponding categories. We provide some examples of when to use multi-class versus multiple binary classifiers below. When should we perform multi-class vs multiple binary classification? Multi-class classification is better suited when the input has exactly one corresponding category. For example, Predicting the weather (rain, cloudy, sunny) in Austin Predicting the scientific name of an animal Predicting the next word in a sentence Multi-label classification is better suited when the input can have multiple corresponding categories. For example, Predicting which cities in Texas will rain Predicting all the attributes of an animal Predicting which books this sentence can be found in