Probability Take a coin with a head and a tail and flip it into the air. What side will it land on? Short answer: is we don’t know. Long answer: if the coin is perfectly balanced, it has a 50% chance to land on head and a 50% chance to land on tail. Probability gives us a mathematical language to describe things that we do not fully know. Why don’t we know the outcome of a coin toss? After all, the physics is well understood, and there is very little randomness in physics at the level of a coin toss. However, we do not know the state of the world precisely. We cannot observe the exact force that is applied to the coin when we flicked it into the air. We do not know the exact mass of the coin. We do not know the wind speed or the density of the air. Because of that, the outcome of the coin toss looks random to us. Probability and statistics is way to large of a field to cover in a few pages. After all, entire PhD’s are awarded to advance the field bit by bit. Much of the literature either online or in text books is mostly geared towards that audience: The experts in the field of probability theory. Much of this is not needed to understand deep learning. The reminder of this section should give you all the stats basics you need for deep learning. We start to give a broad overview of the language used in probability, some of the mathematical operations. Some concepts may seem a bit foreign if you have not taken a machine learning class before. Don’t worry about it, it will become clear throughout the course of this class. Events and Probabilities Event An event is a single outcome or set of outcomes of an experiment. Probabilities are assigned to events. Probability A probability is a non-negative number associated with an event. The higher probability is the more frequently we expect that outcome. If an event has a probability of 0, it will never happen. If an event has a probability of 1, it will always happen. Event $$ \begin{array}{c} P(\overbrace{\colorbox{#3465a4}{X}=\colorbox{#f57900}{a}}^{\color{#cc0000}\text{an event}})\\ \color{#3465a4}\text{a random variable}\ ^{\lrcorner} \qquad\color{#f57900}\ ^{\llcorner}\text{a specific value} \end{array} $$ Let’s consider a random variable $X$. An event is an assignment to that random variable $X = a$. The probability of that event is $P(X=a)$. Symbol Description $X$ A random variable $a$ A specific value $X=a$ An event $P(X=a)$ The probability of that event $p(X=a)$ The probability density of an event Throughout this class, we will use upper case letters for random variables, and lower case variables for values. Coin toss Let’s look at at our coin toss example in more detail Symbol Description $X$ A random variable describing the outcome of the toss $\mathrm{head}$ The value for an head outcome ${X=\mathrm{head}}$ The head event ${P(X=\mathrm{head}})$ The probability of heads The above examples rely on categorical random variables (i.e. discrete events: head or tail). Generally, probabilities of continuous event $P(Y=y)$ cannot be measured. It is likely always zero. In the context of this class, we use probability densities for all continuous variables. Probability density The probability density of a continuous event $Y$ is the derivative $$ p(Y=y) = \lim_{\epsilon \to 0}\frac{P(y \le Y < y + \epsilon)}{\epsilon}. $$ Like probabilities, densities are always non-negative ${p(Y=y) \ne 0}$, but they may be larger than one. Let’s look at an example. Coin angle Let’s again look at our coin toss example, but instead of measuring the outcome as head or tail, we measure the angle at which the coin first makes contact with the ground. Let’s call this angle $\alpha$. Let’s call the corresponding random variable $Y$. The probability of this event $P(Y=\alpha)$ cannot be measured. The chance that we get our measurement of the angle perfectly right up to 100+ digits is likely zero. However, we can measure the probability over a range of values ${P(\alpha_1 \le Y < \alpha_2)}$. Likewise, we can look at the probability density of a single event $Y=\alpha$. Symbol Description $Y$ The random variable corresponding to the angle at which the coin hits the ground $\alpha_1, \alpha_2$ Two angles with $\alpha_1 < \alpha_2$ ${\alpha_1 \le Y < \alpha_2}$ The event that the coin lands at an angle between $\alpha_1$ and $\alpha_2$. ${P(\alpha_1 \le Y < \alpha_2)}$ The probability the coin lands at an angle between $\alpha_1$ and $\alpha_2$. ${p(Y = \alpha_1)}$ The probability density that the coin lands at angle $\alpha_1$. In the context of deep learning, we do not draw a big distinction between continuous and discrete distributions. In fact, we often use the same unified notation for both. We commonly rely on a short-form $P(x)$ to describe the probability (density) of an event $x$. In the context of this class, we do not draw a distinction between probabilities and probability densities. For discrete events we use probabilities, for continuous events probability densities. Short forms Symbol Description ${P(x) = P(X=x)}$ The probability of a discrete event ${P(x) = p(X=x)}$ The probability density of a continuous event Probability vs Likelihood You might have heard the term likelihood in place of probability. Likelihood vs Probability Likelihood refers to the chance that something happens in the past. Probability refers to an event that has not yet happened. Coin toss In our coin example, we can estimate the probability of the coin landing on head or tail before we flip it. Once the coin lands on head or tail, we are talking about likelihood and how likely a certain event was to happen. Probability Distributions Above, we looked at probability of individual events $P(x)$. However, if you look closely $P$ itself is a function of $x$. The function $P(x)$ is called a probability distribution. Probability distribution A function $P(x)$ that captures the probabilities of any value $x$. There are several useful properties of distributions. Properties of probability distributions Property Non-negativity $0 \le P(x)$ Boundedness (discrete only) $0 \le P(x) \le 1$ Summation (discrete) $E_P[1] = \sum_x P(x) = 1$ Summation (continuous) $E_P[1] = \int P(x) dx = 1$ Coin toss In our coin toss example, $P$ is a discrete probability distribution over two outcomes $x \in \{ \text{head}, \text{tail} \}$. The probability distribution $P$ maps the outcome $x$ to a value from 0 to 1: $P: x \rightarrow [0,1]$. Furthermore $P(\text{head}) + P(\text{tail}) = 1$, since the coin toss will either results in $\text{head}$ or $\text{tail}$. Types of Distribution There are two kinds of distributions: the data generating distribution and model distributions. Data generating distribution The data generating distribution, or data distribution for short, governs the process by which the underlying data is produced. For most real world applications it is impossible to measure the true data generating distribution. However, it is possible to collect observations through sampling, more on this shortly. Each deep network outputs a probability distribution either implicitly or explicitly. We call this distribution the model distribution. Model distribution A model distribution is a probability distribution produced by a deep network, or statistical model in general. Why do deep networks output distributions? It is easier to train deep networks that output distributions rather than exact values. As you will see later in class, it greatly simplifies the mathematical setup of the training objective, and it makes large-scale optimization tractable. The model distribution among other things captures the uncertainty of the deep network and its outputs, although not always accurately. Empirical distribution The empirical distribution is a special model distribution for categorial variables. It measures the frequency at which each category appears in the data. Let’s look at our coin toss example again. Coin toss Let the data generating distribution be a binomial distribution with a chance of $0.5$ of head and a chance of $0.5$ for tail. The empirical distribution for a different number of samples is shown below. <?xml version="1.0" encoding="utf-8" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2024-01-26T08:54:15.536064 image/svg+xml Matplotlib v3.8.0, https://matplotlib.org/ As you can see the empirical distribution rarely fits the data generating distribution exactly, even for large sample sizes. In fact, the only guarantee we have is that the empirical and data generating distribution fit when considering in infinite number of samples. Unified Notation: Expectation and Variance When reasoning about a single outcome, there is little to no difference between the continuous or discrete definition $P(x)$. The major difference lies in the way to reason about events over multiple outcomes: For discrete variables, we sum over the corresponding events. For continuous variables, we integrate. This is nicely captured in the definition of expectation. Expectation Consider a function $f(x)$. The expectation measures the value of $f$ weighted by probability $P$ $$ E_{x \sim P}\left[f(x)\right] = \begin{cases}\sum_x P(x) f(x) & \text{discrete $P$}\\\int_x P(x) f(x) dx & \text{continuous $P$} \end{cases} $$ There are a few short-forms for expectation $E_{x \sim P}\left[f(x)\right]$ Expectation $E_{P}\left[f(x)\right]$ assumes the expectation is over $x$ indicated by $f(x)$ $E\left[f(x)\right]$ assumes the expectation is clear from context, or the distribution $P$ does not matter $E_P\left[f\right]$ assumes that the expectation is over all variables of $f$ $E\left[f\right]$ a combination of all above This definition of expectation allows us to unify both discrete and continuous variables, and is commonly used in deep learning. We will use this unified notation for the remainder of the class. There are a few other operations and properties that we commonly use throughout this class Linearity of expectation The expectation is a linear function. We can split terms and pull out scalars: $$ \begin{aligned} E\left[f(x) + g(x)\right] &= E\left[f(x)\right] + E\left[g(x)\right]\\ E\left[\alpha f(x)\right] &= \alpha E\left[f(x)\right] \end{aligned} $$ Mean The mean, also called expected value, is the expectation of a function $f(x)=x$: $$ \mu_x = E_{x \sim P}\left[x\right] $$ Variance The variance measures deviation from the mean $$ \sigma_x^2 = Var_{x \sim P}\left[x\right] = E_{x \sim P}\left[(x - \mu_x)^2\right] $$ Unifying both continuous and discrete distributions is convenient for the context of this class. However, it has its limits, especially for mathematical proofs. A word of warning Mixing continuous and discrete definitions of probabilities is convenient in the context of deep learning. However, you should be careful using this for more complex proofs and derivations. Continuous distributions may behave in odd and unexpected ways. For example, they may have a variance $Var_{x \sim P}\left[x\right]$ that is not finite, thus break many of the implicit assumptions be make here. Mean and variance of an empirical distribution TODO Sampling Sampling Sampling is the process of producing an outcome $x$ following a distribution $P(x)$. It is denoted as $x \sim P$. The higher the probability $P(x)$ is, the larger the chance that we draw $x$. Any data, we produce is sampled from a data-generating distribution. Below are a few examples. Sampling in the real world You can see instances of sampling in every day life around you: Sampling Data generating distribution Rolling a dice The physical process of the dice leaving your hand and bouncing around a surface Typing on your keyboard or phone The physical process of moving your finger and touching a key or phone screen. More likely than not you will miss a key once every so often Taking a picture The set of all natural images in your environment. Even if you try hard you don’t have full control over your environment, hence you’re producing a sample from a distribution of different possible outcomes. The set of samples your draw is always biased, meaning it will never perfectly represent the underlying data-generating distribution. The exercise below will give you an idea why: Bias in samples Let’s assume we take pictures of squirrels, most of them are either red or brown, however one in one hundred thousand is an albino (white). If you create a dataset of ten thousand squirrel pictures (i.i.d., no individual squirrel is pictured twice), what is the chance that you’ll create a dataset with at least one albino squirrel? Round your answer to three decimal places (e.g. 0.025). 0.025 0.095 0.1 0.125 0.2 Let’s say you’re lucky enough and captured one, does this dataset reflect the true distribution of albino squirrels in the wild? yes no Even though the samples are biased (i.e. do not reflect the true data generating distribution), it does not mean that the sampling procedure is wrong. If we draw sufficiently many samples, i.e. infinitely many, the empirical distribution of the samples will reflect the data generating distribution. Statistical Models Recall from the last section, a statistical model is a function $f_\theta: x \to y$. For general models, the input $x$ can vary widely in complexity. The output $y$ is typically a real valued number (regression task) or a category (classification task). Regression model A regression model is a statistical model that outputs one or more real values: $$ f_\theta: \mathbb{R}^n \to \mathbb{R}^d, $$ where $d$ is the number of value we regress to. Directly producing discrete categories is hard for statistical models. They receive continuous inputs, most of their internal computation is continuous, and they would like their outputs to be continuous. A common trick is to parametrize categorical models such that they output a distribution over all categories. Classificaiton model A classification model is a statistical model that outputs probabilities of categories: $$ f_\theta: \mathbb{R}^n \to P(X), $$ where $P(X) \subset \mathbb{R}^d$ is a probability distribution over $d$ categories. Statistical models need data to determine their parameters $\theta$. Data Data, usually denoted as $D$, is a collection of samples from a data generating distribution $P_D$. $$ D = \left\{d_1, d_2, \ldots \right\} \qquad \text{where }d_i \sim P_D. $$ Labeled data Labeled data is a collection of samples from a data generating distribution $P_D$ with corresponding annotations. $$ D = \left\{(d_1, l_1), (d_2, l_2), \ldots \right\} \qquad \text{where }d_i \sim P_D, $$ where a human or machine provides an additional label $l_i$. Data Let’s consider some data sources you may have had contact with: Data Labeled or unlabeled The internet Unlabeled data of text, images, audio, video etc. Personal photo collection Labeled data of raw pixels with associated tags, comments, geo-location etc. Social media Labeled data visual or text content with viewing time, and possible rating annotations. Weather data Unlabeled data of temperature, precipitation, atmospheric measurements etc from weather stations and weather satellites. We will make all the concepts above concrete with code examples in TODO: ref.