Permutation Invariance The formulation of attention is permutation-invariant, meaning the output of attention remains unchanged if we permute the input keys or values. Permutation invariance of attention (List form) Given an input query $\mathbf{q}\in\mathbb{R}^{C}$, two ordered list of keys $\mathbf{K}_a=\begin{bmatrix}\mathbf{k}_1, \mathbf{k}_2, \mathbf{k}_3 \end{bmatrix}$ before permuation and $\mathbf{K}_b=\begin{bmatrix}\mathbf{k}_2, \mathbf{k}_3, \mathbf{k}_1 \end{bmatrix}$ after permuation, and two ordered list of values $\mathbf{V}_a=\begin{bmatrix}\mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3 \end{bmatrix}$ before permutation and $\mathbf{V}_b=\begin{bmatrix}\mathbf{v}_2, \mathbf{v}_3, \mathbf{v}_1 \end{bmatrix}$ after the same permuation. The output vector $\mathbf{o}_a\in\mathbb{R}^{C}$ and $\mathbf{o}_b\in\mathbb{R}^{C}$ are computed as follows: $$ \mathbf{o}_a = \frac{e^{h(\mathbf{q}, \mathbf{k}_1)}}{\sum_j e^{h(\mathbf{q}, \mathbf{k}_j)}} \mathbf{v}_1 +\frac{e^{h(\mathbf{q}, \mathbf{k}_2)}}{\sum_j e^{h(\mathbf{q}, \mathbf{k}_j)}} \mathbf{v}_2 + \frac{e^{h(\mathbf{q}, \mathbf{k}_3)}}{\sum_j e^{h(\mathbf{q}, \mathbf{k}_j)}} \mathbf{v}_3, $$ and $$ \mathbf{o}_b = \frac{e^{h(\mathbf{q}, \mathbf{k}_2)}}{\sum_j e^{h(\mathbf{q}, \mathbf{k}_j)}} \mathbf{v}_2 +\frac{e^{h(\mathbf{q}, \mathbf{k}_3)}}{\sum_j e^{h(\mathbf{q}, \mathbf{k}_j)}} \mathbf{v}_3 + \frac{e^{h(\mathbf{q}, \mathbf{k}_1)}}{\sum_j e^{h(\mathbf{q}, \mathbf{k}_j)}} \mathbf{v}_1. $$ It is easy to see that $ \mathbf{o}_a = \mathbf{o}_b $ due to simple commutative law of addition. We can further write this property in the more general matrix form. Permutation invariance of attention (Matrix form) Let $\mathrm{Perm}(\mathbf{M})$ denote an arbitary permutation operation over the rows of a matrix $\mathbf{M}$. $$ \mathrm{Attention}(\mathbf{Q}, \mathrm{Perm}(\mathbf{K}), \mathrm{Perm}(\mathbf{V})) \\ =\mathrm{softmax}(\frac{\mathbf{Q}\mathrm{Perm}(\mathbf{K})^\top}{\sqrt{C}})\mathrm{Perm}(\mathbf{V}) \\ =\mathrm{softmax}(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{C}})\mathbf{V} \\ =\mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}). $$ This property of permutation-invariance, however, is often not desired. Let’s look at our previous example but consider two different sentences “my kid likes this movie” and “this movie likes my kid”. If we treat each character’s embedding as the element of keys/values, the output of the attention operation will be exactly the same given query. This is not what we want - the meaning of the sentence after shuffling the order of words are totally different because of the permutation bewtween subject and object. How to Model Order? To break the permutation-invariance, we need to introduce an additional embedding that indicates the location of each input element. This is achieved by the so-called positional embedding. Position Embedding Position Embeddings are a set of embeddings (or vectors) that describe the location or position of elements in a sequence so that each position is assigned a unique representation. If the input is denoted as $\mathbf{X}\in\mathbb{R}^{N\times C}$, then the position embeddings are in the form of $\mathrm{PE}\in\mathbb{R}^{N\times C}$ so that they can be seamlessly added to the input embedding. The new input now becomes $\mathbf{X}’=\mathbf{X} + \mathrm{PE}$. Absolute vs Relative Positional Embeddings The positional embeddings can be loosely categorized into two main families: (1) absolute positional embeddings, which capture the information about the absolute position of the tokens in the sequence; (2) relative positional embeddings, which capture the information about the relative pairwise distance between any two tokens in the sequence. Absolute positional embeddings are often easier to implement. However, the performance of absolute positional embeddings degrades when the length of input sequence during inference is different from that during training. Such case of variable length during inference is quite common in the real applications of NLP. On the contrary, relative positional embeddings are more generalizable to sequences of unseen lengths. Let’s first look at a few examples of positional embeddings. Learnable absolute positional embedding The simplest absolute positional embeddings take the form of a learnable matrix that has the same dimension with the input. For example, let $\mathbf{X}\in\mathbb{R}^{N\times C}$ denote the input, we define another learnable matrix $\mathrm{PE}\in\mathbb{R}^{N \times C}$. The new input to be fed into the attention module now becomes $\mathbf{X}’=\mathbf{X} + \mathrm{PE}$. Sinusoidal positional embedding Another popular choice of positional embeddings is the sinusoidal position embeddings, which use sine and cosine functions of different frequencies. Let $\mathbf{X}\in\mathbb{R}^{N\times C}$ denote the input, the sinusoidal position embeddings $\mathrm{PE}\in\mathbb{R}^{N\times C}$ are defined by: $$ \mathrm{PE}(n, 2i) = \sin(\frac{n}{10000^{2i/C}}),\quad \mathrm{PE}(n, 2i+1) = \cos(\frac{n}{10000^{2i/C}}). $$ Unlike the learnable absolute positional embeddings, sinusoidal positional embeddings are fixed. Although they are generally thought of a case of absolute positional embeddings with pre-defined element values, sinusoidal positional embeddings encode relative positions in an indirect way: For any fixed offset $k$, we can see that $\mathrm{PE}(n+k, :)$ can be reprensented as a linear function of $\mathrm{PE}(n, :)$. Relative position embedding It is worth noting that absolution and relative position embedding are not mutually exclusive. One of the example is Rotary Positional Embedding (RoPE) which combines the best of both approaches. It has been used in some state-of-the-art large language models such as Google’s PaLM and Meta’s LLaMA. We are not going into its details. If you are interested, please refer to the original paper [1].