Multi-Head Attention

Connection to Convolution [1] Multi-Head Attention If the keys and values in the attention are the input embeddings as is, the dot-product attention has very limited expressive capability because there are no learnable parameters. A straightforward way to mitigate this is to linearly project the input embeddings to queries and keys. Recall that the convolution operation maps the input embeddings into multiple independent channels to capture information from different aspects. Similarly, we linearly project the queries, keys and values with $h$ different, learned linear projections to $d_k$ ,$d_k$, and $d_v$ dimensions, respectively. On each group of projected (queries, keys, values)-triplet, we perform the scaled dot-product attention operation in parallel, resulting in $d_v$-dimensional output. Finally, we concatenate the outputs and again project the concatenated output to the final output. Multi-head Dot-product Attention For an input triple of queries $\mathbf{Q}\in\mathbb{R}^{M\times C}$, keys $\mathbf{K}\in\mathbb{R}^{N\times C}$, and values $\mathbf{V}\in\mathbb{R}^{N\times C}$. We have $h$ groups of linear projection matrices $\mathbf{W}_{Q, i}\in\mathbb{R}^{C\times d_k}$, $\mathbf{W}_{K, i}\in\mathbb{R}^{C\times d_k}$, $\mathbf{W}_{V, i}\in\mathbb{R}^{C\times d_v}$, $1\leq i \leq h$, and the output matrix $\mathbf{W}_O\in\mathbb{R}^{hd_v\times C}$. $$ \mathrm{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V})=\begin{bmatrix}\mathrm{Attention}(\mathbf{Q}\mathbf{W}_{Q, 1}, \mathbf{K}\mathbf{W}_{K, 1}, \mathbf{V}\mathbf{W}_{V, 1}) \\ \vdots \\ \mathrm{Attention}(\mathbf{Q}\mathbf{W}_{Q, h}, \mathbf{K}\mathbf{W}_{K, h}, \mathbf{V}\mathbf{W}_{V, h}) \end{bmatrix}W_O. $$