Attention is all you need

1 minute read

Summary

Introduces Transformer, an encoder-decoder architecture for sequence translation built solely around attention modules.
- In particular, it is neither convolutional, nor recurrent.
The test performance are state-of-the-art with a reduced training cost.
Two types of attention modules are used:
- Encoder-Decoder Attention,
- Self-attention.
The Attention algorithm is a scaled variant of multiplicative, dot-product Attention.

Transformer architecture

Multi-Head Attention

With $Q, K, V$ respectively the queries, keys and values in the attention module:
- $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
The scaling factor $\sqrt{d_k}$ helps counteract the effect of increasing the dimension of the keys:
- With increasing $d_k$, the dot-product $QK^T$ has higher variance
  - e.g. The dot product of two independent random vectors of dimension $d$ with mean $0$ and variance $1$ has mean $0$ and variance $d$
- This leads to likelier saturation in the softmax and less effective gradient steps.

The authors observe that multi-head attention (with $h = 8$ heads) gives better results than single head attention.
They use reduced dimensions of $d_k / h$ for the keys in the multi-head case, thus retaining a complexity on par with the single head case.

Both the encoder and the decoder use self-attention module, which allow them to attend (in a single sequential step) to all other input/output elements.