Attention is all you need

1 minute read

Summary

  • Introduces Transformer, an encoder-decoder architecture for sequence translation built solely around attention modules.
    • In particular, it is neither convolutional, nor recurrent.
  • The test performance are state-of-the-art with a reduced training cost.
  • Two types of attention modules are used:
    • Encoder-Decoder Attention,
    • Self-attention.
  • The Attention algorithm is a scaled variant of multiplicative, dot-product Attention.

Attention modules

Transformer architecture

Scaled Dot-Product Attention

Multi-Head Attention

  • With $Q, K, V$ respectively the queries, keys and values in the attention module:
    • $\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$
  • The scaling factor $\sqrt{d_k}$ helps counteract the effect of increasing the dimension of the keys:
    • With increasing $d_k$, the dot-product $QK^T$ has higher variance
      • e.g. The dot product of two independent random vectors of dimension $d$ with mean $0$ and variance $1$ has mean $0$ and variance $d$
    • This leads to likelier saturation in the softmax and less effective gradient steps.

Multi-Head Attention

  • The authors observe that multi-head attention (with $h = 8$ heads) gives better results than single head attention.
  • They use reduced dimensions of $d_k / h$ for the keys in the multi-head case, thus retaining a complexity on par with the single head case.

Self-Attention

  • Both the encoder and the decoder use self-attention module, which allow them to attend (in a single sequential step) to all other input/output elements.