self-attention

Self-attention is a mechanism where each token in a sequence attends to every other token, producing similarity-based weights that mix their information so the model can focus on the most relevant parts.

Practically, the model projects inputs into queries (Q), keys (K), and values (V), computes scaled dot-product attention, and uses multiple parallel heads (multi-head) to capture diverse relationships.

In autoregressive settings, a causal mask prevents tokens attending to the future. Self-attention underpins transformer architectures, enabling long-range dependencies—though its O(n^2) cost in sequence length motivates more efficient variants for very long contexts.


By Leodanis Pozo Ramos • Updated Nov. 3, 2025