Why don't people use nonlinear activation functions after projecting the query key value in attention?

Question

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

This observation applies to the transformer, additive attention, etc.

I'm not sure if I got your question right, for the attention model where exactly would you place the non-linearity? Looking at Graph Attention Networks by Petar Velickovic, they do apply an activation function in eq. 5. — razvanc92
– razvanc92, Commented May 3, 2019 at 7:21
Can you provide an example of someone not using nonlinear activations in their attention? — Philip Raeisghasem
– Philip Raeisghasem, Commented May 4, 2019 at 21:53
I think what he means is that the queries, keys and values are computed as linear projections, i.e. the input is simply multiplied by a matrix, q = x * W_q, k = x * W_k and v = x * W_v respectively. We could use a non-linear function on each of them, q = σ(x * W_q) etc., but it is redundant because later on we use the softmax function and at the end a MLP which also has non-linearities in it. — kuzand
– kuzand, Commented Jul 23, 2022 at 8:13
@AndreasK. Related. Why we apply these projections after the embedding layer which is linear? Aren't these two composed linearities redundant? — Antonios Sarikas
– Antonios Sarikas, Commented Jun 25, 2024 at 20:23

Kostya · Accepted Answer · 2023-03-12 13:33:11Z

It seems like doing this would lead to much-needed nonlinearity, otherwise, we're just doing linear transformations.

Attention is broadly defined as a following operation ($\text{softmax}$ is sometimes replaced by $\tanh$) :

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where $Q$, $K$ and $V$ are matrices that are some functions of the inputs.
There are three nonlinear operations there:

The inner projection $QK^T$ is nonlinear. We have multiplication of two functions of the inputs. For example, in case of self-attention $Q=X W_Q$ and $K = XW_K$ are two linear transforms of the same $X$, so $QK^T = X \left(W_Q W_K^T\right) X^T$ is a quadratic function of the inputs.
The $\text{softmax}(x_i) = e^{x_i} /\sum_n e^{x_n} $ function is obviously nonlinear ($\tanh$ as well)
The final $\text{softmax}(\dots) V$ product is also nonlinear for the same reasons as (1)

I would say that it is pretty clear that it is definitely not just a linear transformation - there's quite a lot of nonlinearities in the attention block.

This observation applies to the transformer, additive attention, etc.

Let's see what happens next with the outputs of the attention layers:

In the transformer model, outputs of the multi-head-self-attention are fed into a feed-forward network inside each block:

"Feed-forward" means that the inputs are multiplied by a weight matrix and then a nonlinear activation function is applied.

The additive attention approach, directly applies another $\text{softmax}$ on the outputs of what one would call the attention block:

$$e_{ij} = v_a^T \tanh\left(W_as_{i-1} + U_a h_j\right)$$

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$$

To summarize - I don't think that the premise of the question is correct. Various nonlinearities are present both inside the attention blocks and, typically, are applied after the attention is computed.

Self attention does not meat that $Q = K = X$. It only means that $Q=XW_Q$ and $K=XW_K$ or in other words that K and Q are obtained from the same X, as opposed to cross-attention where keys and queries come from different sequences. — hans
– hans, Commented Mar 11, 2023 at 23:51
@hans I stand corrected, thank you. Edited the answer to reflect that — Kostya
– Kostya, Commented Mar 12, 2023 at 13:33

Emerson Sanchotene · Accepted Answer · 2024-05-24 12:08:30Z

here are a couple of reasons why nonlinear activation functions aren't typically used after projecting the query, key, and value vectors in attention mechanisms like those found in transformers:

Redundancy: The attention mechanism itself already introduces non-linearity through the softmax function. Softmax takes the attention scores (which are linear dot products) and squashes them into probabilities, creating a non-linear relationship between the input and output.

Later Layers Handle Non-Linearity: The transformer architecture addresses the need for non-linearity in later stages. Following the multi-head attention layer, there's a fully-connected feed-forward network (MLP) with one or more hidden layers. These hidden layers typically use ReLU or similar non-linear activation functions, allowing the network to learn complex relationships between the attention outputs and the final prediction.

Stack Exchange Network

Why don't people use nonlinear activation functions after projecting the query key value in attention?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Why don't people use nonlinear activation functions after projecting the query key value in attention?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions