Attention¶

约 2291 个字 15 张图片预计阅读时间 8 分钟

Attention is all you need!

Introduction¶

在Seq2Seq模型中，编码器将输入序列编码为固定长度的上下文向量，然后解码器使用该上下文向量生成输出序列。然而，这种方法在处理长序列时存在问题，因为编码器需要将整个序列编码为单个向量，这会导致信息丢失。

如果说输入序列很长，那么将其编码为单个向量会不可避免地丢失信息。所以需要引入一种注意力机制，使得解码器在生成每个输出时，能够关注输入序列的不同部分。

人眼的注意力机制

人眼在看一个物体时，会有一个focus，对于某个部分会看得更清楚，而其他部分则看得比较模糊。

Attention¶

例如，一开始，将\(s_0\)和每个\(h_i\)进行运算，得到\(e_{1i}\)，然后进行softmax运算，得到\(\alpha_{1i}\)，然后根据\(\alpha_{1i}\)和对应的\(h_i\)求数学期望，得到最后的\(c_1\)。

即

\[ e_{t+1,i} = f_{att}(s_t, h_i) \]

\[ \alpha_{t+1,i} = \frac{e_{t+1,i}}{\sum_{j=1}^T e_{t+1,j}} \]

\[ c_{t+1} = \sum_{i=1}^T \alpha_{t+1,i} h_i \]

得到\(c_t\)后，再和\(y_t\)进行运算，得到下一阶段的\(s_{t+1}\)。然后再重复上述过程，直到生成结束符。

Attention

例如将英语翻译成法语的过程，每一个法语单词对英语的每个单词的注意力都不一样(在这里越亮代表注意力越大)

生成图像字幕时，每一个单词对图像的每个区域的关注度也不一样(在这里越亮代表注意力越大)

image Captioning with RNN and attention¶

图像首先通过CNN处理，生成特征网格（如图中蓝色矩阵所示）。每个特征向量 \( h_{i,j} \) 代表图像不同区域的信息。

使用注意力函数 \( f_{att}(s_{t-1}, h_{i,j}) \) 计算对齐分数 \( e_{t,i,j} \)，表示解码器在时间步 \( t \) 时对图像特征 \( h_{i,j} \) 的关注程度。

对对齐分数进行 softmax 运算，得到注意力权重 \( \alpha_{t,i,j} \)，用于衡量每个特征在生成当前输出时的重要性。

通过加权求和计算上下文向量 \( c_t = \sum_{i,j} \alpha_{t,i,j} h_{i,j} \)，整合了图像中不同区域的信息。

解码器从起始状态 \( s_0 \) 开始，结合上下文向量 \( c_t \) 和前一个输出 \( y_{t-1} \) 生成下一个状态 \( s_t \) 和输出 \( y_t \)。

这个过程持续进行，直到生成结束符。

Attention layer¶

single Query Vector¶

输入：

Query Vector: \(q\) shape: \(D_q\),例如\(s_t\)
Input Vector: \(X\) shape: \(N_X \times D_q\)，例如\(h\)
similarity function: scaled dot-product

计算

Computation:

Similarities: \( e \) (Shape: \( N_X \)) where \( e_i = q \cdot x_i / \sqrt{D_Q} \)
Attention weights: \( a = \text{softmax}(e) \) (Shape: \( N_X \))
Output vector: \( y = \sum_i a_i x_i \) (Shape: \( D_q \))

Info

在注意力机制中，使用点积来计算相似度。如果向量的维度 \( D \) 很大，点积的值也会很大。这会导致 softmax 的输出趋于极端值，从而影响梯度的有效传播。解决方法：为了缓解这个问题，通常会对点积进行缩放（例如，除以 \( D \) ），以防止相似度值过大。这就是“缩放点积注意力”的由来。

Multi Query Vector¶

输入：

Query Vectors: \(Q\) shape: \(N_q \times D_q\)
Input Vectors: \(X\) shape: \(N_X \times D_q\)
similarity function: scaled dot-product

输出

Similarities: \( E \) (Shape: \( N_q \times N_X \)) where \( e_{i,j} = q_i \cdot x_j / \sqrt{D_Q} \)，即\(E=QX^T/\sqrt{D_Q}\)
Attention weights: \( A \) (Shape: \( N_q \times N_X \)) where \( a_{i,j} = \text{softmax}(e_{i,j}，dim=1) \)
Output vectors: \( Y \) (Shape: \( N_q \times D_q \)) where \( y_i = \sum_j a_{i,j} x_j \)，即\(Y=AX\)

Key & Value matrix¶

Inputs:

Query vectors: \( \mathbf{Q} \) (Shape: \( N_q \times D_q \))
Input vectors: \( \mathbf{X} \) (Shape: \( N_x \times D_x \))
Key matrix: \( \mathbf{W}_K \) (Shape: \( D_x \times D_q \))
Value matrix: \( \mathbf{W}_V \) (Shape: \( D_x \times D_v \))

Computation:

Key vectors: \( \mathbf{K} = \mathbf{X} \mathbf{W}_K \) (Shape: \( N_x \times D_q \))
Value Vectors: \( \mathbf{V} = \mathbf{X} \mathbf{W}_V \) (Shape: \( N_x \times D_v \))
Similarities: \( \mathbf{E} = \mathbf{Q} \mathbf{K}^T / \sqrt{D_q} \) (Shape: \( N_q \times N_x \)), \( E_{i,j} = (\mathbf{Q}_i \cdot \mathbf{K}_j) / \sqrt{D_q} \)
Attention weights: \( \mathbf{A} = \text{softmax}(\mathbf{E}, \text{dim}=1) \) (Shape: \( N_q \times N_x \))
Output vectors: \( \mathbf{Y} = \mathbf{A} \mathbf{V} \) (Shape: \( N_q \times D_v \)), \( Y_i = \sum_j A_{i,j} \mathbf{V}_j \)

将值向量和键向量分开来，分别通过不同的矩阵进行变换，可以得到更好的效果。

得到对应的E矩阵后，按列进行softmax归一化，得到A矩阵。

最后将A矩阵的列和对应的V矩阵点积运算，得到Y矩阵。

Self-attention¶

Inputs:

Input vectors: \( \mathbf{X} \) (Shape: \( N_x \times D_x \))
Key matrix: \( \mathbf{W}_K \) (Shape: \( D_x \times D_q \))
Value matrix: \( \mathbf{W}_V \) (Shape: \( D_x \times D_v \))
Query matrix: \( \mathbf{W}_Q \) (Shape: \( D_x \times D_q \))

Computation:

Query vectors: \( \mathbf{Q} = \mathbf{X} \mathbf{W}_Q \) (Shape: \( N_x \times D_q \))
Key vectors: \( \mathbf{K} = \mathbf{X} \mathbf{W}_K \) (Shape: \( N_x \times D_q \))
Value Vectors: \( \mathbf{V} = \mathbf{X} \mathbf{W}_V \) (Shape: \( N_x \times D_v \))
Similarities: \( \mathbf{E} = \mathbf{Q} \mathbf{K}^T / \sqrt{D_q} \) (Shape: \( N_x \times N_x \)), \( E_{i,j} = (\mathbf{Q}_i \cdot \mathbf{K}_j) / \sqrt{D_q} \)
Attention weights: \( \mathbf{A} = \text{softmax}(\mathbf{E}, \text{dim}=1) \) (Shape: \( N_x \times N_x \))
Output vectors: \( \mathbf{Y} = \mathbf{A} \mathbf{V} \) (Shape: \( N_x \times D_v \)), \( Y_i = \sum_j A_{i,j} \mathbf{V}_j \)