I. 前言

前面已经写了一系列有关LSTM时间序列预测的文章：

深入理解PyTorch中LSTM的输入和输出（从input输入到Linear输出）
PyTorch搭建LSTM实现时间序列预测（负荷预测）
PyTorch中利用LSTMCell搭建多层LSTM实现时间序列预测
PyTorch搭建LSTM实现多变量时间序列预测（负荷预测）
PyTorch搭建双向LSTM实现时间序列预测（负荷预测）
PyTorch搭建LSTM实现多变量多步长时间序列预测（一）：直接多输出
PyTorch搭建LSTM实现多变量多步长时间序列预测（二）：单步滚动预测
PyTorch搭建LSTM实现多变量多步长时间序列预测（三）：多模型单步预测
PyTorch搭建LSTM实现多变量多步长时间序列预测（四）：多模型滚动预测
PyTorch搭建LSTM实现多变量多步长时间序列预测（五）：seq2seq
PyTorch中实现LSTM多步长时间序列预测的几种方法总结（负荷预测）
PyTorch-LSTM时间序列预测中如何预测真正的未来值
PyTorch搭建LSTM实现多变量输入多变量输出时间序列预测（多任务学习）
PyTorch搭建ANN实现时间序列预测（风速预测）
PyTorch搭建CNN实现时间序列预测（风速预测）
PyTorch搭建CNN-LSTM混合模型实现多变量多步长时间序列预测（负荷预测）
PyTorch搭建Transformer实现多变量多步长时间序列预测（负荷预测）
PyTorch时间序列预测系列文章总结（代码使用方法）
TensorFlow搭建LSTM实现时间序列预测（负荷预测）
TensorFlow搭建LSTM实现多变量时间序列预测（负荷预测）
TensorFlow搭建双向LSTM实现时间序列预测（负荷预测）
TensorFlow搭建LSTM实现多变量多步长时间序列预测（一）：直接多输出
TensorFlow搭建LSTM实现多变量多步长时间序列预测（二）：单步滚动预测
TensorFlow搭建LSTM实现多变量多步长时间序列预测（三）：多模型单步预测
TensorFlow搭建LSTM实现多变量多步长时间序列预测（四）：多模型滚动预测
TensorFlow搭建LSTM实现多变量多步长时间序列预测（五）：seq2seq
TensorFlow搭建LSTM实现多变量输入多变量输出时间序列预测（多任务学习）
TensorFlow搭建ANN实现时间序列预测（风速预测）
TensorFlow搭建CNN实现时间序列预测（风速预测）
TensorFlow搭建CNN-LSTM混合模型实现多变量多步长时间序列预测（负荷预测）
PyG搭建图神经网络实现多变量输入多变量输出时间序列预测
PyTorch搭建GNN-LSTM和LSTM-GNN模型实现多变量输入多变量输出时间序列预测
PyG Temporal搭建STGCN实现多变量输入多变量输出时间序列预测

Attention机制虽然上个世纪90年代在CV领域便已提出，但它却是在17年谷歌提出transformer后才开始真正火起来，到如今Attention已经成了灌水论文的必备trick，以至于当顶会/顶刊reviewer一看到标题中有Attention字样时，便已经给该论文打上了不够novel的标签。

不过灌水归灌水，Attention机制确在大多数领域确实是很work的。这篇文章主要浅谈一下时间序列预测中常见的几种Attention机制，同时给出可即插即用的代码。在本文中，按照执行机制的位置Attention被分为输入Attention和输出Attention，按维度分为时间步Attention和变量Attention，按注意力实现方式分为：点积、缩放点积、余弦相似度、通用(矩阵乘)、加性、拼接等6种，总共 2 × 2 × 6 = 24 2 \times 2 \times 6 =24 2×2×6=24种。

II. 时序预测中的Attention原理

LSTM/RNN的具体原理可以参考深入理解PyTorch中LSTM的输入和输出（从input输入到Linear输出），在这篇文章中，输入到LSTM中的数据x的维度为(batch_size, seq_len, input_size)，经过LSTM后得到的输出output的维度为(batch_size, seq_len, num_directions * hidden_size)，其中LSTM和BiLSTM的num_directions分别为1和2，为了书写方便，我们令output的维度为(batch_size, seq_len, hidden_size)。

所谓Attention机制，就是对于给定目标，通过生成一个权重系数对输入进行加权求和，来识别输入中哪些特征对于目标是重要的，哪些特征是不重要的。

LSTM中注意力机制根据使用的位置可以分为两种：对x使用Attention和对output使用Attention，而对于这两种，都可以选择对时间步或变量维度执行Attention。

2.1 输入Attention

输入Attention，即在将x送入LSTM前执行Attention。由于x (batch_size, seq_len, input_size)。对x使用Attention主要分为两种：一种是对时间步维度即seq_len执行，一种是对变量维度即对input_size执行。为了便于讨论，我们令x(batch_size=256, seq_len=24, input_size=7)和output(batch_size=256, seq_len=24, input_size=64)。

对于seq_len维度，Attention的目标是区分时间步之间的重要性。当我们利用前24个时刻点的数据预测未来的数据时，如果不使用Attention，输入进LSTM的24个长度为7的向量间是没有太多关联的，为了让所有时间步之间有所关联，我们可以对24个向量执行注意力机制，让每个向量都是所有24个向量的加权组合，这样就可以让每一个时间步的向量中包含其余时间步的信息，并且这种信息可以通过注意力权重来区分重要性，对当前时间步的越重要的时间步的权重越大。

对于input_size维度，Attention的目标是区分所有变量之间的重要性。此时，对于每一个变量，我们都拥有一个长度为24的向量，总共7个长度为24的向量。同理，我们可以让每个变量的长度为24的向量是所有7个向量的线性加权，这样每一个变量中就包含了其他变量的信息。

2.2 输出Attention

输出Attention，即对output(batch_size, seq_len, hidden_size)执行Attention。与输入Attention一样，输出Attention也分为两种：seq_len维度Attention和hidden_size维度Attention，这里不再赘述。

III. 代码实现

在注意力机制中，最主要的部分是如何得到向量 h i \mathbf{h}_i hi和向量 h j \mathbf{h}_j hj间的重要性系数 α i , j \alpha_{i,j} αi,j， α i , j \alpha_{i,j} αi,j是一个实数，用于表征向量j对向量j的重要性。在得到向量 h i \mathbf{h}_i hi与所有向量间的重要性后，向量 h i \mathbf{h}_i hi可以被更新为：
h i ← σ ( ∑ α i , j ⋅ h j ) \mathbf{h}_i \leftarrow \sigma(\sum \alpha_{i,j} \cdot \mathbf{h}_j) hi←σ(∑αi,j⋅hj)
其中 σ \sigma σ表示非线性的激活函数。

3.1 点积

顾名思义，点积方法使用两个向量间的点积来衡量向量间的重要性，点积越大，则两个向量间的关联性越强，相应的 α i , j \alpha_{i,j} αi,j也就越高：
α i , j = h i ⊤ h j \alpha_{i,j} = \mathbf{h}_i^{\top}\mathbf{h}_j αi,j=hi⊤hj

3.1.1 时间步维度

基于点积的seq_len注意力机制可以实现如下：

def att_dot_seq_len(self, x):
    # b, s, input_size / b, s, hidden_size
    x = self.attention(x)  # bsh--->bst
    e = torch.bmm(x, x.permute(0, 2, 1))  # bst*bts=bss
    attention = F.softmax(e, dim=-1)  # b s s
    out = torch.bmm(attention, x)  # bss * bst ---> bst
    out = F.relu(out)

    return out

其中

self.attention = nn.Linear(hidden_size, t)

其作用是对输入进行变换，以得到更高级的表征。一般来讲，如果是对原始输入执行Attention，我一般不选择执行这一步，以防止变量个数发生变化，当然执行也是可以的。x.permute(0, 2, 1)的维度大小为(batch_size, input_size/hidden_size, seq_len)，两者相乘得到e(batch_size, seq_len, seq_len)，对于后两个维度(seq_len, seq_len)，(i, j)就表示向量 h i \mathbf{h}_i hi和向量 h j \mathbf{h}_j hj的点积。值得注意的是，e应该是一个对称矩阵，因为向量的点积具有对称性，即 h i ⊤ h j = h j ⊤ h i \mathbf{h}_i^{\top}\mathbf{h}_j=\mathbf{h}_j^{\top}\mathbf{h}_i hi⊤hj=hj⊤hi。

接着，我们使用softmax函数来让一个向量的所有的注意力系数之和为1，即：

attention = F.softmax(e, dim=-1)

例如对于第一个样本中的第i个时间步，我们有 ∑ k = 1 s e [ 0 , i , k ] = 1 \sum_{k=1}^{s} e[0, i, k] = 1 ∑k=1se[0,i,k]=1。最后，使用矩阵乘法进行加权组合：

out = torch.bmm(attention, x)

此时out中每一个时间步上的向量都是其他所有时间步向量的线性加权。

3.1.2 变量维度(input+hidden)

基于点积的input_size/hidden_size注意力机制可以实现如下：

def att_dot_var(self, x):
    # b, s, input_size / b, s, hidden_size
    e = torch.bmm(x.permute(0, 2, 1), x)  # bis*bsi=bii
    attention = F.softmax(e, dim=-1)  # b i i
    out = torch.bmm(x, attention)  # bsi * bii ---> bsi
    out = F.relu(out)

    return out

这里过程不再叙述。

3.2 缩放点积

缩放点积在点积的基础上除以了向量的长度 d d d，即：
α i , j = h i ⊤ h j d \alpha_{i,j} = \frac{\mathbf{h}_i^{\top}\mathbf{h}_j}{\sqrt{d}} αi,j=d hi⊤hj
除以 d \sqrt{d} d 的目的是为了降低对向量长度的敏感度，使得无论向量的长度如何，点积的方差在不考虑向量长度的情况下仍然是1，方便模型优化，提升网络训练时的稳定性。

有了前面点积的基础，缩放点积的实现也较为简单。以时间步维度为例，缩放点积注意力机制实现如下：

def att_scaled_dot_seq_len(self, x):
    # b, s, input_size / b, s, hidden_size
    x = self.attention(x)  # bsh--->bst
    e = torch.bmm(x, x.permute(0, 2, 1))  # bst*bts=bss
    e = e / np.sqrt(x.shape[2])
    attention = F.softmax(e, dim=-1)  # b s s
    out = torch.bmm(attention, x)  # bss * bst ---> bst
    out = F.relu(out)

    return out

其中

e = e / np.sqrt(x.shape[2])

即为缩放操作。

3.3 余弦相似度

顾名思义，余弦相似度方法使用两个向量间的夹角余弦值来衡量向量间的重要性，余弦值越大，则两个向量间的关联性越强。值得注意的是，余弦的范围为[-1, 1]，为了能够进行计算，我们将其归一化到01之间。

3.3.1 时间步维度

基于余弦相似度的seq_len注意力机制可以实现如下：

def att_cos_seq_len(self, x):
    # b, s, input_size / b, s, hidden_size
    x = self.attention(x)  # bsh--->bst
    e = torch.cosine_similarity(x.unsqueeze(2), x.unsqueeze(1), dim=-1)  # bss
    e = 0.5 * e + 0.5
    attention = F.softmax(e, dim=-1)  # b s s
    out = torch.bmm(attention, x)  # bss * bst ---> bst
    out = F.relu(out)

    return out

其中计算余弦相似度的代码为：

e = torch.cosine_similarity(x.unsqueeze(2), x.unsqueeze(1), dim=-1)

e[0, 1, 2]表示第0个样本的第一个时间步和第二个时间步间的余弦相似度，接着将其归一化到01之间：

e = 0.5 * e + 0.5

这一步其实可有可无，因为softmax可以将负数归一化到01之间。

3.3.2 变量维度(input+hidden)

基于余弦相似度的input_size/hidden_size注意力机制可以实现如下：

def att_cos_var(self, x):
    # b, s, input_size / b, s, hidden_size
    cos = torch.cosine_similarity(x.permute(0, 2, 1).unsqueeze(2), 
                                  x.permute(0, 2, 1).unsqueeze(1), 
                                  dim=-1)   # bii
    e = 0.5 * e + e
    attention = F.softmax(e, dim=-1)  # b i i
    out = torch.bmm(x, attention)  # bsi * bii ---> bsi
    out = F.relu(out)

    return out

这里过程不再叙述。

3.4 通用Attention

通用Attention的本质是利用简单的矩阵乘法来得到相似度，即：
α i , j = h i ⊤ W h j \alpha_{i,j} = \mathbf{h}_i^{\top}\mathbf{W}\mathbf{h}_j αi,j=hi⊤Whj

3.4.1 时间步维度

基于矩阵相乘的seq_len注意力机制可以实现如下：

# x = (batch_size, seq_len, input_size/hidden_size)
seq_len, size = x.shape[1], x.shape[2]
w = nn.Linear(size, size)
e = torch.matmul(w(x), x.permute(0, 2, 1))   # bss

attention = F.softmax(e, dim=-1)  # b s s
out = torch.bmm(attention, x)  # bss * bst ---> bst
out = F.relu(out)

原理比较简单，不再赘述。

3.4.2 变量维度(input+hidden)

基于矩阵相乘的变量维度注意力机制可以实现如下：

# x = (batch_size, seq_len, input_size/hidden_size)
seq_len, size = x.shape[1], x.shape[2]
x = x.permute(0, 2, 1)
w = nn.Linear(seq_len, seq_len)
e = torch.matmul(w(x), x.permute(0, 2, 1))   # bii

attention = F.softmax(e, dim=-1)  # b i i
out = torch.bmm(x, attention)  # bsi * bii ---> bsi
out = F.relu(out)

简单来讲就是将x变换维度后再执行seq_len维度的注意力机制。

3.5 加性Attention

加性注意力机制的实现过程如下：
α i , j = v ⊤ tanh ⁡ ( W h i + U h j ) \alpha_{i,j}=\mathbf{v}^{\top} \tanh(\mathbf{W} \mathbf{h}_i + \mathbf{U} \mathbf{h}_j) αi,j=v⊤tanh(Whi+Uhj)
其中 W \mathbf{W} W和 U \mathbf{U} U都是可学习的参数矩阵。

3.5.1 时间步维度

基于加性的seq_len注意力机制可以实现如下：

# x = (batch_size, seq_len, input_size/hidden_size)
seq_len, size = x.shape[1], x.shape[2]
w = nn.Linear(size, 128)
u = nn.Linear(size, 128)
v = nn.Parameter(torch.empty(size=(128, 1)))
nn.init.xavier_uniform_(v.data, gain=1.414)

x_1 = w(x).repeat(1, seq_len, 1).view(x.shape[0], seq_len * seq_len, -1)
x_2 = u(x).repeat(1, seq_len, 1)
e = torch.matmul(torch.tanh(x_1 + x_2), v).view(x.shape[0], seq_len, -1)  # bss

attention = F.softmax(e, dim=-1)  # b s s
out = torch.bmm(attention, x)  # bss * bst ---> bst
out = F.relu(out)

其中x_1和x_2是进行了重复操作，方便让每个向量都能和其他所有向量进行相加。

3.5.2 变量维度(input+hidden)

基于加性的变量维度注意力机制可以实现如下：

# x = (batch_size, seq_len, input_size/hidden_size)
seq_len, size = x.shape[1], x.shape[2]
x = x.permute(0, 2, 1)   # bis
w = nn.Linear(seq_len, 128)
u = nn.Linear(seq_len, 128)
v = nn.Parameter(torch.empty(size=(128, 1)))
nn.init.xavier_uniform_(v.data, gain=1.414)

x_1 = w(x).repeat(1, size, 1).view(x.shape[0], size * size, -1)
x_2 = u(x).repeat(1, size, 1)
e = torch.matmul(torch.tanh(x_1 + x_2), v).view(x.shape[0], x.shape[1], -1)  # bii

attention = F.softmax(e, dim=-1)  # b i i
out = torch.bmm(x, attention)  # bsi * bii ---> bsi
out = F.relu(out)

原理比较简单，不再赘述。

3.6 拼接Attention

这里灵感来源于图注意力网络GAT，GAT中使用一个可学习的参数 β \beta β来学习两个向量间的注意力参数。具体来讲，对于两个向量 h i \mathbf{h}_i hi和 h j \mathbf{h}_j hj，它们间的注意力系数 α i , j \alpha_{i,j} αi,j可以计算如下：
α i , j = e x p ( L e a k y R e L U ( β ⋅ [ W h i ∣ ∣ W h j ] ) ) ∑ e x p ( L e a k y R e L U ( β ⋅ [ W h i ∣ ∣ W h k ] ) ) \alpha_{i,j}=\frac{\mathrm{exp}(\mathrm{LeakyReLU}(\beta \cdot [\mathbf{W}\mathbf{h}_{i} || \mathbf{W}\mathbf{h}_{j}]))}{\sum \mathrm{exp}(\mathrm{LeakyReLU}(\beta \cdot [\mathbf{W}\mathbf{h}_{i} || \mathbf{W}\mathbf{h}_{k}]))} αi,j=∑exp(LeakyReLU(β⋅[Whi∣∣Whk]))exp(LeakyReLU(β⋅[Whi∣∣Whj]))
其中 ∣ ∣ || ∣∣表示concatenate操作。简单来讲，我们首先将 h i \mathbf{h}_i hi和 h j \mathbf{h}_j hj通过一个权重矩阵 W \mathbf{W} W进行变换，这一步就是前面的x=self.attention(x)。接着将两个向量进行拼接，再乘上一个可学习的参数 β \beta β得到一个常数，然后再利用softmax进行归一化。

不少文章中的拼接Attention的实现方式为：
α i , j = s o f t m a x ( v ⊤ tanh ⁡ ( W ⋅ [ h i ∣ ∣ h j ] ) ) \alpha_{i,j}=\mathrm{softmax}(\mathbf{v}^{\top} \tanh(\mathbf{W} \cdot [\mathbf{h}_i || \mathbf{h}_j])) αi,j=softmax(v⊤tanh(W⋅[hi∣∣hj]))
这与前面相比只是将权重矩阵和激活函数的位置进行了调换，区别不大，这里以第一种为准。

3.6.1 时间步维度

基于拼接的seq_len注意力机制可以实现如下：

# x (batch_size, seq_len, input_size/hidden_size)
seq_len, size = x.shape[1], x.shape[2]
beta = nn.Parameter(torch.empty(size=(2*size, 1)))
nn.init.xavier_uniform_(beta.data, gain=1.414)

x1 = x.repeat(1, seq_len, 1).view(x.shape[0], seq_len * seq_len, -1)
x2 = x.repeat(1, seq_len, 1)
cat_x = torch.cat([x1, x2], dim=-1).view(2, seq_len, -1, 2 * size)   # b s s 2*size

e = F.leaky_relu(torch.matmul(cat_x, beta).squeeze(-1))   # bss
attention = F.softmax(e, dim=-1)  # b s s
out = torch.bmm(attention, x)  # bss * bst ---> bst
out = F.relu(out)

这里利用了repeat操作以实现所有向量两两之间的拼接。

3.6.2 变量维度(input+hidden)

基于拼接的变量维度注意力机制可以实现如下：

# x (batch_size, seq_len, input_size/hidden_size)
seq_len, size = x.shape[1], x.shape[2]
x = x.permute(0, 2, 1)   # bis
beta = nn.Parameter(torch.empty(size=(2*seq_len, 1)))
nn.init.xavier_uniform_(beta.data, gain=1.414)

x1 = x.repeat(1, size, 1).view(x.shape[0], size * size, -1)
x2 = x.repeat(1, size, 1)
cat_x = torch.cat([x1, x2], dim=-1).view(2, size, -1, 2 * seq_len)   # b i i 2*seq_len

e = F.leaky_relu(torch.matmul(cat_x, beta).squeeze(-1))   # bii
attention = F.softmax(e, dim=-1)  # b i i
out = torch.bmm(x, attention)  # bsi * bii ---> bsi
out = F.relu(out)

这里只是将x交换了维度，然后执行了与时间步注意力一样的操作。

3.7 Flatten

在深入理解PyTorch中LSTM的输入和输出（从input输入到Linear输出）中我们提到，在x经过LSTM变成(batch_size, seq_len, hidden_size)后，我们只需要取最后一个时间步的(batch_size, hidden_size)进行映射，前面的时间步注意力机制可以让最后一个时间步的向量是其他所有时间步的线性组合，因此我们就同时利用了所有时间步的信息。

为了利用全部时间步的信息，最简单的一种方法便是将所有时间步展开得到一个大小为(batch_size, seq_len * hidden_size)的矩阵，然后再进行映射以得到最终输出。

IV. 效果对比

为了探究24种+Flatten总共25种方法的效果，这里以前面PyTorch搭建LSTM实现多变量时间序列预测（负荷预测）中的设置为准。为了方便起见，令点积输入时间步、点积输出时间步、点积输入变量、点积输出变量、缩放点积输入时间步、…、拼接输入时间步、拼接输出时间步、拼接输入变量、拼接输出变量的序号分别为1到24，相关实验结果如下表所示（未完待续）：

Cyril_KI

时序预测中Attention机制是否真的有效？盘点LSTM/RNN中24种Attention机制+效果对比

目录