深度学习-循环神经网络（RNN）

1. 简介
RNN(Recurrent Neural Network) 是一组用于处理序列数据的神经网络 \color{blue}{是一组用于处理序列数据的神经网络} 是一组用于处理序列数据的神经网络。序列数据的特点是后面的数据跟前面的数据有关系，是一种按照先后顺序排列的数据。如你需要预测一个句子中的下一个字，知道前边的字会是很有帮助的。一般以序列数据为输入，通过网络内部的结构设计有效捕捉序列之间的关系特征，一般也是以序列形式进行输出。
RNN对序列特性的数据非常有效，它能够挖掘数据中的时序信息以及语义信息，大多应用于语音识别、语言模型、机器翻译以及时序分析 \color{blue}{语音识别、语言模型、机器翻译以及时序分析} 语音识别、语言模型、机器翻译以及时序分析等NLP领域。
2. 网络结构
深度学习-循环神经网络（RNN）-LMLPHP
具体含义如下：
x ( t ) x^{(t)} x(t)代表在序列索引号 t t t时训练样本的输入，同样的， x ( t − 1 ) x^{(t-1)} x(t−1)和 x ( t + 1 ) x^{(t+1)} x(t+1)代表在序列索引号 t − 1 t-1 t−1和 t + 1 t+1 t+1时训练样本的输入；
h ( t ) h^{(t)} h(t)代表在序列索引号 t t t时模型的隐藏状态， h ( t ) h^{(t)} h(t)由 x ( t ) x^{(t)} x(t)和 h ( t − 1 ) h^{(t-1)} h(t−1)共同决定；
o ( t ) o^{(t)} o(t)代表在序列索引号 t t t时模型的输出。 o ( t ) o^{(t)} o(t)只由模型当前的隐藏状态 h ( t ) h^{(t)} h(t)决定；
L ( t ) L^{(t)} L(t)代表在序列索引号 t t t时模型的损失函数；
y ( t ) y^{(t)} y(t)代表在序列索引号 t t t时训练样本序列的真实输出；
U,W,V这三个矩阵是模型的线性关系参数，在整个RNN网络中是共享的。
基于上面的模型，得出RNN前向传播算法。
对于任意一个序列索引号 t t t，隐藏状态由 h ( t ) h^{(t)} h(t)由 x ( t ) x^{(t)} x(t)和 h ( t − 1 ) h^{(t-1)} h(t−1)得到： h ( t ) = σ ( U x ( t ) + W h ( t − 1 ) + b ) h^{(t)}=\sigma(Ux^{(t)}+Wh^{(t-1)}+b) h(t)=σ(Ux(t)+Wh(t−1)+b)
其中， σ \sigma σ为RNN的激活函数，一般为tanh，b为线性关系的偏倚。
序列索引号 t t t时模型的输出 o ( t ) o^{(t)} o(t)的表达式为：
o ( t ) = V h ( t ) + c o^{(t)}=Vh^{(t)}+c o(t)=Vh(t)+c
最终序列索引号 t t t时我们的预测输出为：
y ( t ) ‾ = σ ( o ( t ) ) \overline{y^{(t)}}=\sigma(o^{(t)}) y(t)=σ(o(t))
通常RNN是识别类的分类模型，所以上面的激活函数一般是softmax
3. 反向传播
RNN的反向传播也叫做BPTT(Back-propagation through time)，反向传播的思路是通过梯度下降法一轮轮的迭代，得到合适的RNN模型参数 U ， W ， V ， b ， c U，W，V，b，c U，W，V，b，c。
为了简化描述，这里的损失函数用交叉熵损失函数，公式为：
L = − ∑ i = 0 n y i l o g y ( t ) ‾ L=-\sum_{i=0}^n y_ilog\overline{y^{(t)}} L=−i=0∑nyilogy(t)
对于RNN，由于在序列的每个位置都有损失函数，因此最终的损失 L L L为：
L = ∑ t = 1 T L ( t ) L=\sum_{t=1}^TL^{(t)} L=t=1∑TL(t)
使用随机梯度下降法训练RNN其实就是对 U ， W ， V U，W，V U，W，V求偏导，并不断调整它们以使 L L L尽可能达到最小的过程。现在假设我们我们的时间序列只有 t 1 ， t 2 ， t 3 t_1，t_2，t_3 t1，t2，t3三段。我们对 t 3 t_3 t3时刻的 U ， W ， V U，W，V U，W，V求偏导:
∂ L 3 ∂ V = ∂ L 3 ∂ o 3 ∂ o 3 ∂ V \cfrac{\partial L^3}{\partial V}=\cfrac{\partial L^3}{\partial o^3} \cfrac{\partial o^3}{\partial V} ∂V∂L3=∂o3∂L3∂V∂o3 ∂ L 3 ∂ U = ∂ L 3 ∂ o 3 ∂ o 3 ∂ h 3 ∂ h 3 ∂ U + ∂ L 3 ∂ o 3 ∂ o 3 ∂ h 3 ∂ h 3 ∂ h 2 ∂ h 2 ∂ U + ∂ L 3 ∂ o 3 ∂ o 3 ∂ h 3 ∂ h 3 ∂ h 2 ∂ h 2 ∂ h 1 ∂ h 1 ∂ U \cfrac{\partial L^3}{\partial U}=\cfrac{\partial L^3}{\partial o^3} \cfrac{\partial o^3}{\partial h_3} \cfrac{\partial h^3}{\partial U} + \cfrac{\partial L^3}{\partial o^3} \cfrac{\partial o^3}{\partial h_3} \cfrac{\partial h^3}{\partial h^2}\cfrac{\partial h^2}{\partial U} + \cfrac{\partial L^3}{\partial o^3} \cfrac{\partial o^3}{\partial h_3} \cfrac{\partial h^3}{\partial h^2}\cfrac{\partial h^2}{\partial h^1}\cfrac{\partial h^1}{\partial U} ∂U∂L3=∂o3∂L3∂h3∂o3∂U∂h3+∂o3∂L3∂h3∂o3∂h2∂h3∂U∂h2+∂o3∂L3∂h3∂o3∂h2∂h3∂h1∂h2∂U∂h1 ∂ L 3 ∂ W = ∂ L 3 ∂ o 3 ∂ o 3 ∂ h 3 ∂ h 3 ∂ W + ∂ L 3 ∂ o 3 ∂ o 3 ∂ h 3 ∂ h 3 ∂ h 2 ∂ h 2 ∂ W + ∂ L 3 ∂ o 3 ∂ o 3 ∂ h 3 ∂ h 3 ∂ h 2 ∂ h 2 ∂ h 1 ∂ h 1 ∂ W \cfrac{\partial L^3}{\partial W}=\cfrac{\partial L^3}{\partial o^3} \cfrac{\partial o^3}{\partial h_3} \cfrac{\partial h^3}{\partial W} + \cfrac{\partial L^3}{\partial o^3} \cfrac{\partial o^3}{\partial h_3} \cfrac{\partial h^3}{\partial h^2}\cfrac{\partial h^2}{\partial W} + \cfrac{\partial L^3}{\partial o^3} \cfrac{\partial o^3}{\partial h_3} \cfrac{\partial h^3}{\partial h^2}\cfrac{\partial h^2}{\partial h^1}\cfrac{\partial h^1}{\partial W} ∂W∂L3=∂o3∂L3∂h3∂o3∂W∂h3+∂o3∂L3∂h3∂o3∂h2∂h3∂W∂h2+∂o3∂L3∂h3∂o3∂h2∂h3∂h1∂h2∂W∂h1
可以看出对于求 V V V偏导并没有长期依赖，但是对于 U ， W U，W U，W求偏导，会随着时间序列产生长期依赖。因为 h ( t ) h^{(t)} h(t)随着时间序列向前传播，而 h ( t ) h^{(t)} h(t)又是 U ， W U，W U，W的函数。

参考：
深度学习算法——循环神经网络RNN
深度学习之RNN(循环神经网络)

weixin_40826634

深度学习-循环神经网络（RNN）