Introduction

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

在视频序列中,有些帧由于被严重遮挡,需要被尽可能的“忽略”掉,因此本文提出了时间注意力模型(temporal attention model,TAM),注重于更有相关性的帧。

常规的矩阵学习通常用特征的距离来进行计算,但忽视了帧之间的差异,上图可以看出,本文的方法考虑了相邻帧的空间差异,即空间循环模型(spatial recurrent model,SRM)。

The proposed method

(1)总体框架:

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

输入的视频序列为:论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP,输入为视频序列三元组,首先通过CNN提取每帧的特征,选择的CNN为CaffeNet,包含5个卷积层(conv1~conv5)、2个全连接层(fc6~fc7),得到的输出为:论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

时间注意力模型包含两部分:学习每帧相关性的子网络和时间RNN模型提取特征,最后输出特征为:论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP,定义为:论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

同时,对于视频对 x 和 x,计算论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP(第5个卷积层后的池化层),并将其输入到空间循环模型,该部分包含6个RNN,每个RNN都从一个特定的方向提取特征。输出的结果为一对视频是否为同一个人的可能性,即论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

在测试中,最终两个视频的相似度可以计算为:(为什么这样计算?M的计算方法?)

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

其中 F 为欧式距离,λ 为平衡特征学习和矩阵学习的参数,默认为 1.

(2)针对特征学习的时间注意力模型(TAM):

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

输入CNN提取的特征,每次时间单元 t 都对帧都进行平均加权,即:

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

其中论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP,参数 w 通过训练如下子网络获得:

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

得到的论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP送入RNN,其中的RNN网络采用 Long Short-Term Memory(LSTM)网络。最后将 T 次结果进行时间平均池化。

(3)针对度量学习的空间循环模型(SRM):

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

输入一对视频序列的池化层特征,元素间进行相减操作,得到初步的差异映射,再通过1*1卷积。随后通过6个方向上的空间RNN模块,将得到的特征进行结合,再通过1*1卷积层和全连接层得到最终的特征。

其中RNN的工作原理为:

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

1*1卷积的原理为:

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

Experiments

(1)实验设置:

① 数据集:iLIDS-VID、PRID2011、MARS;

② 实现细节:CNN采用CaffeNet,RNN采用LSTM,视频序列长度设置为6,从tracklet中随机挑选,fc6和fc7的维度设置为1024.

(2)实验结果:

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

CNN:只使用CNN;

CNN+RNN:只使用CNN和RNN(不使用时间池化);

CNN+TAM:使用CNN和RNN基础上的时间池化;

CNN+DIFF:使用CNN,并用全连接层代替空间RNN;

CNN+SRM:使用CNN,并使用空间RNN:

ALL:CNN、时间RNN、空间RNN。

论文阅读笔记(二十二)【CVPR2017】:See the Forest for the Trees: Joint Spatial and Temporal Recurrent Neural Networks for Video-based Person Re-identification-LMLPHP

05-11 11:23