GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记

发表在PAMI上的一篇文章,复旦的团队做了工作。

目录

摘要:

1 INTRODUCTION 简介

2 RELATED WORKS 相关研究

2.1 Gait Recognition 步态识别

2.2 Deep Learning on an Unordered Set 深度学习在非序列集合上的应用

3 GAITSET 提出的方法

3.1 Problem Formulation 问题公式化

3.2 Set Pooling 集合池化

3.3 Horizontal Pyramid Mapping 水平金字塔映射

3.4 Multilayer Global Pipeline 基于多层的处理

3.5 Loss Functions and Training Strategy 损失函数和训练策略

3.6 Training and Test 训练与测试

3.7 Post Feature Dimension Reduction 后期的特征尺度降维

4 EXPERIMENTS 实验

4.1 Datasets 数据集

4.2 Parameter Setting 参数设置

4.3 Brief Introduction of Compared Methods 对比方法的简要介绍

4.4 Main Results 主要结果

4.4.1 CASIA-B CASIA-B数据集

4.4.2 OU-MVLP OU-MVLP数据集​

4.5 Ablation Experiments and Model Studies 各部分的性能和模型研究

4.5.1 Ablation experiments 各部分的性能测试

4.5.2 Training strategies 训练策略

4.6 Feature Dimension Reduction 特征降维

4.7 Practicality 实用性

5 CONCLUSION 结论


摘要:

To portray a gait, existing gait recognition methods utilize either a gait template which makes it difficult to preserve temporal information, or a gait sequence that maintains unnecessary sequential constraints and thus loses the flexibility of gait recognition.
常用的步态模板方法会丢失时序信息,而步态序列的方法冗余度太高,且在步态识别中的灵活性太差。
we present a novel perspective that utilizes gait as a deep set, which means that a set of gait frames are integrated by a global-local fused deep network inspired by the way our left- and right-hemisphere processes information to learn information that can be used in identification.
提出的新方法将步态视作一个深度集(deep set),类似于左右大脑半球处理信息的方式,用全局-局部混合深度网络将步态帧整合。
our method is immune to frame permutations, and can naturally integrate frames from different videos that have been acquired under different scenarios, such as diverse viewing angles, different clothes, or different item-carrying conditions.
方法的优点是,对帧顺序具有鲁棒性,可以自动对不同场景、不同视角、不同衣着与携带武平的环境下的步态帧进行整合。

1 INTRODUCTION 简介

easily affected by exterior factors such as the subject’s walking speed, clothing, and item-carrying condition as well as the camera’s viewpoint and frame rate.
步态识别容易受目标的步行速度、衣着、携带物品、不同的相机视角和帧率等多方面的影响。
Although various existing gait templates encode information as abundantly as possible, the compression process omits significant features such as temporal information and fine-grained spatial information.
步态模板的特点是,忽略了数据的时间信息和细粒度的空间信息。
These methods preserve more temporal information but would suffer a significant degradation when an input contains discontinuous frames or has a frame rate different from the training dataset.
步态序列的特点是,保留了尽可能多的时序信息,但是当输入数据包含不连续帧,或者与训练数据的帧率不同时,方法的效果将显著下降。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第1张图片
To solve these problems, we present a novel perspective that regards gait as a set of gait silhouettes.
提出步态集合的依据是,首先步态信息是重复的,一个步态周期就够了(集合内的元素是有限多个);其次是各个步态轮廓帧的序列并不重要(一是很容易从各轮廓图像中从新排列出,二是大家都是一样的走路模式,这个序列对区分人没有太大意义)
From this perspective, we propose an end-to-end deep learning model called Gaitset that extracts features from a gait frame set to identify gaits.
本文的工作是提出了一个端到端的深度学习模型-Gaitset,用于提取一个步态集合中的特征来鉴别身份。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第2张图片
First, a CNN is used to extract frame-level features from each silhouette in dependently (local information).
第一步,一个CNN网络独立对应一帧步态数据,提取帧级的特征(局部特征)。
Second, an operation called Set Pooling is used to aggregate frame-level features into a single set-level feature (global information).
第二步,用一个叫“集合池化(Set Pooling)”的操作,将所有的帧级特征,汇聚成集合级的特征。
it preserves spatial and temporal information better than a gait template;
集合级特征保留的时间与空间信息要优于步态模板
Third, a structure called Horizontal pyramid mapping (HPM) is applied to project the set-level feature into a more discriminative space to obtain a final deep set representation.
第三步,用一个叫“水平金字塔映射(Horizontal pyramid mapping (HPM) )”的结构,将集合级的特征投影到一个更具分辨性的特征空间,从而获取到最终的深度集合表征。

2 RELATED WORKS 相关研究

2.1 Gait Recognition 步态识别

Gait recognition can be broadly categorized into template-based and sequence-based approaches.
步态识别大体可以分为基于步态模板和基于序列两大类。
The goal of template generation is to compress gait information into a single image,
生成步态模板的目的是将步态信息压缩到一张图中。
In the template matching procedure, they first extract the gait representation from a template image using machine learning approaches such as canonical correlationanalysis (CCA) [16], linear discriminant analysis (LDA) [1],[17] and deep learning [18].
常用CCA、LDA和深度学习从步态模板中提取步态表征信息。
Then, they measure the similarity between pairs of representations using Euclidean distance or other metric learning approaches.
用欧氏距离或者其他方法衡量各步态模板之间的相似度。
Finally,they assign a label to the template based on the measured distance using a classifier, e.g., SVM or nearest neighbor classifier.
最后用SVM或最近邻分类器确定步态模板所属人的身份。
In the second category, the video-based approaches directly take a sequence of silhouettes as an input.
步态序列法是直接用基于视频的数据提取一系列步态轮廓作为方法的输入数据。

2.2 Deep Learning on an Unordered Set 深度学习在非序列集合上的应用

The initial goal for using unordered sets was to address point cloud tasks in the computer vision domain [28] based on PointNet.
CV领域最早使用无序集合是在PointNet中解决点云的问题。
Using an unordered set, PointNet can avoid the noise introduced by quantization and the extension of data, leading to a high prediction performance.
引入无序集合,PointNet可以避免因量化和扩展数据是引入的噪声(就是有降噪的功能)。

3 GAITSET 提出的方法

3.1 Problem Formulation 问题公式化

all silhouettes in one or more sequences of a given person can be regarded as a set of n silhouettes Xi= {xji|j = 1, 2, ..., n}
一个人的所有步态数列数据,转化成一个元素个数为n的集合(这个n应该是一个超参)
fi= H(G(F (x1i), F (x2i), ..., F (xni)))
公式中的x为步态序列中的每张步态轮廓图,F函数为卷积网络提取局部特征,G函数提取全局特征(就是集合的特征),H函数为水平金字塔映射方法。

3.2 Set Pooling 集合池化

The goal of Set Pooling (SP) is to condense a set of gait information,
集合池化的作用是压缩一个集合中的步态信息。
Note that there are two constraints when performing an SP operation.
集合池化操作有两个约束条件,一是操作结果与这个集合中元素的顺序无关;二是操作对集合中元素的个数没有限制。
To meet the invariant constraint requirement in Eq. 2, one rational strategy of SP is to use statistical functions on the set dimension.
考虑到对集合中元素顺序的鲁棒性,想到了利用统计学的工具实现集合池化操作。
max(·), mean(·) and median(·).
选择了最大值、平均值与中位数等三个基础的统计算子。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第3张图片
G(·) = max(·) + mean(·) + median(·) (3)
G(·) = 1_1C(cat(max(·), mean(·), median(·))) (4)
在三个基础算子的基础上,设计了两个算子联合策略。一个是直接累加,另一个是先将三个算计的结果摞起来,再用1*1卷积进行降为。后一种方法包含权重,表征能力更强一些。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第4张图片
We included two attention strategies in our work.
提出了两个注意力机制策略:一个是像素级的注意力机制策略,一个是帧级注意力机制策略。

3.3 Horizontal Pyramid Mapping 水平金字塔映射

horizontal pyramid pooling(HPP) through cropping and resizing the images into a uniform size based on pedestrian size while varying the discriminative parts from image to image.
水平金字塔池化操作通过裁减与缩放使图片尺寸统一,并对每一张图中具有区别性的区域进行改变。
we improve HPP to adapt it to the gait recognition task; instead of applying a 1 × 1 convolutional layer after the pooling, we use independent fully connected layers (FC) for each pooled feature to map it into the discriminative space,
HPM是在HPP的基础上,通过将全连接层替代原来的1*1卷积,从而将原特征映射到具有更高辨识度的特征空间。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第5张图片
The structure of horizontal pyramid mapping
首先对源数据进行金字塔操作,再从垂直方向进行划分。用全局卷积将3D变成1D的特征后,同时进行全局平均池化(GAP)和全局最大池化(GMP)操作,结果拼接后进行全连接操作,从而映射到辨识度更高的特征空间。

3.4 Multilayer Global Pipeline 基于多层的处理

Thus, pixels in the feature maps of a shallow layer pay more attention to local and fine-grained information while those in deeper layers focus more on global and coarse-grained information.
神经网络的浅层网络层对像素级和精细的信息更加敏感,而深层的网络层更加关注全局和较粗的信息。
To collect different-level set information, we propose a multilayer global pipeline (MGP),
提出的MGP,用于提取不同层次的集合信息。
The main pipeline is similar to that of human cognition, which focuses intuitively on a person’s profile, whereas the MGH can preserve more details of a person’s walking movements.
网络的主结构关注人的轮廓信息,而MGH更关注人走路运动的细节。

3.5 Loss Functions and Training Strategy 损失函数和训练策略

In the field of identification,two loss functions are widely used, i.e., cross entropy loss and triplet loss
识别领域最常用的两大损失函数是交叉熵损失和三元损失。
It measures the gap between a predictive distribution and the corresponding true distribution.
交叉熵损失函数是测量预测分布与真实分布之间的差异。
It aims to pull semantically-similar points close to each other while pushing semantically-different points away from each other
三元损失函数的目的是将语义相近的点互相推进,分开语义相差较大的点。
In this study, to improve the learning ability, we combined the cross entropy loss with the triplet loss.
本工作结合了交叉熵损失和三元损失。首先用交叉熵损失训练模型至收敛,再通过三元损失继续训练模型到一个更加具有辨识能力的程度。

3.6 Training and Test 训练与测试

calculate the rank-1 recognition accuracy, which means the percentage of the correct subjects ranked first, based on nearest Euclidean distance.
用欧式距离统计rank-1的正确率。

3.7 Post Feature Dimension Reduction 后期的特征尺度降维

we proposed a post feature dimension reduction module which is a post trained linear projection to reduce the dimension of the output feature while maintaining a competitive recognition accuracy.
设计了一个线性映射用于对特征的降维,从而提升识别的效率。

4 EXPERIMENTS 实验

we report the results of comprehensive experiments conducted to evaluate the performance of the proposed GaitSet.
做了四个角度的实验:1.与现阶段的最优结果对比;2.验证了ablation的情况(就是分析提出方法中各部分对结果的影响);3.验证了特征降维的效果;4.验证了几个特殊条件下的效果。

4.1 Datasets 数据集

Based on the sizes of the training sets, we name these three kinds of division small-sample training (ST), medium-sample training (MT) and large-sample training (LT). 
验证了不同训练样本量对模型的影响。

4.2 Parameter Setting 参数设置

参数设置及优化器的选择GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第6张图片

We adopted the Adam optimizer [45] for training our GaitSet network.
模型训练用的优化器是Adam。

4.3 Brief Introduction of Compared Methods 对比方法的简要介绍

View-invariant Discriminative Projection (ViDP) [2] uses a unitary linear projection to project the templates into a latent space to learn a view-invariant represent.
视场不变判别映射利用一个单一的线性投影将模板转换到一个潜在的空间,用于学习一个不受视场影响的表征方法。
Correlated Motion Co-Clustering(CMCC) [46] first uses motion co-clustering to partition the most related parts of gaits from different views into the same group, and then applies canonical correlation analysis(CCA) on each group to maximize the correlation between gait information across views.
CMCC首先用运动共相关聚类对不同视角下步态数据中最相关的部分继续聚类,在用CCA来最大化这些信息的相关性。

4.4 Main Results 主要结果

4.4.1 CASIA-B CASIA-B数据集

在中科研数据集上的测试
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第7张图片Therefore, both parallel and vertical perspectives lose some portion of the gait information while views such as 36◦and 144◦achieve a better balance between these two extremes.

水平和垂直的视角都会丢失部分步态信息,反而是36和144这种角度的效果更好一些。
1) Because our model regards the input as a set, the number of samples (frames) available for training the convolutional network in the main pipeline is dozens of times higher than the number of samples used to train the template- or video-based models.
小样本训练结果较好的原因之一是,模型将步态集合作为输入,其数据量远大于以步态模板和视频序列为数据的方法。
2) Because the sample sets used in the training phase are composed of frames selected randomly from the sequence in the training set, each of which can generate multiple different sample sets; thus, any units related to set feature learning (such as MGP and HPM) can also be trained well.
小样本效果较好的原因之二是,样本集合是在视频序列中随机挑选的,所以产生多种不同的样本集合,从而训练效果较好。
Our model achieves satisfactory performance on the BG subset. On the CL dataset, the recognition performances are somewhat less satisfactory, although our model still exceeds the best performance reported so far [7] by over15%.
抱球数据集效果蛮不错,穿大衣的数据集效果差一些。

4.4.2 OU-MVLP OU-MVLP数据集GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第8张图片

The results show that our method generalizes well to a dataset with a large scale and wide view variation.
实验表明,提出的方法在处理大尺度,宽视角板换的数据集时,效果较好。

4.5 Ablation Experiments and Model Studies 各部分的性能和模型研究

In this section, we report ablation experiments and model studies on CASIA-B, to examine the effectiveness of regarding gait as a set with set pooling, MGP, HPM, and different training strategies with different loss combinations.
实验验证了方法中各个模块对性能的影响,不同的训练策略和损失函数对结果的影响。

4.5.1 Ablation experiments 各部分的性能测试

There might be two main reasons for this improvement: 1) our SP extracts the set-level feature from a high-level feature map where the temporal information is well preserved and the spatial information has been sufficiently processed; and 2) as mentioned in Sec. 4.4, regarding gait as a set enlarges the volume of training data.
步态集合比能量图效果好的原因有二:1.SP操作提取的集合特征对时序信息保存较好,且空间信息也得到充分的处理。2.生成集合大大扩展了训练数据集的大小,对网络训练有利。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第9张图片
SP with pixel-wised attention achieves the highest accuracy on the NM and BG subsets and when max(·) is used, it obtains the highest accuracy on the CL subsets.
SP选用像素级的注意力机制效果最好,后续也是这样选的。
that set-level features extracted from different layers of the main pipeline
MGP用于从不同层中提取集合级别的特征。 GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第10张图片
HPM obtains better performance with more scales.
HPM在大尺寸上能获取更好的效果。
It can be seen that using independent weights increases the accuracy by more than 7% on each subset.
HPM中的全连接层,权重相互独立的话效果更好一些。

4.5.2 Training strategies 训练策略

GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第11张图片
However, only the pretraining model that combines the two losses reaches the highest 96.1% rank-1 accuracy.
交叉熵损失和三元损失混搭,效果最好。
the dropout layers is essential for a robust training performance of cross-entropy loss in this case.
Dropout与交叉熵组合,训练效果更鲁棒。
batch normalization improves all the training strategies.
批量正则化在各种策略下都有效果。

4.6 Feature Dimension Reduction 特征降维

the testing feature after concatenating all the HPM ouputs has 256 × 31 × 2 = 15, 872 dimensions in a standard framework
HPM后特征的维度是 15, 872
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第12张图片
However, there is still a negative impact on the performance if the HPM output dimensions are too low (down to 32) or too high (up to1024).
参数过大容易过拟合,参数过小限制了模型的表达能力。
By decreasing the HMP output dimensions, we can compress the final feature dimension from 15, 872 to one quarter of that.
将HMP原来的输出维度15872压缩到原来的四分之一。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第13张图片
After the model has been well trained, a new fully connected layer is applied to the 15, 872 dimension feature to reduce it into a lower dimension.
新全连接层的方法是指,在训练好的模型后面,再添加一个全连接层,从而达到降维的效果。

4.7 Practicality 实用性

In real forensic identification scenarios, cases occur in which no continuous sequence of a subject’s gait is available, only some fitful and sporadic silhouettes.
步态识别的应用现场,步态信息可能是时断时续不连续的。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第14张图片
It can also be observed that 1) the accuracy rises monotonically as the number of silhouettes increases, and 2) the accuracy is close to the best performance when the samples contain more than 25 silhouettes.
结果表明步态帧数与模型性能成正比,且25帧步态数据基本就能达到最好水平(25帧是一个步态周期的数据)
We simulate these scenarios by constructing each silhouette sample selected from two sequences that have the same walking condition but different views.
测试不同角度融合的数据集对模型性能的影响。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第15张图片
Including multiple views in the input set allows the model to gather both parallel and vertical information, resulting in performance improvements.
实验表明混搭的数据集,可使模型采集平行和垂直方面的信息,从而提升性能。
We simulate such different conditions by forming an input set using silhouettes from two sequences with the same view but different walking conditions and conduct experiments under the constraint of different numbers of silhouettes.

 

为了验证训练集中包含不同行走环境的情况下,模型的表现能力。
GaitSet: Cross-view Gait Recognition through Utilizing Gait as a Deep Set 阅读笔记_第16张图片
Containing large yet complementary noises and information, the combination of silhouettes from BG and CL helps the model improve the accuracy.
混合的情况下,背包和大衣数据集精度还是有一定的提升的。

5 CONCLUSION 结论

The proposed GaitSet approach extracts both spatial and temporal information more effectively and efficiently than do the existing methods, which regard gait as either a template or a sequence.
提出的GaitSet同时提取时序与空间信息,比现有的模板和序列方法更加有效。
In addition, since the set assumption could fit various other biometric identification tasks including person re-identification and video-based face recognition, the structure of GaitSet can be applied to these tasks with few minor changes in the future.
未来的研究是结合Re-ID和视频人脸识别,会对模型进行小幅的修改。

你可能感兴趣的:(论文解读,机器学习,深度学习,计算机视觉)