本文是复旦大学发表于 AAAI 2019 的工作,截至目前CASIA-B正确率最高的网络。
英文粘贴原文,google参与翻译但人工为主,有不对的地方欢迎评论 ,部分为本人理解添加,非原文内容。
译 | 周悦媛
目录
摘要
1、介绍
Flexible 灵活性
Fast 快速性
Effective 有效性
2、相关工作
2.1 步态识别
2.2 无序序列的深度学习
3、GaitSet
3.1 问题表述
3.2 Set Pooling
Statistical Functions 统计函数
Joint Function 联合函数
Attention 注意力机制
3.3 Horizontal Pyramid Mapping
3.4 Multilayer Global Pipeline
3.5 训练和测试
训练损失函数
测试
4、实验
4.1 数据集和训练细节
CASIA-B
OU-MVLP
训练细节
4.2 主要结果
CASIA-B
Small-Sample Training (ST)
Medium-Sample Training (MT)
& Large-Sample Training (LT)
OU-MVLP
4.3 AblationExperiments 消融实验
Set VS. GEI
Impact of SP
Impact of HPM and MGP
4.4 Practicality 实用性
Limited Silhouettes 有限轮廓数量
MultipleViews 多视角
Multiple Walking Conditions
5、结论
「摘要」
As a unique biometric feature that can be recognized at a distance, gait has broad applications in crime prevention, forensic identification and social security.
作为一种可以远距离识别的独特生物识别功能,步态在预防犯罪,法医鉴定和社会保障方面具有广泛的应用。
To portray a gait, existing gait recognition methods utilize either a gait template, where temporal information is hard to preserve, or a gait sequence, which must keep unnecessary sequential constraints and thus loses the flexibility of gait recognition.
为了描绘步态,现有的步态识别方法利用步态模板(其中时间信息难以保存)或步态序列,其必须保持不必要的顺序约束并因此失去步态识别的灵活性。
In this paper we present a novel perspective, where a gait is regarded as a set consisting of independent frames. We propose a new network named GaitSet to learn identity information from the set.
在本文中,我们提出了一种新颖的视角,其中步态被视为由独立帧组成的(图像)序列。我们提出了一个名为GaitSet的新网络来学习(图像)序列中的身份信息。
Based on the set perspective, our method is immune to permutation of frames, and can naturally integrate frames from different videos which have been filmed under different scenarios, such as diverse viewing angles, different clothes/carrying conditions.
基于(图像)序列视角,我们的方法不受帧的排列的影响,并且可以自然地整合来自不同视频的帧,这些视频已经在不同的场景下被完成,例如不同的视角,不同的衣服/携带条件。
Experiments show that under normal walking conditions, our single-model method achieves an average rank-1 accuracy of 95.0% on the CASIAB gait dataset and an 87.1% accuracy on the OU-MVLP gait dataset.
实验表明,在正常步行条件下,我们的单模型方法在CASIAB步态数据集上实现了平均95.0%的一次命中准确度,在OU-MVLP步态数据集上达到了87.1%的准确度。
These results represent new state-of-the-art recognition accuracy.
这些结果代表了新的最先进的识别准确度。
On various complex scenarios, our model exhibits a significant level of robustness. It achieves accuracies of 87.2% and 70.4% on CASIA-B underbag-carrying and coat-wearing walking conditions, respectively.
在各种复杂场景中,我们的模型具有显着的鲁棒性。它分别对携带CARA-Bunderbag和涂层的行走条件达到了87.2%和70.4%的准确率。
These outperform the existing best methods by a large margin.
这些都大大优于现有的最佳方法。
The method presented can also achieve a satisfactory accuracy with a small number of frames in a test sample, e.g., 82.5% on CASIAB with only 7 frames.
所提出的方法可以在小帧数测试样本中获得令人满意的正确率,例如在CASIAB上仅用7帧得到82.5%的正确率。
「1」介绍
Unlike other biometrics such as face, fingerprint and iris, gait is a unique biometric feature that can be recognized at a distance without the cooperation of subjects and intrusion to them.Therefore,it has broad applications in crime prevention, forensic identification and social security.
与脸部,指纹和虹膜等其他生物识别技术不同,步态是一种独特的生物特征,可以远距离识别,非侵入且无需受试者的合作。因此,它被广泛应用于犯罪防范、法医鉴定和社会保障。
However, gait recognition suffers from exterior factors such as the subject’s walking speed, dressing and carrying condition, and the camera’s viewpoint and frame rate.
然而,步态识别受到外部因素的影响,例如受试者的步行速度,穿着和携带状况,以及相机的视点和帧速率。
There are two main ways to identify gait in literature,i.e.,regarding gait as an image and regarding gait as a video sequence. The first category compresses all gait silhouettes into one image, or gait template for gait recognition.
在文献中识别步态有两种主要方式,即将步态视为图像和将步态视为视频序列。第一类将所有步态轮廓压缩成一个图像,或用步态模板进行步态识别。"第一类典型代表典型代表GEI,如下图最后一列就是前几列图像的GEI,Gait Energy Image"
Simple and easy to implement, gait template easily loses temporal and fine-grained spatial information. Differently, the second category extracts features directly from the original gait silhouette sequences in recent years.
步态模板简单易行,但很容易丢失时间和细粒度的空间信息。不同的是,近几年第二类直接从原始步态轮廓序列中提取特征的算法更多。
However, these methods are vulnerable to exterior factors. Further, deep neural networks like3D-CNN for extracting sequential information are harder to train than those using a single template like Gait Energy Image.
但是,这些方法容易受到外部因素的影响。此外,用于提取序列信息的深度神经网络如 3D-CNN 比使用像 GEI 这样的单个模板的深度神经网络更难训练。
To solve these problems, we present a novel perspective which regards gait as a set of gait silhouettes. As a periodic motion, gait can be represented by a single period.
为了解决这些问题,我们提出了一种新思路即将步态特征视为一组步态轮廓图。作为周期性运动,步态可以由一个周期表示。
In a silhouette sequence containing one gait period, it was observed that the silhouette in each position has unique appearance, as shown in Fig. 1.
在包含一个步态周期的轮廓序列中,观察到每个位置的轮廓具有独特的外观,如图1所示。
图1:从左上角到右下角是CASIA-B步态数据集中的一个目标的完整周期轮廓。
Even if these silhouettes are shuffled, it is not difficult to rearrange them into correct order only by observing the appearance of them. Thus, we assume the appearance of a silhouette has contained its position information. With this assumption, order information of gait sequence is not necessary and we can directly regard gait as a set to extract temporal information.
即使这些轮廓是乱序的,但只有通过观察它们的外观就能将它们重新排列成正确的顺序。因此,我们假设轮廓的外观包含其位置信息。通过这种假设,步态序列的顺序信息不是必需的(输入特征),我们可以直接将步态视为一组(图像)来提取时间信息。
We propose an end-to-end deep learning model called GaitSet whose scheme is shown in Fig. 2.
我们提出了一种端到端的深度学校模型称作GaitSet,其框架图见图2。
图2:GaitSet的框架。'SP'代表Set Pooling。梯形表示卷积和池化块,同一列中的梯形具有相同的参数,这些参数由带有大写字母的矩形表示。请注意,尽管MGP中的块与主流水线中的块具有相同的参数,但其参数仅在主流水线中的块之间共享,而不与MGP中的块共享。HPP代表水平金字塔池化。
The input of our model is a set of gait silhouettes.
我们这个模型的输入是一组步态轮廓图像。(就像图1那种)
First, a CNN is used to extract frame-level features from each silhouette independently. Second, an operation called Set Pooling is used to aggregate frame-level features into a single set-level feature.
首先,CNN用于独立地从每个轮廓中提取帧级特征。其次,名为Set Pooling的操作用于将帧级特征聚合成独立序列级特征。
Since this operation is applied on high-level feature maps instead of the original silhouettes, it can preserve spatialand temporal information better than gait template.This will be justified by the experiment in Sec. 4.3.
由于此操作应用于高级特征(原始轮廓卷积之后就变成高级特征了)而不是原始轮廓,因此它可以比步态模板更好地保留空间和时间信息。
(其实我感觉这句话说的有点不太好理解,也可能是我理解能力有限,作者应该想表达的是:整个过程提取了每一帧图像的空间特征同时还提取了整个序列的时间特征,比步态模板的方式提取的特征更全面,侧重点应该在保留时间特征的同时提取了各帧特征)这部分的实验验证在Sec.4.3中详细介绍。
Third, a structure called Horizontal Pyramid Mapping is used to map the set-level feature into a more discriminative space to obtain the final representation.
第三,使用称为水平金字塔映射(Horizontal Pyramid Mapping,HPM)的结构将序列级特征映射到更具辨别力的空间以获得最终表示。
(这句话的后半句说的很玄乎啊,主要discriminative这个词用的太好了,让人不明觉厉。我的理解就是把这个序列级特征,就是包含了时间和空间的特征压缩成一维特征便于最后全连接做分类。)
The superiorities of the proposed method are summarized as follows:
该方法的优越性总结如下:
Flexible
Our model is pretty flexible since there are no any constraints on the input of our model except the size of the silhouette. It means that the input set can contain any number of non-consecutive silhouettes filmed under different viewpoints with different walking conditions. Related experiments are shown in Sec. 4.4.
灵活性
我们的模型非常灵活,因为除了轮廓的大小之外,我们模型的输入没有任何限制。这意味着输入的序列可以包含在不同视点下具有不同行走条件的任意数量的非连续轮廓。相关实验见Sec.4.4。(此处原文忘记写句号了我帮他们填上了哈哈哈)
Fast
Our model directly learns the representation of gait instead of measuring the similarity between a pair of gait templates or sequences. Thus, the representation of each sample needs to be calculated only once, then the recognition can be completed by calculating the Euclidean distance between representations of different samples.
快速性
我们的模型直接学习步态的表示,而不是测量一对步态模板或序列之间的相似性。因此,每个样本的表示仅需要计算一次,然后可以通过计算不同样本的表示之间的欧式距离来完成识别。
Effective
Our model greatly improves the performance on the CASIA-B and the OUMVLP datasets, showing its strong robustness to view and walking condition variations and high generalization ability to large datasets.
有效性
我们的模型极大地提高了CASIA-B和OUMVLP数据集的性能,显示了其对视图和行走条件变化的强大鲁棒性以及对大型数据集的高泛化能力。
「2」相关工作
In this section, we will give a brief survey on gait recognition and set-based deep learning methods.
这部分我们会简要介绍步态识别和基于序列的深度学习方法的回顾。
2.1 步态识别
Gait recognition can be grouped into template-based and sequence-based categories.
步态识别可以分为基于模板和基于序列两种。
Approaches in the former category first obtain human silhouettes of each frame by background subtraction.
Second, they generate a gait template by rendering pixel level operators on the aligned silhouettes.
Third, they extract the representation of the gait by machine learning approaches such as Canonical Correlation Analysis(CCA), Linear Discriminant Analysis (LDA) and deep learning. Fourth, they measure the similarity between pairs of representations by Euclidean distance or some metric learning approaches.
Finally, they assign a label to the template by some classifier, e.g., nearest neighbor classifier.
前一类中的方法首先通过背景减法获得每个帧的人体轮廓。第二步,将排列好的轮廓在帧级进行操作以生成步态模板。第三步,他们通过机器学习方法提取步态的表示,例如典型相关分析(CCA),线性判别分析(LDA)和深度学习。第四,它们通过欧几里德距离或一些度量学习方法来测量表示对(表示对就是输入的图像序列和训练过程中已经存储的一组图像序列)之间的相似性。最后,他们通过某些分类器,例如,最近邻居分类器,来为(输入的待检测)模板分配标签。
Previous works generally divides this pipeline into two parts, template generation and matching.
以前的工作通常将此流程分为两部分,模板生成和匹配。
The goal of generation is to compress gait information into a single image, e.g., Gait Energy Image (GEI) and Chrono-Gait Image (CGI).
(模板)生成的目标是将步态信息压缩成单个图像,例如步态能量图像(GEI)和计时步态图像(CGI)。
In template matching approaches, View Transformation Model (VTM) learns a projection between different views. (Hu et al. 2013) proposed View-invariant Discriminative Projection (ViDP) to project the templates into a latent space to learn a view-invariance representation.
在模板匹配方法中,视角转换模型(VTM)学习不同视图之间的投影。(Hu et al.2013)提出了视角不变判别投影(ViDP)将模板投影到潜在空间以学习视角不变性表示。
(关于潜在空间 latent space参考https://www.quora.com/What-is...,其实就是一个说不定几维的空间,这个空间中同一类的物体离的更近,以便于分类。上述链接可能打不开,内容见下图)
Recently, as deep learning performs well on various generation tasks, it has been employed on gait recognition task (Yu et al. 2017a; He et al. 2019; Takemura et al. 2018a; Shiraga et al. 2016; Yu et al. 2017b; Wu et al. 2017).
最近,由于深度学习在各种生成任务上表现良好,因此它已被(广泛)用于步态识别任务(列举了一堆相关文献)。
As the second category, video-based approaches directly take a sequence of silhouettes as input. Based on the way of extracting temporal information, they can be classified into LSTM-based approaches (Liao et al. 2017) and 3D CNN-based approaches (Wolf, Babaee, and Rigoll 2016; Wu et al. 2017).
作为第二类,基于视频的方法直接采用一系列轮廓作为输入。基于提取时间信息的方式,可以将它们分类为基于LSTM的方法和基于3D CNN的方法。
The advantages of these approaches are that 1) focusing on each silhouette, they can obtain more comprehensive spatial information.2)They can gather more temporal information because specialized structures are utilized to extract sequential information. However, The price to pay for these advantages is high computational cost.
这些方法的优点在于:1)关注每个轮廓以获得更全面的空间信息.2)可以收集更多的时间信息,因为利用了专门的结构来提取顺序信息。然而,为这些优势付出的代价是高计算成本。
2.2 无序序列的深度学习
Most works in deep learning focus on regular input representations like sequence and images. The concept of unordered set is first introduced into computer vision by (Charles et al. 2017) (PointNet) to tackle point cloud tasks. Using unordered set, PointNet can avoid the noise and the extension of data caused by quantization, and obtain a high performance. Since then, set-based methods have been wildly used in point cloud field(Wangetal.2018c;ZhouandTuzel2018; Qi et al. 2017).
大多数深度学习工作都致力于常规输入表示,如序列和图像。无序集的概念首先被(Charles et al.2017)(PointNet)引入到计算机视觉中,以解决点云任务。PointNet使用无序序列,可以避免由量化引起的噪声和数据扩展,并获得更好的性能。于是,基于序列的方法被广泛用于点云领域(列举相关文献)。
Recently, such methods are introduced into computer vision domains like content recommendation (Hamilton, Ying, and Leskovec 2017) and image captioning (Krause et al. 2017) to aggregate features in a form of a set. (Zaheer et al. 2017) further formalized the deep learning tasks defined on sets and characterizes the permutation invariant functions. To the best of our knowledge, it has not been employed in gait recognition domain up to now.
最近,这些方法被引入计算机视觉领域,如内容推荐和图像字幕,用于聚合一个序列的特征。Zaheer等人进一步给出了深度学习任务中的序列描述和排列不变函数。据我们所知,它至今尚未被用于步态识别领域。
「3」GaitSet
In this section, we describe our method for learning discriminative information from a set of gait silhouettes. The overall pipeline is illustrated in Fig. 2.
在本节中,我们将介绍从一组步态轮廓中学习判别信息的方法。整个流程如图2所示。
3.1 问题表述
We begin with formulating our concept of regarding gait as a set.
首先,将步态视为一组序列。
Given a dataset of N people with identities yi,i ∈ 1,2,...,N, we assume the gait silhouettes of a certain person subject to a distribution Pi which is only related to its identity.
给定一个数据集,数据集中一共N个人,每个人用yi表示(共有y1,y2,...yN这么多个表示)。假设某个人的步态轮廓分布Pi只与这个人的ID有关(就是说一个人的轮廓和这个人是一一对应的,不会搞错,其实就是步态识别的可行基础,即每个人的步态独具特色)。
Therefore, all silhouettes in one or more sequences of a person can be regarded as a set of n silhouettes Xi = {x(ij) | j = 1,2,...,n}, where x(ij) ∼Pi. (为了方便打字,本文用x(ij) 代表)
因此,在一个或多个序列中,所有的轮廓可以被看做是Xi = {x(ij) | j = 1,2,...,n}, 其中 x(ij) ∼Pi。
插入一段解释或者说是总结(以CASIC-B数据集为例):
数据集中有N=124个人,每个人用yi表示,比如我没记错的话ID=109的那个人的视频好多连人都没出现视频就结束了,那么在这个论文中就说y109视频不全。
在全部数据集中闭着眼睛任选出来一个轮廓怎么表示那?假如选到的轮廓图所在序列一共有20帧,选的的轮廓图是序列中的第3帧,那么表示方法就是x(20 3),其所在序列表示为X20。
Under this assumption, we tackle the gait recognition task through 3 steps, formulated as:
在这个假设下,我们通过3个步骤解决步态识别任务,表述为:
where F is a convolutional network aims to extract framelevel features from each gait silhouette.
其中F是卷积网络,旨在从每个步态轮廓中提取帧级特征。
The function G is a permutation invariant function used to map a set of framelevel feature to a set-level feature (Zaheer et al. 2017). It is implemented by an operation called Set Pooling (SP) which will be introduced in Sec. 3.2.
函数G是用于将一组帧级特征映射到序列级特征的排列不变函数。该函数通过Set Pooling(SP)实现,详细信息在Sec.3.2中介绍。
The function H is used to learn the discriminative representation of Pi from the set-level feature. This function is implemented by a structure called Horizontal Pyramid Mapping (HMP) which will be discussed in Sec. 3.3.
函数H用于从序列级特征中学习Pi的辨别表示。(就是对序列级特征进行分类,对应到每个人身上)这个函数是通过一个叫做Horizontal Pyramid Mapping(HPM此处原文应该是打错了)的结构实现的,将在Sec.3.3中介绍。
The input Xi is a tensor with four dimensions, i.e. set dimension, image channel dimension, image hight dimension, and image width dimension.
输入Xi是具有四个维度的tensor,分别是序列维度,图像通道维度,图像高度和图像宽度维度。tensor.shape=(n帧,2通道,64,64)
3.2 Set Pooling
The goal of Set Pooling (SP) is to aggregate gait information of elements in a set, formulated as z = G(V ), where z denotes the set-level feature and V = {vj|j = 1,2,...,n} denotes the frame-level features. (vj表示)
Set Pooling(SP)的目的在于收集一下整个序列的步态信息,公式化表示成z = G(V ),其中z表示序列级特征, V = {vj|j = 1,2,...,n}表示帧级特征。
There are two constraints in this operation.
此处有两个约束条件。
First, to take set as an input, it should be a permutation invariant function which is formulated as:
第一,将序列作为输入,它应该是一个排列不变函数,其表达式为:
其中π为任意排列组合。
Second, since in real-life scenario the number of a person’s gait silhouettes can be arbitrary, the function G should be able to take a set with arbitrary cardinality.
第二,因为现实生活场景中,一个人的步态轮廓数可是是任意的,函数G应该可以输入任意基数的序列。(就是这个序列可长可短,多少帧都行,这是GaitSet宣传的一大优势)
Next, we describe several instantiations of G. It will be shown in the experiments that although different instantiations of SP do have sort of influence on the performances, they do not differ greatly and all of them exceed GEI-based methods by a large margin.
下面,我们介绍了函数G的几个实例。在实验中将显示尽管SP的不同实例确实对性能有影响,但它们没有很大差异并且它们都大大超过基于GEI的方法。
Statistical Functions 统计函数
To meet the requirement of invariant constraint in Equ. 2, a natural choice of SP is to apply statistical functions on the set dimension. Considering the representativeness and the computational cost, we studied three statistical functions: max(·), mean(·) and median(·). The comparison will be shown in Sec. 4.3.
在满足Equ. 2中不变约束的要求下,SP一个很自然的选取是在序列维度上应用统计函数。考虑到典型性和计算成本,研究了三个统计函数:max(·),mean(·)和median(·)。比较将在Sec.4.3中展示。
Joint Function 联合函数
We also studied two ways to join 3 statistical functions mentioned above:
我们也研究了两种上述3个统计函数共同作用的情况:
其中,cat表示在通道维度连接,1_1C表示1×1卷积层,max、mean、median都是应用在序列维度。Equ.4 是Equ.3的增强版,多出来的1×1卷积层可以学习合适的权重以组合不同统计函数提取的信息。
Attention 注意力机制
这部分原文大量使用了refine这个词,我大概有个理解,但是没想好这个词怎么翻译才合理。
Since visual attention was successfully applied in lots of tasks, we use it to improve the performance of SP.
由于视觉注意力已成功应用于大量任务中,因此我们使用它来提高SP的性能。
Its structure is shown in Fig. 3. The main idea is to utilize the global information to learn an element-wise attention map for each frame-level feature map to refine it.
其结构如图3所示。主要思想是利用全局信息来学习每个帧级特征图的元素注意力图,以便提炼更有价值信息。
图3 Set Pooling(SP)应用注意力机制的结构。1_1C和cat分别代表1×1卷积层和连接。乘法和加法都是逐点的。
Global information is first collected by the statistical functions in the left. Then it is fed into a 1×1 convolutional layer along with the original feature map to calculate an attention for the refinement. The final set-level feature z will be extracted by employing MAX on the set of the refined frame-level feature maps. The residual structure can accelerate and stabilize the convergence.
首先由左侧(上面)的统计函数收集全局信息。然后,将其与原始特征图一起送入1×1卷积层计算注意力以精炼特征信息。通过在所设置的帧级特征映射的集合上使用MAX来提取最终的设置级特征z。最终的序列级特征z将被MAX应用在序列维度。残余结构可以加速并稳定收敛。
3.3 Horizontal Pyramid Mapping
In literature, splitting feature map into strips is commonly used in person re-identification task.The images are cropped and resized into uniform size according to pedestrian size whereas the discriminative parts vary from image to image.
在文献中,将特征图分割成条的方式经常用于人的重新识别任务。根据行人大小裁剪图像并将其尺寸调整为均匀尺寸,但辨别部分仍然因图像而异。
(Fu et al. 2018) proposed Horizontal Pyramid Pooling (HPP) to deal with it. HPP has 4 scales and thus can help the deep network focus on features with different sizes togather both local and global information. We improve HPP to make it adapt better for gait recognition task.
(Fu et al.2018)提出了Horizontal Pyramid Pooling(HPP) 来处理上述问题。HPP有4个等级,因此可以帮助深度网络同时提取局部和全局特征。我们改进了HPP使其更适合步态识别任务。
Instead of applying a 1×1 convolutional layer after the pooling, we use independent fully connect layers (FC) for each pooled feature to map it into the discriminative space, as shown in Fig. 4. We call it Horizontal Pyramid Mapping (HPM).
如图4所示,我们对每个池化后的特征使用独立的完全连接层(FC)将其映射到判别空间,而不是在合并后应用1×1卷积层。我们称这样的操作为Horizontal Pyramid Mapping (HPM)。
图4 HPM结构图
Specifically, HPM has S scales. On scale s ∈ 1,2,...,S, the feature map extracted by SP is split into strips on height dimension, i.e. strips in total.
具体而言,HPM具有S个尺度。再尺度s ∈ 1,2,...,S上,由SP提取的特征图在高度尺寸上被分成条,即总共条。
(举个例子,假如S=3,则一个人的特征在竖直方向上如下图被分割成3种尺度,=4条,所有尺度的条加在一起一共是1+2+4=7=)
Then a Global Pooling is applied to the 3-D strips to get 1-D features. For a strip zs,t where t ∈ 1,2,..., stands index of the strip in the scale, the Global Pooling is formulated as f's,t = maxpool(zs,t) + avgpool(zs,t), where maxpool and avgpool denote Global Max Pooling and Global Average Pooling respectively. Note that the functions maxpool and avgpool are used at the same time because it outperforms applying anyone of them alone.
然后,用一个全局池化将3维条变成1维特征。对于一个条zs,t来说,t ∈ 1,2,...,代表尺度s种条的角标,全局池化的公式是 f's,t = maxpool(zs,t) + avgpool(zs,t),其中maxpool和avgpool分别代表全局最大池化和全局平均池化。注:同时使用maxpool和avgpool是因为同时使用比只使用其中一种效果要好。
The final step is to employ FCs to map the features f‘ into a discriminative space. Since strips in different scales depict features of different receptive fields, and different strips in each scales depict features of different spatial positions, it comes naturally to use independent FCs, as shown in Fig. 4.
最后一步是使用FC(全连接)将特征f'映射到辨别空间。因为不用的条在不同的尺度中描述不同的感受野,并且不同的条在每个尺度中秒速不同空间位置的特征,因此如图4,很自然会想到用独立的FC。
3.4 Multilayer Global Pipeline
Different layers of a convolutional network have different receptive fields. The deeper the layer is, the larger the receptive field will be. Thus, pixels in feature maps of a shallow layer focus on local and fine-grained information while those in a deeper layer focus on more global and coarse-grained information.
不同层的卷积网络具有不同的感受野。越深层具有越大的感受野。因此,浅层特征更注重细粒度,而深层特征蕴含更多全局粗粒度信息。
The set-level features extracted by applying SP on different layers have analogical property. As shown in the main pipeline of Fig. 2, there is only one SP on the last layer of the convolutional network. To collect various-level set information, Multilayer Global Pipeline (MGP) is proposed. It has a similar structure with the convolutional network in the main pipeline and the set-level features extracted in different layers are added to MGP.
SP提取的序列级特征在不同层有相似的属性。如图2所示的主流程,在卷积网络的最后只有一个SP。为了收集不同级别的序列信息而提出Multilayer Global Pipeline (MGP)。
The final feature map generated by MGP will also be mapped into features by HPM. Note that the HPM after MGP does not share parameters with the HPM after the main pipeline.
最终由MGP生成的特征也被HPM分成条特征。注意:在MGP后面的HPM不会和主流程后面的HPM共享参数。
3.5 训练和测试
训练损失函数
As aforementioned, the output of the network is features with dimension d. The corresponding features among different samples will be used to compute the loss.
如上所述,网络的输出是具有d个维度的个特征。不同样本对应的特征将被用于计算损失。
In this paper, Batch All (BA+) triplet loss is employed to train the network (Hermans, Beyer, and Leibe 2017).
本文中,训练网络使用Batch All(BA+)三元损失。(BA+三元损失在文章《In Defense of the Triplet Loss for Person Re-Identification》中的Sec.2的第6段介绍。)
A batch with size of p×k is sampled from the training set where p denotes the number of persons and k denotes the number of training samples each person has in the batch.
从训练集中拿出一个大小是p*k的batch,其中p是人数,k是每个人拿k张图。
Note that although the experiment shows that our model performs well when it is fed with the set composed by silhouettes gathered from arbitrary sequences, a sample used for training is actually composed by silhouettes sampled in one sequence.
注:虽然我们的模型在输入任意序列中的轮廓测试时表现良好,但是训练的时候其实是用一个序列中的轮廓训练的。
(我理解的这句话意思是:测试阶段,可以混合输入一个人任意序列中的某些轮廓,但是训练时,是每个人每次只输入一个序列中的某些轮廓)
测试
Given a query Q, the goal is to retrieve all the sets with the same identity in gallery set G. Denote the sample in G as g. The Q is first put into GaitSet net to generate multiscale features, followed by concatenating all these features into a final representations Fq as shown in Fig. 2. The same process is applied on each g to get Fg. Finally,Fq is compared with every Fg using Euclidean distance to calculate Rank 1 recognition accuracy.
给定一个待验证序列Q,目标是在图片序列G中遍历全部序列找到与给定相同的ID。设G中的样本为g。首先将Q输入到GaitSet网络中生成多尺度特征,然后将这些特征连接起来形成最终的表示Fq,如图2所示。每一个样本g都走一遍一样的流程,即输入Gait Set网络并连起来,生成Fg。最终,Fq与每一个Fg计算欧式距离来判断一次命中的识别正确率。