1710.10467 generalized end-to-end loss for speaker verification

简要说明

摘要原文

2017年，作者提出了一个新的损失函数，称为广义损失端到端（GE2E）损失，与之前（2016年）基于元组的端到端（TE2E）丢失函数相比，这使说话人验证模型的训练更加有效。

与TE2E不同，GE2E损失函数以新的方式更新网络参数，其通过关注（emphasizes）在训练过程的各步骤（step）中都难以验证的样本来实现。另外，GE2E损失函数不需要样本选择的初始阶段。通过这些特性，我们具有新的损失函数的模型将说话人验证的EER降低10％以上，同时将训练时间缩短了60％。

我们还介绍了MultiReader技术，使我们能够进行域自适应，训练出支持多个关键字（Multi keywords）的更准确的模型（即，“ OK Google”和“ Hey Google”）以及多种方言。

容易混淆之处

论文涉及点有些多：

新提出的GE2E Loss应用在Text-independent Speaker Verification（TI-SV）、Text-dependent Speaker Verification（TD-SV）两个领域。
同时Text-dependent领域又提出MultiReader（即多关键字 Multi keywords）
关于GE2E的具体实现，又提出两种：一种是基于Softmax、一种是基于Constract（对比型，关注在训练过程的各步骤中都难以验证的样本）
从而对比实验有这么多。
Text-independent领域1个：GE2E vs TE2E vs Softmax；
Text-dependent领域4个：(GE2E, TE2E) x (MultiReader, None MultiReader).
GE2E中的Softmax（这里所说的Softmax是用在GE2E内部的某个环节）与Contrast对比1个，但这个没结出具体数据。

论文回避之处：

论文摘要中首先强调：“与TE2E不同，GE2E损失函数以新的方式更新网络参数，其通过关注在训练过程的各步骤中都难以验证的样本来实现。”，这个即论文中所说的“Contrasts（对比形式）”的方式，找出负样本中最难区分的。但论文中并未给出GE2E Loss中Softmax与Contrast两种算法具体对比效果，只是简单说：Softmax在TD-SV表现更好，而Contrast在TI-SV中稍微好点。可能是GE2E中的Constrast公式与Triplet Loss太相似了，直接回避。

背景

这里包括几个相关的Loss算子：

Triplet Loss
Softmax
TE2E

Softmax

交叉熵损失函数。直接输出分类的类别概率。

img

Softmax function, a wonderful activation function that turns numbers aka logits into probabilities that sum to one.Softmax function outputs a vector that represents the probability distributions of a list of potential outcomes.

Triplet Loss

2015年FaceNet论文提出Triplet Loss.

img

The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

Triplet Loss解决的问题：

类别数目固定，可以使用基于softmax的交叉熵损失函数。
类别数目是一个变量，可以使用triplet loss，即triplet loss跟类别数量无关。不然，如果类别数很大，比如10k量级的说话人数据集，softmax loss 部分的权重矩阵太大，难以训练。

triplet loss的优势在于细节区分，triplet loss的缺点在于其收敛速度慢，有时不收敛。

Offline triplet

Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.

根据样本之间的距离，分为：semi-hard triplets，hard triplets与easy triplets三种，选择semi-hard triplets，hard triplets进行训练。

此方法不够高效，因为每过几个epoch，要重新对negative examples进行分类。

Online triplet

Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.

使用triplet进行分类

FaceNet是特征向量提取器，输出的是一个欧几里得空间向量，随后就可以用各种机器学习算法进行分类。

FaceNet. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

前作 Tuple Based End-to-End Loss

2016 End-to-end text-dependent speaker verification. ICASSP

Tuple Based End-to-End Loss:

img

Pros

Simulates runtime behavior in the loss function
Each (N+1)-tuple contains all utterances involved in a verification decision (unlike triplet)

Cons

Most tuples are easy - training is inefficient

本文 Generalized End-to-End Loss

2017 Generalized end-to-end loss for speaker verification.

GE2E Loss

img

上图，相同颜色的属于同一类。 [图片上传失败...(image-e1c855-1582508024839)] 是各类的中心。

GE2E loss pushes the embedding towards the centroid of the true speaker, and away from the centroid of the most similar different speaker. [图片上传失败...(image-8a3965-1582508024839)]

Similarit Matrix

相似矩阵

img

Construct a similarity matrix for each batch:

Embeddings are L2-normalized:

[图片上传失败...(image-c91b8b-1582508024839)]

Centroids:

[图片上传失败...(image-f863af-1582508024839)]

Similarity:

[图片上传失败...(image-bd6fbb-1582508024839)]

Softmax vs. Contrast

img

Each row of [图片上传失败...(image-ad80a9-1582508024839)] for [图片上传失败...(image-70f086-1582508024839)] defines similarity between [图片上传失败...(image-3598eb-1582508024839)] and every centroid [图片上传失败...(image-552440-1582508024839)] . We want [图片上传失败...(image-6fb2d8-1582508024839)] to be close to [图片上传失败...(image-f5f189-1582508024839)] and far away from [图片上传失败...(image-63f305-1582508024839)] for [图片上传失败...(image-3ad46f-1582508024839)] .

Softmax与Contrast两个Loss计算方式的对比，以下是作者的实验结果。

Softmax: Good for text-independent applications [图片上传失败...(image-ed5237-1582508024839)]
Contrast: Good for keyword-based applications. The contrast loss is defined on positive pairs and most aggressive negative pairs [图片上传失败...(image-dd805e-1582508024839)]

这里有个疑问，为何在text-independent任务中Contrast（对比度损失）会比Softmax差呢?

Contrast方法其实是类似Triplet Loss，其Loss计算为：正对和最积极的负对之和。

Trick about centroid

For true speaker centroid, we should exclude the embedding itself. 即计算 [图片上传失败...(image-5601ee-1582508024839)] 时要排除 [图片上传失败...(image-621ff9-1582508024840)] 。

img

To avoid trivial solution: all utterances have same embedding. 即统一成一个公式。

[图片上传失败...(image-698e1b-1582508024840)]

Efficiency estimate

TE2E vs. GE2E

主要思想是，TE2E是一个tuple算一次，而GE2E是基本等于一个批量的tuple，同时放到GPU计算，效率更高。

For TE2E, assume a training batch has: - N speakers - Each speaker has M utterances - P enrollment utterances per speaker

Number of all possible tuples: [图片上传失败...(image-23526b-1582508024840)] Theoretically, one GE2E step is equivalent to [图片上传失败...(image-4186b6-1582508024840)] TE2E steps.

TODO 对这里的具体计算不是很理解。为何能得出 [图片上传失败...(image-3ee240-1582508024840)] 这个数值关系。

对比Triplet Loss

作者认为Triplet Loss的优劣：

Pros: Simple, and correctly models the embedding space
Cons: Does NOT simulate runtime behavior。

这里的runtime behavior，即语音/人脸验证（verification）场景中的使用流程:

注册: 注册语音 -> 语音向量 -> 多条语音向量取平均
验证: 对验证时录制的语音提取特征向量 -> 与注册语音库中的平均向量计算相似度（如Cosine距离） -> 根据设置的阈值在判断是否同一个人。

img

Text-Independent

Text-Independent Speaker Verification

We are interested in identifying speaker based on arbitrary speech Challenge:

Length of utterance can vary
Unlike keyword-based, where we assume fixed-length (0.8s) for keyword segment

Naive solution: Full sequence training?

No batching - very very slow
Dynamic RNN unrolling - even slower...

解决方法：Sliding window inference

In inference time, we extract sliding windows, and compute per-window d-vector
For experiments, we use 1.6s window size, with 50% overlap
L2正则化 per-window d-vectors，然后取这些向量的平均值作为这条不定长语音的特征向量。

img

Training

Text-independent Training.

加速训练，同时，充分利用数据。因每批次内部utterances长度相同，从而各utterances 计算时间相同。

img

In training time, we need to group utterances by length
Extract segments by minimal truncation
Form batches of same-length segments

Experiment

Text-independent experiments.

img

Text-dependent

TODO

训练记录

Text-independent

流程分为两个步骤：数据预处理与训练．

音频数据预处理：wav转spectrogram等步骤很耗时，考虑使用GPU来计算，如使用PyTorch audio。

参考

训练耗时参考：

Dataset:

LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500). 500小时音频。
VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv). 1200 人
VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev). 6000人。

time: trained 1.56M steps (20 days with a single GPU) with a batch size of 64. GPU: GTX 1080 Ti.

训练过程:

img

最终达到的效果：特征向量使用UMAP降维再画图。

img