Lantian Li; Tsinghua University
Zhiyuan Tang; Tsinghua University
Ying Shi; Tsinghua University
Dong Wang; Tsinghua University
https://ieeexplore.ieee.org/document/8683245
Gaussian-constrained Training for Speaker Verification
演讲者验证的高斯约束训练
Abstract:
文摘:
Neural models, in particular the d-vector and x-vector architectures, have produced state-of-the-art performance on many speaker verification tasks.
神经模型,特别是d-向量和x-向量结构,在许多说话人验证任务上产生了最先进的性能。
However, two potential problems of these neural models deserve more investigation.
然而,这些神经模型的两个潜在问题值得进一步研究。
Firstly, both models suffer from ‘information leak’, which means that some parameters participating in model training will be discarded during inference, i.
首先,两种模型都存在“信息泄漏”,即在推理过程中,参与模型训练的一些参数会被丢弃。
i.e, the layers that are used as the classifier.
例如,作为分类器的层数。
Secondly, these models do not regulate the distribution of the derived speaker vectors.
其次,这些模型并没有调节导出的说话人向量的分布。
This ‘unconstrained distribution’ may degrade the performance of the subsequent scoring component, e.g., PLDA.
这种“无约束分布”可能会降低后续评分组件(例如PLDA)的性能。
This paper proposes a Gaussian-constrained training approach that (1) discards the parametric classifier, and (2) enforces the distribution of the derived speaker vectors to be Gaussian.
本文提出了一种高斯约束训练方法(1)丢弃参数分类器,(2)强制导出的说话人向量分布为高斯分布。
Our experiments on the VoxCeleb and SITW databases demonstrated that this new training approach produced more representative and regular speaker embeddings, leading to consistent performance improvement.
我们在VoxCeleb和SITW数据库上的实验表明,这种新的培训方法产生了更具代表性和规律性的说话人嵌入,从而导致性能的持续改进。
SECTION 1.INTRODUCTION
1.节介绍
Automatic speaker verification (ASV) is an important biometric authentication technology and has found a broad range of applications.
自动说话人认证(ASV)是一种重要的生物特征认证技术,具有广泛的应用前景。
The current ASV methods can be categorized into two groups: the statistical model approach that has gained the most popularity [1], [2], [3], and the neural model approach that emerged recently but has shown great potential [4], [5], [6].
目前的ASV方法可以分为两类:一类是最流行的[1]、[2]、[3]的统计模型方法,另一类是最近出现的、显示出巨大潜力的[4]、[5]、[6]的神经模型方法。
Perhaps the most famous statistical model is the Gaussian mixture model−universal background model (GMM-UBM) [1].
也许最著名的统计模型是高斯混合模型-通用背景模型(GMM-UBM)[1]。
It factorizes the variance of speech signals by the UBM, and then models individual speakers conditioned on that factorization.
它利用UBM对语音信号的方差进行因子分解,然后根据该因子分解对单个说话人进行建模。
More succinct models design subspace structures to improve the statistical strength, including the joint factor analysis model [2] and the i-vector model [3].
更简洁的模型设计子空间结构来提高统计强度,包括联合因子分析模型[2]和i-vector模型[3]。
Further improvements were obtained by either discriminative models (e.g., PLDA [7]) or phonetic knowledge transfer (e.g., the DNN-based i-vector model [8], [9]).
通过判别模型(如PLDA[7])或语音知识转移模型(如基于dnn的i-vector模型[8]、[9])得到进一步的改进。
The neural model approach has also been studied for many years as well, however it was not as popular as the statistical model approach until recently training large-scale neural models became feasible.
神经模型方法也已研究多年,但直到最近大规模神经模型的训练成为可能,它才像统计模型方法那样流行起来。
The initial success was reported by Ehsan et al. on a text-dependent task [4], where frame-level speaker features were extracted from the last hidden layer of a deep neural network (DNN), and utterance-based speaker vectors (‘d-vectors’) were derived by averaging the frame-level features.
Ehsan等人报道了在文本依赖任务[4]上的初步成功,该任务从深度神经网络(DNN)的最后一层隐层中提取帧级说话者特征,并通过平均帧级特征得到基于话语的说话者向量(“d-vector”)。
Learning frame-level speaker features offers many advantages, which paves the way to deeper understanding of speech signals.
学习帧级别的说话人特征,有许多优势,这为深入理解语音信号铺垫了道路。
Researchers followed Ehsan’s work in two directions.
研究人员从两个方向跟踪Ehsan的工作。
In the first direction, more speech-friendly DNN architectures were designed, with the goal of learning stronger frame-level speaker features while keeping the simple d-vector architecture unchanged [6].
在第一个方向,设计了更多的语音友好的DNN架构,目的是在保持简单的d-vector架构不变的情况下,学习更强的帧级别说话人特征。
In the second direction, researchers pursue end-to-end solutions which produce utterance-level speaker vectors directly [5], [10], [11], [12].
在第二个方向,研究人员寻求端到端解决方案,直接生成话语级别的说话者向量[5]、[10]、[11]、[12]。
A representative work in this direction is the x-vector architecture proposed by Snyder et al. [12], which produces the utterance-level speaker vectors (x-vectors) from the first- and second-order statistics of the frame-level features.
斯奈德(Snyder)等人提出的x-vector 框架就是这方面的一个代表性工作,它利用帧级特征的一阶和二阶统计量生成话语级说话者向量(x-vector)。
For both the d-vector and x-vector architectures, however, there are two potential problems.
然而,对于d-vector和x-vector体系结构,都存在两个潜在的问题。
Firstly, the DNN models involve a parametric classifier (i.e.
首先,DNN模型训练时包含一个参数分类器(即
, the last affine layer) during model training.
,最后一个仿射层)。
This means that part of the knowledge involved in the training data is used to learn a classifier that will be ultimately thrown away during inference, leading to potential ‘information leak’.
这意味着训练数据中涉及的部分知识被用来学习分类器,最终在推理过程中被丢弃,导致潜在的“信息泄漏”。
Secondly, these models do not regulate the distribution of the derived speaker vectors, either at the frame-level or at the utterance-level.
其次,这些模型在帧级和话语级都没有调节,生成的说话者向量的分布。
The uncontrolled distribution will degrade the subsequent scoring component, especially the PLDA model that assumes the speaker vectors are Gaussian [7].
不受控的分布会降低后续的评分分量,尤其是假设说话者向量为高斯[7]的PLDA模型。
To deal with these two problems, we propose a Gaussian-constrained training approach in this paper.
针对这两个问题,本文提出了一种高斯约束训练方法。
This new training approach will (1) discard the parametric classifier to mitigate information leak, and (2) enforce the distribution of the derived speaker vectors to be Gaussian to meet the requirement of the scoring component.
这种新的训练方法(1)丢弃参数分类器以减少信息泄漏,(2)强制导出的说话者向量分布为高斯分布以满足评分的要求。
Our experiments on two databases demonstrated that the approach can produce more representative and regular speaker vectors than both the d-vector and x-vector models, which in turn leads to consistent performance improvement.
我们在两个数据库上的实验表明,与d-vector和x-vector模型相比,该方法可以生成更有代表性的说话者向量,从而提高了性能。
The organization of this paper is as follows.
本文的组织结构如下。
Section 2 gives a brief overview for the x-vector and d-vector models, and Section 3 presents the proposed Gaussian-constrained training.
第2节简要介绍了x-vector和d-vector模型,第3节介绍了提出的高斯约束训练。
Experiments are reported in Section 4, and the paper is concluded in Section 5.
实验结果见第4节,论文结论见第5节。
SECTION 2.
第二节。
OVERVIEW OF X-VECTOR AND D-VECTOR MODELS
x -向量和d -向量模型概述
The x-vector model and the d-vector model are two typical neural models adopted by ASV researchers.
x-向量模型和d-向量模型是ASV研究人员采用的两种典型的神经模型。
The architectures of these models are shown in Figure 1, in a comparative way.
图1以一种比较的方式显示了这些模型的体系结构。
For the x-vector model, it consists of three components.
对于x向量模型,它由三个分量组成。
The first component is used for frame-level feature learning.
第一个组件用于帧级特性学习。
The input of this component is a sequence of acoustic features, e.g., filterbank coefficients.
该分量的输入是一系列声学特征,如filterbank系数。
After several feed-forward neural networks, it outputs a sequence of frame-level speaker features.
经过多个前馈神经网络,输出一系列帧级说话人特征。
The second component is a statistic pooling layer, in which statistics of the frame-level features, e.g., mean and standard deviation, are computed.
第二个组件是统计池化层,计算帧级特征(如均值和标准差)的统计量。
This statistic pooling maps a variable-length input to a fixed-dimensional vector.
这个统计池化层将一个可变长度的输入映射到一个固定维向量。
The third component is used to produce utterance-level speaker vectors.
第三个分量用于生成话语级说话者向量。
The output layer of this component is a softmax, in which each output node corresponds to a particular speaker in the training data.
该组件的输出层为softmax,其中每个输出节点对应于训练数据中的特定说话人。
The network is trained to discriminate all the speakers in the training data, conditioned on the input utterance.
根据输入的话语,训练网络对训练数据中的所有说话者进行区分。
Once the model is well trained, utterance-level speaker vectors, i.e.
一旦模型得到良好的训练,说话者的话语水平向量,即
, x-vectors, are read from a layer in the last component, and a scoring model, such as PLDA, will be used to score the trials.
x-vectors,从最后一个组件的层读取x-vectors,并使用评分模型(如PLDA)对试验进行评分。
The d-vector model has a similar but simpler architecture.
d-向量模型具有类似但更简单的体系结构。
It consists of two components, one is for frame-level feature learning, the same as the first component of the x-vector model, and the other is for frame-level speaker embedding, the same as the third component of the x-vector model.
它由两部分组成,一是帧级的特征学习,与x-vector模型的第一个组件相同,另一个是帧级的speaker嵌入,与x-vector模型的第三个组件相同。
Since the entire architecture is frame-based, a pooling layer is not required.
由于整个体系结构是基于帧级别的,所以不需要池化层。
The network is trained to discriminate speakers in the training data, but conditioned on each frame.
该网络被训练,在训练数据中辨别说话者,但以每帧为条件。
Once the training is done, utterance-level speaker vectors, i.e.
训练完成后,话语水平的说话者向量,即
, d-vectors, are derived by averaging the frame-level features.
, d向量,通过对帧级特征进行平均得到。
Finally, a scoring model such as PLDA will be used to score the trials.
最后,将使用像PLDA这样的评分模型对试验进行评分。