TEXT-INDEPENDENT SPEAKER VERIFICATION WITH ADVERSARIAL LEARNING ON SHORT UTTERANCES
Kai Liu, Huan Zhou
Artificial Intelligence Application Research Center, Huawei Technologies Shenzhen, PRC
ABSTRACT
摘要
A text-independent speaker verification system suffers severe performance degradation under short utterance condition. To address the problem, in this paper, we propose an adversari- ally learned embedding mapping model that directly maps a short embedding to an enhanced embedding with increased discriminability. In particular, a Wasserstein GAN with a bunch of loss criteria are investigated. These loss functions have distinct optimization objectives and some of them are less favoured for the speaker verification research area. Dif- ferent from most prior studies, our main objective in this study is to investigate the effectiveness of those loss criteria by con- ducting numerous ablation studies. Experiments on Voxceleb dataset showed that some criteria are beneficial to the veri- fication performance while some have trivial effects. Lastly, a Wasserstein GAN with chosen loss criteria, without fine- tuning, achieves meaningful advancements over the baseline, with 4% relative improvements on EER and 7% on minDCF in the challenging scenario of short 2second utterances.
与文本无关的说话人验证系统在短语音条件下性能严重下降。为了解决这一问题,本文提出了一种敌方学习的嵌入映射模型,该模型直接将短嵌入映射到具有更高可分辨性的增强嵌入。特别地,研究了具有一系列损耗准则的Wasserstein GAN。这些损失函数具有明显的优化目标,其中一些函数不太适合于说话人验证领域。与以往的研究不同,本研究的主要目的是通过大量的消融研究来探讨这些丧失标准的有效性。在Voxceleb数据集上的实验表明,一些准则有利于验证性能的提高,而一些准则的影响较小。最后,一个Wasserstein GAN在没有微调的情况下选择了损耗标准,在较短的2秒话语的挑战性场景中,与基线相比取得了有意义的进步,EER相对提高了4%,minDCF相对提高了7%。
[18] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. In ICASSP, pages 5791– 5795, 2019. 牛津的
However, the performance of a SV system usually de- grades in real scenarios, due to prevalent mismatches between development and test condition, such as channel, domain or duration mismatch [11, 5, 18]. For instance, it has been ob- served [5] that on NIST-SRE 2010 test set (female part), the performance of i-vector/PLDA system drops from 2.48% to 24.78% when the verification trial was shortened from full- duration to 5 seconds long.
然而,由于开发和测试条件之间普遍存在不匹配,例如通道、域或持续时间不匹配,SV系统的性能在实际场景中通常会降低[11、5、18]。例如,在NIST-SRE 2010测试集(女性部分)上,当验证试验从完整持续时间缩短到5秒时,i-vector/PLDA系统的性能从2.48%下降到24.78%。
Numerous research studies have been proposed to miti- gate the short duration effect. An early family of researches aimed to modify different aspects of i-vector based SV sys- tem, e.g., feature extraction techniques, intermediate param- eter estimation, speaker model generation, score normaliza- tion techniques, as summarized in [11]. Recently, more novel deep learning technologies are explored. For instance, in- sufficient phonetic information is compensated by a teacher- student learning framework [17] and scoring scheme is cal- ibrated by transfer learning [13]. Another research strategy is to design duration robust speaker embeddings to dealing with utterances of arbitrary duration. By applying different neural network architectures and alternative loss functions, the discriminability of embeddings is further enhanced. For example, Inception Net with triplet loss is depolyed in [20], Inception-ResNet with joint softmax and center loss in [8] and ResCNN with novel speaker identity subspace loss in [14].
许多研究已经提出方法来缓解这种短时效应。早期的一系列研究旨在修改基于i-向量的SV系统的不同方面,例如特征提取技术、中间参数估计、说话人模型生成、分数规范化技术,如[11]所述。近年来,越来越多的新型深度学习技术被开发出来。例如,充分的语音信息由师生学习框架[17]进行补偿,评分方案由迁移学习[13]进行校准。另一种研究策略是设计持续时间鲁棒的说话人嵌入来处理任意持续时间的话语。通过采用不同的神经网络结构和交替损耗函数,进一步提高了嵌入的可分辨性。例如,在[20]中,具有三重态损失的起始网被去分解,在[8]中,具有联合软最大值和中心损失的起始ResNet和在[14]中具有新的说话人身份子空间损失的ResCNN。
[11] A. Poddar, M. Sahidullah, and G. Saha. Performance comparison of speaker recognition systems in presence of duration variability. In 2015 Annual IEEE India Conference (INDICON), pages 1–6, Dec 2015.
[13] Lihong Wan Jun Zhang Qingyang Hong, Lin Li and Feng Tong. Transfer learning for speaker verification on short utterances. In Interspeech, pages 1848–1852, 2016.
[20] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification with triplet loss on short utterances. In Interspeech, pages 1487–1491, 2017.
[8] Na Li, Deyi Tuo, Dan Su, Zhifeng Li, and Dong Yu. Deep discriminative embeddings for duration robust speaker verification. In Interspeech, pages 2262–2266, 2018.
[14] Xinyuan Cai Ruifang Ji and Bo Xu. An end-to-end text- independent speaker identification system on short ut- terances. In Interspeech, pages 3628–3632, 2018.
Generative Adversarial Networks (GANs) [6] are one of the most popular deep learning algorithm developed recently. GANs have the potential to generate realistic instances and provide a solution to problems that require a generative so- lution, most notably in various image-to-image translation tasks.
生成性对抗网络(GANs)[6]是近年来发展起来的最流行的深度学习算法之一。GANs有可能生成真实的实例,并为需要生成解决方案的问题提供解决方案,尤其是在各种图像到图像的翻译任务中。
[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Sys- tems 27, pages 2672–2680. 2014. 这个是GAN的提出论文,是Ian Goodfellow搞出来的。
In this study, we aim to investigate the short duration is-sue presented in a practical SV system. Contrary to the most techniques mentioned above, our proposed approach works directly on the speaker embeddings. In particular, given short and long embedding pairs extracted from same speaker and session, we propose to use adversarial learning of Wasser- stein GAN to learn a new embedding with enhanced discrim- inability. To test our approach, G-vector is chosen as the em- bedding benchmark in our experiments due to its promising performance on short speeches. This put forward a challenge to our study than those prior studies which benchmarked with the i-vectors.
在本研究中,我们的目的是探讨在实际的SV系统中所呈现的短持续时间。与上述大多数技术相反,我们提出的方法直接作用于说话人嵌入。特别是,在给定从同一说话人和会话中提取的短嵌入对和长嵌入对的情况下,我们提出利用Wasser-stein-GAN的对抗性学习来学习一种新的嵌入方法。为了测试我们的方法,在我们的实验中选择了G-vector作为embedding的基准,因为它在短语音中有很好的表现。这对我们的研究提出了一个挑战,比以前的那些以i-向量为基准的研究提出了挑战。
The remainder of this paper is organized as follows: Sec- tion 2 briefly introduces the related works of our methods. Section 3 details our proposed Wassertein GAN based ap- proach. Section 4 presents experimental results and discus- sions. Finally, our conclusions are given in Section 5.
本文的其余部分安排如下:第二节简要介绍了本文方法的相关工作。第三节详细介绍了我们提出的基于Wassertein-GAN的ap-proach。第四节介绍了实验结果和讨论。最后,第5节给出了我们的结论。
2.1. Wasserstein-GAN
GANs [6] are deep generative models comprised of two net- works, a generator and a discriminator. The discriminator D tries to learn the difference between real sample y and fake sample g generated from noise η, and the generator G tries to fool the discriminator. That is, the following minimax loss function is optimized through alternating optimization, until equilibrium is reached.
GANs[6]是由两个网络、一个生成器和一个鉴别器组成的深度生成模型。鉴别器D试图学习噪声η产生的真样本y和假样本g之间的差别,而生成器g试图愚弄鉴别器。也就是说,通过交替优化来优化下面的极大极小损失函数,直到达到平衡。公式(1)
However, training a GAN model is difficult due to well- known diminishing or exploding gradients issue. The issues has been addressed by Wasserstein GAN (WGAN) [1], where the discriminator is designed to find a good fw and a new loss function is configured as measuring the Wasserstein distance:
然而,由于众所周知的梯度递减或爆炸问题,训练GAN模型是困难的。Wasserstein GAN(WGAN)[1]已经解决了这些问题,其中鉴别器被设计成寻找良好的fw,并且新的损耗函数被配置成测量Wasserstein距离:公式如下:
[1] Mart ́ın Arjovsky, Soumith Chintala, and Le ́on Bot- tou. Wasserstein gan. ArXiv, 2017. URL: https://arxiv.org/pdf/1701.07875.pdf.
2.2. Deployments of GANs inSV
SV中GANS的部署
Motivated by the remarkable success in image-to-image translation, GANs have been actively deployed in SV re- search community, mainly to handle domain-mismatch issue, like transforming i-vectors [15] and x-vectors [19]. In con- trast, there are few works to use GANs to handle the short du- ration issue. To authors’ best knowledge, the only published work is to propose compensating the i-vectors via conditional GAN [7]. However, limited performance improvements were observed. The proposed system alone failed to outperform the baseline system, and only score-wise fusion based system showed better performance than the baseline.
由于在图像到图像转换方面取得了显著的成功,GANs被积极地部署在SV搜索社区中,主要用于处理域失配问题,如转换i-向量[15]和x-向量[19]。相反,很少有作品使用GANs来处理短语音问题。据作者所知,唯一发表的工作是提出通过条件GAN补偿i-向量[7]。然而,观察到的性能改进有限。单系统的性能没有优于基线系统,只有融合系统的性能优于基线系统。
[19] Man-Wai Mak Youzhi Tu and Jen-Tzung Chien. Varia- tional domain adversarial learning for speaker verifica- tion. In Interspeech, pages 4315–4319, 2019.
[7] Nakamasa Inoue Jiacen Zhang and Koichi Shinoda. I- vector transformation using conditional generative ad- versarial networks for short utterance speaker verifica- tion. In Interspeech, Hyderabad, pages 3613–3617, 2018.
In authors’ opinion, training GAN is non-trivial, the rea- son behind such results might be the oversight on effects of loss functions of conditional GAN. As such, in this study, we investigate the problem and seek to reveal some guidelines on choosing beneficial loss functions to make the model perform better.
在作者看来,训练GAN是非常重要的,这种结果背后的原因可能是对条件GAN损失函数的影响的疏忽。因此,在本研究中,我们探讨了这个问题,并试图揭示一些选择有利损失函数的准则,以使模型表现得更好。
The architecture of our proposed approach is illustrated in Fig.1. Here x and y are D-dimensional G-vectors correspond- ing to short and long utterance embedding from same speaker session, z is speaker identity labels. With given x, y, z, the proposed system is trained to learn a D-dimensional embed- ding g, with the expectation that the g-based SV system can outperform the one based on x.
我们提出的方法的架构如图1所示。这里x和y是D维G-vectors,对应于同一说话人会话中嵌入的短、长话语,z是说话人身份标签。在给定x,y,z的情况下,训练系统学习D维的嵌入g,期望基于g的SV系统能优于基于x的SV系统。
Overall, the proposed architecture can be decomposed into four core components: embedding generator Gf , speaker label predictor Gc, distance calculator Gd and Wasserstine discriminator Dw . All components are jointly trained in order to generate enhanced embeddings with carefully handcraft optimization objects, as described as follows.
总体而言,该架构可以分解为四个核心组件:嵌入生成器Gf、说话人标签预测器Gc、距离计算器Gd和Wasserstine鉴别器Dw。所有组件都经过联合训练,以便使用精心设计的手工优化对象生成增强的嵌入,如下所述。
3.1. Proposed Discriminator-Related Loss Functions
我们提出的判别性相关的损失函数
As aforementioned, the primary task of the proposed ap- proach is to learn embedding with enhanced discriminability. Let P denote the data distribution, we propose to achieve the task by mapping Pg from initial Px to the target Py by adver- sarial learning of WGAN. To this end, in the discriminative model, several loss criteria are investigated with different optimization objectives.
如前所述,所提出的方法的主要任务是学习增强过的具有判别性的嵌入。假设P表示数据分布,我们建议通过WGAN的adver-sarial学习将Pg从初始Px映射到目标Py来完成任务。为此,在判别模型中,研究了具有不同优化目标的几种损失准则。
Following the conventional definition of min-max function, the loss function of WGAN is:
根据最小最大功能的传统定义,WGAN的损失函数为:
Inspired by the idea of conditional GAN [10], in this study, we investigate a novel loss function by optimizing the Wassertein distance between joint data distributions. That is, to control the data to be discriminated by concatenating short embedding x with the conventional discriminator input. The corresponding min-max function is updated as:
[10] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ArXiv, 2014. URL: https://arxiv.org/pdf/1411.1784.pdf.
在本研究中,受条件GAN的启发,我们通过优化联合数据分布之间的Wassertein距离,研究了一种新的损失函数。也就是说,通过将短嵌入x与传统的鉴别器输入串联来控制要鉴别器化的数据。相应的min max函数更新为:
In addition, to seek more discriminability, the Fre ́chet In- ception Distance (FID) [9], as a popular metric to calculate the distance between feature vectors of real and generated im- ages, is also explored herein. Assuming Py and Pg as normal distributions with means μy , μg and co-variance matrices Cy , Cg , FID loss can be calculated by:
此外,为了寻求更高的可分辨性,本文还探讨了Fréchet-In-ception Distance(FID)[9]作为计算真实图像和生成图像特征向量之间距离的常用度量。假设Py和Pg为正态分布,平均值为μy,μg,协方差矩阵为Cy,Cg,FID损失可通过以下公式计算:
3.2. Proposed Generator-Related Loss Functions
我们提出的Generator相关的损失函数
In order to guide GAN training with the objective of feature discriminability, four loss criteria are investigated herein as extra training guides for the GAN training.
为了以特征判别性为目的指导GAN训练,本文研究了四种loss准则作为GAN训练的附加训练准则。
To verify the speaker label, the widely adopted multiclass cross-entropy (CE) loss is investigated with formulation of:
为了验证说话人标签,研究了广泛采用的多类交叉熵(CE)损失,公式如下:
where N is the batch size, c is the number of classes. gi de- notes the i-th generated embedding sample and zi is the cor- responding label index. W ∈ RD∗c and b ∈ Rc denotes the weight matrix and bias in the project layer.
其中N是批处理大小,c是类的数量。gi 表示第i个生成的嵌入样本和zi是对应的标签索引。W∈RD*c和b∈Rc表示项目层中的权重矩阵和偏差。
To explicitly penalize the class-related classification error, triplet loss is deployed as well, where a baseline (anchor) in- put is compared to a positive (truthy) input and a negative (falsy) input. Let Γ be the set of all possible embedding triplets γ = (ga , gp , gn ) in the training set, the loss is defined as:
为了显式地惩罚与类相关的分类错误,还采用了三元组loss,其中基线(锚)输入与正(truthy)输入和负(falsy)输入进行比较。设Γ为训练集中所有可能嵌入三元组的集合γ=(ga,gp,gn),其损失定义为:
where ga is an anchor input, gp is a positive input from the same class and gn is a negative input from a different class, Ψ ∈ R+ is safety margin between positive and negative pairs.
其中ga是锚输入,gp是同一类的正输入,gn是不同类的负输入,Ψ∈R+是正负对之间的安全角度。
Apart from the above, to minimize intra-class variation, center loss [16] is also adopted. It can be formulated as:
除此之外,为了最小化类内变化,还采用了中心损失[16]。它可以表述为:
where c
其中c
denotes the ith deep feature belonging to the yith class and m is the size of mini-batch.
表示属于yith类的第i个深特征,m是小批量的大小。
To better guide the training process, the similarity be- tween enhanced embedding and its target is explicitly con- sidered. It’s measured by the cosine distance and evaluated as a dot product as follow:
为了更好地指导训练过程,明确考虑了增强嵌入与目标之间的相似性。它是由余弦距离测量的,并作为点积计算,如下所示:
where g ̄ and y ̄ are normalized version of embedding g and y, respectively.
其中ḡ和ȳ分别是嵌入g和y的规范化版本。
In all, we propose to train the generator Gf with the total loss defined as:
总之,我们建议训练generator Gf时的total loss 定义为:
and discriminator Dw with:
和鉴别器Dw:
After the training of WGAN, the generative model Gf is retained. At the SV test stage, a short embedding x for any given test short utterance, can be easily mapped to its enhanced version (g) by directly applying the feed-forward model of Gf on the x.
经过WGAN训练后,保留了生成模型Gf。在SV测试阶段,通过直接在x上应用Gf的前向模型,可以很容易地将任意给定测试短句的短嵌入x映射到其增强版本(g)。
4.1. Experimental Setup
We use a subset of the Voxceleb2 to train our proposed sys- tem, where 1,057 speakers are chosen with total 164,716 ut- terances. Those utterances are randomly cut to 2 seconds as short utterance. Similarly, a subset of Voxceleb1 with 40 speakers is sampled and total 13,265 utterance pairs are used for testing.
我们使用Voxceleb2的一个子集来训练我们提出的系统,其中1057个说话人被选择,总共164716个话语。这些话语被随机缩短为2秒作为简段语音。类似地,对40个说话人的Voxceleb1子集进行采样,并使用总共13265个话语对进行测试。
The VGG-Restnet34s network is used to extract G-vectors as our baseline system. Regarding the GAN training, the learning rates for both Gf and Dw are 0.0001; Adam op timization is adopted; weight clipping is employed for Dw with threshold setting from -0.01 to 0.01 and batch size is set as 128.
使用VGG-Restnet34s网络提取G向量作为我们的基线系统。对于GAN训练,Gf和Dw的学习率均为0.0001;采用Adam-op优化;Dw采用权值剪裁,阈值设置为-0.01到0.01,批设置为128。
4.2. Ablation Studies on Various Loss Functions
不同损失函数的ablation消融研究
To verify the importance of proposed loss criteria, a bunch of ablation studies are conducted by choosing different com- binations of them. The overall results are illustrated in Tab.1, where Lc, Lt denote Lcenter and Ltriplet, respec- tively. Triplet a means that inputs are sampled from both y and g and b means from g only.
为了验证提出的ablation准则的重要性,我们选择了不同的ablation准则组合进行了大量的ablation研究。总体结果如表1所示,其中Lc,Lt分别表示Lcenter和Ltriplet。三元组a表示输入同时从y和g采样,b表示仅从g采样。
In our study, total 8 systems (v1 − v8), by combining different loss criteria with Watterstein GAN, are evaluated. Their corresponding detection error trade-off (DET) curves are plotted in Fig.2.
在我们的研究中,通过将不同的损失标准与Watterstein GAN相结合,对总共8个系统(v1-v8)进行了评估。相应的检测误差权衡(DET)曲线如图2所示。
From the above experimental results, the following con-
从以上实验结果来看,下面是-
clusions could be drawn:
可以得出结论:
• FID loss has positive effect (v1 vs. v2);
• Conditional WGAN outperforms WGAN (v3 vs. v4);
• Triplet loss is preferred (v7 vs. v2);
• Triplet a greatly outperforms triplet b (v3 vs. v8);
• softmax has positive effect (v3 vs. v5);
• Center loss has negative effect (v6 vs. v7);
• Cosine loss has significant positive effect (v6 vs. v8).
The above findings are very interesting with a twofold out- come. Firstly, it demonstrates that additional training func- tions (e.g. traditional softmax, cosine loss and triplet loss) all have positive contribution to the performance, which verifies our earlier statement that extra training guides might be help- ful for feature discriminability. Secondly, some less-favoured
以上的发现非常有趣,有两个方面。首先,证明了附加的训练函数(如传统的softmax、余弦损失和三重态损失)对性能都有积极的贡献,这验证了我们之前的观点,即附加的训练指南可能有助于提高特征的可分辨性。其次,一些不太受欢迎的
loss criteria to a typical SV system (e.g. FID loss and con- ditional WGAN loss) are surprisingly helpful, which are un- usual findings and might be worthy of further investigation.
典型SV系统的损失标准(如FID损失和常规WGAN损失)出人意料地有用,这是不常见的发现,可能值得进一步研究。
4.3. Comparison with the Baseline System
In the end, we make a performance comparison between our best system (v3) and the G-vector baseline system. Herein the comparison is measured in terms of equal error rate (EER) and minDCF. The results are reported in Tab.2.
最后,我们对最佳系统(v3)和G矢量基线系统进行了性能比较。在此,以等错误率(EER)和minDCF来测量比较。结果见表2。
From the table, we can see that our proposed system also has the merit for generalization and behave consistently for different short duration over the baseline system. In detail, for verification with 2 second enroll-test utterances, our proposed system shows 4.2% relative EER improvement and 7.2% rela- tive minDCF improvement. For shorter utterances with dura- tion of 1 second, it shows comparable EER (3.8%) improve- ment.
从表中可以看出,我们提出的系统也具有泛化的优点,并且在不同的短时间内与基线系统保持一致。详细地说,对于2秒注册测试话语的验证,我们提出的系统显示了4.2%的相对EER改进和7.2%的相对minDCF改进。对于持续1秒的较短的话语,它显示出可比的EER(3.8%)提高。
It’s worth noting that due to time constraint, the FID loss function has not been added to our final system; besides, there is no any fine-tuning on hyper-parameters, loss weights α, β, γ, λ, ε and triplet margin η. This means there are still a lot of room for improvements in our system.
值得注意的是,由于时间的限制,最终系统中没有加入FID损失函数;此外,对超参数、损失权重α、β、γ、λ、ε和三重态裕度η也没有任何微调。这意味着我们的系统还有很大的改进空间。
CONCLUSIONS
5个。结论
In this paper, we have successfully applied WGAN to learn enhanced embedding for speaker verification application with short utterances. Our main contributions are twofold: pro- posed WGAN-based kernel system; and on top of it, validated the effectiveness of a bunch of loss criteria on the GAN train- ing. Our final proposed system outperforms the baseline sys- tem for the challenging short speaker verification scenarios. In all, our experiments show both decent advancement and a potential direction where our further research goes forward.
本文成功地将WGAN应用于短语音说话人验证应用中的增强嵌入学习。我们的主要贡献有两方面:提出了基于WGAN的核系统;在此基础上,验证了一系列损失准则在GAN训练中的有效性。我们最后提出的系统在挑战性的短小说话人验证情形下,表现优于基线系统。总之,我们的实验既显示了良好的进展,也显示了我们进一步研究的潜在方向。
REFERENCES
[1] Mart ́ın Arjovsky, Soumith Chintala, and Le ́on Bot- tou. Wasserstein gan. ArXiv, 2017. URL: https://arxiv.org/pdf/1701.07875.pdf.
[2] D. Povey D. Snyder, D. Garcia-Romero and S. Khu- danpur. Deep neural network embeddings for text-independent speaker verification. In Interspeech, page 999, 2017.
[3] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verifi- cation. IEEE Transactions on Audio, Speech, and Lan- guage Processing, 19(4):788–798, May 2011.
[4] Daniel Garcia-Romero and Carol Espy-Wilson. Analy- sis of i-vector length normalization in speaker recogni- tion systems. In Interspeech, pages 249–252, 2011.
[5] JahangirAlamGautamBhattacharyaandPatrickKenny. Deep speaker embeddings for short duration speaker verification. In Interspeech, Stockholm, pages 1517– 1521, 2017.
[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Sys- tems 27, pages 2672–2680. 2014.
[7] Nakamasa Inoue Jiacen Zhang and Koichi Shinoda. I- vector transformation using conditional generative ad- versarial networks for short utterance speaker verifica- tion. In Interspeech, Hyderabad, pages 3613–3617, 2018.
[8] Na Li, Deyi Tuo, Dan Su, Zhifeng Li, and Dong Yu. Deep discriminative embeddings for duration robust speaker verification. In Interspeech, pages 2262–2266, 2018.
[9] Thomas Unterthiner Bernhard Nessler Martin Heusel, Hubert Ramsauer and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
[10] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ArXiv, 2014. URL: https://arxiv.org/pdf/1411.1784.pdf.
[11] A. Poddar, M. Sahidullah, and G. Saha. Performance comparison of speaker recognition systems in presence of duration variability. In 2015 Annual IEEE India Con- ference (INDICON), pages 1–6, Dec 2015.
[12] S. J. D. Prince and J. H. Elder. Probabilistic linear dis- criminant analysis for inferences about identity. In 2007 IEEE 11th International Conference on Computer Vi- sion, pages 1–8, Oct 2007.
[13] Lihong Wan Jun Zhang Qingyang Hong, Lin Li and Feng Tong. Transfer learning for speaker verification on short utterances. In Interspeech, pages 1848–1852, 2016.
[14] Xinyuan Cai Ruifang Ji and Bo Xu. An end-to-end text- independent speaker identification system on short ut- terances. In Interspeech, pages 3628–3632, 2018.
[15] Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng Chng, and Haizhou Li. Unsupervised domain adaptation via domain adversarial training for speaker recognition. In ICASSP, pages 4889–4893, 2018.
[16] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, pages 499–515, 2016.
[17] Jee weon Jung, Hee-Soo Heo, Hye jin Shim, and Ha-Jin Yu. Short utterance compensation in speaker verification via cosine-based teacher-student learn- ing of speaker embeddings. ArXiv, 2018. URL: https://arxiv.org/pdf/1810.10884.pdf.
[18] Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. Utterance-level aggregation for speaker recognition in the wild. In ICASSP, pages 5791– 5795, 2019.
[19] Man-Wai Mak Youzhi Tu and Jen-Tzung Chien. Varia- tional domain adversarial learning for speaker verifica- tion. In Interspeech, pages 4315–4319, 2019.
[20] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification with triplet loss on short utterances. In Interspeech, pages 1487–1491, 2017.