ICASSP 2019----A Robust Text-independent Speaker Verification Method Based on Speech Separation and

https://ieeexplore.ieee.org/document/8683762

Fei Zhao; Inner Mongolia University
Hao Li; Inner Mongolia University
Xueliang Zhang; Inner Mongolia University

A Robust Text-independent Speaker Verification Method Based on Speech Separation and Deep Speaker
一种基于语音分离和深度说话人的,鲁棒的,文本无关的,说话人验证方法
Abstract:
文摘:
Recently, deep neural networks (DNNs) have achieved incredible performance in speaker verification.
近年来,深度神经网络(DNNs)在说话人验证方面取得了非常好的成绩。
However, most of which remains sensitive to environment noise.
然而,其中大部分仍然对环境噪声敏感。
In this paper, we propose an end-to-end speaker verification framework to enhance the robustness against background noise.
在本文中,我们提出了一个端到端说话人验证的框架,来增强对背景噪声的鲁棒性。
The proposed framework first utilizes convolutional recurrent network (CRN) to address speech separation.
该框架首先利用卷积递归网络(CRN)来解决语音分离问题。
Then the output of the middle layer of the CRN is used as the auxiliary feature, and together with the robust Filter banks (Fbanks) feature of noisy speech are fed to the speaker verification system.
然后利用CRN中间层的输出作为辅助特征,并结合噪声语音的鲁棒滤波器组(Fbanks)特征,将其输入到说话人验证系统中。
The speech separation and speaker verification are jointly optimized.
对语音分离和说话人验证进行了联合优化。
Compared with deep speaker and DNN/i-vector, systematic evaluation indicates that the proposed algorithm can obtain a better performance in noisy conditions.
系统评价表明,与deep speaker和DNN/i-vector算法相比,该算法在噪声条件下具有更好的性能。

SECTION 1.INTRODUCTION
1.节介绍
Speaker verification (SV) is a task of judging whether it is a declared speaker identity through the information of the speech.
说话人验证(SV)是通过说话人的信息来判断说话人是否为已声明的说话人身份的一项任务。
Depending on whether the speech content of enrolling and testing are the same, it can be divided into text-dependent SV [1] and text-independent SV [2], [3].
根据注册和测试的语音内容是否相同,可以将其分为与文本相关的SV[1]和与文本无关的SV[2]、[3]。
Apparently, with no limitation on content during test, the text independent SV is more friendly to users.
显然,在测试过程中不限制内容,文本无关的SV对用户更友好。
At the same time, it’s also more difficult than text-dependent SV.
同时,它也比依赖文本的SV更困难。
This work focuses on text-independent SV.
这项工作的重点是文本无关的SV。
I-vector [4] is a well known method which greatly improved the performance of SV.
i-vector[4]是一种非常著名的方法,它极大地提高了SV的性能。
The method consists of several steps:
该方法包括以下几个步骤:
Firstly, training a universal background model (UBM) [5] with a large amount of speech to collect sufficient statistics for extracting i-vector.
首先,训练一个具有大量语音的通用背景模型(UBM)[5],收集足够的统计量来提取i-vector。
Secondly, extract speaker i-vector, so hight-dimension statistics can converted into a single low-dimensional i-vector that representing the identity of the speaker.
其次,提取说话人的i-vector向量,使高维统计量可以转化为一个表示说话人身份的低维i-vector向量。
Finally, training probabilistic linear discriminant analysis (PLDA) [6] model, and produce verification scores by calculating distance between i-vector from different utterances.
最后,训练概率线性判别分析(PLDA)[6]模型,通过计算i-vector向量与不同话语之间的距离得到验证分数。
Influenced by DNN powerful modeling ability and its successful application in automatic speech recognition (AS-R) [7], Lei et al [8] used DNN to replace the gaussian mixture model (GMM) for acoustic modeling to extract i-vector.
受DNN强大的建模能力及其在自动语音识别(AS-R)[7]中的成功应用影响,Lei等[8]采用DNN代替高斯混合模型(GMM)进行声学建模,提取i-vector向量。
DNN can directly model the phoneme state space instead of the complex acoustic space, and significant improvements over the traditional GMM.
DNN可以直接模拟音素状态空间而不是复杂的声学空间,与传统的GMM相比有很大的改进。
Another effective technique is to use DNN to extract deep bottleneck features [9], [10] or obtain speaker representations directly [11], [12].
另一种有效的技术是利用DNN提取深度瓶颈特征[9]、[10]或直接获取说话人的表示形式[11]、[12]。
Driven by big data and increased computing power, end-to-end SV [12], [13], [14] can achieve better performance than classic i-vector approach.
在大数据和计算能力提升的驱动下,端到端SV[12]、[13]、[14]的性能优于经典的i-vector方法。
The output of the neural network is low-dimensional vector called embedding (also known as d-vector) which is adopted to represent the speaker identity.
神经网络的输出是低维向量,称为嵌入(也称为d-vector向量),用来表示说话人的身份。
Although research on SV has achieved big progress, noise is still a inevitable factor in real environment that impairs the performance of SV systems.
虽然SV的研究已经取得了很大的进展,但在实际环境中,噪声仍然是影响SV系统性能的一个不可避免的因素。
A common strategy is using a frontend processing method to enhance both training and test set first, and conducting SV system on the enhanced training set. It may be able to improve the SV performance since the features may become cleaner after enhancement.
一种常见的策略是先用前端处理方法对训练集和测试集进行语音增强,然后在增强的训练集上进行SV系统。增强后的特征,会变得更清晰,因此可以提高SV性能。
However, the performance of this approach is highly dependent on the performance of the separation frontend [15].
然而,这种方法的性能在很大程度上依赖于,分离前端,[15]的性能。
More recently, Tan and Wang [16] incorporated a convolutional encoder-decoder (CED) and long short-term memory (LSTM) into the CRN architecture for speech separation.
最近,Tan和Wang[16]将卷积编解码器(CED)和长短时记忆(LSTM)合并到用于语音分离的CRN架构中。
We speculate that the low-dimensional output of LSTM in CRN can be used as robust bottleneck features in SV system.
我们推测在CRN中LSTM的低维输出可以作为SV系统的鲁棒瓶颈特性。
In this paper, we integrate the speech separation into end-to-end speaker verification system, and jointly optimize the two modules both of which are based deep learning.
本文将语音分离集成到端到端说话人验证系统中,共同优化了基于深度学习的两个模块。
Experimental results show that the proposed method outperforms the recent proposed end-to-end SV method deep speaker [12] and DNN/i-vector [8].
实验结果表明,该方法优于最近提出的端到端SV方法,即[12]中的深度说话人方法和[8]中的DNN/i-vector方法。
The rest of this paper is organized as follows.
本文的其余部分组织如下。
Section 2 briefly reviews the speech separation, end-to-end SV framework and triplet loss.
第二部分简要回顾了语音分离、端到端SV框架和triplet loss。
Section 3 describes the framework proposed.
第3节描述了提出的框架。
Experiments and analysis are presented in the Section 4.
第四部分给出了实验和分析。
Finally, summarize in Section 5.
最后,总结第5节。

你可能感兴趣的:(ICASSP,2019)