Hao Lu1, Jianfeng Zhou2, Miao Zhao1, Wendian Lei3, Qingyang Hong∗1, Lin Li∗2
1School of Informatics, Xiamen University, China,厦门大学信息学院
2School of Electronic Science and Engineering, Xiamen University, China 中国厦门大学电子科学与工程学院
3Xiamen Talentedsoft Co., Ltd., China {qyhong,lilin}@xmu.edu.cn


In this paper, we present our submitted XMU-TS system for NIST SRE19 CTS Challenge. The evaluation of this year on- ly offers the open training condition. With the large amounts of data assimilated into training set, the diversity of training data sources inevitably leads to domain mismatch, which be- comes a key factor affecting the system performance. In order to solve this problem, we have made a lot of attempts. Based on the x-vector framework, we used different network struc- tures, and tried to modify the performance of factorized time delay deep neural network (F-TDNN) and residual network (ResNet). In addition, in the back-end classifier, we used do- main adaption to eliminate the impact of domain mismatch. Finally, we employed Adaptive Symmetric Score Normaliza- tion (AS-Norm) for score normalization to adjust the fraction- al distribution space. These attempts have enriched the diver- sity of our systems, enabling the fusion system to comple- ment each subsystem and improve the final submission per- formace.
在这篇文章中,我们为NIST SRE19CTS挑战赛提交了XMU-TS系统。今年的评估提供了开放式训练的条件。随着大量的数据被吸收到训练集中,训练数据源的多样性不可避免地导致域失配,而域失配是影响系统性能的一个关键因素。为了解决这个问题,我们做了很多尝试。在x矢量框架的基础上,采用不同的网络结构,对因子化时滞深神经网络(F-TDNN)和残差网络(ResNet)的性能进行了改进。另外,在后端分类器中,我们使用domain自适应来消除域失配的影响。最后,我们使用自适应对称记分标准化(Adaptive Symmetric Score Normaliza- tion (AS-Norm)进行score normalization,以调整score分布空间。这些尝试丰富了我们系统的多样性,使融合系统能够完成每个子系统并提高最终提交性能。


The Speaker Recognition Evaluation, sponsored by the US National Institute of Standards and Technology (NIST), has been one of the most representative contests in speaker recog- nition since 1996. Research teams from all over the world constantly explore new algorithms and state-of-the-art tech- nologies for speaker recognition. The 2019 NIST Speaker Recognition Evaluation (SRE19) includes two separate activ- ities:

  • CTS: The evaluation data is conversational telephone speech obtained from Call My Net 2 (CMN2) corpus.
  • Multimedia: The evaluation data includes audio and vi- sual data obtained from Video Annotation for Speech Technology (VAST) corpus.

The goal of the two tasks above is to determine whether the enrollment speaker is present in the test statement. Our system only conducts the CTS task.
以上两项任务的目标是确定the enrollment speaker是否出现在测试语句中。我们的系统只执行CTS任务。

Since this evaluation provides options for open training data, it will inevitably lead to the introduction of large-scale publicly available data sets for system development. It is con- ceivable that the domain mismatch between individual data sets and test data will arise due to the different collection en- vironment of data sets. We started the system development work for this challenge and tried to eliminate the performance degradation caused by the domain mismatch.

The first thing we thought of is to increase the diversi- ty of subsystems, and it is most convenient to extract differ- ent acoustic features for training. In our experiments, three types of features have been employed for training. And it is necessary to find a robust training system. In the field of speaker recognition, one of the most representative architec- tures at present is time delay deep neural network (TDNN) based x-vector[1, 2]. In[3], D.Snyder used data augmentation to improve the robustness of the x-vector system, which also demonstrates that x-vector is data driven. This character of x- vector and its excellent performance are perfectly in line with our needs. Therefore, we mainly explored the architectures of speaker recognition system based on x-vector. In terms of network structure, we mainly explored TDNN, extended TDNN (E-TDNN)[4], F-TDNN[5] and ResNet[6]. E-TDNN has a deeper network structure than TDNN to learn more in- formation. F-TDNN factorizes the parameter matrices into s- maller matrices, which makes the training more efficient. And ResNet can learn a lot of detailed temporal information.

Following the extraction of x-vector, we used probabilis- tic linear discriminant analysis (PLDA) [7] for the back-end scoring. We also employed centering, whitening, LDA, do- main adaption and length normalization on x-vector before scoring. These have played an important role in eliminating domain mismatches. After the scoring, we used AS-Norm[8] to optimize the distribution of scores. In this process, the im- pact of domain mismatch is gradually weakened, which fur- ther improves our system performance.

The rest of the paper is organized as follow: Section 2 gives the description of datasets and acoustic feature extrac- tion. In Section 3, we described the details of the subsystems we developed for SRE19. Section 4 illustrates the back-end and score normalization. In Section 5, we reports the result of our subsystems for SRE18 evaluation. Finally, we conclude our work in Section 6.


2.1. Datasets

For the open training condition, the publicly available data sets, including the corpuses of speaker recognition evalua- tions (SREs), SwitchBoard and VoxCeleb [9, 10], were used for systems training and development.
By combing the above individual datasets, three different training sets were obtained:
(i) Train-v1: This set includes the corpuses of NIST SRE04, 05, 08, 10, SRE12-tel, MIXER6 and Switch- board. It contains 84,287 recordings from 5,238 speak- ers.
训练集-1:该组包括NIST SRE04、05、08、10、SRE12-tel、 MIXER6和Switchboard。它收录了5238名说话者的84287段录音。
(ii) Train-v2: This set consists of VoxCeleb1 and VoxCele- b2. It contains 2,040,479 recordings from 7,205 speak- ers.
(iii) Train-v3: This set includes the corpuses of NIST SRE04, 05, 08, 10, MIXER6, VoxCeleb1, VoxCeleb2 and Switchboard. This provides 2,124,766 recordings from 12,443 speakers.
训练集-3:该组包括NIST SRE04、05、08、10、MIXER6、VoxCeleb1、VoxCeleb2和 Switchboard。这提供了来自12443个说话人的2124766个录音。

We also employ additive noises and reverberation (i.e., Babble, Noise, Music and Reverb from Musan [11] and re- verberation [12]) as described in [3] to augment the training data. This operation can make the systems more robust, and alleviate the problem of training data domain mismatch.

2.2. Acoustic feature extraction声学特征提取

2.2.1. Mel frequency cepstral coefficient
For the Mel frequency cepstral coefficient (MFCC) feature extraction, all audios were converted to the cepstral fea- tures of 23-dimensional MFCC with a frame-length of 25ms and a frame shift of 10ms. The cepstral filter banks were selected within the range of 20 to 3700 Hz. Then, a frame- level energy-based voice activity detector (VAD) selection was conducted to the features. This was followed by local cepstral mean and variance normalization (CMVN) over a 3-second sliding window. All operations of feature extraction were based on Kaldi toolkit[13].

2.2.2. Perceptual linear predictive features 线性感知预测特征
Perceptual linear predictive (PLP) is also a common acoustic feature. Compared to the linear prediction coefficient (LPC), it is more in line with the auditory mechanism of the human ear. 20-dimension PLP feature with 3-dimension pitch (PLP- Pitch) parameters were also adopted as the features for the performance comparation in this work. Similar to MFCC, VAD and CMVN were used in sequence.

2.2.3. Filter bank features 滤波器组特征
The other subsystems were based on the filter bank (FB) feature. The FB feature retains a lot of raw information, which makes it possible for the neural network to learn more useful information. Of course, this also requires the neural network itself to have strong modeling capability. The FB feature vec- tors include 40 dimensional FBs and energy value extracted from the raw signal with a 25ms frame-length. Similar to M- FCC, VAD and CMVN were also used in sequence.



The final submitted system is based on the fusion of several x-vector systems with different datasets and features. In this section we will introduce the details of each subsystem.

3.1. TDNN x-vector systems

• tdnn-v1: TDNN x-vector with the architecture pro- posed in [2] trained on 2-fold Train-v1 with 23- dimension PLP-Pitch features.
tdnn-v1: tdnn x矢量,其架构和[2]中一样,在具有23维PLP音高特征的2倍Train-v1上进行训练。

• tdnn-v2: TDNN x-vector trained on 2-fold Train-v1 with 23-dimention MFCC features.
tdnn-v2: tdnn x矢量,在具有23维MFCC特征的2倍Train-v1上训练。

3.2. Extended TDNN x-vector systems
• e-tdnn-v1: Extended TDNN x-vector trained on 2-fold Train-v3 with 26-dimension MFCC-Pitch features (23- dimension MFCC and 3-dimension Pitch are concate- nated together). The configuration of extended TDNN x-vector could be found in [4].
e-tdnn-v1:扩展的tdnn x向量,在2倍的Train-v3上训练,具有26维MFCC-音高特征(23维MFCC和3维音高结合在一起)。扩展的TDNN x矢量的结构可以在[4]中找到。

• e-tdnn-v2: Extended TDNN x-vector trained on 2-fold Train-v3 with 23-dimension PLP-Pitch features.
e-tdnn-v2:扩展tdnn x矢量,在2倍Train-v3上训练,具有23维PLP-Pitch特性。

• e-tdnn-v3: Extended TDNN x-vector trained on 5-fold Train-v2 with 23-dimension MFCC features.
e-tdnn-v3:扩展tdnn x矢量,在5倍Train-v2上训练,具有23维MFCC特征。

• e-tdnn-v4: Extended TDNN x-vector trained on 2-fold Train-v3 with 23-dimension PLP-Pitch features which are the same as e-tdnn-v2, but differ in the VAD proce- dure. The threshold for calculating the VAD is relaxed, allowing some of the noise to be retained.
e-tdnn-v4:扩展tdnn x矢量,在2倍Train-v3上训练,具有23维PLP-Pitch特征,与e-tdnn-v2相同,但在VAD过程中有所不同。计算VAD的阈值被放宽,允许保留一些噪声。

• e-tdnn-v5: Extended TDNN x-vector trained on 3-fold Train-v3 with 23-dimension MFCC features.
e-tdnn-v5:在具有23维MFCC特征的3倍Train-v3上训练的扩展tdnn x向量。

• e-tdnn-v6: Extended TDNN x-vector trained on 3-fold Train-v3 with 23-dimension MFCC features and focal loss [14] which was proposed to solve the imbalance of samples for different classes. In this system, focal loss was used as the objective for training.
e-tdnn-v6:在具有23维MFCC特征和focal loss [14]的3倍Train-v3上训练的扩展tdnn x矢量, [14]提出了解决不同类别样本不平衡的方法。该系统以focal loss 为训练目标。

3.3. Factorized TDNN x-vector systems
因子化TDNN x向量系统
The core trick of F-TDNN is factorizing matrices with a semi- orthogonal constraint. This obviously reduces the amount of parameters and proves that there is no loss of modeling capa- bility through the singular value decomposition (SVD). The configuration of the first two factorized TDNN x-vector sys- tems could be found in[15]. We have modified the architec- ture of the factorized TDNN to make it deeper in the third system, and the architecture configuration is shown in Table 1.
F-TDNN的核心技巧是分解具有半正交约束的矩阵。这明显减少了参数的数量,并证明了通过奇异值分解(SVD)建模的能力没有损失。前两个因子化TDNN x向量系统的结构见[15]。我们修改了分解TDNN的架构,使其在第三个系统中更深入,架构配置如表1所示。

f-tdnn-v1: Factorized TDNN x-vector trained on 5-fold Train-v2 with 23-dimension MFCC features.
f-tdnn-v1:在具有23维MFCC特征的5倍Train-v2上训练的Factorized TDNN x-vector

f-tdnn-v2: Factorized TDNN x-vector trained on 3-fold Train-v3 with 23-dimention MFCC features.
f-tdnn-v2:在具有23维MFCC特征的3倍Train-v3上训练的Factorized TDNN x-vector

f-tdnn-v3: Factorized TDNN x-vector trained on 2- fold Train-v3 with 44-dimension FB-Pitch features (41-dimension FB and 3-dimension Pitch are concate- nated together).
f-tdnn-v3:在具有44维FB-Pitch特征的2倍Train-v3上训练的Factorized TDNN x-vector(41维FB和3维Pitch被合并在一起)。

3.4.ResNet x-vector systems

res-v1: Resnet trained on 2-fold Train-v3 with 44- dimension FB-Pitch features. The architecture config- uration of this system is shown in Table 2.
Most of the subsystems were implemented in Kaldi toolkit [13] except that the e-tdnn-v5 and e-tdnn-v6 both were imple- mented in Pytorch[16].


4.1. Scoring

For all the systems above, the PLDA of the system was trained using embeddings of the 2-fold Train-v1 (Switchboard is ex- cluded) since the PLDA is sensitive to the domain. For the post-processing of the embeddings extracted from the embedding extractors, length normalization, centering, whiten- ing and LDA transformation for feature dimensionality reduc- tion have been applied to the embeddings in sequence, final- ly followed by the PLDA training. Furthermore, The PLDA parameters are adapted on the in-domain development data (SRE 18). All scores of subsystems were estimated using the adapted PLDA.
对于上述所有系统,由于PLDA对域敏感,所以系统的PLDA是使用2倍Train-v1(Switchboard除外)的embeddings进行训练的。对于embedding的提取后处理,对embedding进行了长度归一化、中心化、白化和特征降维LDA变换,最后进行PLDA训练。此外,PLDA参数根据域内dev数据(SRE 18)进行调整。所有子系统的得分都是用自适应PLDA得到的。

4.2. Score normalization and fusion 分数归一化与融合
It is worth noting that we have applied the AS-Norm to the scores. For a given score sij (a score for speaker model i and test segment j), the normalized score could be formulated as follow:
where the μi (Nj) and σi (Nj) are the mean and standard deviation of the scores for the speaker model i and the test segments from a subset of cohort set Nj , which is consists of the top N scores calculated between the test segment j and the segments from cohort set.
其中μi(N j)和σi(Nj)是说话人模型i和来自队列集Nj的子集的测试段的分数的平均值和标准差,由测试段j和队列集的分段之间计算的前N个分数组成。
In our experiments, the combination of the unlabeled and enroll part of SRE 18 dev has been used as cohort set and top 2,000 of sorted cohort scores have been used for calculating the normalization statistics.
在我们的实验中,使用SRE 18 dev的未标记部分和注册部分的组合作为队列集,并使用排序队列得分的前2000名来计算标准化统计。


We present the experimental results on the evaluation part of SRE18, since not each subsystem has obtained the results on the progress set of SRE19. And all the results were calcu- lated through the official scoring software. The results of al- l subsystems are shown in Table 3. We used AS-Norm to adjust the score distribution in order to improve the system performance. However, the results after AS-Norm only have improvement on EER, but still do not work for the primary metric act C. These results are not been listed due to the lim- ited length of the paper. We thus fused the original score and the score after AS-Norm, which further improves the prefor- mance of the subsystems. The score level fusion results of all subsystems can be seen in Table 4.
我们在SRE18的评估部分给出了实验结果,因为并不是每个子系统都得到了SRE19进度集的结果。所有的结果都是通过官方评分软件计算出来的。所有子系统的结果如表3所示。我们使用AS-Norm来调整分数分布,以提高系统性能。然而,AS-Norm后的结果仅对EER有改进,但仍不适用于主度量act C。由于论文篇幅有限,这些结果未列出。我们因此将原始得分和AS-Norm之后的得分融合,进一步提高了子系统的性能。各子系统的得分融合结果见表4。
From the above results, it can be clearly seen that F- TDNN x-vector performs better under the same data and the best single system is f-tdnn-v2. It could be found that the structure in f-tdnn-v3 also has great potential, and it takes ad- vantage of the deeper structure configuration of the F-TDNN network. In the feature piece, PLP has better robustness to noise and seems to be more suitable for this data augmenta- tion mode. We did a lot of engineering optimizations on the Pytorch system, which gave e-tdnn-v5 the best performance in comparison with the systems with the same extended TDNN architecture.
从以上结果可以清楚地看出,在相同的数据下,F-TDNN x矢量表现更好,最佳单系统是F-TDNN-v2。可以发现,f-tdnn-v3的结构也具有很大的潜力,它有利于f-tdnn网络更深层次的结构配置。在特征块中,PLP对噪声具有更好的鲁棒性,似乎更适合这种数据增强模式。我们对Pythorch系统进行了大量的工程优化,使e-tdnn-v5与具有相同扩展tdnn体系结构的系统相比具有最佳性能。

Finally, all the twelve subsystems are fused to generate the scores of primary system submitted to CTS challenge. The Bosaris toolkit[17] was used to perform the fusion by learn-ing the weights from scores of the development set. In our experiments, we used the eval part of SRE 18 as the develop- ment set. Results of our primary system on the progress set of NIST SRER19 and the baseline of SRE19 are shown in Table 5. System fusion allows the usage of complemental informa- tion from different subsystems so that the performance could be improved.
最后,对12个子系统进行融合,生成提交CTS挑战的主系统得分。Bosaris工具包[17]通过从dev集的分数中学习权重来执行融合。在我们的实验中,我们使用SRE 18的eval部分作为dev集。我们的主系统在NIST SRER19进度集(the progress set 这是啥?)和SRE19基线上的结果如表5所示。系统融合允许使用来自不同子系统的互补信息,从而提高性能。


This paper presented the description of the XMU-TS submis- sion to SRE 19 CTS challenge. In view of the large amount of training data and the domain mismatch problem, we have made various attempts in network structures, back-end s- coring and score normalization. Different acoustic features greatly enhance the diversity and complementarity of our sys- tems. These attempts have eliminated the impact of domain mismatch to some extent from different stages, allowing our final fusion system to achieve great improvement in compari- son with the baseline.
本文描述了XMU-TS向SRE 19 CTS挑战赛的提交。针对训练数据量大、域不匹配的问题,我们在网络结构、后端打分和分数规范化等方面做了各种尝试。不同的声学特性大大增强了我们系统的多样性和互补性。这些尝试在一定程度上消除了不同阶段域失配的影响,使得我们最终的融合系统与基线相比有了很大的改进。


This work is supported by the National Natural Science Foundation of China (Grant No.61876160).


