2017--Speaker and Language Recognition and Characterization: Introduction to the CSL Special Issue

2017–Speaker and Language Recognition and Characterization: Introduction to the CSL Special Issue

Eduardo Lleida1, Luis Javier Rodriguez-Fuentes2
1 Aragon Institute for Engineering Research (I3A), University of Zaragoza, Spain
2 GTTS, DEE, ZTF/FCT, University of the Basque Country UPV/EHU, Spain

Abstract
摘要
Speaker and language recognition and characterization is an exciting area of research that has gained importance in the field of speech science and technology. This special issue features, among other contributions, some of the most remarkable ideas presented and discussed at Odyssey 2016: the Speaker and Language Recognition Workshop, held in Bilbao, Spain, in June 2016. This introduction provides an overview of the selected papers in the context of current challenges.
说话人和语言的识别与表征是一个令人兴奋的研究领域,在语音科学与技术领域具有重要意义。本期特刊除其他贡献外,还介绍了2016年6月在西班牙毕尔巴鄂举办的《奥德赛:说话人和语言识别研讨会》上提出和讨论的一些最显著的想法。本导言概述了在当前挑战的背景下选定的论文。

  1. Introduction
    一。介绍
    This special issue compiles state of the art efforts in the field of speaker and language recognition and characterization (Hansen and Hasan, 2015; Li et al., 2013). New ideas about features (Diez et al., 2012; Siniscalchi et al., 2013; Yamada et al., 2013; Matějka et al., 2014), models (Dehak et al., 2011; Lei et al., 2014a; Richardson et al., 2015) and tasks (Bell et al., 2015; Molina et al., 2016) are growing in the last years, making it a particularly exciting field. Hopefully, the papers selected for this issue will bring the reader closer to the kind of challenges that researchers must face up, along with the range of potential applications.
    本期特刊汇集了说话人和语言识别与表征领域的最新研究成果(Hansen和Hasan,2015;Li等人,2013)。关于特征(Diez et al.,2012;Siniscalchi et al.,2013;Yamada et al.,2013;Matějka et al.,2014)、模型(Dehak et al.,2011;Lei et al.,2014a;Richardson et al.,2015)和任务(Bell et al.,2015;Molina et al.,2016)的新想法在过去几年不断增长,使其成为一个特别令人兴奋的领域。希望本期论文能让读者更接近研究者必须面对的挑战,以及潜在的应用范围。
    Speaker recognition (SR) has gained importance in the field of speech science and technology, with new applications beyond forensics (Campbell et al., 2009; Drygajlo et al., 2015), such as large- scale filtering of telephone calls (Ramasubramanian, 2012), automated access through voice profiles (Kinnunen et al., 2016), speaker indexing and diarization (Moattar and Homayounpour, 2012), etc. Current challenges involve the use of increasingly short signals to perform verification (Chen et al., 2016; Kheder et al., 2016; Bhattacharya et al., 2017), the need for algorithms that are robust to all kind of extrinsic variabilities, such as noise and channel conditions (Kheder et al. 2017; McCree et al., 2017), but allowing for a certain amount of intrinsic variability (due to health issues, stress, etc.) (Chen et al., 2012) and the development of countermeasures against spoofing and tampering attacks (Wu et al., 2015).
    说话人识别(SR)在语音科学和技术领域具有重要意义,除了法医学之外,它还具有新的应用(Campbell等人,2009年;Drygajlo等人,2015年),例如电话呼叫的大规模过滤(Ramasubramanian,2012年)、通过语音配置文件的自动访问(Kinnunen等人,2016年),说话人索引和二值化(Moattar和Homayounpour,2012年)等。当前的挑战包括使用越来越短的信号进行验证(Chen等人,2016年;Kheder等人,2016年;Bhattacharya等人,2017年),需要对各种外部变量都具有鲁棒性的算法,例如噪声和信道条件(Kheder等人。2017年;McCree等人,2017年),但考虑到一定程度的内在可变性(由于健康问题、压力等)(Chen等人,2012年)和针对欺骗和篡改攻击的对策制定(Wu等人,2015年)
    On the other hand, spoken language recognition (SLR) has also witnessed a remarkable interest from the community as an auxiliary technology for speech recognition (Gonzalez-Dominguez et al., 2015b), dialogue systems (Lopez-Cozar and Araki, 2005) and multimedia search engines (Lamel, 2012) but specially for large-scale routing or filtering of telephone calls (Muthusamy et al., 1994). An active area of research specific to SLR is dialect and accent identification (Biadsy, 2011). Other issues to be considered, such as short signals, channel and environment variability, etc., are basically the same as for SR.
    另一方面,作为语音识别(Gonzalez Dominguez等人,2015b)、对话系统(Lopez Cozar和Araki,2005)和多媒体搜索引擎(Lamel,但特别是用于大规模路由或电话过滤(Muthusamy等人,1994),针对单反的一个活跃的研究领域是方言和口音识别(Biadsy,2011)。其他需要考虑的问题,如短信号、信道和环境变化等,与SR基本相同。
    The features, modeling approaches and algorithms used in SR and SLR are closely related, though not equally effective, since these two tasks differ in several ways (Li and Ma, 2010). In the last years, and after the success of Deep Learning (DL) in image and speech recognition (Hinton et al., 2012: LeCun et al., 2015; Goodfellow et al., 2016), the use of Deep Neural Networks (DNN) both as feature extractors and classifiers/regressors is opening new and exciting research horizons (Gonzalez-Dominguez et al., 2014; Kenny et al., 2014; Lei et al., 2014a; Lopez-Moreno et al., 2014; Matějka et al, 2014; Lozano-Diez et al., 2015; Richardson et al., 2015; Tian et al., 2015;Ferrer et al., 2016; Heigold et al., 2016; Marcos and Richardson, 2016; Matějka et al, 2016; Milner and Hain, 2016; Pešán et al., 2016; Rouvier and Favre, 2016; Trong et al., 2016; Zazo et al., 2016; Bhattacharya et al., 2017; Garcia-Romero et al., 2017; Jin et al., 2017; Yin et al., 2017; Zhang and Hansen, 2017)
    SR和SLR中使用的特征、建模方法和算法是密切相关的,但并非同等有效,因为这两个任务在许多方面不同(Li和Ma,2010)。在过去的几年中,在图像和语音识别领域的深度学习(DL)取得成功(Hinton et al.,2012:LeCun et al.,2015;Goodfelow et al.,2016)之后,深度神经网络(DNN)作为特征提取器和分类器/回归器的使用正在开辟新的令人兴奋的研究领域(Gonzalez Dominguez et al.,2014;Kenny et al。,2014;Lei等人,2014a;Lopez Moreno等人,2014;Matějka等人,2014;Lozano Diez等人,2015;Richardson等人,2015;Tian等人,2015;Ferrer等人,2016;Heigold等人,2016;Marcos and Richardson,2016;Matějka等人,2016;Milner and Hain,2016;Pešán等人,2016;Rouvier and Favre,2016;Trong等人,2016;Zazo等人,2016;Bhattacharya等人,2017年;Garcia Romero等人,2017年;Jin等人,2017年;Yin等人,2017年;Zhang和Hansen,2017年)
    Until recently, speaker and language recognition technologies were mostly driven by NIST evaluation campaigns: Speaker Recognition Evaluations (SRE) and Language Recognition Evaluations (LRE), which focused on large-scale verification of telephone speech (Martin and Garofolo 2007). In the last years, other initiatives, such as the 2008/2010/2012 Albayzin LRE (Rodríguez-Fuentes et al., 2013), the 2013 SRE in Mobile Environment (Khoury et al., 2013), the RSR2015 database (Larcher et al., 2014), the RedDots database (Lee et al., 2015), the Speakers in the Wild (SITW) database and evaluation (McLaren et al., 2016a; 2016b) or the 2015 Multi-Genre Broadcast (MGB) Challenge (Bell et al., 2015) have widened the range of applications and the research focus.
    直到最近,说话人和语言识别技术主要由NIST的评估活动推动:说话人识别评估(SRE)和语言识别评估(LRE),这两项活动的重点是大规模验证电话语音(Martin and Garofolo 2007)。在过去几年中,其他举措,如2008/2010/2012 Albayzin LRE(Rodríguez Fuentes等人,2013年)、2013年移动环境SRE(Khoury等人,2013年)、RSR2015数据库(Larcher等人,2014年)、RedDots数据库(Lee等人,2015年)、野外扬声器(SITW)数据库和评估(McLaren等人,2016a;2016b)或2015年多类型广播(MGB)挑战赛(Bell等人,2015)拓宽了应用范围和研究重点。
    Odyssey, a Research Workshop organized every two years by the ISCA Speaker and Language Characterization Special Interest Group (SpLC-SIG, http://www.speakerodyssey.com/), brings together researchers from worldwide, who present their latest findings and insights on a wide range of topics, covering speaker and language characterization, modelling, evaluation, and applications. The Odyssey 2016 technical program included 59 papers distributed in 5 oral sessions, 3 poster sessions and 2 special sessions: Special Session on NIST 2015 Language Recognition i-Vector Machine Learning Challenge and Special Session on Speaker Recognition in Multimedia Content, the latter organized in collaboration with the ISCA SIG on Speech and Language in Multimedia (SliM, http://slim-sig.irisa.fr/). Besides, a Forensics & Industry track and a Show & Tell session were allocated to give companies, Research & Development labs, government agencies and other interested parties (e.g. forensic experts and labs) the opportunity to actively participate in the event. Odyssey 2016 also featured 3 invited talks by distinguished lecturers: “Voice conversion and spoofing countermeasures for speaker verification”, by Haizhou Li; “Understanding individual-level speech variability: From novel speech production data to robust speaker recognition”, by Shri Narayanan; and “i-Vector representation based on GMM and DNN for audio classification”, by Najim Dehak. For details, please refer to http://www.odyssey2016.org/.
    《奥德赛》是由ISCA演讲者和语言表征特别兴趣小组(SpLC SIG,http://www.Speaker Odyssey.com/)每两年举办一次的研究研讨会,来自世界各地的研究人员聚集在一起,介绍他们对广泛主题的最新发现和见解,包括演讲者和语言表征、建模,评估和应用。Odyssey 2016技术计划包括59篇论文,分为5个口头会议、3个海报会议和2个特别会议:NIST 2015语言识别i-向量机学习挑战特别会议和多媒体内容中的说话人识别特别会议,后者与ISCA语音和多媒体语言(SliM,http://SliM sig.irisa.fr/)。此外,还安排了一个法医学和工业跟踪以及一次展示和讲解会议,使公司、研发实验室、政府机构和其他相关方(如法医专家和实验室)有机会积极参与这一活动。《奥德赛2016》还邀请了3位著名讲师进行演讲:李海洲(Haizhou Li)的“语音转换和针对说话人验证的欺骗对策”;Shri Narayanan的“理解个人级别的语音可变性:从新颖的语音生成数据到稳健的说话人识别”;以及“基于GMM的i矢量表示”以及DNN音频分类”,作者Najim Dehak。详情请参阅http://www.odyssey2016.org/。
    The Odyssey 2016 papers presented new research on a wide range of topics, including confidence estimation for speaker and language recognition, low-resource (lightly supervised) speaker and language recognition, corpora and tools for system development and evaluation, speaker and language characterization, system calibration and fusion, speaker recognition with speech recognition, spoofing and tampering attacks, multispeaker segmentation, detection, and diarization, forensic and investigative speaker recognition, systems and applications, speaker and language clustering, robustness in channels and environment, machine learning for speaker and language recognition, features for speaker and language recognition, language, dialect, and accent recognition and speaker and language recognition, verification and identification.
    《奥德赛2016》论文提出了一系列新的研究课题,包括说话人和语言识别的置信度估计、低资源(轻度监督)说话人和语言识别、系统开发和评估的语料库和工具、说话人和语言特征、系统校准和融合,具有语音识别、欺骗和篡改攻击、多峰分割、检测和二值化的说话人识别、法医和调查性说话人识别、系统和应用、说话人和语言聚类、信道和环境中的鲁棒性、用于说话人和语言识别的机器学习,用于说话人和语言识别、语言、方言和口音识别以及说话人和语言识别、验证和识别的功能。
    The remaining sections provide an overview of the papers selected for this Special Issue, organized into four broad categories: (1) Speaker Verification, (2) Spoken Language Recognition, (3) Speaker Diarization, and (4) Applications. A brief introduction at the beginning of each section aims to provide the required context, with pointers to recent advances. It is not our intention, however, to provide a thorough review of current research in the area of speaker and language characterization.
    其余各节概述了为本期特刊挑选的论文,分为四大类:(1)说话人验证,(2)口语识别,(3)说话人二值化,和(4)应用。每一节的开头都有一个简短的介绍,目的是提供所需的上下文,以及指向最新进展的指针。然而,我们并不打算对目前在说话人和语言表征领域的研究进行彻底的回顾。

  2. Speaker Verification

In the last decade, speaker verification technology has made great advances to be considered as a reliable approach to verify a person’s claimed identity. Speaker verification is considered text- dependent when the lexical content of the utterances is fixed to some phrase, otherwise it is text- independent. State of the art speaker verification systems are using techniques like the i-vector representation (Dehak et al., 2011), Probabilistic Linear Discriminant Analysis (PLDA) (Kenny, 2010) and more recently DNN (McLaren et al., 2015a).
在过去的十年里,说话人验证技术取得了巨大的进步,被认为是一种可靠的方法来验证一个人所声称的身份。当话语的词汇内容固定在某个短语上时,说话人验证被认为是文本依赖的,否则它是文本无关的。最先进的说话人验证系统正在使用诸如i矢量表示(Dehak等人,2011)、概率线性鉴别分析(PLDA)(Kenny,2010)和最近的DNN(McLaren等人,2015a)等技术。
Up to the early 2000, the Gaussian Mixture Model (GMM) approach (Reynolds et al., 2000) was the basis for most speaker verification systems. Speaker models were adapted from a generic Universal Background Model (UBM). One of the main challenges of this approach was the sensitivity to the effects of inter-session variability. Since then, researchers have made a great effort to make the GMM-UBM approach robust to these effects. Some methods like feature mapping (Reynolds et al., 2003) and feature warping (Pelecanos et al., 2001) perform feature-level session compensation. Other methods try to compensate by modifying model parameters, the most successful techniques being Support Vector Machines (SVM) + GMM with nuisance attribute projection (NAP) (Campbell et al., 2006) and joint factor analysis (JFA) (Kenny et al., 2007).
直到2000年初,高斯混合模型(GMM)方法(Reynolds等人,2000)是大多数说话人验证系统的基础。说话人模型改编自通用背景模型(UBM)。这种方法的主要挑战之一是对闭会期间可变性的影响的敏感性。从那时起,研究人员就一直在努力使GMM-UBM方法对这些影响具有鲁棒性。一些方法如特征映射(Reynolds等人,2003)和特征扭曲(Pelecanos等人,2001)执行特征级会话补偿。其他方法试图通过修改模型参数进行补偿,最成功的技术是支持向量机(SVM)+带有干扰属性投影的GMM(NAP)(Campbell等人,2006年)和联合因子分析(JFA)(Kenny等人,2007年)。
Joint factor analysis is a generative model that allows to estimate the speaker’s GMM taking into account the different sources of variability (speaker and channel) separately. JFA evolved into a new approach known as i-vectors (identity vectors) which extracts a low-dimensional fixed length vector from a variable length speech utterance. The i-vector defines a single space that is referred as total variability (TV) space containing both sources of variability, speaker and channel, simultaneously. The speaker identity is verified by computing the similarity of i-vectors. Several channel compensation techniques have been proposed to compensate the channel distortion on the i- vector. In (Dehak et al., 2011), linear discriminant analysis (LDA) and within-class covariance normalization (WCCN) with cosine or SVM scoring are used. In (Matějka et al., 2011; Kenny, 2010), i-vectors are modeled by PLDA (a single Gaussian simplification of JFA). PLDA (Prince et al., 2007) is a generative model that decomposes i-vectors into a speaker dependent term and a channel dependent term.
联合因子分析是一种生成模型,它允许在考虑不同变异源(说话人和信道)的情况下估计说话人的GMM。JFA发展成为一种称为i-向量(身份向量)的新方法,它从变长语音中提取低维定长向量。i向量定义了一个单独的空间,称为总可变性(TV)空间,同时包含可变性源、扬声器和频道。通过计算i向量的相似度来验证说话人身份。提出了几种信道补偿技术来补偿i矢量上的信道失真。在(Dehak等人,2011)中,使用了线性鉴别分析(LDA)和带余弦或SVM评分的类内协方差归一化(WCCN)。在(Matějka等人,2011;Kenny,2010)中,i向量由PLDA(JFA的单高斯简化)建模。PLDA(Prince等人,2007)是一个生成模型,它将i向量分解为一个与说话人相关的项和一个与信道相关的项。
In the last years, a great effort has been made towards effectively exploiting deep neural networks for speaker verification. Recent approaches are using neural networks as classifiers to produce posterior probabilities for JFA or i-vector extractors (Kenny et al., 2014; Richardson et al., 2015; Dey et al., 2016) or to generate bottleneck features that can be used in an i-vector PLDA approach (Snyder et al., 2015; Matějka et al, 2016). With regard to text-dependent speaker verification, end-to-end solutions are being developed, where a neural network performs all the processing steps and provides the likelihood ratio (Heigold et al. 2016; Miguel et al. 2017).
在过去的几年里,人们一直在努力有效地利用深层神经网络进行说话人验证。最近的方法是使用神经网络作为分类器来产生JFA或i-向量抽取器的后验概率(Kenny等人,2014;Richardson等人,2015;Dey等人,2016),或产生可用于i-向量PLDA方法的瓶颈特征(Snyder等人,2015;Matějka等人,2016)。关于与文本相关的说话人验证,正在开发端到端的解决方案,其中神经网络执行所有处理步骤并提供似然比(Heigold等人。2016年;Miguel等人。2017年)。
Despite having made important advances, there are still open challenges such as spoofing attacks (Wu et al., 2015), short-utterances (Vogt et al., 2010) or phonetic content (Misra et al., 2014) that need an extra effort by the research community. In this special issue, some of these challenges are addressed. Eight papers are directly related to speaker recognition. Three of them are dealing with the i-vector/PLDA framework for text independent speaker recognition (Khosravani et al., 2017; Lin et al., 2017; Rahman et al., 2017), one with text-dependent speaker recognition using neural networks and hidden Markov models (HMM) (Zeinali et al., 2017a), one with spoofing countermeasures (Todisco et al., 2017), another with speech activity detection (Sholokhov et al., 2018) and finally, two papers performing speaker verification with whispered speech (Sarria-Paja et al., 2017) and humming (Patil and Madhavi, 2017).
尽管取得了重要进展,但仍有一些公开的挑战,如欺骗攻击(Wu et al.,2015)、短句(Vogt et al.,2010)或语音内容(Misra et al.,2014),需要研究界付出额外努力。在这一特别问题中,解决了其中一些挑战。八篇论文与说话人识别直接相关。其中三个涉及文本无关说话人识别的i-vector/PLDA框架(Khosravani等人,2017;Lin等人,2017;Rahman等人,2017),一个使用神经网络和隐马尔可夫模型(HMM)进行文本相关说话人识别(Zeinali等人,2017a),一个使用欺骗对策(Todisco等人,2017),另一个是语音活动检测(Sholokhov等人,2018),最后,两篇论文使用低语语音(Sarria Paja等人,2017)和哼唱(Patil和Madhavi,2017)进行说话人验证。

2.1 PLDA improvements in text-independent speaker verification
2.1文本无关说话人验证中的PLDA改进

State-of-the art text-independent speaker verification systems are based on the i-vector/PLDA framework, a data driven approach for which models trained on a specific type of data or acoustic domain may not generalize well to other types of data or domains. In this special issue, Rahman et al. present several domain-compensation approaches to handle the data mismatch between train and test conditions in the LDA and PLDA subspaces (Rahman et al. 2017). The authors propose to estimate the in-domain and out-domain between-class scatter matrices separately. The domain variability scatter matrix is used to train a domain-invariant LDA subspace to compensate domain mismatch. The out-domain variability scatter matrix, with a small amount of supervised in-domain data, is used to train the domain-invariant PLDA where a domain factor is introduced along with the standard speaker factor. The paper presents experimental results, using NIST 2008 and 2010 SRE datasets, with different domain-compensation approaches and score fusion combinations.
与文本无关的说话人验证系统是基于i-vector/PLDA框架的,该框架是一种数据驱动的方法,在特定类型的数据或声学域上训练的模型可能无法很好地推广到其他类型的数据或域。在本期特刊中,Rahman等人。针对LDA子空间和PLDA子空间中列车和测试条件之间的数据不匹配问题,提出了几种域补偿方法(Rahman等人。2017年)。提出了分别估计类间散布矩阵的内域和外域的方法。利用域变分散射矩阵训练一个域不变的LDA子空间来补偿域失配。利用带少量监督域内数据的域外变分散射矩阵训练域不变PLDA,在PLDA中引入域因子和标准说话人因子。本文给出了用NIST 2008和2010的SRE数据集,采用不同的域补偿方法和分数融合组合的实验结果。

Rahman, M.H., Kanagasundaram, A., Himawan, I., Dean, D., Sridharan, S., 2018. Improving PLDA speaker verification performance using domain mismatch compensation techniques. Computer Speech & Language 47 (2018), pp. 240-258.

Also in this special issue, Khosravani et al. propose a PLDA approach to deal with the effects of the language being spoken for improving the performance of speaker recognition in cross-language conditions (Khosravani et al. 2017). The PLDA framework is extended to directly model the language variability by defining a language factor. A language-independent PLDA is trained by using speech from bilingual speakers. The authors conclude that modeling the language dependency allows a better estimation of the channel subspace and, as a result, a better estimation of the between-speaker subspace.
同样在本期特刊中,Khosravani等人,提出了一种PLDA方法来处理所说语言的影响,以提高在跨语言条件下说话人识别的性能(Khosravani等人。2017年)。PLDA框架通过定义一个语言因子来直接建模语言的可变性。使用双语者的语音训练独立于语言的PLDA。作者认为,建立语言依赖模型可以更好地估计信道子空间,从而更好地估计说话人之间的子空间。
In a third paper, Lin et al. address the reliability of the i-vector representation when presented with utterances of arbitrary duration (Lin et al. 2017). The utterance duration mismatch between enrolment and test data degrades the performance of a conventional i-vector/PLDA system, being especially severe when using short utterances. Lin et al. use the uncertainty propagation approach proposed by Kenny et al. (Kenny et al. 2013) to handle the length variability in the PLDA model. The method introduces an extra loading matrix into the PLDA model to represent the reliability of the i-vector. As a result, the scoring is more computationally intensive and memory consuming than the conventional one. To tackle this problem, Lin et al. propose to group the i-vectors according to utterance durations and to model the reliability in each group by a single representative loading matrix, saving lots of computation by pre-computing all the relevant terms for scoring on development data. Authors report a scoring time as low as 3.18% of the original PLDA with uncertainty propagation without loss of performance.
在第三篇论文中,Lin等人。当以任意持续时间的话语呈现时,解决i-向量表示的可靠性(Lin等人。2017年)。注册和测试数据之间的话语持续时间不匹配降低了传统i-vector/PLDA系统的性能,在使用短话语时尤为严重。Lin等人。使用Kenny等人提出的不确定性传播方法。(Kenny等人。2013年)处理PLDA模型中的长度变化。该方法在PLDA模型中引入一个额外的加载矩阵来表示i矢量的可靠性。因此,评分比传统评分更为计算密集和消耗内存。为了解决这个问题,林等人。提出根据话语持续时间对i-向量进行分组,并用一个具有代表性的负载矩阵对每组的可靠性进行建模,通过预先计算所有相关的开发数据评分项,节省了大量的计算量。作者报道了一个评分时间低至3.18%的原始PLDA与不确定性传播没有性能损失。

2.2 Text-dependent speaker verification

One of the papers included in this Special Issue proposes a text-dependent speaker verification system where, in order to take into account the temporal structure of the pass-phrase, the sufficient statistics for i-vector extraction are collected by frame alignment (Zeinali et al., 2017a). The authors analyze four frame alignment methods. A GMM approach is used as baseline and is compared to a phonemic HMM-based system (Zelinali et al., 2017b), a DNN-based system with the DNN trained for senone classification, and a Bottleneck (BN) based system where a GMM is trained on BN features (Tian et al., 2015). In contrast to HMM alignment, DNN alignment does not need the true phonetic transcription of the enrollment phrases. Besides, authors propose to use DNN based bottleneck features along with standard MFCC features. All experiments are performed on the RSR2015 corpus and on the more challenging RedDots corpus (which features higher variability). Best results are obtained when DNN based bottleneck features are concatenated with the conventional MFCCs. Experimental results show that the phonemic HMM-based and the DNN- based approaches yield similar performance, whereas more effort is needed by the BN-based method due to its weakness in open phrase scenarios. Also, BN features perform much better for close phrase scenarios than for open phrase scenarios.
本期特刊中的一篇论文提出了一种文本相关的说话人验证系统,在该系统中,为了考虑通行短语的时间结构,通过帧对齐来收集用于i-向量提取的充分统计信息(Zeinali等人,2017a)。分析了四种帧对齐方法。GMM方法用作基线,并与基于音素HMM的系统(Zelinali等人,2017b)、基于DNN的系统和基于瓶颈(BN)的系统(Tian等人,2015)进行比较。与HMM对齐相比,DNN对齐不需要注册短语的真正语音转录。此外,作者建议使用基于DNN的瓶颈特征和标准MFCC特征。所有的实验都是在RSR2015语料库和更具挑战性的RedDots语料库(具有更高的可变性)上进行的。将基于DNN的瓶颈特征与传统的MFCCs相结合,得到了最佳的结果。实验结果表明,基于音素隐马尔可夫模型和基于DNN的方法具有相似的性能,而基于BN的方法由于其在开放短语场景中的弱点,需要付出更多的努力。此外,BN特性在短语接近的情况下比在短语打开的情况下表现得更好。

2.3 Spoofing attacks: detection and countermeasures

State-of-the-art speaker verification systems have achieved great performance in recent times. However, performance is usually measured in an ideal scenario where impostors do not try to disguise their voices to make them similar to the target speakers and where target speakers do not try to conceal their identities. Spoofing is the attempt of impersonating somebody by employing techniques like playing a recording of the victim’s speech or applying some kind of voice transformation. This topic is drawing attention motivated by the desire of introducing this technology in new applications. Spoofing techniques can be classified into four groups (Evans et al., 2013): (1) impersonation, imitation or mimicry, where impostors alter their own voices to sound like the target speaker (Lau et al., 2004; Blomberg et al., 2004); (2) speech synthesis, where the impostor uses an artificial production of the human speech (Zen et al., 2009; De Leon et al., 2012); (3) voice conversion, where different techniques are used to make a speaker’s voice to sound like that of another person (Perrot et al., 2005; Stylianou, 2009); and (4) replay attacks, which consist on feeding the speaker verification system with a recording acquired from the victim (Lindberg and Blomberg, 1999; Villalba and Lleida, 2011).
近年来,最先进的说话人验证系统取得了巨大的性能。然而,性能通常是在一个理想的场景中衡量的,在这个场景中,冒名顶替者不会试图掩饰自己的声音,使其与目标演讲者相似,目标演讲者也不会试图隐藏自己的身份。欺骗是通过播放受害者的语音录音或应用某种语音转换等技术来模仿某人的企图。本课题正是出于将这项技术引入新应用的愿望而引起人们的关注。欺骗技术可分为四类(Evans等人,2013):(1)模仿、模仿或模仿,其中冒名顶替者将自己的声音改变为听起来像目标演讲者的声音(Lau等人,2004;Blomberg等人,2004);(2)语音合成,其中冒名顶替者使用人工生成的人类语音(Zen等人,2009);De Leon等人,2012年);(3)语音转换,其中使用不同的技术使说话人的声音听起来像另一个人的声音(Perrot等人,2005年;Stylianou,2009年);和(4)重放攻击,其中包括向说话人验证系统提供从受害者获得的录音(Lindberg和Blomberg,1999年;Villalba和Lleida,2011年)。
In the last years, there have been huge efforts to define common databases, protocols and metrics to compare different experimental results (Evans et al., 2013b). As an example, the ASVspoof initiative has emerged providing databases, protocols and metrics for speech synthesis and voice conversion spoofing attacks, such as in ASVspoof-2015 (Ergunay et al., 2015) and replay attacks, such as in ASVspoof-2017 (Kinnunen et al., 2017a) and BTAS-2016 (Korshunov et al., 2016).
在过去的几年里,已经有了巨大的努力来定义共同的数据库、协议和度量来比较不同的实验结果(Evans等人,2013b)。例如,ASVspoof倡议已经出现,它为语音合成和语音转换欺骗攻击提供了数据库、协议和度量,如ASVspoof-2015(Ergunay等人,2015)和重放攻击,如ASVspoof-2017(Kinnunen等人,2017a)和BTAS-2016(Korsunov等人,2016)。
In this special issue, Todisco et al. propose the use of the constant Q transform (CQT) for spoofing detection (Todisco et al., 2017). CQT (Youngberg and Boll, 1978) employs geometrically spaced frequency bins imposing a constant Q factor across the entire spectrum. CQT provides higher frequency resolution at lower frequencies and higher temporal resolution at higher frequencies. In the paper, authors present an assessment of the use of constant Q cepstral coefficients (CQCC) in different use case scenarios (physical access control and logical access control) and different spoofing attack types and algorithms, and the generalization of the approach through cross-database evaluation (front-end optimization on one database and evaluation on another). The experimental results show that the new features are effective in capturing the tell-tale signs of manipulation artifacts which are indicative of spoofing attacks, given the high relative improvements obtained over previous best results on the test databases. However, cross-database evaluation experiments show that performance is sensitive to the precise CQCC configuration, suggesting that spoofing attacks of different nature require different solutions, which represents a big challenge for future work.
在本期特刊中,Todisco等人。建议使用常数Q变换(CQT)进行欺骗检测(Todisco等人,2017)。CQT(Youngberg和Boll,1978)采用几何间隔的频率箱,在整个频谱中施加恒定的Q因子。CQT在低频时提供更高的频率分辨率,在高频时提供更高的时间分辨率。在本文中,作者提出了在不同的用例场景(物理访问控制和逻辑访问控制)和不同的欺骗攻击类型和算法中使用常数Q倒谱系数(CQCC)的评估,并通过跨数据库评估(一个数据库的前端优化和另一个数据库的评估)对该方法进行了推广。实验结果表明,新的特征能够有效地捕获操纵伪影的信号,这些伪影表明存在欺骗攻击,这是因为相对于测试数据库上以前的最佳结果,新特征获得了较高的相对改进。然而,跨数据库评估实验表明,性能对CQCC的精确配置非常敏感,表明不同性质的欺骗攻击需要不同的解决方案,这对未来的工作提出了很大的挑战。

2.4 Speaker recognition miscellaneous
2.4说话人识别杂项
Speech activity detection (SAD) plays an important role in any speaker recognition system when dealing with degraded speech by additive noise (Mak and Yu, 2014; Ferrer et al., 2013; Villalba et al., 2013). However, not much attention has been paid to this issue in the speaker recognition community. In this special issue, Sholokhov et al. propose a semi-supervised SAD with an extensive experimentation on speaker recognition tasks (Sholokhov et al., 2018). In this paper, authors introduce a SAD method based on a Gaussian mixture model (GMM) trained from scratch for every audio recording. A small fraction of labeled data from the specific recording is used to train a GMM in a semi-supervised way, following a method originally developed for image segmentation. Labeling data can be obtained by human assistance or by a simpler SAD. In the paper, the reader can find an extensive experimentation of the proposed SAD, assessing the method as a standalone SAD and evaluating the performance of speaker verification systems with the proposed and other unsupervised SAD methods. As a standalone SAD, the main conclusions are that the number of Gaussians have little impact on the performance, but the choice of the covariance matrix structure does. Full covariance matrices and covariance matrices shared within each class are helpful. As expected, for the semi-supervised learning the initial labeling is a key factor to get a good performance. Regarding the features used for SAD, all the features tested (MFCC, PLP, FDLP, PNCC) produced reasonable results, but PNCC seems preferable for noisier conditions while MFCC can be better for relatively clean data. When assessing the performance of the SAD integrated in a speaker verification system, the proposed SAD is best suited for long and noisy data conditions but it is not a good choice for text-dependent scenarios involving short utterances.
语音活动检测(SAD)在处理加性噪声导致的语音退化时,在任何说话人识别系统中都起着重要作用(Mak和Yu,2014;ferler等人,2013;Villalba等人,2013)。然而,在说话人识别领域,对这一问题的关注并不多。在本期特刊中,Sholokhov等人。提出一种半监督SAD,并对说话人识别任务进行了广泛的实验(Sholokhov等人,2018)。本文介绍了一种基于高斯混合模型(GMM)的SAD方法,该方法对每个音频记录都进行从头训练。根据最初开发的图像分割方法,使用特定记录中的一小部分标记数据以半监督方式训练GMM。标签数据可以通过人工辅助或简单的SAD获得。在本文中,读者可以找到对所提出的SAD的广泛实验,将该方法作为一个独立的SAD进行评估,并使用所提出的和其他无监督的SAD方法评估说话人验证系统的性能。作为一个独立的SAD,主要结论是高斯数对性能影响不大,但协方差矩阵结构的选择对性能有影响。在每个类中共享的全协方差矩阵和协方差矩阵是有帮助的。正如预期的那样,对于半监督学习,初始标记是获得良好性能的关键因素。对于用于SAD的特征,所有被测试的特征(MFCC、PLP、FDLP、PNCC)都产生了合理的结果,但PNCC似乎更适合于噪声较大的条件,而MFCC则更适合于相对干净的数据。在评估说话人验证系统中SAD的性能时,提出的SAD最适合于长数据和噪声数据条件,但对于包含短话语的文本相关场景,它不是一个好的选择。

  1. Spoken Language Recognition

In the last few years, research on Spoken Language Recognition (SLR) has focused on exploring different flavours of DNN, either to improve or to replace baseline approaches, which typically comprised an i-vector front-end and a classifier backend (Gaussian, PLDA, etc.) (Martinez et al., 2011; Sizov et al., 2017). DNN have been used as classifiers (Lopez-Moreno et al., 2014; Richardson et al., 2015; Lozano-Diez et al., 2015), as the underlying global model for the computation of class posteriors in the i-vector approach (Lei et al., 2014a; Lei et al., 2014b), or just as extractors of posteriors or bottleneck features (BNF), typically as part of an i-vector front-end (Matějka et al., 2014; Jiang et al., 2014; Ferrer et al., 2014; Richardson et al., 2015; Ferrer et al., 2016; Lopez-Moreno et al., 2016). More recently, recurrent neural networks (such as LSTM architectures) have been successfully applied to SLR in end-to-end approaches, sometimes in combination with feed-forward neural networks (Gonzalez-Dominguez et al., 2014; Zazo et al., 2016; Trong et al., 2016; Garcia-Romero et al., 2016; Gelly et al., 2016; Gelly et al., 2017). At the same time, alternative ways of leveraging the information extracted by feed-forward DNN and Convolutional Neural Networks (CNN) when dealing with sequences of observations have been proposed (Li et al., 2016; Pešán et al., 2016; Jin et al., 2017).
在过去的几年中,口语识别(SLR)的研究集中在探索不同口味的DNN,以改进或取代通常由i矢量前端和分类器后端(高斯、PLDA等)组成的基线方法(Martinez等人,2011;Sizov等人,2017)。DNN被用作分类器(Lopez Moreno et al.,2014;Richardson et al.,2015;Lozano Diez et al.,2015),作为i-vector方法(Lei et al.,2014a;Lei et al.,2014b)中类后验计算的基础全局模型,或者仅仅用作后验或瓶颈特征(BNF)的提取器,通常作为i-vector的一部分前端(Matějka et al.,2014;Jiang et al.,2014;Ferrer et al.,2014;Richardson et al.,2015;Ferrer et al.,2016;Lopez Moreno et al.,2016)。最近,递归神经网络(如LSTM架构)在端到端方法中成功地应用于SLR,有时结合前馈神经网络(Gonzalez-Dominguez等人,2014;Zazo等人,2016;Trong等人,2016;Garcia Romero等人,2016;Gelly等人,2016;Gelly等人,2017)。同时,还提出了在处理观测序列时利用前向DNN和卷积神经网络(CNN)提取的信息的替代方法(Li等人,2016;Pešán等人,2016;Jin等人,2017)。
Research on feature extraction for SLR has gone beyond BNF (Matějka et al., 2014; Jiang et al., 2014), with new proposals such as PLLRs (Diez et al., 2012; Plchot et al., 2014) and variations of them (Irtza et al., 2015; Fernando et al., 2016a), sparse coding (Gwon et al., 2016) and eigenfeatures (Fernando et al., 2016b). Also, efforts have been made to increase the robustness of SLR systems against noise or mismatched conditions (Sadjadi et al., 2015; Nercessian et al., 2016), to deal with short speech utterances (Gonzalez-Dominguez et al., 2015a; Cumani et al., 2015; Fernando et al., 2017) and to better discriminate Out-Of-Set (OOS) from target languages in open- set SLR setups (Zhang et al., 2016; McLaren et al., 2017).
单反相机特征提取的研究已经超越了BNF(Matějka et al.,2014;Jiang et al.,2014),提出了PLLRs(Diez et al.,2012;Plchot et al.,2014)及其变体(Irtza et al.,2015;Fernando et al.,2016a)、稀疏编码(Gwon et al.,2016)和特征(Fernando et al.,2016b)等新的方案。此外,还努力提高单反相机系统对噪音或不匹配条件的鲁棒性(Sadjadi等人,2015;Nercessian等人,2016),处理短语音(Gonzalez Dominguez等人,2015a;Cumani等人,2015;Fernando等人,2017),并在开放环境中更好地区分预设(OOS)和目标语言-设置单反装置(Zhang等人,2016;McLaren等人,2017)。
Issues specifically related to dialect/accent recognition have been studied in (Behravana et al., 2015), and performance improvements in these tasks have been reported by using universal speech attributes (Hautamäki et al., 2015), unsupervised BNF together with an i-vector front-end (Zhang et al., 2017), or acoustic, phonotactic, lexical and syntactic features, combined in different ways (Hansen et al., 2016; Ali et al., 2016; Khurana et al., 2017).
在(Behravana等人,2015年)中研究了与方言/口音识别特别相关的问题,并通过使用通用语音属性(Hautama尢ki等人,2015年)、无监督BNF和i矢量前端(Zhang等人,2017年)或声学、语音策略,报告了这些任务中的性能改进,词汇和句法特征,以不同方式组合(Hansen et al.,2016;Ali et al.,2016;Khurana et al.,2017)。
Finally, multi-lingual resources have been used to build universal SLR front-ends, potentially capable of providing reliable representations for any set of target languages. Typically, a DNN is trained with utterances from multiple languages, with all hidden layers (including a bottleneck layer) being shared, and the output layer being dependent on the training language. BNF extracted from such a DNN are expected to contain discriminative information in SLR tasks, since the network is trained to distinguish between the sounds of many different languages. Good results have been reported in experiments carried out under this approach (Fér et al., 2015; Marcos et al., 2015).
最后,多语言资源被用于构建通用单反前端,有可能为任何一组目标语言提供可靠的表示。通常,DNN使用来自多种语言的语句进行训练,所有隐藏层(包括瓶颈层)共享,输出层依赖于训练语言。从这样一个DNN中提取的BNF在SLR任务中被期望包含鉴别信息,因为网络被训练来区分许多不同语言的声音。在这种方法下进行的实验中报告了良好的结果(Fér et al.,2015;Marcos et al.,2015)。
In this Special Issue, two papers are devoted to SLR and both deal with exploiting multi-lingual resources to get improved performance. The work presented in (Fér et al., 2017) builds upon previous work by the same authors (Fér et al., 2015), with an in-depth study of practical aspects (multiple configurations of DNN) of the multi-lingual training of BNF and a comparison of mono- and multi-lingual features in terms of both SLR and ASR metrics. In experiments on NIST 2009 LRE datasets, multi-lingual BNF not only outperform BNF trained on a single language but also contribute complementary information in score-level fusions.
在本期专刊中,有两篇文章是关于单反的,都是关于如何利用多种语言资源来提高单反的性能。在(Fér et al.,2017)中提出的工作建立在同一作者先前的工作(Fér et al.,2015)的基础上,深入研究了BNF多语种训练的实际方面(DNN的多种配置),并比较了单语和多语种在SLR和ASR度量方面的特征。在NIST 2009的LRE数据集上的实验中,多语言BNF不仅优于单一语言训练的BNF,而且在分数级融合中提供了互补信息。
The second paper devoted to SLR explores a different way of leveraging multi-lingual data: the unsupervised cross-lingual adaptation of an existing DNN phone tokenizer (trained on English transcribed data) to different target languages, by using automatically obtained transcriptions (Ng et al., 2017). In this way, the adapted DNN, though yielding degraded phoneme recognition accuracy, are expected to provide complementary information in score-level fusions of the corresponding SLR systems. Building upon a previous work (Ng et al., 2016), authors apply cross-entropy and state-level minimum Bayes risk adaptation methods in bottleneck i-vector and phonotactic SLR systems, respectively. In experiments on the NIST 2015 LRE evaluation dataset, the fusion of the adapted systems outperform baseline configurations without adapted tokenizers.
第二篇论文致力于SLR,探索了一种利用多种语言数据的不同方法:通过使用自动获得的转录,将现有的DNN电话标记器(接受过英语转录数据培训)无监督地跨语言适应不同的目标语言(Ng等人,2017)。这样,经调整的DNN虽然会降低音素识别的准确度,但在相应SLR系统的分数级融合中,它有望提供互补信息。在前人工作的基础上(Ng等人,2016),作者分别将交叉熵和状态级最小Bayes风险适应方法应用于瓶颈i矢量和声调单反系统。在NIST 2015 LRE评估数据集上的实验中,在没有自适应标记器的情况下,自适应系统的融合性能优于基线配置。

  1. Speaker Diarization and Clustering

The task of speaker diarization, which consists of determining speaker turns in a stream of audio, has received renewed attention in recent years, mostly due to the emergence of new speaker characterization techniques (ivectors + PLDA, DNN) but also to the increasing need to enrich huge amounts of speech resources with metadata (speaker and language labels, noise levels, etc.) in an almost unsupervised way.
说话人二值化的任务包括确定音频流中的说话人匝数,近年来受到了新的关注,这主要是由于新的说话人特征描述技术(ivectors+PLDA,DNN)的出现,同时也因为越来越需要丰富以几乎无监督的方式使用元数据(说话人和语言标签、噪音级别等)的语音资源量。
A speaker diarization system typically consists of three separate stages (Tranter and Reynolds, 2006; Anguera et al., 2012): (1) speech activity detection (SAD); (2) speaker change detection; and (3) clustering of the segments produced in the second stage. Systems differ in the way each stage is implemented, sometimes integrating two of them or making segmentation and clustering iterative (by gradually adapting the models) until a convergence criterion is met.
说话人二值化系统通常包括三个独立的阶段(Tranter和Reynolds,2006;Anguera等人,2012):(1)语音活动检测(SAD);(2)说话人变化检测;以及(3)第二阶段产生的片段的聚类。每个阶段的实现方式各不相同,有时将其中两个阶段集成起来,或使分割和聚类迭代(通过逐步调整模型),直到满足收敛标准。
Nowadays, state-of-the-art technology involves using ivectors to characterize speech segments (Dehak et al., 2011; Silovsky and Prazak, 2012) and applying PLDA scoring in the clustering phase (Kenny, 2010; Silovsky et al., 2011; Villalba et al., 2015). Recent improvements include replacing UBM-GMM posteriors by DNN posteriors in the computation of ivectors (Kenny et al., 2014; Sell et al., 2015), using representations suitable for fast as well as reasonably accurate speaker diarization (Delgado et al., 2015), extending spectral features with prosodic and voice-quality features (Woubie et al., 2015; Woubie et al., 2016), using deep speaker embeddings instead of ivectors (Rouvier and Favre, 2016; Garcia-Romero et al., 2017), using unlabelled resources to train and adapt PLDA models (LeLan et al., 2016; Viñals et al., 2017), using DNN models for clustering (Milner and Hain, 2016) or using recurrent neural networks (bi-directional LSTMs) for speaker segmentation (Yin et al., 2017), among others.
目前,最先进的技术包括使用ivectors来表征语音片段(Dehak等人,2011年;Silovsky和Prazak,2012年),并在聚类阶段应用PLDA评分(Kenny,2010年;Silovsky等人,2011年;Villalba等人,2015年)。最近的改进包括在计算ivectors时用DNN后验代替UBM-GMM后验(Kenny等人,2014;Sell等人,2015),使用适合于快速且合理准确的说话人二值化的表示(Delgado等人,2015),用韵律和语音质量特征扩展频谱特征(Woubie等人。,2015年;Woubie等人,2016年),使用深度扬声器嵌入代替ivectors(Rouvier和Favre,2016年;Garcia Romero等人,2017年),使用未标记资源训练和调整PLDA模型(LeLan等人,2016年;Viñals等人,2017年),使用DNN模型进行聚类(Milner和Hain,2016年),或使用递归神经网络(双向LSTM)用于说话人分割(Yin等人,2017)等。
This Special Issue includes two papers on speaker diarization and clustering. The work presented in (Desplanques et al., 2017) builds upon a previous work (Desplanques et al., 2015), where segmentation relied on the distance between two frame-level speaker factor vectors (based on an eigenvoice matrix), and then a two-step Agglomerative Hierarchical Clustering (AHC) algorithm was applied, with an initial BIC clustering stage followed by ivector PLDA clustering, which led to performance gains over using log-likelihood ratios and the Bayesian Information Criterion (BIC). Here, the authors present several adaptation strategies: (1) a soft SAD (McLaren et al., 2015b), which weighs the contribution of each frame to the speaker factors; (2) a two-pass training of the eigenvoice matrix for segmentation; and (3) a two-pass Cosine Distance Scoring (CDS) of speaker factors (replacing ivector PLDA scoring), initially with a generic out-of-domain eigenvoice matrix and then, after a first clustering pass, with an adapted eigenvoice matrix, including both generic and in-domain eigenvoices. In experiments on the COST 278 broadcast news database, authors report consistent performance gains over their baseline systems.
本期特刊包括两篇关于说话人二值化和聚类的论文。在(Desplanques等人,2017)中提出的工作建立在先前的工作(Desplanques等人,2015)的基础上,其中分割依赖于两个帧级说话人因子向量之间的距离(基于特征语音矩阵),然后应用两步聚合分层聚类(AHC)算法,在初始的BIC聚类阶段之后是ivector-PLDA聚类,通过使用对数似然比和贝叶斯信息准则(BIC)可以获得性能提升。在这里,作者提出了几种适应策略:(1)一种软SAD(McLaren等人,2015b),它衡量每个帧对说话人因素的贡献;(2)用于分割的特征语音矩阵的两次训练;以及(3)说话人因素的两次余弦距离评分(CDS)(代替ivector PLDA评分),最初是一个通用的域外特征语音矩阵,然后在第一次聚类通过后,使用一个自适应的特征语音矩阵,包括通用和域内特征语音。在对成本278广播新闻数据库的实验中,作者报告了与基线系统相比一致的性能提高。
The work presented in (Salmun et al., 2017) does not deal with diarization, but clustering, since speaker segments are already provided by the application: short push-to-talk cuts from different drivers recorded at a central taxi station. Segments from a given driver may come from different cars and environments, leading to a high intra-speaker variability, which makes the clustering task really challenging. In (Shapiro et al., 2015), authors studied and optimized the use of Mean Shift clustering with cosine similarity (Senoussaoui et al., 2013; 2014) on ivectors. In (Salmun et al., 2016), two improvements were introduced to the Mean Shift clustering algorithm: PLDA scoring was used instead of cosine scoring, and the fixed threshold (bandwidth) used in the selection of neighbours was replaced by an adaptive threshold based on the k-nearest neighbour (k-NN). In this work, authors compare the Expectation-Maximization iterative estimation to the exact algebraic calculation of a two-covariance PLDA model (Sizov et al., 2014), and the use of a Euclidean distance and PLDA scores both as stopping criterion in the shifting process and as the metric for clustering. Also, authors propose two variations of the Mean Shift algorithm that are shown to improve performance and to increase the stability of the approach: using normalized ivectors (Garcia-Romero and Espy-Wilson, 2011) and discarding negative PLDA scores in the k-NN shifting process.
在(Salmun等人,2017年)中介绍的工作不涉及二值化,而是聚类,因为应用程序已经提供了扬声器段:在中央出租车站记录的不同司机的短按通话中断。来自给定驱动程序的片段可能来自不同的汽车和环境,导致高的说话人内部可变性,这使得聚类任务非常具有挑战性。在(Shapiro等人,2015年)中,作者研究并优化了带余弦相似性的均值漂移聚类(Senoussaoui等人,2013年;2014年)在ivectors上的应用。在(Salmun等人,2016年)中,对Mean Shift聚类算法进行了两项改进:使用PLDA评分代替余弦评分,使用基于k近邻(k-NN)的自适应阈值(带宽)代替用于选择邻居的固定阈值(带宽)。在这项工作中,作者将期望最大化迭代估计与两协方差PLDA模型的精确代数计算(Sizov等人,2014)进行了比较,并将欧几里德距离和PLDA分数用作移位过程中的停止准则和聚类度量。此外,作者还提出了两种改进的Mean-Shift算法,以提高算法的性能和稳定性:在k-NN移位过程中使用规范化的ivectors(Garcia Romero和Espy Wilson,2011)和丢弃负PLDA分数。

  1. Applications

With the latest developments of speaker and spoken language characterization and recognition technology, the number and diversity of applications have increased remarkably. Also, new approaches have been integrated into existing systems. Recent SLR applications include robust systems capable of operating in hostile environments such as cars (Ghahabi et al., 2016), multilingual recognition of emotions (Sagha et al., 2016), language identification in singing (Kruspe, 2016), and the new fields of code switching (language segmentation and identification in multi-lingual audio) (Molina et al., 2016; Amazouz et al., 2017) and language clustering (Kacprzak, 2017). On the other hand, the field of forensic speaker verification has witnessed the transition to semi-automatic Likelihood Ratio (LR) based protocols (Solewicz et al., 2017). The field of speaker diarization has expanded to speaker linking or cross-show speaker diarization, where speakers that participate in several TV shows, meetings or conversations, recorded in different channels or environments, must be identified under the same name (Ghaemmaghami et al., 2016; Ferras et al., 2016a; 2016b; Sturim and Campbell, 2016). Other applications of speaker diarization include speaker recognition in multi-speaker audio streams (Novotny et al., 2016), speaker diarization with minimal human supervision (Yu and Hansen, 2017) and speaker diarization for the automatic processing of real world call center data (Church et al., 2017). Finally, important contributions have been also made regarding the databases and tools needed for the advancement of technology. Outstanding examples are the Multi-Genre Broadcast (MGB) Challenge (Bell et al. 2015), the Red- Dots Replayed corpus for replay spoofing attacks in Text-Dependent Speaker Verification (Kinnunen et al., 2017b) and pyannote.metrics, a toolkit for the evaluation, diagnostic and error analysis of speaker diarization systems (Bredin, 2017).
随着说话人和口语表征与识别技术的最新发展,应用的数量和多样性显著增加。此外,新的方法已被纳入现有系统。最近的单反雷达应用包括能够在恶劣环境下工作的强大系统,如汽车(Ghahabi等人,2016年)、情感的多语种识别(Sagha等人,2016年)、歌唱中的语言识别(Kruspe,2016年)和代码转换的新领域(多语种音频中的语言分割和识别)(Molina等人,2016年;Amazouz等人,2017年)和语言聚类(Kacprzak,2017年)。另一方面,法医说话人验证领域见证了向基于半自动似然比(LR)的协议的转变(Solewicz等人,2017)。演讲者二值化的领域已经扩展到演讲者链接或跨节目演讲者二值化,其中,在不同频道或环境中录制的参与多个电视节目、会议或对话的演讲者必须以相同的名称进行识别(Ghaemmaghami等人,2016;Ferras等人,2016a;2016b;Sturim和Campbell,2016年)。说话人二值化的其他应用包括多说话人音频流中的说话人识别(Novotny等人,2016年)、在最小人监督下的说话人二值化(Yu和Hansen,2017年)和用于自动处理真实世界呼叫中心数据的说话人二值化(Church等人,2017年)。最后,还对技术进步所需的数据库和工具作出了重要贡献。突出的例子是多类型广播(MGB)挑战(Bell等人。2015年),在文本相关的说话人验证(Kinnunen等人,2017b)和pyannote.metrics(一个用于评估、诊断和分析说话人二值化系统的工具包)中,红点重放了用于重放欺骗攻击的语料库(Bredin,2017年)。
In this Special Issue, three papers can be regarded as applications. In the first one, Phil Rose studies the performance of higher level features for forensic voice comparison within the Likelihood Ratio (LR) framework in real-world cases (Rose, 2017). Specifically, the trajectories of F1, F2 and F3 formants along with the pitch F0 in a disyllabic word are evaluated on a database of 23 male speakers, using the log-likelihood ratio cost as metric. The author shows that the score- level fusion of the studied features provides remarkable improvements over the individual features. Moreover, other higher level features, more local and not reliable as individual features for forensic voice comparison, are shown to further increase performance through score-level fusions. The author discusses in depth how to correctly estimate the strength of forensic voice comparison evidence under the LR framework, giving pros and cons of LRs and higher level features and presenting three case-studies to illustrate them.
在本期特刊中,有三篇论文可以作为应用。在第一个案例中,Phil Rose研究了在真实案例中,在似然比(LR)框架内,法医语音比较的高级特征的表现(Rose,2017)。具体来说,在23位男性说话者的数据库中,以对数似然比成本为度量指标,计算了双音节词中F1、F2和F3共振峰的轨迹以及基音F0。作者认为,所研究特征的分数级融合比单个特征有显著的改进。此外,其他更高层次的特征,更局部和不可靠的作为法医语音比较的个体特征,被证明通过分数级融合进一步提高性能。作者深入探讨了在LR框架下如何正确估计法医语音比对证据的强度,给出了LR的优缺点和更高层次的特征,并给出了三个案例加以说明。
The second application paper (Magariños et al., 2017) deals with the task of reversible speaker de-identification, which is needed to protect privacy and to avoid spoofing attacks in telephone banking, call centers, healthcare records of doctor-patient interviews, etc. The three key features of speaker de-identification, namely universality, naturalness and reversibility, are met by the proposed approach. Following (Pobar and Ipsic, 2014), universality is obtained by using a pool of pre-trained voice conversion transformations between multiple source and target speakers. Then, given an input speaker, the transform that maps from the most similar source speaker to the most dissimilar target speaker is applied. This avoids the need to train a specific transform for each input speaker. As in (Abou-Zleikha et al., 2015; Magariños et al., 2016), similarity is computed by using the i-vector speaker characterization approach, which is state-of-the-art in speaker recognition, thus ensuring high de-identification accuracy. On the other hand, the transformation method, based on frequency warping and amplitude scaling (Erro et al., 2010; Godoy et al., 2012), guarantees naturalness and reversibility. Objective and subjective evaluations show that the proposed approach provides high-quality de-identification and re-identification results with low (real-time) computational cost.
第二份申请文件(Magariños等人,2017年)涉及可逆的说话人识别任务,这是保护隐私和避免电话银行、呼叫中心、医患面谈的医疗记录等方面的欺骗攻击所必需的。说话人识别的三个关键特征,即通用性,所提出的方法满足自然性和可逆性。接下来(Pobar和Ipsic,2014),通过在多个源和目标扬声器之间使用一个预先训练的语音转换池来获得通用性。然后,给定一个输入扬声器,应用从最相似的源扬声器映射到最不相似的目标扬声器的变换。这避免了为每个输入扬声器训练特定变换的需要。如(Abou Zleikha et al.,2015;Magariños et al.,2016)所述,相似性通过使用i矢量说话人特征化方法计算,这是说话人识别的最新技术,从而确保高的去识别精度。另一方面,基于频率翘曲和幅度缩放的变换方法(Erro等人,2010;Godoy等人,2012)保证了自然性和可逆性。客观和主观评价表明,该方法以较低的计算量提供了高质量的去辨识和再辨识结果。

The third application paper (Dubey et al., 2017), which builds upon previous work by the same authors (Dubey et al., 2016a; 2016b), exploits robust speech activity detection (SAD) and speaker diarization methods to extract speech segments from different participants in small-group conversations. Based on the segments of each participant, behavioral characteristics such as dominance, curiosity, emphasis and engagement can be estimated based on a small set of acoustic cues: speaking time, speech energy, pitch contour and speech rate. The paper introduces the Peer- Led Team Learning (PLTL) task and corpus, and presents some statistics about the annotated data: SNR, turn duration and the answers to a questionnaire about team and individual dynamics. The speaker diarization system applied in this work combines stacked-autoencoder bottleneck features with an HMM for joint speaker segmentation and clustering. In order to make the task easier, the HMM is fed with side information including the number of speakers and the minimum duration of turns. Finally, the performance provided by the SAD and speaker diarization modules is presented, and the usefulness of acoustic cues for detecting curiosity, emphasis and engagement are studied too, with encouraging results.
第三篇应用论文(Dubey et al.,2017)在同一作者(Dubey et al.,2016a;2016b)先前工作的基础上,利用稳健的语音活动检测(SAD)和说话人二值化方法从小组对话的不同参与者中提取语音片段。基于每个参与者的片段,可以根据一小部分听觉线索来估计他们的行为特征,如支配、好奇、强调和参与度:说话时间、言语能量、音调轮廓和语速。本文介绍了以同伴为主导的团队学习(PLTL)任务和语料库,并对标注的数据进行了统计分析:信噪比、轮换持续时间以及团队和个体动态问卷的答案。本文所应用的说话人二值化系统,将堆叠式自动编码瓶颈特性与隐马尔可夫模型相结合,实现说话人的联合分割与聚类。为了使任务更简单,HMM中加入了边信息,包括说话人的数量和最短的轮换时间。最后,介绍了SAD和说话人二值化模块所提供的性能,并研究了声音线索在检测好奇心、强调和参与方面的有用性,取得了令人鼓舞的结果。
6. Conclusion
6。结论
The articles included in this special issue are representative of recent progress made in the field of speaker and language recognition and characterization, a dynamic research area with a steady pace of technological developments. Although the technology is reasonably mature and state-of- the-art performance is good enough to deploy practical applications, there is still a lot of work to be done. Channel mismatch, short speech utterances or spoofing countermeasures, to name just a few, are some of the challenges that will require intensive research. At the same time, new and exciting research horizonts are being opened with the introduction of deep neural networks and new machine learning techniques. We hope that this special issue will help introduce novel researchers to this area and inspire future developments.
本期特刊中的文章代表了近年来在说话人和语言识别与表征领域取得的进展,这是一个技术发展速度稳定的动态研究领域。虽然技术相当成熟,最先进的性能足以部署实际应用程序,但仍有许多工作要做。信道失配、短语音或欺骗对策,仅举几个例子,是一些需要深入研究的挑战。同时,随着深度神经网络和新的机器学习技术的引入,新的令人兴奋的研究领域正在打开。我们希望这期特刊能帮助引进这方面的新研究人员,并启发未来的发展。

Acknowledgments
致谢
The editors would like to thank the editorial and administrative support of the Computer Speech & Language journal, especially Roger Moore for his guidance and help throughout the production of this special issue. The editors also thank all of the reviewers and the participants of the 2016 ISCA Workshop on Speaker and Language Recognition (Odyssey 2016) for their thoughtful contributions to this special issue.
编辑们要感谢《计算机语音与语言》杂志的编辑和行政支持,特别是罗杰·摩尔在本期特刊的整个制作过程中给予的指导和帮助。编辑们还感谢所有评论员以及2016年国际汉语协会说话人和语言识别研讨会(Odyssey 2016)的参与者,感谢他们对本期特刊做出的周到贡献。

References
Abou-Zleikha, M., Tan, Z.-H., Christensen, M., Jensen, S., 2015. A discriminative approach for speaker selection in speaker de-identification systems. In Proc. EUSIPCO 2015, pp. 2147–2151.
Ali, A., Dehak, N., Cardinal, P., Khurana, S., Yella, S.H., Glass, J., Bell, P., Renals, S., 2016. Automatic Dialect Detection in Arabic Broadcast Speech. In Proc. Interspeech 2016, pp. 2934- 2938.
Amazouz, D., Adda-Decker, M., Lamel, L., 2017. Addressing Code-Switching in French/Algerian Arabic Speech. In Proc. Interspeech 2017, pp. 62-66.
Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O., 2012. Speaker diarization: A review of recent research. IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, NO. 2, pp. 356–370, February 2012.
Bhattacharya, G., Alam, J., Kenny, P., 2017. Deep Speaker Embeddings for Short-Duration Speaker Verification. in Proc. Interspeech 2017, pp. 1517-1521.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Behravana, H., Hautamäki, V., Kinnunen, T., 2015. Factors affecting i-vector based foreign accent recognition: A case study in spoken Finnish. Speech Communication 66 (2015), pp. 118-129.
Bell, P., Gales, M.J.F., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., Woodland, P.C., 2015. The MGB Challenge: Evaluating Multi-genre Broadcast Media Recognition. In Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 687-693.
Biadsy, F., 2011. Automatic Dialect and Accent Recognition and its Application to Speech Recognition. Ph.D. Thesis. Columbia University, New York, NY, USA, 2011.
Blomberg, M., Elenius, D., and Zetterholm, E., 2004. Speaker verification scores and acoustic analysis of a professional impersonator. In P. Branderud and H. Traunm ller (Eds.), Proceedings of Fonetik 2004: The 17th Swedish Phonetics Conference, pp. 84–87, Stockholm, Sweden.
Bredin, H., 2017. pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems. In Proc. Interspeech 2017, pp. 3587-3591.
Brummer, N., De Villiers, E., 2010. The Speaker Partitioning Problem. In Proc. Odyssey 2010, pp. 194–201.
Campbell, J.P., Shen, W., Campbell, W.M., Schwartz, R., Bonastre, J.-F., Matrouf, D., 2009. Forensic speaker recognition: A need for caution. IEEE Signal Processing Magazine, Vol. 26, NO. 2, pp. 95–103, March 2009.
Chen, L., Lee, K.A., Chng, E.S., Ma, B., Li, H., Dai, L.R., 2016. Content-aware local variability vector for speaker verification with short utterance. In Proc. IEEE ICASSP 2016, pp. 5485-5489.
Chen, S., Xu, M., Pratt, E., 2012. Study on the effects of intrinsic variation using i-vectors in text- independent speaker verification. In Proc. Odyssey 2012, pp. 172-179.
Church, K., Zhu, W., Vopicka, J., Pelecanos, J., Dimitriadis, D., Fousek, P., 2017. Speaker Diarization: A Perspective on Challenges and Opportunities from Theory to Practice. In Proc. IEEE ICASSP 2017, pp. 4950-4954.
Cumani, S., Plchot, O., Fér, R, 2015. Exploiting i–vector posterior covariances for short–duration language recognition. In Proc. Interspeech 2015, pp. 1002-1006.
De Leon, P. L., Pucher, M., Yamagishi, J., Hernaez, I., Saratxaga, I., 2012. Evaluation of Speaker Verification Security and Detection of HMM Based Synthetic Speech. IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, NO. 8, pp. 2280–2290, October 2012.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, NO. 4, pp. 788–798, May 2011.
Delgado, H., Anguera, X., Fredouille, C., Serrano, J., 2015. Fast Single- and Cross-Show Speaker Diarization Using Binary Key Speaker Modeling. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 23, NO. 12, pp. 2286-2297, December 2015.
Desplanques, B., Demuynck, K., Martens, J.P., 2015. Factor analysis for speaker segmentation and improved speaker diarization. In Proc. Interspeech 2015, pp. 3081–3085.
Desplanques, B., Demuynck, K., Martens, J.-P., 2017. Adaptive speaker diarization of broadcast news based on factor analysis. Computer Speech & Language 46 (2017), pp. 72-93.
Dey, S., Madikeri, S., Ferras, M., Motlicek, P., 2016. Deep neural network based posteriors for text- dependent speaker verification. In Proc. IEEE ICASSP 2016, pp. 5050–5054.
Diez, M., Varona, A., Penagarikano, M., Rodriguez-Fuentes, L.J., Bordel, G., 2012. On the use of phone log-likelihood ratios as features in spoken language recognition. In Proc. 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 274-279.
Drygajlo, A., Jessen, M., Gfroerer, S., Wagner, I., Vermeulen, J., Niemi, T., 2015. Methodological guidelines for best practice in forensic semiautomatic and automatic speaker recognition
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
including guidance on the conduct of proficiency testing and collaborative exercises. ENFSI, Verlag für Polizeiwissenschaft, Frankfurt.
Dubey, H., Kaushik, L., Sangwan, A., Hansen, J.H., 2016a. A Speaker Diarization System for Studying Peer-Led Team Learning Groups. In Proc. Interspeech 2016, pp. 2180-2184.
Dubey, H., Sangwan, A., Hansen, J.H.L., 2016b. A Robust Diarization System for Measuring Dominance in Peer-Led Team Learning Groups. In Proc. 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 319-323.
Dubey, H., Sangwan, A., Hansen, J.H.L., 2017. Using speech technology for quantifying behavioral characteristics in peer-led team learning sessions. Computer Speech & Language 46 (2017), pp. 343-366.
Ergunay, S., Khoury, E., Lazaridis, A., Marcel, S., 2015. On the vulnerability of speaker verification to realistic voice spoofing. In Proc. 7th IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS 2015), pp. 1–6.
Erro, D., Moreno, A., Bonafonte, A., 2010. Voice conversion based on weighted frequency warping. Computer Speech & Language 18 (2010), pp. 922–931.
Evans, N., Kinnunen, T., and Yamagishi, J., 2013. Spoofing and Countermeasures for Automatic Speaker Verification. In Proc. Interspeech 2013, pp. 925–929.
Evans, N., Yamagishi, J., Kinnunen, T., 2013b. Spoofing and countermeasures for speaker verification: a need for standard corpora, protocols and metrics. In IEEE Signal Processing Society Speech and Language Technical Committee Newsletter.
Fér, R., Matějka, P., Grezl, F., Plchot, O., Černocký, J., 2015. Multilingual Bottleneck Features for Language Recognition. In Proc. Interspeech 2015, pp. 389-393.
Fér, R., Matějka, P., Grézl, F., Plchot, O., Veselý, K., Černocký, J., 2017. Multilingually trained bottleneck features in spoken language recognition. Computer Speech & Language 46 (2017), pp. 252-267.
Fernando, S., Sethu, V., Ambikairajah, E., 2016a. A Feature Normalisation Technique for PLLR Based Language Identification Systems. In Proc. Interspeech 2016, pp. 2925-2929.
Fernando, S., Sethu, V., Ambikairajah, E., 2016b. Eigenfeatures: An alternative to Shifted Delta Coefficients for Language Identification. In Proc. 16th Speech Science and Technology Conference (SST 2016), pp. 253-256.
Fernando, S., Sethu, V., Ambikairajah, E., Epps, J., 2017. Bidirectional Modelling for Short Duration Language Identification. In Proc. Interspeech 2017, pp. 2809-2813.
Ferras, M., Madikeri, S., Motlicek, P., Bourlard, H., 2016a. System Fusion and Speaker Linking for Longitudinal Diarization of TV Shows. In Proc. IEEE ICASSP 2016, pp. 5495-5499.
Ferras, M., Madikeri, S., Bourlard, H., 2016b. Speaker Diarization and Linking of Meeting Data. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, NO. 11, pp. 1935-1945, November 2016.
Ferrer, L., McLaren, M., Scheffer, N., Lei, Y., Graciarena, M., Mitra, V., 2013. A noise-robust system for NIST 2012 speaker recognition evaluation. In Proc. Interspeech 2013, pp. 1981-1985.
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N., 2014. Spoken Language Recognition Based on Senone Posteriors. In Proc. Interspeech 2014, pp. 2150-2154.
Ferrer, L., Lei, Y., McLaren, M., Scheffer, N., 2016. Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, NO. 1, pp. 105-116, January 2016.
Garcia-Romero, D., Espy-Wilson, C., 2011. Analysis of i-vector length normalization in speaker recognition systems. In Proc. Interspeech 2011, pp. 249–252.
Garcia-Romero, D., McCree, A., 2016. Stacked Long-Term TDNN for Spoken Language
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Recognition. In Proc. Interspeech 2016, pp. 3226-3230.
Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., McCree, A., 2017. Speaker Diarization Using Deep Neural Network Embeddings. In Proc. IEEE ICASSP 2017, pp. 4930-4934.
Gelly, G., Gauvain, J.L., Le, V., Messaoudi, A., 2016. A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks. In Proc. Interspeech 2016, pp. 3231-3235.
Gelly, G., Gauvain, J.L., 2017. Spoken Language Identification Using LSTM-Based Angular Proximity. In Proc. Interspeech 2017, pp. 2566-2570.
Ghaemmaghami, H., Dean, D., Sridharan, S., van Leeuwen, D.A., 2016. A study of speaker clustering for speaker attribution in large telephone conversation datasets. Computer Speech & Language 40 (2016), pp. 23-45.
Ghahabi, O., Bonafonte, A., Hernando, J., Moreno, A. , 2016. Deep Neural Networks for i-Vector Language Identification of Short Utterances in Cars. In Proc. Interspeech 2016, pp. 367-371.
Ghias, A., Logan, J., Chamberlin, D., and Smith, B.C., 1995. Query By Humming – Musical Information Retrieval in an Audio Database. In Proc. 3rd ACM International Conference on Multimedia, pp. 231-236.
Godoy, E., Rosec, O., Chonavel, T., 2012. Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Transactions on Audio, Speech, and Language Processing , Vol. 20, NO. 4, pp. 1313–1323, May 2012.
Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J., 2014. Automatic language identification using long short-term memory recurrent neural networks. In Proc. Interspeech 2014, pp. 2155-2159.
Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P.J., Gonzalez-Rodriguez, J., 2015a. Frame- by-frame language identification in short utterances using deep neural networks. Neural Networks 64 (2015), pp. 49-58.
Gonzalez-Dominguez, J., Eustis, D., Lopez-Moreno, I., Senior, A., Beaufays, F., Moreno, P.J., 2015b. A Real-Time End-to-End Multilingual Speech Recognition Architecture. IEEE Journal of Selected Topics in Signal Processing, Vol. 9, NO. 4, pp. 749-759, June 2015.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press, 2016.
Gwon, Y.L., Campbell, W.M., Sturim, D.E., Kung, H., 2016. Language Recognition via Sparse Coding. In Proc. Interspeech 2016, pp. 2920-2924.
Hansen, J.H.L., Hasan, T., 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Processing Magazine 32 (2015), pp. 74-99.
Hansen, J.H.L., Liu, G., 2016. Unsupervised accent classification for deep data fusion of accent and language information. Speech Communication 78 (2016), pp. 19-33.
Hautamäki, V., Siniscalchi, S.M., Behravana, H., Salerno, V.M., Kukanov, I., 2015. Boosting Universal Speech Attributes Classification with Deep Neural Network for Foreign Accent Characterization. In Proc. Interspeech 2015, pp. 408-412.
Heigold, G., Moreno, I., Bengio, S., Shazeer, N., 2016. End-to-end text-dependent speaker verification. In Proc. IEEE ICASSP 2016, pp. 5115-5119.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B., 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, Vol. 29, NO. 6, pp. 82-97, October 2012.
Irtza, S., Sethu, V., Le, P.N., Ambikairajah, E., Li, H., 2015. Phonemes Frequency based PLLR Dimensionality Reduction for Language Recognition. In Proc. Interspeech 2015, pp. 997-1001.
Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R., 2014. Deep Bottleneck
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Features for Spoken Language Identification. PLOS ONE, Published: July 1, 2014, https://doi.org/10.1371/journal.pone.0100795.
Jin, M., Song, Y., McLoughlin, I., 2017. End-to-end DNN-CNN Classification for Language Identification. In Proc. of the World Congress on Engineering 2017 (WCE 2017), pp. 199-203.
Kacprzak, S., 2017. Spoken language clustering in the i-vectors space. In Proc. 2017 International Conference on Systems, Signals and Image Processing (IWSSIP), pp. 1-5.
Kenny, P., 2010. Bayesian Speaker Verification with Heavy-Tailed Priors. In Proc. Odyssey 2010. Invited Keynote.
Kenny, P., Stafylakis, T., Ouellet, P., Alam, M.J., Dumouchel, P., 2013. PLDA for speaker verification with utterances of arbitrary duration. In Proc. IEEE ICASSP 2013, pp. 7649–7653. Kenny, P., Gupta, V., Stafylakis, T., Ouellet, P., Alam, J., 2014. Deep neural networks for extracting Baum-Welch statistics for speaker recognition. In Proc. Odyssey 2014, pp. 293–298.
Kheder, W.B., Matrouf, D., Ajili, M., Bonastre, J.F., 2016. Probabilistic Approach Using Joint Long and Short Session i-Vectors Modeling to Deal with Short Utterances for Speaker Recognition. In Proc. Interspeech 2016, pp. 1830-1834.
Kheder, W.B., Matrouf, D., Bousquet, P.M., Bonastre, J.F., Ajili, M., 2017. Fast i-vector denoising using MAP estimation and a noise distributions database for robust speaker recognition. Computer Speech & Language 45 (2017), pp. 104-122.
Khosravani, A. Homayounpour, M.M., 2017. A PLDA approach for language and text independent speaker recognition. Computer Speech & Language 45 (2017), pp 457-474.
Khoury, E. et al. (37 additional authors), 2013. The 2013 speaker recognition evaluation in mobile environment. In Proc. International Conference on Biometrics (ICB 2013), pp 1-8.
Khurana, S., Najafian, M., Ali, A., Hanai, T.A., Belinkov, Y., Glass, J., 2017. QMDIS: QCRI-MIT Advanced Dialect Identification System. In Proc. Interspeech 2017, pp. 2591-2595.
Kinnunen, T., Sahidullah, M., Kukanov, I., Delgado, H., Todisco, M., Sarkar, A.K., Thomsen, N.B., Hautamäki, V., Evans, N., Tan, Z., 2016. Utterance Verification for Text-Dependent Speaker Recognition: A Comparative Assessment Using the RedDots Corpus. In Proc. Interspeech 2016, pp. 430-434.
Kinnunen, T., Sahidullah, M., Delgado, H., Todisco, M., Evans, N., Yamagishi, J., Lee, K.A., 2017a. The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection. In Proc. Interspeech 2017, pp. 2-6.
Kinnunen, T., Sahidullah, M., Falcone, M., Costantini, L., Gonzalez-Hautamaki, R., Thomsen, D., Sarkar, A., Tan, Z.-H., Delgado, H., Todisco, M., Evans, N., Hautamaki, V., Lee, K.A., 2017b. RedDots Replayed: A New Replay Spoofing Attack Corpus for Text-Dependent Speaker Verification Research. In Proc. IEEE ICASSP 2017, pp. 5395-5399.
Konno, H., Toyama, J., Shimbo, M., Murata, K., 1996. The effect of formant frequency and spectral tilt of unvoiced vowels on their perceived pitch and phonemic quality. IEICE Technical Report, SP95-140, pp. 39–45, March 1996.
Korshunov, P., Marcel, S., Muckenhirn, H., Gonçalves, A. R., Mello, G. S., Violato, R. P. V., Simoes, F. O., Neto, M. U., de Assis Angeloni, M., Stuchi, J. A., Dinkel, H., Chen, N., Qian, Y., Paul, D., Saha, G. and Sahidullah, M., 2016. Overview of BTAS 2016 speaker anti-spoofing competition. In Proc. IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS 2016), pp. 1-6.
Kruspe, A.M., 2016. Phonotactic Language Identification for Singing. In Proc. Interspeech 2016, pp. 3319-3323.
Lamel, L., 2012. Multilingual Speech Processing Activities in Quaero: Application to Multimedia
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Search in Unstructured Data. In Human Language Technologies - The Baltic Perspective, A.
Tavast et al. (Eds.), pp. 1-8. IOS Press, 2012.
Larcher, A., Lee, K.A., Ma, B., Li, H., 2014. Text-dependent speaker verification: Classifiers,
databases and RSR2015. Speech Communication 60 (2014), pp. 56-77.
Lau, Y. W., Wagner, M., and Tran, D., 2004. Vulnerability of speaker verification to voice
mimicking. In Proc. 2004 IEEE International Symposium on Intelligent Multimedia, Video and
Speech Processing, pp. 145–148.
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), pp. 436-444.
Lee, K.A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., van Leeuwen, D., Aronowitz, H.,
Kockmann, M., Vaquero, C., Ma, B., Li, H., Stafylakis, T., Alam, M.J., Swart, A., Perez, J., 2015. The RedDots data collection for speaker recognition. In Proc. Interspeech 2015, pp. 2996- 3000.
Lei, Y., Scheffer, N., Ferrer, L., McLaren, M., 2014a. A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proc. IEEE ICASSP 2014, pp. 1714-1718.
Lei, Y., Ferrer, L., Lawson, A., McLaren, M., Scheffer, N., 2014b. Application of Convolutional Neural Networks to Language Identification in Noisy Conditions. In Proc. Odyssey 2014, pp. 287-292.
Le Lan, G., Charlet, D., Larcher, A., Meignier, S., 2016. Iterative PLDA Adaptation for Speaker Diarization. In Proc. Interspeech 2016, pp. 2175-2179.
Li, H., Ma, B., 2010. TechWare: Speaker and Spoken Language Recognition Resources [Best of the Web]. IEEE Signal Processing Magazine, Vol. 27, NO. 6, pp. 139-142, November 2010.
Li, H., Ma, B., Lee, K.A., 2013. Spoken language recognition: from fundamentals to practice. Proceedings of the IEEE, Vol. 101, NO. 5, pp. 1136-1159, May 2013.
Li, R., Mallidi, S.H., Burget, L., Plchot, O., Dehak, N., 2016. Exploiting Hidden-Layer Responses of Deep Neural Networks for Language Recognition. In Proc. Interspeech 2016, pp. 3265-3269.
Lin, W.W., Mak, M.W., Chien, J.T., 2017. Fast scoring for PLDA with uncertainty propagation via i-vector grouping. Computer Speech & Language 45 (2017), pp. 503-515.
Lindberg, J. and Blomberg, M., 1999. Vulnerability in speaker verification a study of technical impostor techniques. In Proc. Eurospeech 1999, pp. 1211–1214.
Lopez-Cozar, R., Araki, M., 2005. Spoken, multilingual and multimodal dialogue systems: Development and assessment. John Wiley & Sons, 2005.
Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.J., 2014. Automatic language identification using deep neural networks. In Proc. IEEE ICASSP 2014, pp. 5337-5341.
Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., Moreno, P.J., 2016. On the use of deep feedforward neural networks for automatic language identification. Computer Speech & Language 40 (2016), pp. 46-59.
Lozano-Diez, A., Zazo-Candil, R., Gonzalez-Dominguez, J., Toledano, D.T., Gonzalez-Rodriguez, J., 2015. An End-to-end Approach to Language Identification in Short Utterances using Convolutional Neural Networks. In Proc. Interspeech 2015, pp. 403-407.
Magariños, C., Lopez-Otero, P., Docio-Fernandez, L., Erro, D., Banga, E., Garcia-Mateo, C., 2016. Piecewise linear definition of transformation functions for speaker de-identification. In Proc. First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE), pp. 1–5.
Magariños, C., Lopez-Otero, P., Docio-Fernandez, L., Rodriguez-Banga, E., Erro, D., Garcia- Mateo, C., 2017. Reversible speaker de-identification using pre-trained transformation functions. Computer Speech & Language 46 (2017), pp. 36-52.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Mak, M.W., Yu,H., 2014. A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer Speech & Language 28 (2014), pp. 295–313.
Marcos, L.M., Richardson, F., 2016. Multi-lingual deep neural networks for language recognition. In Proc. 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 330-334.
Martin, A, and Garofolo, J. S., 2007. NIST Speech Processing Evaluations: LVCSR, Speaker Recognition, Language Recognition. In Proc. 2007 IEEE Workshop on Signal Processing Applications for Public Security and Forensics, pp. 1-7.
Martinez, D. G., Plchot, O., Burget, L., Glembek, O., Matějka, P., 2011. Language recognition in ivectors space. In Proc. Interspeech 2011, pp. 861–864.
Matějka, P., Zhang, L., Ng, T., Mallidi, S.H., Glembek, O., Ma, J., Zhang, B., 2014. Neural Network Bottleneck Features for Language Identification. In Proc. Odyssey 2014, pp. 299-304.
Matějka, P., Glembek, O., Novotny, O., Plchot, O., Grèzl, F, Burget, L., Černocký, J., 2016. Analysis of DNN approaches to speaker identification. In Proc. IEEE ICASSP 2016, pp. 5100– 5104.
McCree, A., Sell, G., Garcia-Romero, D., 2017. Extended Variability Modeling and Unsupervised Adaptation for PLDA Speaker Recognition. In Proc. Interspeech 2017, pp. 1552-1556
McLaren, M., Lei, Y., Ferrer, L., 2015a. Advances in deep neural network approaches to speaker recognition. In Proc. IEEE ICASSP 2015, pp 4814-4818.
McLaren, M., Graciarena, M., Lei, Y., 2015b. SoftSAD: Integrated frame-based speech confidence for speaker recognition. In Proc. IEEE ICASSP 2015, pp. 4694–4698.
McLaren, M., Ferrer, L., Castan, D., Lawson, A., 2016a. The Speakers in the Wild (SITW) Speaker Recognition Database. In Proc. Interspeech 2016, pp. 812-822.
McLaren, M., Ferrer, L., Castan, D., Lawson, A., 2016b. The 2016 Speakers in the Wild Speaker Recognition Evaluation. In Proc. Interspeech 2016, pp. 823-827.
McLaren, M., Ferrer, L., Castan, D., Lawson, A., 2017. Calibration Approaches for Language Detection. In Proc. Interspeech 2017, pp. 2804-2808.
Miguel, A., Llombart, J., Ortega, A., Lleida, E., 2017. Tied Hidden Factors in Neural Networks for End-to-End Speaker Recognition. In Proc. Interspeech 2017, pp 2819-2823.
Milner, R., Hain, T., 2016. DNN-Based Speaker Clustering for Speaker Diarisation. In Proc. Interspeech 2016, pp. 2185-2189.
Misra, A., Hansen, J.H., 2014. Spoken language mismatch in speaker verification: An investigation with NIST-SER and CRSS bi-ling corpora. In Proc. 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 372-377.
Moattar, M.H., Homayounpour, M.M., 2012. A review on speaker diarization systems and approaches. Speech Communication 54 (2012), pp. 1065-1103.
Molina, G., Rey-Villamizar, N., Solorio, T., AlGhamdi, F., Ghoneim, M., Hawwari, A., Diab, M., 2016. Overview for the Second Shared Task on Language Identification in Code-Switched Data. In Proc. 2nd Workshop on Computational Approaches to Code Switching, pp. 40–49, Austin, TX, November, 2016.
Muthusamy, Y.K., Barnard, E., Cole, R.A., 1994. Reviewing automatic language identification. IEEE Signal Processing Magazine, Vol. 11, NO. 4, pp. 33–41, October 1994.
Nercessian, S., Torres-Carrasquillo, P., Martinez-Montes, G., 2016. Approaches for language identification in mismatched environments. In Proc. 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 335-340.
Ng, R.W.M., Nicolao, M., Hain, T., 2017. Unsupervised crosslingual adaptation of tokenisers for spoken language recognition. Computer Speech & Language 46 (2017), pp. 327-342.
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Novotný, O., Matějka, P., Plchot, O., Glembek, O., Burget, L., Černocký, J., 2016. Analysis of Speaker Recognition Systems in Realistic Scenarios of the SITW 2016 Challenge. In Proc. Interspeech 2016, pp. 828-832.
Patil, H.A., Madhavi, M.C., 2017. Combining evidences from magnitude and phase information using VTEO for person recognition using humming. Computer Speech & Language (in press).
Perrot, P., Aversano, G., Blouet, R., Charbit, M., and Chollet, G., 2005. Voice Forgery Using ALISP: Indexation in a Client Memory. In Proc. IEEE ICASSP 2005, pp. 17–20.
Pešán, J., Burget, L., Černocký, J., 2016. Sequence Summarizing Neural Networks for Spoken Language Recognition. In Proc. Interspeech 2016, pp. 3285-3288.
Plchot, O., Diez, M., Soufifar, M., Burget, L., 2014. PLLR Features in Language Recognition System for RATS. In Proc. Interspeech 2014, pp. 3047-3051.
Pobar, M., Ipsic, I., 2014. Online speaker de-identification using voice transformation. In Proc. 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1264–1267.
Rahman, M.H., Kanagasundaram, A., Himawan, I., Dean, D., Sridharan, S., 2018. Improving PLDA speaker verification performance using domain mismatch compensation techniques. Computer Speech & Language 47 (2018), pp. 240-258.
Ramasubramanian, V., 2012. Speaker Spotting: Automatic Telephony Surveillance for Homeland Security. In A. Neustein and H. A. Patil (Eds.), Forensic Speaker Recognition: Law Enforcement and Counter-Terrorism, Springer-Verlag New York, 2012.
Reynolds, D. A., Quatieri, T. F., Dunn, R. B., 2000. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing, Vol. 10. NO. 1-3, pp.19–41, January 2000.
Reynolds, D. A., 2002. An overview of automatic speaker recognition technology. In Proc. IEEE ICASSP 2002, pp. 4072–4075.
Richardson, F., Reynolds, D., Dehak, N., 2015. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, Vol. 22, NO. 10, pp. 1671–1675, October 2015.
Rodriguez-Fuentes, L. J., Brümmer, N., Penagarikano M., Varona, A., Diez, M., and Bordel, G., 2013. The Albayzin 2012 Language Recognition Evaluation. In Proc. Interspeech 2013, pp. 1497–1501.
Rose, P., 2017. Likelihood ratio-based forensic voice comparison with higher level features: research and reality. Computer Speech & Language 45 (2017), pp. 475-502.
Rouvier, M., Favre, B., 2016. Investigation of Speaker Embeddings for Cross-Show Speaker Diarization. In Proc. IEEE ICASSP 2016, pp. 5585-5589.
Sadjadi, S.O., Pelecanos, J.W., Ganapathy, S., 2015. Nearest neighbor discriminant analysis for language recognition. In Proc. IEEE ICASSP 2015, pp. 4205-4209.
Sagha, H., Matějka, P., Gavryukova, M., Povolny, F., Marchi, E., Schuller, B., 2016. Enhancing Multilingual Recognition of Emotion in Speech by Language Identification. In Proc. Interspeech 2016, pp. 2949-2953.
Salmun, I., Opher, I., Lapidot, I., 2016. On the Use of PLDA i-vector Scoring for Clustering Short Segments. In Proc. Odyssey 2016, pp. 407-414.
Salmun, I., Shapiro, I., Opher, I., Lapidot, I., 2017. PLDA-based mean shift speakers’ short segments clustering. Computer Speech & Language 45 (2017), pp. 411-436.
Sarria-Paja, M., Falk, T.H., 2017. Fusion of auditory inspired amplitude modulation spectrum and cepstral features for whispered and normal speech speaker verification. Computer Speech & Language 45 (2017), pp. 437-456.
Sell, G., Garcia-Romero, D., McCree, A., 2015. Speaker Diarization with I-Vectors from DNN
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
Senone Posteriors. In Proc. Interspeech 2015, pp. 3096-3099.
Senoussaoui, M., Kenny, P., Dumouchel, P., Stafylakis, T., 2013. Efficient Iterative Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering. In Proc. IEEE ICASSP 2013, pp. 4311-4315.
Senoussaoui, M., Kenny, P., Stafylakis, T., Dumouchel, P., 2014. A study of the cosine distance- based mean shift for telephone speech diarization. IEEE Transactions on Audio, Speech and Language Processing, Vol. 22, NO. 1, pp. 217–227, January 2014.
Shapiro, I., Rabin, N., Opher, I., Lapidot, I., 2015. Clustering short push-to-talk segments. In Proc. Interspeech 2015, pp. 3031-3035.
Sholokhov, A., Sahidullah, M., Kinnunen, T., 2018. Semi-supervised speech activity detection with an application to automatic speaker verification. Computer Speech & Language 47 (2018), pp. 132-156.
Silovsky, J., Prazak, J., Cerva, P., Zdansky, J., Nouza, J., 2011. PLDA-based clustering for speaker diarization of broadcast streams. In Proc. Interspeech 2011, pp. 2909–2912.
Silovsky, J., Prazak, J., 2012. Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In Proc. IEEE ICASSP 2012, pp. 4193–4196.
Siniscalchi, S.M., Reed, J., Svendsen, T., Lee, C.-H., 2013. Universal attribute characterization of spoken languages for automatic spoken language recognition. Computer Speech & Language 27 (2013), pp. 209-227.
Sizov, A., Lee, K.A., Kinnunen, T., 2014. Unifying Probabilistic Linear Discriminant Analysis Variants in Biometric Authentication. In P. Franti et al. (Eds.): S+SSPR 2014, LNCS 8621, pp. 464–475. Springer-Verlag Berlin Heidelberg, 2014.
Sizov, A., Lee, K.A., Kinnunen, T., 2017. Direct Optimization of the Detection Cost for I-Vector- Based Spoken Language Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 25, NO. 3, pp. 588-597, March 2017.
Snyder, D., Garcia-Romero, D., Povey, D., 2015. Time delay deep neural network-based universal background models for speaker recognition. In Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 92–97.
Solewicz, Y.A., Jessen, M., Vloed, D.V.D., 2017. Null-Hypothesis LLR: A Proposal for Forensic Automatic Speaker Recognition. In Proc. Interspeech 2017, pp. 2849-2853.
Sturim, D.E., Campbell, W.M., 2016. Speaker Linking and Applications Using Non-Parametric Hashing Methods. In Proc. Interspeech 2016, pp. 2170-2174.
Stylianou, Y., 2009. Voice Transformation: A survey. In Proc. IEEE ICASSP 2009, pp. 3585–3588.
Tian, Y., Cai, M., He, L., Liu, J., 2015. Investigation of bottleneck features and multilingual deep neural networks for speaker verification. In Proc. Interspeech 2015, pp.1151–1155.
Todisco, M., Delgado, H., Evans, N., 2017. Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification. Computer Speech & Language 45 (2017), pp. 516-535.
Tranter, S., Reynolds, D., 2006. An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, NO. 5, pp. 1557–1565, September 2006.
Trong, T.N., Hautamäki, V., Lee, K.A., 2016. Deep Language: a comprehensive deep learning approach to end-to-end language recognition. In Proc. Odyssey 2016, pp. 109-116.
Unal, E., Chew, E., Georgiou, P.G., and Narayanan, S.S., 2008. Challenging Uncertainty in Query by Humming Systems: A Fingerprinting Approach. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 16, NO. 2, pp. 359-371, February 2008.
Villalba, J., Lleida, E., 2011. Detecting replay attacks from far-field recordings on speaker
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
verification systems. In Proc. 2011 European Workshop on Biometrics and ID management (BioID 2011), Brandenburg (Havel), Germany, pp. 274-285.
Villalba, J., Lleida, E., Ortega, A., Miguel, A., 2013. The I3A speaker recognition system for NIST SRE12: post-evaluation analysis. In Proc. Interspeech 2013, pp. 3689–3693.
Villalba, J., Ortega, A., Miguel, A., Lleida, E., 2015. Variational Bayesian PLDA for Speaker Diarization in the MGB Challenge. In Proc. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 667-674.
Viñals, I., Ortega, A., Villalba, J., Miguel, A., Lleida, E., 2017. Domain Adaptation of PLDA Models in Broadcast Diarization by Means of Unsupervised Speaker Clustering. In Proc. Interspeech 2017, pp. 2829-2833.
Vogt, R., Sridharan, S., Mason, M., 2010. Making Confident Speaker Verification Decisions With Minimal Speech. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, NO. 6, pp. 1182-1192, August 2010.
Woubie, A., Luque, J., Hernando, J., 2015. Using Voice-quality Measurements with Prosodic and Spectral Features for Speaker Diarization. In Proc. Interspeech 2015, pp. 3100-3104.
Woubie, A., Luque, J., Hernando, J., 2016. Improving i-Vector and PLDA Based Speaker Clustering with Long-Term Features. In Proc. Interspeech 2016, pp. 372-376.
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., Li, H., 2015. Spoofing and Countermeasures for Speaker Verification: a Survey. Speech Communication 66 (2015), pp. 130-153.
Yamada, T., Wang, L., Kai, A., 2013. Improvement of distant-talking speaker identification using bottleneck features of DNN. In Proc. Interspeech 2013, pp. 3661–3664.
Yin, R., Bredin, H., Barras, C., 2017. Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks. In Proc. Interspeech 2017, pp. 3827-3831.
Yu, C., Hansen, J.H.L., 2017. Active Learning Based Constrained Clustering For Speaker Diarization. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 25, NO. 11, pp. 2188-2198, November 2017.
Zazo, R., Lozano-Diez, A., Gonzalez-Dominguez, J., Toledano, D.T., Gonzalez-Rodriguez, J., 2016. Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks. PLOS ONE, January 29, 2016, https://doi.org/10.1371/journal.pone.0146917.
Zeinali, H., Sameti, H., Burget, L., Černocký, J., 2017a. Text-dependent speaker verification based on i-vectors, Neural Networks and Hidden Markov Models. Computer Speech & Language 46 (2017), pp. 53-71.
Zeinali, H., Sameti, H., Burget, L., 2017b. HMM-based phrase-independent i-vector extractor for text-dependent speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 25, NO. 7, pp 1421-1435, July 2017.
Zen, H., Tokuda, K., and Black, A. W., 2009. Statistical parametric speech synthesis. Speech Communication 51 (2009), pp. 1039–1064.
Zhang, Q., Hansen, J.H.L., 2016. Unsupervised k-means clustering based out-of-set candidate selection for robust open-set language recognition. In Proc. 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 324-329.
Zhang, Q., Hansen, J.H.L., 2017. Dialect Recognition Based on Unsupervised Bottleneck Features. In Proc. Interspeech 2017, pp. 2576-2580.

你可能感兴趣的:(odyssey)