Tharshini Gunendradasan; University of New South Wales
Saad Irtza; University of New South Wales
Eliathamby Ambikairajah; University of New South Wales
Julien Epps; University of New South Wales
https://ieeexplore.ieee.org/document/8682771
Transmission Line Cochlear Model Based AM-FM Features for Replay Attack Detection
基于传输线耳蜗模型的AM-FM特性用于重播攻击检测
Abstract:
文摘:
This paper focuses on providing a countermeasure to replay attack which is the simplest and more accessible form of attack used to spoof automatic speaker verification systems.
本文着重提出了一种针对重放攻击的对策,这是一种用于欺骗自动说话人验证系统的最简单、最容易实现的攻击形式。
Specifically, it proposes the use of the transmission line cochlear model, which resembles the human cochlea more accurately than parallel filter bank models, in the front-end of replay detection systems.
具体地,提出了传输线耳蜗模型在回放检测系统前端的应用,该模型比并行滤波器组模型更准确地模拟人体耳蜗。
Here the basilar membrane is modeled as a cascade of digital filters with decreasing resonant frequencies.
在这里,基膜被建模为一个级联的数字滤波器与减少共振频率。
In this context, we propose two features – transmission line cochlea-amplitude modulation (TLC-AM) and frequency modulation (TLC-FM) – to extract the modulation features of the speech from the simulated membrane displacements.
在此背景下,我们提出了传输线耳蜗振幅调制(TLC-AM)和调频(TLC-FM)两个特征,从模拟的膜位移中提取语音的调制特征。
TLC-AM is analogous to the output of the inner hair cell bending movement, which accurately captures the amplitude modulation component of the speech.
TLC-AM类似于内毛细胞弯曲运动的输出,可以准确地捕捉到语音的调幅成分。
TLC-FM is extracted by deriving the in-phase and out of phase signals of basilar membrane displacement.
通过提取基底膜位移的同相信号和异相信号,提取TLC-FM。
Results show that individual TLC-AM and TLC-FM features perform better than the best parallel filter bank baseline system.
结果表明,单个TLC-AM和TLC-FM特性的性能优于最佳的并行滤波器组基线系统。
Experiments suggest that higher frequency selectivity is beneficial for replay detection, especially for AM, and the proposed TLC model is better able to achieve this property than parallel filter bank models.
实验表明,较高的频率选择性有利于重放检测,尤其是对AM,所提出的TLC模型比并行滤波器组模型更能实现这一特性。
SECTION 1.INTRODUCTION
1.节介绍
Automatic speaker verification (ASV) is a mature technology that uses voice biometric to verify a person’s identity [1].
自动说话人验证(ASV)是一种成熟的利用语音生物特征来验证人的身份[1]的技术。
As speech can be assessed remotely and deployment of speaker verification is simple and cost-effective, it has been adopted by several applications for secure verification e.g. telephone banking, physical access control.
由于语音可以进行远程评估,而扬声器验证的部署既简单又划算,因此已被多个应用程序用于安全验证,例如电话银行、物理访问控制。
Although current ASV systems verify the identity with high accuracy and low equal error rate (EER), their vulnerability to spoofing attack has been shown to be significant [2], and dramatically affects the reliability and security of the system.
目前的ASV系统虽然具有较高的识别精度和较低的等误码率(EER),但其对欺骗攻击的脆弱性已被证明是显著的[2],严重影响了系统的可靠性和安全性。
There are four main types of spoofing attacks currently under consideration: replay [3], speech-synthesis (SS) [4], voice conversion (VC) [5] and impersonation [6].
目前考虑的欺骗攻击主要有四种类型:重放[3]、语音合成(SS)[4]、语音转换(VC)[5]和模拟[6]。
Replay attacks involve playing back the recorded speech of the genuine target speaker to spoof the system.
重放攻击包括重放真实目标说话者的录音来欺骗系统。
Among all these attacks, replay poses the biggest threat due to the availability of high-quality recording devices and smartphones and the non-requirement of any advanced technical knowledge or effort [2].
在所有这些攻击中,重播是最大的威胁,因为高质量的录音设备和智能手机的可用性,以及不需要任何先进的技术知识或努力[2]。
This paper focuses on providing countermeasures for replay spoofing attack, to verify either the given speech is genuine or replayed.
本文主要针对重放欺骗攻击提供对策,以验证所给语音是真实的还是重放的。
Different countermeasures have been recently proposed for replay detection.
针对重播检测,近年来提出了不同的对策。
Mel filter bank slope and linear filter bank slope features [7] have been proposed to capture low and high frequency spectral information separately.
提出了Mel滤波器组斜率和线性滤波器组斜率特征[7]分别捕获低频和高频频谱信息。
In [8], static and dynamic characteristics of the modulation spectrum were fused with short time magnitude features to improve the system performance.
在[8]中,将调制频谱的静态和动态特性与短时幅度特性相融合,以提高系统性能。
Linear prediction based features were proposed in [9], [10].
在[9]、[10]中提出了基于线性预测的特征。
Spectral centroid based amplitude and frequency modulation features were proposed in [11].
在[11]中提出了基于谱心的幅度和调频特性。
Apart from these front-end features, different neural network architectures e.g. convolutional neural networks and Siamese network have been proposed which are effective in discriminating replay attacks [12], [13], [14].
除了这些前端特性外,还提出了不同的神经网络结构,如卷积神经网络和暹罗网络等,这些结构能够有效地识别重放攻击[12]、[13]、[14]。
Replayed utterances contain both additive and convolutional distortions introduced by recording and playback devices [15].
重放语音包含加性失真和卷积失真,这是由录制和回放设备[15]引入的。
As replay attacks involve multiple recording and playback, they will be affected by noise [16].
由于重播攻击涉及多个录制和回放,所以会受到噪声[16]的影响。
The amplitude-based features of the signal can capture these distortions.
基于振幅的信号特征可以捕捉这些失真。
It has been suggested that the changes in spectral envelope due to the channel characteristic of intermediate devices can be captured by phase-based features [17].
研究表明,基于相位特征的[17]可以捕获中间器件信道特性引起的频谱包络线的变化。
As instantaneous information can capture the dynamic and time evolution of the speech features, instantaneous amplitude and phase-based features can effectively capture these distortions.
由于瞬时信息可以捕捉语音特征的动态和时间演化,因此瞬时幅度和基于相位的特征可以有效地捕捉这些失真。
The contribution of these two features in discriminating genuine and replayed speech has been exploited in the following past work.
在过去的工作中,我们已经充分利用了这两个特征在辨别真假和重放语音方面的作用。
In [18] auditory filter banks were learned using ConvRBM and AM and FM features were extracted using the conventional energy separation algorithm.
在[18]中使用ConvRBM学习听觉滤波器组,使用传统的能量分离算法提取AM和FM特征。
VESA-IACC and VESA-IFCC features were proposed in [19], where instantaneous amplitude (IA) and instantaneous frequency (IF) were estimated using the VESA algorithm from the Gabor filtered subband signal.
在[19]中提出了VESA- iacc和VESA- ifcc特征,利用VESA算法从Gabor滤波的子带信号中估计瞬时幅度(IA)和瞬时频率(IF)。
Moreover, the importance of these two features for detecting SS and VC was also analyzed in [20].
此外,还分析了这两个特征在[20]中检测SS和VC的重要性。
The above-mentioned methods use parallel filter banks to obtain the subband signals for instantaneous amplitude and phase extraction.
上述方法利用并行滤波器组获取子带信号进行瞬时幅相提取。
The cochlear model in the human/mammalian auditory system, which inherently has extraordinary frequency sensitivity and selectivity, can be more accurately approximated by a transmission line model [21].
人/哺乳动物听觉系统中的耳蜗模型具有固有的频率敏感性和选择性,可以用传输线模型[21]更准确地逼近。
In the transmission line model, the cochlea is represented as a cascade of digital filters, which helps to achieve sharper roll off even with smaller order filters, allowing high frequency selectivity [21].
在传送线模型中,耳蜗被表示为数字滤波器的级联,即使使用较小的阶数滤波器,也可以实现更清晰的滚动,从而实现高频率选择性[21]。
The importance of choice of filters with different shape in discriminating genuine and replayed speech was reported in [22].
在[22]中,研究了不同形状滤波器在真音和重放语音识别中的重要性。
Motivated by these, we hypothesize that analyzing the AM and FM components of the signal with high frequency sensitivity and resolution would be effective in capturing differences between genuine and spoofed speech.
在此基础上,我们假设分析具有高频率灵敏度和分辨率的信号的调幅和调频分量能够有效地捕捉真实语音和欺骗语音之间的差异。
Thus, we proposed transmission line cochlear model to extract the AM and FM components of the speech.
因此,我们提出了传输线耳蜗模型来提取语音的AM和FM分量。
We refer to the proposed features as transmission line cochlea AM and FM (TLC-AM, TLC-FM).
我们将提出的特征称为传输线耳蜗AM和FM (TLC-AM, TLC-FM)。