ICASSP 2019----Transmission Line Cochlear Model Based AM-FM Features for Replay Attack Detection

Tharshini Gunendradasan; University of New South Wales
Saad Irtza; University of New South Wales
Eliathamby Ambikairajah; University of New South Wales
Julien Epps; University of New South Wales

This paper focuses on providing a countermeasure to replay attack which is the simplest and more accessible form of attack used to spoof automatic speaker verification systems.
Specifically, it proposes the use of the transmission line cochlear model, which resembles the human cochlea more accurately than parallel filter bank models, in the front-end of replay detection systems.
Here the basilar membrane is modeled as a cascade of digital filters with decreasing resonant frequencies.
In this context, we propose two features – transmission line cochlea-amplitude modulation (TLC-AM) and frequency modulation (TLC-FM) – to extract the modulation features of the speech from the simulated membrane displacements.
TLC-AM is analogous to the output of the inner hair cell bending movement, which accurately captures the amplitude modulation component of the speech.
TLC-FM is extracted by deriving the in-phase and out of phase signals of basilar membrane displacement.
Results show that individual TLC-AM and TLC-FM features perform better than the best parallel filter bank baseline system.
Experiments suggest that higher frequency selectivity is beneficial for replay detection, especially for AM, and the proposed TLC model is better able to achieve this property than parallel filter bank models.

Automatic speaker verification (ASV) is a mature technology that uses voice biometric to verify a person’s identity [1].
As speech can be assessed remotely and deployment of speaker verification is simple and cost-effective, it has been adopted by several applications for secure verification e.g. telephone banking, physical access control.
Although current ASV systems verify the identity with high accuracy and low equal error rate (EER), their vulnerability to spoofing attack has been shown to be significant [2], and dramatically affects the reliability and security of the system.
There are four main types of spoofing attacks currently under consideration: replay [3], speech-synthesis (SS) [4], voice conversion (VC) [5] and impersonation [6].
Replay attacks involve playing back the recorded speech of the genuine target speaker to spoof the system.
Among all these attacks, replay poses the biggest threat due to the availability of high-quality recording devices and smartphones and the non-requirement of any advanced technical knowledge or effort [2].
This paper focuses on providing countermeasures for replay spoofing attack, to verify either the given speech is genuine or replayed.
Different countermeasures have been recently proposed for replay detection.
Mel filter bank slope and linear filter bank slope features [7] have been proposed to capture low and high frequency spectral information separately.
In [8], static and dynamic characteristics of the modulation spectrum were fused with short time magnitude features to improve the system performance.
Linear prediction based features were proposed in [9], [10].
Spectral centroid based amplitude and frequency modulation features were proposed in [11].
Apart from these front-end features, different neural network architectures e.g. convolutional neural networks and Siamese network have been proposed which are effective in discriminating replay attacks [12], [13], [14].
Replayed utterances contain both additive and convolutional distortions introduced by recording and playback devices [15].
As replay attacks involve multiple recording and playback, they will be affected by noise [16].
The amplitude-based features of the signal can capture these distortions.
It has been suggested that the changes in spectral envelope due to the channel characteristic of intermediate devices can be captured by phase-based features [17].
As instantaneous information can capture the dynamic and time evolution of the speech features, instantaneous amplitude and phase-based features can effectively capture these distortions.
The contribution of these two features in discriminating genuine and replayed speech has been exploited in the following past work.
In [18] auditory filter banks were learned using ConvRBM and AM and FM features were extracted using the conventional energy separation algorithm.
VESA-IACC and VESA-IFCC features were proposed in [19], where instantaneous amplitude (IA) and instantaneous frequency (IF) were estimated using the VESA algorithm from the Gabor filtered subband signal.
Moreover, the importance of these two features for detecting SS and VC was also analyzed in [20].
The above-mentioned methods use parallel filter banks to obtain the subband signals for instantaneous amplitude and phase extraction.
The cochlear model in the human/mammalian auditory system, which inherently has extraordinary frequency sensitivity and selectivity, can be more accurately approximated by a transmission line model [21].
In the transmission line model, the cochlea is represented as a cascade of digital filters, which helps to achieve sharper roll off even with smaller order filters, allowing high frequency selectivity [21].
The importance of choice of filters with different shape in discriminating genuine and replayed speech was reported in [22].
Motivated by these, we hypothesize that analyzing the AM and FM components of the signal with high frequency sensitivity and resolution would be effective in capturing differences between genuine and spoofed speech.
Thus, we proposed transmission line cochlear model to extract the AM and FM components of the speech.
We refer to the proposed features as transmission line cochlea AM and FM (TLC-AM, TLC-FM).
