常规的声音识别方案中最关键的两个环节:特征构建和分类模型选择。这次我们学习传统的声音信号的时频域统计特征及其Python代码实现。
编程环境:Python3.6/3.5、 librosa 0.7
以下程序大部分利用公式重写代码验证过,若有问题,请提出,谢谢!!!
In digital signal processing, the spectral centroid (SC) and the spectral spread (SS) are measures for characterising the distribution of the frequency components of a signal. The spectral centroid is defined as the ”center of mass” of the spectrum and is computed as follows:
S C = ∑ i = 1 L F i F s L F ∣ X ( i ) ∣ ∑ i = 1 L F ∣ X ( i ) ∣ , SC = \frac{\sum^{L_F}_{i=1}i\frac{F_s}{L_F}\lvert X(i)\rvert}{\sum^{L_F}_{i=1}\lvert X(i)\rvert}, SC=∑i=1LF∣X(i)∣∑i=1LFiLFFs∣X(i)∣,
while the spectral spread is computed as the dispersion of the frequency components of the signal around the centroid:
S S = ∑ i = 1 L F [ i F s L F − S C ] 2 ∣ X ( i ) ∣ ∑ i = 1 L F ∣ X ( i ) ∣ SS = \sqrt{\frac{\sum^{L_F}_{i=1}\lbrack i\frac{F_s}{L_F}-SC\rbrack ^2 \lvert X(i)\rvert}{\sum^{L_F}_{i=1}\lvert X(i)\rvert}} SS=∑i=1LF∣X(i)∣∑i=1LF[iLFFs−SC]2∣X(i)∣
where L F L_F LF and ∣ X ( k ) ∣ \lvert X(k) \rvert ∣X(k)∣ are the length and the module of the F F T FFT FFT of the imput signal x ( n ) x(n) x(n), respectively.
# 代码如下,已经过验证
import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
frame = librosa.util.frame(y,frame_length=1024,hop_length=512)
S = np.abs(librosa.stft(y,n_fft=1024,hop_length=512,center=False))
sc = librosa.feature.spectral_centroid(y,sr=sr,n_fft=1024,hop_length=512,center=False)
#spectral spread
ss = np.zeros((1,frame.shape[1]))
for i in range(frame.shape[1]):
ss[:,i] = np.sqrt((sum((fre-sc[:,i])**2*S[:,i]))/sum(S[:,i]))
The spectral rolloff is a measure of the skewness of the spectrum and is defined as the frequency fro at which the P % P\% P% of the spectral components of the signal is at lower frequency. In our case, we consider P = 90 P = 90 P=90 and determine the value f r o f_{ro} fro from the following relation:
∑ i = 1 f r o ∣ X ( i ) ∣ = P 100 ∑ i = 1 F m a x ∣ X ( i ) ∣ \sum^{f_{ro}}_{i=1}\lvert X(i) \rvert = \frac{P}{100}\sum^{F_{max}}_{i=1}\lvert X(i) \rvert i=1∑fro∣X(i)∣=100Pi=1∑Fmax∣X(i)∣
# 利用librosa来提取特征,程序未验证
import librosa
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
rolloff = librosa.feature.spectral_rolloff(y=y,sr=sr,n_fft=1024,hop_length=512,center=False,roll_percent=0.9)
The spectral flux (SF) indicates how quickly the spectral information of a signal is changing and it is computed by considering the squared-difference between the spectra of two consecutive audio frames, as reported in the following equation:
S F = ∑ i = 1 L F [ X n ( i ) − X n − 1 ( i ) ] 2 SF = \sum^{L_F}_{i=1}\lbrack X_n(i) - X_{n-1}(i) \rbrack^2 SF=i=1∑LF[Xn(i)−Xn−1(i)]2
import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
sf = np.zeros((1,S.shape[1]-1))
for i in range(S.shape[1]-1):
sf[:,i] = sum((S[:,i+1]-S[:,i])**2)
The energy ratios in sub-bands (ERSB) give a rough approximation of the energy distribution of the spectrum. We divided the spectrum of the signal into four sub-bands,which are reported as follow, and for each sub-band we computed the ratio between the energy contained in that subband and the overall energy of the audio frame.
E R S B n = ∑ i = k n 1 k n 2 ∣ X ( i ) ∣ 2 ∑ i = 1 F m a x ∣ X ( i ) ∣ 2 , ERSB_n = \frac{\sum^{k_{n2}}_{i=k_{n1}} \lvert X(i)\rvert ^2}{\sum^{F_{max}}_{i=1} \lvert X(i)\rvert ^2}, ERSBn=∑i=1Fmax∣X(i)∣2∑i=kn1kn2∣X(i)∣2,
where
import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
ersb = np.zeros((4,S.shape[1]))
sub_band = [14,39,102,512]
for i in range(S.shape[1]):
ersb[0,i] = sum(S[:sub_band[0],i])/sum(S[:,i])
ersb[1,i] = sum(S[sub_band[0]:sub_band[1],i])/sum(S[:,i])
ersb[2,i] = sum(S[sub_band[1]:sub_band[2],i])/sum(S[:,i])
ersb[3,i] = sum(S[sub_band[2]:sub_band[3],i])/sum(S[:,i])
We calculate the volume feature (V) as the root mean square (RMS) of the amplitude value of the samples in an audio frame:
V = 1 L ∑ i = 1 L x ( i ) 2 V = \sqrt {\frac{1}{L}\sum^{L}_{i=1}x(i)^2} V=L1i=1∑Lx(i)2
while the energy (E) is the squared-sum of the amplitude value of the audio samples:
E = ∑ i = 1 L x ( i ) 2 E = \sum^{L}_{i=1}x(i)^2 E=i=1∑Lx(i)2
import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
frame = librosa.util.frame(y,frame_length=1024,hop_length=512)
# Volumns and energy
V = librosa.feature.rms(y,frame_length=1024,hop_length=512,center=False)
Energy = np.sum(frame**2,axis=0)
The zero crossing rate (ZCR) is the rate of the sign-changes along a frame and is especially used to characterise percussive sounds and environmental noise. For a frame x ( i ) x(i) x(i) of L L L samples, the ZCR is computed as follows:
Z C R = 1 2 L ∑ i = 1 L ∣ s g n ( x ( i + 1 ) ) − s g n ( x ( i ) ) ∣ ZCR = \frac{1}{2L}\sum^L_{i=1}\lvert sgn(x(i+1)) - sgn(x(i))\rvert ZCR=2L1i=1∑L∣sgn(x(i+1))−sgn(x(i))∣
import librosa
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
zcr = librosa.feature.zero_crossing_rate(y,frame_length=1024,hop_length=512,center=False)
以免忘记,在此记录下特征提取方法,其他特征将会继续更新,若想看其他特征欢迎在评论区提出,共同学习。