常见声音的时频统计特征的Python编程实现

文章目录

  • 摘要
  • 特征描述及Python代码
    • 一、Spectral centroid and spectral spread
    • 二、Spectral rolloff
    • 三、Spectral flu
    • 四、Energy ratios in sub-bands
    • 五、Volume and energy
    • 六、Zero crossing rate
  • 总结

摘要

常规的声音识别方案中最关键的两个环节:特征构建和分类模型选择。这次我们学习传统的声音信号的时频域统计特征及其Python代码实现。

特征描述及Python代码

编程环境:Python3.6/3.5、 librosa 0.7
以下程序大部分利用公式重写代码验证过,若有问题,请提出,谢谢!!!

一、Spectral centroid and spectral spread

In digital signal processing, the spectral centroid (SC) and the spectral spread (SS) are measures for characterising the distribution of the frequency components of a signal. The spectral centroid is defined as the ”center of mass” of the spectrum and is computed as follows:
S C = ∑ i = 1 L F i F s L F ∣ X ( i ) ∣ ∑ i = 1 L F ∣ X ( i ) ∣ , SC = \frac{\sum^{L_F}_{i=1}i\frac{F_s}{L_F}\lvert X(i)\rvert}{\sum^{L_F}_{i=1}\lvert X(i)\rvert}, SC=i=1LFX(i)i=1LFiLFFsX(i),
while the spectral spread is computed as the dispersion of the frequency components of the signal around the centroid:
S S = ∑ i = 1 L F [ i F s L F − S C ] 2 ∣ X ( i ) ∣ ∑ i = 1 L F ∣ X ( i ) ∣ SS = \sqrt{\frac{\sum^{L_F}_{i=1}\lbrack i\frac{F_s}{L_F}-SC\rbrack ^2 \lvert X(i)\rvert}{\sum^{L_F}_{i=1}\lvert X(i)\rvert}} SS=i=1LFX(i)i=1LF[iLFFsSC]2X(i)
where L F L_F LF and ∣ X ( k ) ∣ \lvert X(k) \rvert X(k) are the length and the module of the F F T FFT FFT of the imput signal x ( n ) x(n) x(n), respectively.

# 代码如下,已经过验证
import librosa
import numpy as np

path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
frame = librosa.util.frame(y,frame_length=1024,hop_length=512)
S = np.abs(librosa.stft(y,n_fft=1024,hop_length=512,center=False))
sc = librosa.feature.spectral_centroid(y,sr=sr,n_fft=1024,hop_length=512,center=False)
#spectral spread
ss = np.zeros((1,frame.shape[1]))
for i in range(frame.shape[1]):
    ss[:,i] = np.sqrt((sum((fre-sc[:,i])**2*S[:,i]))/sum(S[:,i]))

二、Spectral rolloff

The spectral rolloff is a measure of the skewness of the spectrum and is defined as the frequency fro at which the P % P\% P% of the spectral components of the signal is at lower frequency. In our case, we consider P = 90 P = 90 P=90 and determine the value f r o f_{ro} fro from the following relation:
∑ i = 1 f r o ∣ X ( i ) ∣ = P 100 ∑ i = 1 F m a x ∣ X ( i ) ∣ \sum^{f_{ro}}_{i=1}\lvert X(i) \rvert = \frac{P}{100}\sum^{F_{max}}_{i=1}\lvert X(i) \rvert i=1froX(i)=100Pi=1FmaxX(i)

# 利用librosa来提取特征,程序未验证
import librosa
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
rolloff = librosa.feature.spectral_rolloff(y=y,sr=sr,n_fft=1024,hop_length=512,center=False,roll_percent=0.9)

三、Spectral flu

The spectral flux (SF) indicates how quickly the spectral information of a signal is changing and it is computed by considering the squared-difference between the spectra of two consecutive audio frames, as reported in the following equation:
S F = ∑ i = 1 L F [ X n ( i ) − X n − 1 ( i ) ] 2 SF = \sum^{L_F}_{i=1}\lbrack X_n(i) - X_{n-1}(i) \rbrack^2 SF=i=1LF[Xn(i)Xn1(i)]2

import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
sf = np.zeros((1,S.shape[1]-1))
for i in range(S.shape[1]-1):
    sf[:,i] = sum((S[:,i+1]-S[:,i])**2)

四、Energy ratios in sub-bands

The energy ratios in sub-bands (ERSB) give a rough approximation of the energy distribution of the spectrum. We divided the spectrum of the signal into four sub-bands,which are reported as follow, and for each sub-band we computed the ratio between the energy contained in that subband and the overall energy of the audio frame.
E R S B n = ∑ i = k n 1 k n 2 ∣ X ( i ) ∣ 2 ∑ i = 1 F m a x ∣ X ( i ) ∣ 2 , ERSB_n = \frac{\sum^{k_{n2}}_{i=k_{n1}} \lvert X(i)\rvert ^2}{\sum^{F_{max}}_{i=1} \lvert X(i)\rvert ^2}, ERSBn=i=1FmaxX(i)2i=kn1kn2X(i)2,
where
常见声音的时频统计特征的Python编程实现_第1张图片

import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
ersb = np.zeros((4,S.shape[1]))
sub_band = [14,39,102,512]
for i in range(S.shape[1]):
    ersb[0,i] = sum(S[:sub_band[0],i])/sum(S[:,i])
    ersb[1,i] = sum(S[sub_band[0]:sub_band[1],i])/sum(S[:,i])
    ersb[2,i] = sum(S[sub_band[1]:sub_band[2],i])/sum(S[:,i])
    ersb[3,i] = sum(S[sub_band[2]:sub_band[3],i])/sum(S[:,i])

五、Volume and energy

We calculate the volume feature (V) as the root mean square (RMS) of the amplitude value of the samples in an audio frame:
V = 1 L ∑ i = 1 L x ( i ) 2 V = \sqrt {\frac{1}{L}\sum^{L}_{i=1}x(i)^2} V=L1i=1Lx(i)2
while the energy (E) is the squared-sum of the amplitude value of the audio samples:
E = ∑ i = 1 L x ( i ) 2 E = \sum^{L}_{i=1}x(i)^2 E=i=1Lx(i)2

import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
frame = librosa.util.frame(y,frame_length=1024,hop_length=512)
# Volumns and energy
V = librosa.feature.rms(y,frame_length=1024,hop_length=512,center=False)
Energy = np.sum(frame**2,axis=0)

六、Zero crossing rate

The zero crossing rate (ZCR) is the rate of the sign-changes along a frame and is especially used to characterise percussive sounds and environmental noise. For a frame x ( i ) x(i) x(i) of L L L samples, the ZCR is computed as follows:
Z C R = 1 2 L ∑ i = 1 L ∣ s g n ( x ( i + 1 ) ) − s g n ( x ( i ) ) ∣ ZCR = \frac{1}{2L}\sum^L_{i=1}\lvert sgn(x(i+1)) - sgn(x(i))\rvert ZCR=2L1i=1Lsgn(x(i+1))sgn(x(i))

import librosa
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
zcr = librosa.feature.zero_crossing_rate(y,frame_length=1024,hop_length=512,center=False)

总结

以免忘记,在此记录下特征提取方法,其他特征将会继续更新,若想看其他特征欢迎在评论区提出,共同学习。

你可能感兴趣的:(声学特征总结)