声音是因为物体振动而产生的声波,是可以被人或动物的听觉器官所感知的波动现象。声音有多种特性,比如音色、音调、响度、频率。人耳是能够通过这些特征区分出声音的种类,那么如何让机器去区别不同种类的声音呢?研究者通常采用梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient, 简称:MFCC)作为声学特征,让机器学会辨别声音。
梅尔(Mel)频率是由研究人员跟据人耳听觉机理提出,它与赫兹(Hz)频率成非线性对应关系。MFCC则利用两者之间的非线性关系,计算得到Hz频谱特征。当前MFCC已经广泛应用于语音数据特征提取和降低运算维度。由于Hz频率与Mel频率之间存在非线性的对应关系,使得当频率提高时,MFCC的计算精度随之下降。通常情况下,在应用时仅使用低频MFCC,而舍弃中频和高频MFCC。
MFCC的计算包括预加重,分帧,加窗,快速傅里叶变换,梅尔滤波器组(梅尔频率转换),离散余弦变换(Discrete CosineTransform,简称:DCT),动态特征等过程。其中最重要的步骤是快速傅里叶变换和梅尔滤波器组,它们对数据进行了主要的降维操作。下面,介绍一下MFCC算法的具体实现过程。
预加重、分帧和加窗,以及快速傅里叶变化网上参考资料具体,且参数固定,这里就不再赘述,这里可以参考这篇文章MFCC算法原理讲解。
笔者这篇文章主要分享实现mfcc后半部分的代码,也就是这里主要实现傅里叶变换后的四个步骤,取绝对值或平方值、梅尔滤波、取对数和离散余弦变换。其实就是还原下面代码contrib_audio.mfcc函数,主要分析得到快速傅里叶变换后的spectrogram后如何处理。
下面一共有两个部分的代码,第一部分的代码contrib_audio.audio_spectrogram函数完成了预加重、分帧、加窗以及傅里叶变换(根据我的研究,这个函数内部的傅里叶变换点数为1024,但是保留了前面的513个,这里笔者还没详探原理,感兴趣的可以自己看资料研究一下,网上看的资料总结出来就是如果傅里叶变换的点数为512,那么就取217,以此类推)。
第二部分的代码contrib_audio.mfcc函数完成了取绝对值或平方值(在TensorFlow内部函数实现其实是开根号,这个是一个坑)、梅尔滤波、取对数和离散余弦变换。
# stft , get spectrogram
spectrogram = contrib_audio.audio_spectrogram(
wav_decoder.audio,
window_size=640,
stride=640,
magnitude_squared=True)
# get mfcc (C, H, W)
_mfcc = contrib_audio.mfcc(
spectrogram,
sample_rate=16000,
upper_frequency_limit=4000,
lower_frequency_limit=20,
filterbank_channel_count=40,
dct_coefficient_count=10)
笔者在网上寻找了需要资料,发现只有一篇参考资料是基于还原TensorFlow2的内部mfcc原理,而笔者所需要完成的项目是还原TensorFlow1.14内部的mfcc原理,虽然版本不一样,但是笔者猜测2个版本的实现原理应该相同,应该只有部分参数不一样。在笔者使用TensorFlow2的参数努力尝试还原想要的TensorFlow1.14结果时,发现结果都是差一点,果然应该是参数不同。不过matlab的结果和TensorFlow2的结果完全相同。
下面TensorFlow2获取梅尔滤波器的源码,其中num_mel_bins为梅尔滤波器组数40;num_spectrogram_bins为513,因为之前的fft为1024,保留了前面的513,这里要对应;sample_rate为采样率16000Hz;梅尔下限频率lower_edge_hertz=20,梅尔上限频率upper_edge_hertz=4000
# 官方代码
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from tensorflow.python.framework import dtypes
from tensorflow.python.framework import ops
from tensorflow.python.framework import tensor_util
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops.signal import shape_ops
from tensorflow.python.util.tf_export import tf_export
_MEL_BREAK_FREQUENCY_HERTZ = 700
_MEL_HIGH_FREQUENCY_Q = 1127
def _hertz_to_mel(frequencies_hertz, name=None):
"""Converts frequencies in `frequencies_hertz` in Hertz to the mel scale.
Args:
frequencies_hertz: A `Tensor` of frequencies in Hertz.
name: An optional name for the operation.
Returns:
A `Tensor` of the same shape and type of `frequencies_hertz` containing
frequencies in the mel scale.
"""
with ops.name_scope(name, 'hertz_to_mel', [frequencies_hertz]):
frequencies_hertz = ops.convert_to_tensor(frequencies_hertz)
return _MEL_HIGH_FREQUENCY_Q * math_ops.log(
1.0 + (frequencies_hertz / _MEL_BREAK_FREQUENCY_HERTZ))
@tf_export('signal.linear_to_mel_weight_matrix')
def linear_to_mel_weight_matrix(num_mel_bins=20,
num_spectrogram_bins=129,
sample_rate=8000,
lower_edge_hertz=125.0,
upper_edge_hertz=3800.0,
dtype=dtypes.float32,
name=None):
with ops.name_scope(name, 'linear_to_mel_weight_matrix') as name:
# Convert Tensor `sample_rate` to float, if possible.
if isinstance(sample_rate, ops.Tensor):
maybe_const_val = tensor_util.constant_value(sample_rate)
if maybe_const_val is not None:
sample_rate = maybe_const_val
# This function can be constant folded by graph optimization since there are
# no Tensor inputs.
sample_rate = math_ops.cast(
sample_rate, dtype, name='sample_rate')
lower_edge_hertz = ops.convert_to_tensor(
lower_edge_hertz, dtype, name='lower_edge_hertz')
upper_edge_hertz = ops.convert_to_tensor(
upper_edge_hertz, dtype, name='upper_edge_hertz')
zero = ops.convert_to_tensor(0.0, dtype)
# HTK excludes the spectrogram DC bin.
bands_to_zero = 1
nyquist_hertz = sample_rate / 2.0
# 间隔 (nyquist_hertz - zero) / (num_spectrogram_bins-1)
# shape = (512, )
linear_frequencies = math_ops.linspace(
zero, nyquist_hertz, num_spectrogram_bins)[bands_to_zero:]
# herz to mel
# shape = [512, 1]
spectrogram_bins_mel = array_ops.expand_dims(
_hertz_to_mel(linear_frequencies), 1)
# Compute num_mel_bins triples of (lower_edge, center, upper_edge). The
# center of each band is the lower and upper edge of the adjacent bands.
# Accordingly, we divide [lower_edge_hertz, upper_edge_hertz] into
# num_mel_bins + 2 pieces.
# shape = ((num_mel_bins + 2 - frame_length)/frame_step + 1, frame_length)
band_edges_mel = shape_ops.frame(
math_ops.linspace(_hertz_to_mel(lower_edge_hertz),
_hertz_to_mel(upper_edge_hertz),
num_mel_bins + 2), frame_length=3, frame_step=1)
# Split the triples up and reshape them into [1, num_mel_bins] tensors.
lower_edge_mel, center_mel, upper_edge_mel = tuple(array_ops.reshape(
t, [1, num_mel_bins]) for t in array_ops.split(
band_edges_mel, 3, axis=1))
# Calculate lower and upper slopes for every spectrogram bin.
# Line segments are linear in the mel domain, not Hertz.
lower_slopes = (spectrogram_bins_mel - lower_edge_mel) / (
center_mel - lower_edge_mel)
print(lower_slopes)
upper_slopes = (upper_edge_mel - spectrogram_bins_mel) / (
upper_edge_mel - center_mel)
# Intersect the line segments with each other and zero.
mel_weights_matrix = math_ops.maximum(
zero, math_ops.minimum(lower_slopes, upper_slopes))
print(mel_weights_matrix)
# Re-add the zeroed lower bins we sliced out above.
# 补上bands_to_zero 行, 内容为0
return array_ops.pad(
mel_weights_matrix, [[bands_to_zero, 0], [0, 0]], name=name)
melbank = linear_to_mel_weight_matrix(num_mel_bins=40,
num_spectrogram_bins=513,
sample_rate=16000,
lower_edge_hertz=20.0,
upper_edge_hertz=4000.0,
dtype=dtypes.float32,
name=None)
得到的梅尔滤波器参数
下面是MATLAB获取梅尔滤波器参数代码
N=513; %fft点数
fs=16000; %采样频率
fl=30;fh=4000;%定义频率范围,低频和高频
linear_frequencies=linspace(0,fs/2,N);%起点频率0,终点频率采样频率的一半,步长为513
linear_frequencies=linear_frequencies(2:N);%取linear_frequencies的第二个到第N个数据
spectrogram_bins_mel=2595*log10(1+linear_frequencies/700);%将linear_frequencies转化为梅尔频率
bl=2595*log10(1+fl/700);%得到梅尔刻度的最小值
bh=2595*log10(1+fh/700);%得到梅尔刻度的最大值
%梅尔坐标范围
p=40;%滤波器个数
mm=linspace(bl,bh,p+2);%规划p+2个不同的梅尔刻度,起点bl,终点bh,间隔p+2
mm_lower_mel=mm(1:40);%低频
mm_center_mel=mm(2:41);%中频
mm_upper_mel=mm(3:42);%高频
spectrogram_bins_mel=repmat(spectrogram_bins_mel',1,40);%复制,将(1,512)的矩阵先转置成(512,1),然后再按列复制39份,变成(512,40)
lower_slopes=(spectrogram_bins_mel-repmat(mm_lower_mel,512,1))./repmat((mm_center_mel-mm_lower_mel),512,1);%用点除,矩阵对应元素相除,mm_lower_mel(1,40)按行复制511行变成(512,40),(mm_center_mel-mm_lower_mel)按行复制511变成(512,40)
upper_slopes=(repmat(mm_upper_mel,512,1)-spectrogram_bins_mel)./repmat((mm_upper_mel-mm_center_mel),512,1);%用点除,矩阵对应元素相除,mm_upper_mel(1,40)按行复制511行变成(512,40),(mm_upper_mel-mm_center_mel)按行复制511变成(512,40)
%判断一下upper_slopes和lower_slopes哪个数值小,留下小的给mel_weights_matrix
for i=1:512
for j=1:40
if(lower_slopes(i,j)>upper_slopes(i,j))
mel_weights_matrix(i,j)=upper_slopes(i,j);
else
mel_weights_matrix(i,j)=lower_slopes(i,j);
end
end
end
%判断一下mel_weights_matrix和0比哪个数大,留下大的给mel_weights_matrix
for i=1:512
for j=1:40
if(mel_weights_matrix(i,j)>0)
mel_weights_matrix(i,j)=mel_weights_matrix(i,j);
else
mel_weights_matrix(i,j)=0;
end
end
end
mel_weights_matrix=[mel_weights_matrix;zeros(1,40)];%因为spectrogram的维度为(49,513),所以将mel_weights_matrix扩展为(513,40)
梅尔滤波器器得到的参数大小为(513,40),spectrograms参数大小为(49,513),需要进行矩阵乘法,注意在进行矩阵乘法之前需要对spectrograms进行开根号,TensorFlow的相关代码如下
spectrograms = tf.sqrt(spectrograms)
mel_spectrograms = tf.matmul(spectrograms, linear_to_mel_weight_matrix)
MATLB中笔者首先对傅里叶变换得到的spectrograms进行了开根号处理保存成了txt文件,从而导入MATLAB,因此笔者在matlab中仅对spectrograms和梅尔滤波参数进行矩阵乘法操作。
H=E*mel_weights_matrix; %mel_weights_matrix和spectrogram的矩阵乘法
TensorFlow代码如下
def log(mel_spectrograms):
# Compute a stabilized log to get log-magnitude mel-scale spectrograms.
# shape: (1 + (wav-win_length)/win_step, n_mels)
# 学术界又叫做filter_banks
log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-12)
return log_mel_spectrograms
MATLAB相关代码如下
for i=1:49
for j=1:p
H(i,j)=log(H(i,j)+0.000001);%取对数运算
end
end
TensorFlow离散余弦变换代码
import tensorflow as tf
mfccs = tf.signal.mfccs_from_log_mel_spectrograms(
log_mel_spectrograms)[..., :10]
print(mfccs)
for i=1:49
for j=0:9
%先求取每一帧的能量总和
sum=0;
%作离散余弦变换
for k=0:39
sum=sum+H(i,k+1)*cos((pi*j)*(2*k+1)/(2*40));
end
mfcc(i,j+1)=((2/40)^0.5)*sum;
%完成离散余弦变换
end
end
TensorFlow代码参考
MATLAB代码参考
python代码参考
MFCC原理参考
欢迎大家在评论区交流,笔者水平有限,还有诸多不足,多多指教