TensorFlow1.14或TensorFlow2内部获取mfcc原理探索(matlab复现或python复现)

MFCC概述

声音是因为物体振动而产生的声波,是可以被人或动物的听觉器官所感知的波动现象。声音有多种特性,比如音色、音调、响度、频率。人耳是能够通过这些特征区分出声音的种类,那么如何让机器去区别不同种类的声音呢?研究者通常采用梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient, 简称:MFCC)作为声学特征,让机器学会辨别声音。

梅尔(Mel)频率是由研究人员跟据人耳听觉机理提出,它与赫兹(Hz)频率成非线性对应关系。MFCC则利用两者之间的非线性关系,计算得到Hz频谱特征。当前MFCC已经广泛应用于语音数据特征提取和降低运算维度。由于Hz频率与Mel频率之间存在非线性的对应关系,使得当频率提高时,MFCC的计算精度随之下降。通常情况下,在应用时仅使用低频MFCC,而舍弃中频和高频MFCC。

MFCC的计算包括预加重,分帧,加窗,快速傅里叶变换,梅尔滤波器组(梅尔频率转换),离散余弦变换(Discrete CosineTransform,简称:DCT),动态特征等过程。其中最重要的步骤是快速傅里叶变换和梅尔滤波器组,它们对数据进行了主要的降维操作。下面,介绍一下MFCC算法的具体实现过程。

TensorFlow1.14或TensorFlow2内部获取mfcc原理探索(matlab复现或python复现)_第1张图片

预加重、分帧和加窗,以及快速傅里叶变化网上参考资料具体,且参数固定,这里就不再赘述,这里可以参考这篇文章MFCC算法原理讲解。
笔者这篇文章主要分享实现mfcc后半部分的代码,也就是这里主要实现傅里叶变换后的四个步骤,取绝对值或平方值、梅尔滤波、取对数和离散余弦变换。其实就是还原下面代码contrib_audio.mfcc函数,主要分析得到快速傅里叶变换后的spectrogram后如何处理。
下面一共有两个部分的代码,第一部分的代码contrib_audio.audio_spectrogram函数完成了预加重、分帧、加窗以及傅里叶变换(根据我的研究,这个函数内部的傅里叶变换点数为1024,但是保留了前面的513个,这里笔者还没详探原理,感兴趣的可以自己看资料研究一下,网上看的资料总结出来就是如果傅里叶变换的点数为512,那么就取217,以此类推)。
第二部分的代码contrib_audio.mfcc函数完成了取绝对值或平方值(在TensorFlow内部函数实现其实是开根号,这个是一个坑)、梅尔滤波、取对数和离散余弦变换。

MFCC算法实现流程

# stft , get spectrogram
spectrogram = contrib_audio.audio_spectrogram(
    wav_decoder.audio,
    window_size=640,
    stride=640,
    magnitude_squared=True)

# get mfcc (C, H, W)
_mfcc = contrib_audio.mfcc(
    spectrogram,
    sample_rate=16000,
    upper_frequency_limit=4000,
    lower_frequency_limit=20,
    filterbank_channel_count=40,
    dct_coefficient_count=10)

MALTAB复现TensorFlow内部MFCC原理

笔者在网上寻找了需要资料,发现只有一篇参考资料是基于还原TensorFlow2的内部mfcc原理,而笔者所需要完成的项目是还原TensorFlow1.14内部的mfcc原理,虽然版本不一样,但是笔者猜测2个版本的实现原理应该相同,应该只有部分参数不一样。在笔者使用TensorFlow2的参数努力尝试还原想要的TensorFlow1.14结果时,发现结果都是差一点,果然应该是参数不同。不过matlab的结果和TensorFlow2的结果完全相同。

梅尔滤波器

下面TensorFlow2获取梅尔滤波器的源码,其中num_mel_bins为梅尔滤波器组数40;num_spectrogram_bins为513,因为之前的fft为1024,保留了前面的513,这里要对应;sample_rate为采样率16000Hz;梅尔下限频率lower_edge_hertz=20,梅尔上限频率upper_edge_hertz=4000

# 官方代码

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from tensorflow.python.framework import dtypes
from tensorflow.python.framework import ops
from tensorflow.python.framework import tensor_util
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops.signal import shape_ops
from tensorflow.python.util.tf_export import tf_export

_MEL_BREAK_FREQUENCY_HERTZ = 700
_MEL_HIGH_FREQUENCY_Q = 1127

def _hertz_to_mel(frequencies_hertz, name=None):
  """Converts frequencies in `frequencies_hertz` in Hertz to the mel scale.
  Args:
    frequencies_hertz: A `Tensor` of frequencies in Hertz.
    name: An optional name for the operation.
  Returns:
    A `Tensor` of the same shape and type of `frequencies_hertz` containing
    frequencies in the mel scale.
  """
  with ops.name_scope(name, 'hertz_to_mel', [frequencies_hertz]):
    frequencies_hertz = ops.convert_to_tensor(frequencies_hertz)
    return _MEL_HIGH_FREQUENCY_Q * math_ops.log(
        1.0 + (frequencies_hertz / _MEL_BREAK_FREQUENCY_HERTZ))

@tf_export('signal.linear_to_mel_weight_matrix')
def linear_to_mel_weight_matrix(num_mel_bins=20,
                                num_spectrogram_bins=129,
                                sample_rate=8000,
                                lower_edge_hertz=125.0,
                                upper_edge_hertz=3800.0,
                                dtype=dtypes.float32,
                                name=None):
    with ops.name_scope(name, 'linear_to_mel_weight_matrix') as name:
        # Convert Tensor `sample_rate` to float, if possible.
        if isinstance(sample_rate, ops.Tensor):
            maybe_const_val = tensor_util.constant_value(sample_rate)
            if maybe_const_val is not None:
                sample_rate = maybe_const_val


        # This function can be constant folded by graph optimization since there are
        # no Tensor inputs.
        sample_rate = math_ops.cast(
            sample_rate, dtype, name='sample_rate')
        lower_edge_hertz = ops.convert_to_tensor(
            lower_edge_hertz, dtype, name='lower_edge_hertz')
        upper_edge_hertz = ops.convert_to_tensor(
            upper_edge_hertz, dtype, name='upper_edge_hertz')
        zero = ops.convert_to_tensor(0.0, dtype)

        # HTK excludes the spectrogram DC bin.
        bands_to_zero = 1
        nyquist_hertz = sample_rate / 2.0
        # 间隔 (nyquist_hertz - zero) / (num_spectrogram_bins-1)
        # shape = (512, )
        linear_frequencies = math_ops.linspace(
            zero, nyquist_hertz, num_spectrogram_bins)[bands_to_zero:]
        

        # herz to mel
        # shape = [512, 1]
        spectrogram_bins_mel = array_ops.expand_dims(
            _hertz_to_mel(linear_frequencies), 1)

      
        # Compute num_mel_bins triples of (lower_edge, center, upper_edge). The
        # center of each band is the lower and upper edge of the adjacent bands.
        # Accordingly, we divide [lower_edge_hertz, upper_edge_hertz] into
        # num_mel_bins + 2 pieces.
        # shape = ((num_mel_bins + 2 - frame_length)/frame_step + 1, frame_length)
        band_edges_mel = shape_ops.frame(
            math_ops.linspace(_hertz_to_mel(lower_edge_hertz),
                              _hertz_to_mel(upper_edge_hertz),
                              num_mel_bins + 2), frame_length=3, frame_step=1)


        # Split the triples up and reshape them into [1, num_mel_bins] tensors.
        lower_edge_mel, center_mel, upper_edge_mel = tuple(array_ops.reshape(
            t, [1, num_mel_bins]) for t in array_ops.split(
                band_edges_mel, 3, axis=1))
        
        # Calculate lower and upper slopes for every spectrogram bin.
        # Line segments are linear in the mel domain, not Hertz.
        lower_slopes = (spectrogram_bins_mel - lower_edge_mel) / (
            center_mel - lower_edge_mel)
        print(lower_slopes)
        upper_slopes = (upper_edge_mel - spectrogram_bins_mel) / (
            upper_edge_mel - center_mel)

        # Intersect the line segments with each other and zero.
        mel_weights_matrix = math_ops.maximum(
            zero, math_ops.minimum(lower_slopes, upper_slopes))
        print(mel_weights_matrix)

        # Re-add the zeroed lower bins we sliced out above.
        # 补上bands_to_zero 行, 内容为0
        return array_ops.pad(
            mel_weights_matrix, [[bands_to_zero, 0], [0, 0]], name=name)
    
melbank = linear_to_mel_weight_matrix(num_mel_bins=40,
                            num_spectrogram_bins=513,
                            sample_rate=16000,
                            lower_edge_hertz=20.0,
                            upper_edge_hertz=4000.0,
                            dtype=dtypes.float32,
                            name=None)

得到的梅尔滤波器参数
TensorFlow1.14或TensorFlow2内部获取mfcc原理探索(matlab复现或python复现)_第2张图片
下面是MATLAB获取梅尔滤波器参数代码

N=513; %fft点数
fs=16000;  %采样频率
fl=30;fh=4000;%定义频率范围,低频和高频
linear_frequencies=linspace(0,fs/2,N);%起点频率0,终点频率采样频率的一半,步长为513
linear_frequencies=linear_frequencies(2:N);%取linear_frequencies的第二个到第N个数据
spectrogram_bins_mel=2595*log10(1+linear_frequencies/700);%将linear_frequencies转化为梅尔频率 
bl=2595*log10(1+fl/700);%得到梅尔刻度的最小值
bh=2595*log10(1+fh/700);%得到梅尔刻度的最大值
%梅尔坐标范围
p=40;%滤波器个数
mm=linspace(bl,bh,p+2);%规划p+2个不同的梅尔刻度,起点bl,终点bh,间隔p+2
mm_lower_mel=mm(1:40);%低频
mm_center_mel=mm(2:41);%中频
mm_upper_mel=mm(3:42);%高频
spectrogram_bins_mel=repmat(spectrogram_bins_mel',1,40);%复制,将(1,512)的矩阵先转置成(512,1),然后再按列复制39份,变成(512,40)
lower_slopes=(spectrogram_bins_mel-repmat(mm_lower_mel,512,1))./repmat((mm_center_mel-mm_lower_mel),512,1);%用点除,矩阵对应元素相除,mm_lower_mel(1,40)按行复制511行变成(512,40),(mm_center_mel-mm_lower_mel)按行复制511变成(512,40)
upper_slopes=(repmat(mm_upper_mel,512,1)-spectrogram_bins_mel)./repmat((mm_upper_mel-mm_center_mel),512,1);%用点除,矩阵对应元素相除,mm_upper_mel(1,40)按行复制511行变成(512,40),(mm_upper_mel-mm_center_mel)按行复制511变成(512,40%判断一下upper_slopes和lower_slopes哪个数值小,留下小的给mel_weights_matrix
for i=1:512
    for j=1:40
        if(lower_slopes(i,j)>upper_slopes(i,j))
            mel_weights_matrix(i,j)=upper_slopes(i,j);
        else
            mel_weights_matrix(i,j)=lower_slopes(i,j);
        end
    end
end

%判断一下mel_weights_matrix和0比哪个数大,留下大的给mel_weights_matrix
for i=1:512
    for j=1:40
        if(mel_weights_matrix(i,j)>0)
            mel_weights_matrix(i,j)=mel_weights_matrix(i,j);
        else
            mel_weights_matrix(i,j)=0;
        end
    end
end

mel_weights_matrix=[mel_weights_matrix;zeros(1,40)];%因为spectrogram的维度为(49,513),所以将mel_weights_matrix扩展为(513,40)

MATLAB结果
TensorFlow1.14或TensorFlow2内部获取mfcc原理探索(matlab复现或python复现)_第3张图片

矩阵乘法

梅尔滤波器器得到的参数大小为(513,40),spectrograms参数大小为(49,513),需要进行矩阵乘法,注意在进行矩阵乘法之前需要对spectrograms进行开根号,TensorFlow的相关代码如下

spectrograms = tf.sqrt(spectrograms)
mel_spectrograms = tf.matmul(spectrograms, linear_to_mel_weight_matrix)

MATLB中笔者首先对傅里叶变换得到的spectrograms进行了开根号处理保存成了txt文件,从而导入MATLAB,因此笔者在matlab中仅对spectrograms和梅尔滤波参数进行矩阵乘法操作。

H=E*mel_weights_matrix; %mel_weights_matrix和spectrogram的矩阵乘法

取对数(非线性变换)

TensorFlow代码如下

def log(mel_spectrograms):
    # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
    # shape: (1 + (wav-win_length)/win_step, n_mels)
    # 学术界又叫做filter_banks
    log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-12)
    return log_mel_spectrograms

MATLAB相关代码如下

for i=1:49
     for j=1:p
         H(i,j)=log(H(i,j)+0.000001);%取对数运算   
     end
end

离散余弦变换(DCT)

TensorFlow离散余弦变换代码

import tensorflow as tf
mfccs = tf.signal.mfccs_from_log_mel_spectrograms(
  log_mel_spectrograms)[..., :10]
print(mfccs)

结果
TensorFlow1.14或TensorFlow2内部获取mfcc原理探索(matlab复现或python复现)_第4张图片
MATLAB离散余弦变换代码

for i=1:49
     for j=0:9
         %先求取每一帧的能量总和
         sum=0;
         %作离散余弦变换
         for k=0:39
             sum=sum+H(i,k+1)*cos((pi*j)*(2*k+1)/(2*40));
         end
         mfcc(i,j+1)=((2/40)^0.5)*sum;
         %完成离散余弦变换
     end    
 end

结果
TensorFlow1.14或TensorFlow2内部获取mfcc原理探索(matlab复现或python复现)_第5张图片

参考文章

TensorFlow代码参考
MATLAB代码参考
python代码参考
MFCC原理参考
欢迎大家在评论区交流,笔者水平有限,还有诸多不足,多多指教

你可能感兴趣的:(mfcc,语音算法,python,matlab)