从音频文件中读取出来的原始语音信号通常称为raw waveform,是一个一维数组,长度是由音频长度和采样率决定,比如采样率Fs为16KHz,表示一秒钟内采样16000个点,这个时候如果音频长度是10秒,那么raw waveform中就有160000个值,值的大小通常表示的是振幅。
声音信号是一维信号,直观上只能看到时域信息,不能看到频域信息。
通过傅里叶变换(FT)可以变换到频域,但是丢失了时域信息,无法看到时频关系。为了解决这个问题,产生了很多方法,短时傅里叶变换,小波等都是很常用的时频分析方法。
短时傅里叶变换(STFT),就是对短时的信号做傅里叶变换。原理如下:对一段长语音信号,分帧、加窗,再对每一帧做傅里叶变换,之后把每一帧的结果沿时间维度堆叠,得到一张图(类似于二维信号),这张图就是声谱图。
librosa.stft(y, *, n_fft=2048, hop_length=None,
win_length=None, window='hann', center=True, dtype=None, pad_mode='constant')
def stft(
y, # 音频时间序列
*,
n_fft=2048,
hop_length=None,
win_length=None,
window="hann",
center=True,
dtype=None,
pad_mode="constant",
)
参数:
y:音频时间序列
n_fft:FFT窗口大小,n_fft=hop_length+overlapping
hop_length:帧移。
spectrum = np.abs(librosa.stft(frame, n_fft=self.nfft)),未指定hop_length时,则默认win_length / 4
spectrum = np.abs(librosa.stft(frame, n_fft=self.nfft, hop_length=len(frame)))时,如果帧移长度小于傅里叶变换点数,librosa.stft输出为hop_length+1
spectrum = np.abs(librosa.stft(frame, n_fft=self.nfft, hop_length=self.nfft))时,无论win_length设置为帧长还是nfft,librosa.stft输出都只有一帧。
最后得出结论librosa.stft的输出帧数为speech_length // hop_length + 1
win_length:每一帧音频都由window()加窗。窗长win_length,然后用零填充以匹配n_fft。
默认win_length=n_fft。
window:字符串,元组,数字,函数 shape =(n_fft, )
窗口(字符串,元组或数字)
窗函数,例如scipy.signal.hanning
长度为n_fft的向量或数组
center:bool
如果为True,则填充信号y,以使帧 D [:, t]以y [t * hop_length]为中心
如果为False,则D [:, t]从y [t * hop_length]开始
dtype:D的复数值类型。默认值为64-bit complex复数
pad_mode:如果center = True,则在信号的边缘使用填充模式。默认情况下,STFT使用reflection padding
返回:一个复数矩阵使得D(f,t) STFT矩阵 shape =(1 + nfft/2,t)其中:
n_fft/2是因为实数FFT信号具有对称性,我们只需要去一般的数据分析即可,全部返回有数据冗余。
n_frames: n_frames = (speech_len) // hop_len + 1。具体可以画图,信号处理之前首先需要padding, padding之后分帧,画图可以看到,真正与帧数有关系的,是hop_len。
参数详解:
import librosa
y, sr = librosa.load("air_conditioner01.wav") # y.shape = (88200,) sr = 22050
print("y.shape = {0}; y = {1}".format(y.shape, y))
print("sr = {0}".format(sr))
n_fft = 2048
win_length = n_fft # 默认
hop_length = win_length // 4 # 默认
D = librosa.stft(y=y, n_fft=n_fft, win_length=win_length, hop_length=hop_length, window='hann', center=True, pad_mode='constant')
print("\nSTFT结果:D.shape = {0}; \nD = {1}".format(D.shape, D))
计算:
打印结果:
y.shape = (88200,); y = [-0.00429636 -0.01181004 -0.01559684 ... -0.02535474 -0.02254227 -0.01510671]
sr = 22050
STFT结果:D.shape = (1025, 173);
D = [[ 4.0991772e-02+0.0000000e+00j 7.6347418e-02+0.0000000e+00j
1.0839665e-01+0.0000000e+00j ... -5.6880221e-02+0.0000000e+00j
-4.0515706e-01+0.0000000e+00j -1.2604758e+00+0.0000000e+00j]
[ 4.9349996e-03+2.2843832e-02j -6.5081336e-02-1.5483504e-03j
-6.4895572e-03+8.5287131e-02j ... 8.6759105e-02+4.4598781e-02j
-1.3626395e-01-3.5130250e-01j 1.2085854e+00-4.9212924e-01j]
[-3.1803373e-02+6.2021937e-02j 1.8289314e-01-2.9990083e-02j
-2.6830113e-01-2.0388773e-02j ... 2.5796932e-01-3.4968770e-01j
8.2743064e-02-4.2531621e-01j -9.6776468e-01+1.2715183e+00j]
...
[ 6.0332339e-04-2.8250517e-07j -3.0159805e-04+3.6464758e-07j
-1.6187376e-07-1.8996661e-07j ... -1.5264601e-07-2.4162804e-07j
-7.4535434e-04-8.3691225e-04j 2.4131173e-03+2.7090199e-03j]
[-6.0317205e-04+8.0741266e-08j -4.6025047e-08+3.0144685e-04j
-8.3143476e-08+1.5921461e-07j ... -1.0696644e-07+1.1993595e-07j
4.5853498e-04-1.0226711e-03j -3.3102881e-03-1.4843964e-03j]
[ 6.0308439e-04+0.0000000e+00j 3.0151531e-04+0.0000000e+00j
-1.1964966e-07+0.0000000e+00j ... 2.1211932e-07+0.0000000e+00j
1.1208834e-03+0.0000000e+00j 3.6279499e-03+0.0000000e+00j]]
Process finished with exit code 0
import numpy as np
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load("air_conditioner01.wav") # y.shape = (88200,) sr = 22050
print("y.shape = {0}; \ny = {1}".format(y.shape, y))
print("\nsr = {0}".format(sr))
n_fft = 2048
win_length = n_fft # 默认
hop_length = win_length // 4 # 默认
D = librosa.stft(y=y, n_fft=n_fft, win_length=win_length, hop_length=hop_length, window='hann', center=True, pad_mode='constant')
print("\nD.shape = {0}; \nD = {1}".format(D.shape, D))
D_abs = np.abs(D) # 4.9349996e-03+2.2843832e-02j ----> 2.3370814e-02【a + bj ----> sqrt(a^2 + b^2)】
print("\nD_abs.shape = {0}; \nD_abs = {1}".format(D_abs.shape, D_abs))
D_pow = D_abs ** 2 # 4.9349996e-03+2.2843832e-02j ----> 5.4619490e-04【a + bj ----> a^2 + b^2】
print("\nD_pow.shape = {0}; \nD_pow = {1}".format(D_pow.shape, D_pow))
# 作图01
fig, ax = plt.subplots()
img = librosa.display.specshow(data=D_pow, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.2f')
ax.set(title='STFT spectrogram')
fig.show()
D_dB = librosa.power_to_db(D_pow) # 能量转换为分贝
print("\nD_dB.shape = {0}; \nD_dB = {1}".format(D_dB.shape, D_dB))
# 作图02
fig, ax = plt.subplots()
img = librosa.display.specshow(data=D_dB, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='STFT(dB) spectrogram')
fig.show()
# 作图03【y_axis='linear'】
fig, ax = plt.subplots()
img = librosa.display.specshow(data=D_dB, x_axis='time', y_axis='linear', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='STFT(dB) spectrogram')
fig.show()
打印结果:
y.shape = (88200,);
y = [-0.00429636 -0.01181004 -0.01559684 ... -0.02535474 -0.02254227 -0.01510671]
sr = 22050
D.shape = (1025, 173);
D = [[ 4.0991772e-02+0.0000000e+00j 7.6347418e-02+0.0000000e+00j
1.0839665e-01+0.0000000e+00j ... -5.6880221e-02+0.0000000e+00j
-4.0515706e-01+0.0000000e+00j -1.2604758e+00+0.0000000e+00j]
[ 4.9349996e-03+2.2843832e-02j -6.5081336e-02-1.5483504e-03j
-6.4895572e-03+8.5287131e-02j ... 8.6759105e-02+4.4598781e-02j
-1.3626395e-01-3.5130250e-01j 1.2085854e+00-4.9212924e-01j]
[-3.1803373e-02+6.2021937e-02j 1.8289314e-01-2.9990083e-02j
-2.6830113e-01-2.0388773e-02j ... 2.5796932e-01-3.4968770e-01j
8.2743064e-02-4.2531621e-01j -9.6776468e-01+1.2715183e+00j]
...
[ 6.0332339e-04-2.8250517e-07j -3.0159805e-04+3.6464758e-07j
-1.6187376e-07-1.8996661e-07j ... -1.5264601e-07-2.4162804e-07j
-7.4535434e-04-8.3691225e-04j 2.4131173e-03+2.7090199e-03j]
[-6.0317205e-04+8.0741266e-08j -4.6025047e-08+3.0144685e-04j
-8.3143476e-08+1.5921461e-07j ... -1.0696644e-07+1.1993595e-07j
4.5853498e-04-1.0226711e-03j -3.3102881e-03-1.4843964e-03j]
[ 6.0308439e-04+0.0000000e+00j 3.0151531e-04+0.0000000e+00j
-1.1964966e-07+0.0000000e+00j ... 2.1211932e-07+0.0000000e+00j
1.1208834e-03+0.0000000e+00j 3.6279499e-03+0.0000000e+00j]]
D_abs.shape = (1025, 173);
D_abs = [[4.0991772e-02 7.6347418e-02 1.0839665e-01 ... 5.6880221e-02
4.0515706e-01 1.2604758e+00]
[2.3370814e-02 6.5099753e-02 8.5533671e-02 ... 9.7550981e-02
3.7680408e-01 1.3049406e+00]
[6.9700614e-02 1.8533567e-01 2.6907471e-01 ... 4.3454534e-01
4.3329009e-01 1.5979135e+00]
...
[6.0332345e-04 3.0159828e-04 2.4958049e-07 ... 2.8580573e-07
1.1207029e-03 3.6279366e-03]
[6.0317205e-04 3.0144685e-04 1.7961662e-07 ... 1.6070609e-07
1.1207634e-03 3.6278698e-03]
[6.0308439e-04 3.0151531e-04 1.1964966e-07 ... 2.1211932e-07
1.1208834e-03 3.6279499e-03]]
D_pow.shape = (1025, 173);
D_pow = [[1.6803254e-03 5.8289282e-03 1.1749834e-02 ... 3.2353594e-03
1.6415225e-01 1.5887991e+00]
[5.4619490e-04 4.2379778e-03 7.3160087e-03 ... 9.5161935e-03
1.4198132e-01 1.7028699e+00]
[4.8581758e-03 3.4349307e-02 7.2401196e-02 ... 1.8882965e-01
1.8774031e-01 2.5533276e+00]
...
[3.6399919e-07 9.0961521e-08 6.2290423e-14 ... 8.1684910e-14
1.2559751e-06 1.3161924e-05]
[3.6381653e-07 9.0870202e-08 3.2262130e-14 ... 2.5826448e-14
1.2561105e-06 1.3161439e-05]
[3.6371080e-07 9.0911477e-08 1.4316039e-14 ... 4.4994604e-14
1.2563796e-06 1.3162020e-05]]
D_dB.shape = (1025, 173);
D_dB = [[-27.746067 -22.344112 -19.299683 ... -24.900776 -7.8475313
2.01069 ]
[-32.626526 -23.728413 -21.357258 ... -20.215368 -8.477688
2.3118148]
[-23.13527 -14.64082 -11.402542 ... -7.2392983 -7.2644243
4.071065 ]
...
[-57.84667 -57.84667 -57.84667 ... -57.84667 -57.84667
-48.806805 ]
[-57.84667 -57.84667 -57.84667 ... -57.84667 -57.84667
-48.80697 ]
[-57.84667 -57.84667 -57.84667 ... -57.84667 -57.84667
-48.80677 ]]
Process finished with exit code 0
"""Short-time Fourier transform (STFT).
The STFT represents a signal in the time-frequency domain by
computing discrete Fourier transforms (DFT) over short overlapping
windows.
This function returns a complex-valued matrix D such that
- ``np.abs(D[..., f, t])`` is the magnitude of frequency bin ``f``
at frame ``t``, and
- ``np.angle(D[..., f, t])`` is the phase of frequency bin ``f``
at frame ``t``.
The integers ``t`` and ``f`` can be converted to physical units by means
of the utility functions `frames_to_sample` and `fft_frequencies`.
Parameters
----------
y : np.ndarray [shape=(..., n)], real-valued
input signal. Multi-channel is supported.
n_fft : int > 0 [scalar]
length of the windowed signal after padding with zeros.
The number of rows in the STFT matrix ``D`` is ``(1 + n_fft/2)``.
The default value, ``n_fft=2048`` samples, corresponds to a physical
duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the
default sample rate in librosa. This value is well adapted for music
signals. However, in speech processing, the recommended value is 512,
corresponding to 23 milliseconds at a sample rate of 22050 Hz.
In any case, we recommend setting ``n_fft`` to a power of two for
optimizing the speed of the fast Fourier transform (FFT) algorithm.
hop_length : int > 0 [scalar]
number of audio samples between adjacent STFT columns.
Smaller values increase the number of columns in ``D`` without
affecting the frequency resolution of the STFT.
If unspecified, defaults to ``win_length // 4`` (see below).
win_length : int <= n_fft [scalar]
Each frame of audio is windowed by ``window`` of length ``win_length``
and then padded with zeros to match ``n_fft``.
Smaller values improve the temporal resolution of the STFT (i.e. the
ability to discriminate impulses that are closely spaced in time)
at the expense of frequency resolution (i.e. the ability to discriminate
pure tones that are closely spaced in frequency). This effect is known
as the time-frequency localization trade-off and needs to be adjusted
according to the properties of the input signal ``y``.
If unspecified, defaults to ``win_length = n_fft``.
window : string, tuple, number, function, or np.ndarray [shape=(n_fft,)]
Either:
- a window specification (string, tuple, or number);
see `scipy.signal.get_window`
- a window function, such as `scipy.signal.windows.hann`
- a vector or array of length ``n_fft``
Defaults to a raised cosine window (`'hann'`), which is adequate for
most applications in audio signal processing.
.. see also:: `filters.get_window`
center : boolean
If ``True``, the signal ``y`` is padded so that frame
``D[:, t]`` is centered at ``y[t * hop_length]``.
If ``False``, then ``D[:, t]`` begins at ``y[t * hop_length]``.
Defaults to ``True``, which simplifies the alignment of ``D`` onto a
time grid by means of `librosa.frames_to_samples`.
Note, however, that ``center`` must be set to `False` when analyzing
signals with `librosa.stream`.
.. see also:: `librosa.stream`
dtype : np.dtype, optional
Complex numeric type for ``D``. Default is inferred to match the
precision of the input signal.
pad_mode : string or function
If ``center=True``, this argument is passed to `np.pad` for padding
the edges of the signal ``y``. By default (``pad_mode="constant"``),
``y`` is padded on both sides with zeros.
If ``center=False``, this argument is ignored.
.. see also:: `numpy.pad`
Returns
-------
D : np.ndarray [shape=(..., 1 + n_fft/2, n_frames), dtype=dtype]
Complex-valued matrix of short-term Fourier transform
coefficients.
See Also
--------
istft : Inverse STFT
reassigned_spectrogram : Time-frequency reassigned spectrogram
Notes
-----
This function caches at level 20.
Examples
--------
>>> y, sr = librosa.load(librosa.ex('trumpet'))
>>> S = np.abs(librosa.stft(y))
>>> S
array([[5.395e-03, 3.332e-03, ..., 9.862e-07, 1.201e-05],
[3.244e-03, 2.690e-03, ..., 9.536e-07, 1.201e-05],
...,
[7.523e-05, 3.722e-05, ..., 1.188e-04, 1.031e-03],
[7.640e-05, 3.944e-05, ..., 5.180e-04, 1.346e-03]],
dtype=float32)
Use left-aligned frames, instead of centered frames
>>> S_left = librosa.stft(y, center=False)
Use a shorter hop length
>>> D_short = librosa.stft(y, hop_length=64)
Display a spectrogram
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> img = librosa.display.specshow(librosa.amplitude_to_db(S,
... ref=np.max),
... y_axis='log', x_axis='time', ax=ax)
>>> ax.set_title('Power spectrogram')
>>> fig.colorbar(img, ax=ax, format="%+2.0f dB")
"""
人耳能听到的频率范围是20-20000HZ,但是人耳对HZ单位不是线性敏感,而是对低HZ敏感,对高HZ不敏感,将HZ频率转化为梅尔频率,则人耳对频率的感知度就变为线性。
例如如果我们适应了1000Hz的音调,如果把音调频率提高到2000Hz,我们的耳朵只能觉察到频率提高了一点点,根本察觉不到频率提高了一倍。
下图是HZ到Mel的映射关系图,由于二者为log关系,在频率较低时,Mel随HZ变化较快;当频率较高时,曲线斜率小,变化缓慢。
举例来说,,则人耳听起来两者的音调也如果两段语音的Mel频率相差两倍相差两倍。
使用python的librosa音频处理库可以轻松实现:
librosa.feature.melspectrogram(*, y=None, sr=22050, S=None, n_fft=2048,
hop_length=512, win_length=None, window='hann',
center=True, pad_mode='constant', power=2.0, **kwargs)
def melspectrogram(
y=None,
sr=22050,
S=None,
n_fft=2048,
hop_length=512,
win_length=None,
window="hann",
center=True,
pad_mode="reflect",
power=2.0,
**kwargs,
):
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load("air_conditioner01.wav") # y.shape = (88200,) sr = 22050
print("y.shape = {0}; y = {1}".format(y.shape, y))
print("\nsr = {0}".format(sr))
n_fft = 2048
win_length = n_fft # 默认
hop_length = win_length // 4 # 默认
n_mels = 80 # 默认 128
S = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=n_fft, win_length=win_length, hop_length=hop_length, n_mels=n_mels)
print("\nMel频谱--结果:S.shape = {0}; \nS = {1}".format(S.shape, S))
# 作图01
fig, ax = plt.subplots()
img = librosa.display.specshow(S, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.2f')
ax.set(title='Mel-frequency spectrogram')
fig.show()
S_dB = librosa.power_to_db(S=S)
print("\nMel频谱-dB能量谱---结果:S_dB.shape = {0}; \nS_dB = {1}".format(S_dB.shape, S_dB))
# 作图02
fig, ax = plt.subplots()
img = librosa.display.specshow(S_dB, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency(dB) spectrogram')
fig.show()
# 作图03
fig, ax = plt.subplots()
img = librosa.display.specshow(S_dB, x_axis='time', y_axis='linear', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency(dB) spectrogram')
fig.show()
打印结果:
y.shape = (88200,); y = [-0.00429636 -0.01181004 -0.01559684 ... -0.02535474 -0.02254227 -0.01510671]
sr = 22050
结果:S.shape = (80, 173);
S = [[2.1666234e-02 2.1649336e-02 1.9549519e-01 ... 7.8533143e-01 1.4178720e+00 7.7648085e-01]
[1.6606098e-01 2.8194109e-01 7.3984677e-01 ... 5.2415431e-01 6.8600404e-01 4.3208513e-01]
[1.0179499e-01 1.3825276e-01 2.5464761e-01 ... 6.1250693e-01 2.6014277e-01 1.3389401e-01]
...
[6.9146539e-05 3.5622125e-04 1.7929733e-04 ... 4.7781770e-05 4.5372890e-05 3.5324701e-05]
[3.0970994e-05 8.2059370e-05 3.9507086e-05 ... 2.9160121e-05 2.3678922e-05 1.5615517e-05]
[2.1788853e-06 5.2237228e-06 3.3358933e-06 ... 1.7063500e-06 2.1911462e-06 2.4011656e-06]]
Process finished with exit code 0
import numpy as np
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load("air_conditioner01.wav") # y.shape = (88200,) sr = 22050
print("y.shape = {0}; y = {1}".format(y.shape, y))
print("\nsr = {0}".format(sr))
n_fft = 2048
win_length = n_fft # 默认
hop_length = win_length // 4 # 默认
n_mels = 80 # 默认 128
D = librosa.stft(y=y, n_fft=n_fft, win_length=win_length, hop_length=hop_length, window='hann', center=True, pad_mode='constant')
print("\nSTFT结果:D.shape = {0}; \nD = {1}".format(D.shape, D))
D_abs = np.abs(D) # 4.9349996e-03+2.2843832e-02j ----> 2.3370814e-02【a + bj ----> sqrt(a^2 + b^2)】
print("\nD_abs.shape = {0}; \nD_abs = {1}".format(D_abs.shape, D_abs))
D_pow = D_abs ** 2 # 4.9349996e-03+2.2843832e-02j ----> 5.4619490e-04【a + bj ----> a^2 + b^2】
print("\nD_pow.shape = {0}; \nD_pow = {1}".format(D_pow.shape, D_pow))
S = librosa.feature.melspectrogram(S=D_pow, sr=sr, n_mels=n_mels)
print("\nMel频谱--结果:S.shape = {0}; \nS = {1}".format(S.shape, S))
# 作图01
fig, ax = plt.subplots()
img = librosa.display.specshow(S, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.2f')
ax.set(title='Mel-frequency spectrogram')
fig.show()
S_dB = librosa.power_to_db(S=S)
print("\nMel频谱-能量谱---结果:S_dB.shape = {0}; \nS_dB = {1}".format(S_dB.shape, S_dB))
# 作图02
fig, ax = plt.subplots()
img = librosa.display.specshow(S_dB, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency(dB) spectrogram')
fig.show()
# 作图03
fig, ax = plt.subplots()
img = librosa.display.specshow(S_dB, x_axis='time', y_axis='linear', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency(dB) spectrogram')
fig.show()
打印结果:
y.shape = (88200,); y = [-0.00429636 -0.01181004 -0.01559684 ... -0.02535474 -0.02254227 -0.01510671]
sr = 22050
STFT结果:D.shape = (1025, 173);
D = [[ 4.0991772e-02+0.0000000e+00j 7.6347418e-02+0.0000000e+00j
1.0839665e-01+0.0000000e+00j ... -5.6880221e-02+0.0000000e+00j
-4.0515706e-01+0.0000000e+00j -1.2604758e+00+0.0000000e+00j]
[ 4.9349996e-03+2.2843832e-02j -6.5081336e-02-1.5483504e-03j
-6.4895572e-03+8.5287131e-02j ... 8.6759105e-02+4.4598781e-02j
-1.3626395e-01-3.5130250e-01j 1.2085854e+00-4.9212924e-01j]
[-3.1803373e-02+6.2021937e-02j 1.8289314e-01-2.9990083e-02j
-2.6830113e-01-2.0388773e-02j ... 2.5796932e-01-3.4968770e-01j
8.2743064e-02-4.2531621e-01j -9.6776468e-01+1.2715183e+00j]
...
[ 6.0332339e-04-2.8250517e-07j -3.0159805e-04+3.6464758e-07j
-1.6187376e-07-1.8996661e-07j ... -1.5264601e-07-2.4162804e-07j
-7.4535434e-04-8.3691225e-04j 2.4131173e-03+2.7090199e-03j]
[-6.0317205e-04+8.0741266e-08j -4.6025047e-08+3.0144685e-04j
-8.3143476e-08+1.5921461e-07j ... -1.0696644e-07+1.1993595e-07j
4.5853498e-04-1.0226711e-03j -3.3102881e-03-1.4843964e-03j]
[ 6.0308439e-04+0.0000000e+00j 3.0151531e-04+0.0000000e+00j
-1.1964966e-07+0.0000000e+00j ... 2.1211932e-07+0.0000000e+00j
1.1208834e-03+0.0000000e+00j 3.6279499e-03+0.0000000e+00j]]
D_abs.shape = (1025, 173);
D_abs = [[4.0991772e-02 7.6347418e-02 1.0839665e-01 ... 5.6880221e-02
4.0515706e-01 1.2604758e+00]
[2.3370814e-02 6.5099753e-02 8.5533671e-02 ... 9.7550981e-02
3.7680408e-01 1.3049406e+00]
[6.9700614e-02 1.8533567e-01 2.6907471e-01 ... 4.3454534e-01
4.3329009e-01 1.5979135e+00]
...
[6.0332345e-04 3.0159828e-04 2.4958049e-07 ... 2.8580573e-07
1.1207029e-03 3.6279366e-03]
[6.0317205e-04 3.0144685e-04 1.7961662e-07 ... 1.6070609e-07
1.1207634e-03 3.6278698e-03]
[6.0308439e-04 3.0151531e-04 1.1964966e-07 ... 2.1211932e-07
1.1208834e-03 3.6279499e-03]]
D_pow.shape = (1025, 173);
D_pow = [[1.6803254e-03 5.8289282e-03 1.1749834e-02 ... 3.2353594e-03
1.6415225e-01 1.5887991e+00]
[5.4619490e-04 4.2379778e-03 7.3160087e-03 ... 9.5161935e-03
1.4198132e-01 1.7028699e+00]
[4.8581758e-03 3.4349307e-02 7.2401196e-02 ... 1.8882965e-01
1.8774031e-01 2.5533276e+00]
...
[3.6399919e-07 9.0961521e-08 6.2290423e-14 ... 8.1684910e-14
1.2559751e-06 1.3161924e-05]
[3.6381653e-07 9.0870202e-08 3.2262130e-14 ... 2.5826448e-14
1.2561105e-06 1.3161439e-05]
[3.6371080e-07 9.0911477e-08 1.4316039e-14 ... 4.4994604e-14
1.2563796e-06 1.3162020e-05]]
Mel频谱--结果:S.shape = (80, 173);
S = [[2.1666234e-02 2.1649336e-02 1.9549519e-01 ... 7.8533143e-01
1.4178720e+00 7.7648085e-01]
[1.6606098e-01 2.8194109e-01 7.3984677e-01 ... 5.2415431e-01
6.8600404e-01 4.3208513e-01]
[1.0179499e-01 1.3825276e-01 2.5464761e-01 ... 6.1250693e-01
2.6014277e-01 1.3389401e-01]
...
[6.9146539e-05 3.5622125e-04 1.7929733e-04 ... 4.7781770e-05
4.5372890e-05 3.5324701e-05]
[3.0970994e-05 8.2059370e-05 3.9507086e-05 ... 2.9160121e-05
2.3678922e-05 1.5615517e-05]
[2.1788853e-06 5.2237228e-06 3.3358933e-06 ... 1.7063500e-06
2.1911462e-06 2.4011656e-06]]
Mel频谱-能量谱---结果:S_dB.shape = (80, 173);
S_dB = [[-16.642166 -16.645554 -7.0886393 ... -1.0494702 1.5163702
-1.0986925]
[ -7.797324 -5.4984164 -1.3085821 ... -2.8054085 -1.6367333
-3.644307 ]
[ -9.922736 -8.593262 -5.9406037 ... -2.12889 -5.8478827
-8.7323885]
...
[-41.602295 -34.4828 -37.46426 ... -43.20738 -43.432037
-44.519215 ]
[-45.090446 -40.85872 -44.03325 ... -45.352104 -46.25638
-48.064438 ]
[-56.617657 -52.8202 -54.76788 ... -57.67932 -56.593285
-56.195778 ]]
Process finished with exit code 0
"""Compute a mel-scaled spectrogram.
If a spectrogram input ``S`` is provided, then it is mapped directly onto
the mel basis by ``mel_f.dot(S)``.
If a time-series input ``y, sr`` is provided, then its magnitude spectrogram
``S`` is first computed, and then mapped onto the mel scale by
``mel_f.dot(S**power)``.
By default, ``power=2`` operates on a power spectrum.
Parameters
----------
y : np.ndarray [shape=(..., n)] or None
audio time-series. Multi-channel is supported.
sr : number > 0 [scalar]
sampling rate of ``y``
S : np.ndarray [shape=(..., d, t)]
spectrogram
n_fft : int > 0 [scalar]
length of the FFT window
hop_length : int > 0 [scalar]
number of samples between successive frames.
See `librosa.stft`
win_length : int <= n_fft [scalar]
Each frame of audio is windowed by `window()`.
The window will be of length `win_length` and then padded
with zeros to match ``n_fft``.
If unspecified, defaults to ``win_length = n_fft``.
window : string, tuple, number, function, or np.ndarray [shape=(n_fft,)]
- a window specification (string, tuple, or number);
see `scipy.signal.get_window`
- a window function, such as `scipy.signal.windows.hann`
- a vector or array of length ``n_fft``
.. see also:: `librosa.filters.get_window`
center : boolean
- If `True`, the signal ``y`` is padded so that frame
``t`` is centered at ``y[t * hop_length]``.
- If `False`, then frame ``t`` begins at ``y[t * hop_length]``
pad_mode : string
If ``center=True``, the padding mode to use at the edges of the signal.
By default, STFT uses zero padding.
power : float > 0 [scalar]
Exponent for the magnitude melspectrogram.
e.g., 1 for energy, 2 for power, etc.
**kwargs : additional keyword arguments
Mel filter bank parameters.
See `librosa.filters.mel` for details.
Returns
-------
S : np.ndarray [shape=(..., n_mels, t)]
Mel spectrogram
See Also
--------
librosa.filters.mel : Mel filter bank construction
librosa.stft : Short-time Fourier Transform
Examples
--------
>>> y, sr = librosa.load(librosa.ex('trumpet'))
>>> librosa.feature.melspectrogram(y=y, sr=sr)
array([[3.837e-06, 1.451e-06, ..., 8.352e-14, 1.296e-11],
[2.213e-05, 7.866e-06, ..., 8.532e-14, 1.329e-11],
...,
[1.115e-05, 5.192e-06, ..., 3.675e-08, 2.470e-08],
[6.473e-07, 4.402e-07, ..., 1.794e-08, 2.908e-08]],
dtype=float32)
Using a pre-computed power spectrogram would give the same result:
>>> D = np.abs(librosa.stft(y))**2
>>> S = librosa.feature.melspectrogram(S=D, sr=sr)
Display of mel-frequency spectrogram coefficients, with custom
arguments for mel filterbank construction (default is fmax=sr/2):
>>> # Passing through arguments to the Mel filters
>>> S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,
... fmax=8000)
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> S_dB = librosa.power_to_db(S, ref=np.max)
>>> img = librosa.display.specshow(S_dB, x_axis='time',
... y_axis='mel', sr=sr,
... fmax=8000, ax=ax)
>>> fig.colorbar(img, ax=ax, format='%+2.0f dB')
>>> ax.set(title='Mel-frequency spectrogram')
"""
MFCC的全部组成其实是由:
librosa.feature.mfcc(*, y=None, sr=22050, S=None, n_mfcc=20, dct_type=2, norm='ortho', lifter=0, **kwargs)
def mfcc(
y=None,
sr=22050,
S=None,
n_mfcc=20,
dct_type=2,
norm="ortho",
lifter=0,
**kwargs
):
这里主要设置sr
和n_mfcc
(你要提取特征的个数)
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load("air_conditioner01.wav") # y.shape = (88200,) sr = 22050
print("y.shape = {0}; y = {1}".format(y.shape, y))
print("\nsr = {0}".format(sr))
sr = 22050 # 默认 22050
n_mfcc = 20 # 默认 20
dct_type = 2 # 默认 2
norm = "ortho" # 默认 "ortho"
lifter = 0 # 默认 0
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc, dct_type=dct_type, norm=norm, lifter=lifter)
print("\nMFCC--结果:mfccs.shape = {0}; \nmfccs = {1}".format(mfccs.shape, mfccs))
# 作图
fig, ax = plt.subplots()
img = librosa.display.specshow(mfccs, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.2f')
ax.set(title='MFCC')
fig.show()
打印结果:
y.shape = (88200,); y = [-0.00429636 -0.01181004 -0.01559684 ... -0.02535474 -0.02254227 -0.01510671]
sr = 22050
MFCC--结果:mfccs.shape = (20, 173);
mfccs = [[-287.29108 -247.5219 -249.65224 ... -253.60095 -254.32365 -275.92908 ]
[ 129.54532 118.740906 129.125 ... 154.39165 151.97177 145.18109 ]
[ -30.826519 -26.77267 -32.930305 ... -33.37476 -32.383087 -30.955482 ]
...
[ 9.547321 12.869612 13.215841 ... 7.2535534 7.9752393 6.778652 ]
[ 4.4144416 4.8549047 7.463866 ... 3.3326685 3.2872963 5.157634 ]
[ 2.9182916 4.505571 5.7218165 ... 4.5183744 1.9608486 -1.0470729]]
Process finished with exit code 0
import numpy as np
import librosa.display
import matplotlib.pyplot as plt
y, sr = librosa.load("air_conditioner01.wav") # y.shape = (88200,) sr = 22050
print("y.shape = {0}; y = {1}".format(y.shape, y))
print("\nsr = {0}".format(sr))
n_fft = 2048
win_length = n_fft # 默认
hop_length = win_length // 4 # 默认
n_mels = 80 # 默认 128
S = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=n_fft, win_length=win_length, hop_length=hop_length, n_mels=n_mels)
print("\nMel频谱--结果:S.shape = {0}; \nS = {1}".format(S.shape, S))
S_dB = librosa.power_to_db(S=S, ref=np.max)
print("\nMel频谱-dB能量谱---结果:S_dB.shape = {0}; \nS_dB = {1}".format(S_dB.shape, S_dB))
mfccs = librosa.feature.mfcc(S=S_dB)
print("\nMFCC---结果:mfccs.shape = {0}; \nmfccs = {1}".format(mfccs.shape, mfccs))
# 作图
fig, ax = plt.subplots(nrows=2, sharex=True)
img = librosa.display.specshow(S_dB, x_axis='time', y_axis='mel', fmax=8000, ax=ax[0])
fig.colorbar(img, ax=[ax[0]], format='%+2.0f dB')
ax[0].set(title='Mel spectrogram')
ax[0].label_outer()
img = librosa.display.specshow(mfccs, x_axis='time', y_axis='mel', fmax=8000, ax=ax[1])
fig.colorbar(img, ax=[ax[1]], format='%+2.0f dB')
ax[1].set(title='MFCC')
ax[1].label_outer()
fig.show()
"""Mel-frequency cepstral coefficients (MFCCs)
.. warning:: If multi-channel audio input ``y`` is provided, the MFCC
calculation will depend on the peak loudness (in decibels) across
all channels. The result may differ from independent MFCC calculation
of each channel.
Parameters
----------
y : np.ndarray [shape=(..., n,)] or None
audio time series. Multi-channel is supported..
sr : number > 0 [scalar]
sampling rate of ``y``
S : np.ndarray [shape=(..., d, t)] or None
log-power Mel spectrogram
n_mfcc : int > 0 [scalar]
number of MFCCs to return
dct_type : {1, 2, 3}
Discrete cosine transform (DCT) type.
By default, DCT type-2 is used.
norm : None or 'ortho'
If ``dct_type`` is `2 or 3`, setting ``norm='ortho'`` uses an ortho-normal
DCT basis.
Normalization is not supported for ``dct_type=1``.
lifter : number >= 0
If ``lifter>0``, apply *liftering* (cepstral filtering) to the MFCCs::
M[n, :] <- M[n, :] * (1 + sin(pi * (n + 1) / lifter) * lifter / 2)
Setting ``lifter >= 2 * n_mfcc`` emphasizes the higher-order coefficients.
As ``lifter`` increases, the coefficient weighting becomes approximately linear.
**kwargs : additional keyword arguments
Arguments to `melspectrogram`, if operating
on time series input
Returns
-------
M : np.ndarray [shape=(..., n_mfcc, t)]
MFCC sequence
See Also
--------
melspectrogram
scipy.fftpack.dct
Examples
--------
Generate mfccs from a time series
>>> y, sr = librosa.load(librosa.ex('libri1'))
>>> librosa.feature.mfcc(y=y, sr=sr)
array([[-565.919, -564.288, ..., -426.484, -434.668],
[ 10.305, 12.509, ..., 88.43 , 90.12 ],
...,
[ 2.807, 2.068, ..., -6.725, -5.159],
[ 2.822, 2.244, ..., -6.198, -6.177]], dtype=float32)
Using a different hop length and HTK-style Mel frequencies
>>> librosa.feature.mfcc(y=y, sr=sr, hop_length=1024, htk=True)
array([[-5.471e+02, -5.464e+02, ..., -4.446e+02, -4.200e+02],
[ 1.361e+01, 1.402e+01, ..., 9.764e+01, 9.869e+01],
...,
[ 4.097e-01, -2.029e+00, ..., -1.051e+01, -1.130e+01],
[-1.119e-01, -1.688e+00, ..., -3.442e+00, -4.687e+00]],
dtype=float32)
Use a pre-computed log-power Mel spectrogram
>>> S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,
... fmax=8000)
>>> librosa.feature.mfcc(S=librosa.power_to_db(S))
array([[-559.974, -558.449, ..., -411.96 , -420.458],
[ 11.018, 13.046, ..., 76.972, 80.888],
...,
[ 2.713, 2.379, ..., 1.464, -2.835],
[ 2.712, 2.619, ..., 2.209, 0.648]], dtype=float32)
Get more components
>>> mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
Visualize the MFCC series
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots(nrows=2, sharex=True)
>>> img = librosa.display.specshow(librosa.power_to_db(S, ref=np.max),
... x_axis='time', y_axis='mel', fmax=8000,
... ax=ax[0])
>>> fig.colorbar(img, ax=[ax[0]])
>>> ax[0].set(title='Mel spectrogram')
>>> ax[0].label_outer()
>>> img = librosa.display.specshow(mfccs, x_axis='time', ax=ax[1])
>>> fig.colorbar(img, ax=[ax[1]])
>>> ax[1].set(title='MFCC')
Compare different DCT bases
>>> m_slaney = librosa.feature.mfcc(y=y, sr=sr, dct_type=2)
>>> m_htk = librosa.feature.mfcc(y=y, sr=sr, dct_type=3)
>>> fig, ax = plt.subplots(nrows=2, sharex=True, sharey=True)
>>> img1 = librosa.display.specshow(m_slaney, x_axis='time', ax=ax[0])
>>> ax[0].set(title='RASTAMAT / Auditory toolbox (dct_type=2)')
>>> fig.colorbar(img, ax=[ax[0]])
>>> img2 = librosa.display.specshow(m_htk, x_axis='time', ax=ax[1])
>>> ax[1].set(title='HTK-style (dct_type=3)')
>>> fig.colorbar(img2, ax=[ax[1]])
"""
To get MFCC, compute the DCT on the mel-spectrogram. The mel-spectrogram is often log-scaled before.
MFCC is a very compressible representation, often using just 20 or 13 coefficients instead of 32-64 bands in Mel spectrogram.
The MFCC is a bit more decorrelarated, which can be beneficial with linear models like Gaussian Mixture Models.
With lots of data and strong classifiers like Convolutional Neural Networks, mel-spectrogram can often perform better.
梅尔频谱(melspectrogram)与 梅尔倒谱(MFCC)的区别:
# -- Mel spectrogram and MFCCs -- #
def mfcc(y=None, sr=22050, S=None, n_mfcc=20, **kwargs):
if S is None:
S = logamplitude(melspectrogram(y=y, sr=sr, **kwargs))
return np.dot(filters.dct(n_mfcc, S.shape[0]), S)
给定原始的音频信号,通过melspectrogram()函数提取梅尔频谱,然后通过DCT离散余弦变换得到梅尔倒谱系数。(总之一句话,在梅尔频谱上取对数,做DCT变换,就得到了梅尔倒谱)
Mel滤波器对应了频率提高之后人耳会迟钝的客观规律:
参考资料:
声谱图,梅尔语谱,倒谱,梅尔倒谱系数
【Day6】窗涵式,n_fft ,hop_length 到底什麽意思啊?
零基础入门语音识别: 一文详解MFCC特征(附python代码)
论文笔记:语音情感识别(四)语音特征之声谱图,log梅尔谱,MFCC,deltas
声谱图,梅尔谱图
librosa 语音库(二)STFT 的实现
梅尔频谱和梅尔倒谱的初次理解和使用
音频特征提取——常用音频特征
深度学习之语音识别-音频基础知识、声谱图(Spectrogram)
声谱图
STFT和声谱图,梅尔频谱(Mel Bank Features)与梅尔倒谱(MFCCs)
语谱图(四) Mel spectrogram 梅尔语谱图
librosa 语音库(三) librosa.feature. 中的 spectrogram 与 melspectrogram
librosa.feature.melspectrogram 的形状(Shape of librosa.feature.melspectrogram)
librosa.feature.melspectrogram()
【librosa】音频特征提取
信号处理基础——傅里叶变换与短时傅里叶变换
语音信号加窗分帧是起什么作用
声学特征(二) MFCC特征原理
MFCC
librosa–学习笔记(2)(频谱特性 Spectral representations)