语音识别关键概念:声音的本质是震动,震动的本质是位移关于时间的函数,波形文件(.wav)中记录了不同采样时刻的位移。
通过傅里叶变换,可以将时间域的声音函数分解为一系列不同频率的正弦函数的叠加,通过频率谱线的特殊分布,建立音频内容和文本的对应关系,以此作为模型训练的基础。
主要思想:提取13个特征,生成梅尔频率倒谱系数矩阵。
API:
import scipy.io.wavfile as wf
import python_speech_features as sf
# 提取采样率,特征
sample_rate,sigs = wf.read('xxx.wav')
# 生成mfcc矩阵
mfcc = sf.mfcc(sigs,sample_rate)
案例:
import json
import numpy as np
import scipy.io.wavfile as wf
# 读取存有音频信息的json文件
with open('../data/12.json', 'r') as f:
freqs = json.loads(f.read())
tones = [
('G5', 1.5),
('A5', 0.5),
('G5', 1.5),
('E5', 0.5),
('D5', 0.5),
('E5', 0.25),
('D5', 0.25),
('C5', 0.5),
('A4', 0.5),
('C5', 0.75)]
# 设置采样率
sample_rate = 44100
# 创建一个空数组储存合成音频信息
music = np.empty(shape=1)
for tone, duration in tones:
times = np.linspace(0, duration, duration * sample_rate)
sound = np.sin(2 * np.pi * freqs[tone] * times)
music = np.append(music, sound)
music *= 2 ** 15
music = music.astype(np.int16)
wf.write('music.wav', sample_rate, music)
基本步骤: 提取声音信息(采样率,特征),生成梅尔频率倒谱系数矩阵(mfcc),训练隐马尔科夫模型,做最后的识别。
API:
import numpy as np
import scipy.io.wavfile as wf
import python_speech_features as sf
import hmmlearn.hmm as hl
# 提取样本信息
train_x,train_y = [],[]
mfccs = np.array([])
for sound_files in files_list:
for sound_file in sound_files:
sample_rate,sigs = wf.read(sound_file)
mfcc = sf.mfcc(sigs,sample_rate)
# 将mfcc矩阵添加到mfccs中
if len(mfccs) == 0:
mfccs == mfcc
else:
mfccs = np.append(mfccs,mfcc)
# 将mfccs矩阵列表添加到训练集中
train_x.append(mfccs)
# 最终的train_x len(sound_files)个特征的矩阵
# train_y存的是特征标签,比如:apple,banana,pear
# 构建并训练隐马模型
models = {}
for mfccs,label in zip(train_x,train_y):
model = hl.GaussianHMM(
n_components = 4, covariance_type = 'diag',
n_iter = 1000
)
models[label] = model.fit(mfccs)
# 同样方法获取测试集数据
# 测试
pred_y = []
for mfccs in test_x:
# 验证每个模型对当前mfcc的匹配度得分
best_score, best_label = None, None
for label, model in models.items():
score = model.score(mfccs)
if (best_score is None) or (best_score < score):
best_score = score
best_label = label
pred_y.append(best_label)
print(test_y)
print(pred_y)