本博客旨在学习和记录关于语音识别的相关资料,也参考网上大神(链接),在此表示感谢。话不多说,下面进入正题。
本博客偏向实践,以LibriSpeech公开英语语料数据集作为训练语料,搭建了基于CTC(Connectionist temporal classification)-BiLSTM的联合模型的语音识别系统。其中,CTC和BiLSTM就不多做介绍,可自行上网查阅。
项目结构如下
import os
import soundfile
import numpy as np
from scipy.fftpack import fft
from random import shuffle
import keras
from keras.layers import Input, Conv2D, BatchNormalization, MaxPooling2D,Bidirectional,LSTM
from keras.layers import Reshape, Dense, Lambda
from keras.optimizers import Adam
from keras import backend as K
from keras.models import Model
from keras.utils import to_categorical
np.random.seed(4)
os.environ['CUDA_VISIBLE_DEVICES'] = '/gpu:0'
print('package has import')
这里使用的语音特征是频域特征,包括了加窗、分帧以及傅里叶转换。现在一般用的是MFCC参数,后面有时间的话,我在补充相关的实验。
def compute_fbank(file):
x=np.linspace(0, 400 - 1, 400, dtype = np.int64)
w = 0.54 - 0.46 * np.cos(2 * np.pi * (x) / (400 - 1) )
wavsignal, fs= soundfile.read(file)
time_window = 25
window_length = fs / 1000 * time_window
wav_arr = np.array(wavsignal)
wav_length = len(wavsignal)
range0_end = int(len(wavsignal)/fs*1000 - time_window)
data_input = np.zeros((range0_end, 200), dtype = np.float)
data_line = np.zeros((1, 400), dtype = np.float)
for i in range(0, range0_end):
p_start = i * 160
p_end = p_start + 400
data_line = wav_arr[p_start:p_end]
data_line = data_line * w # 加窗
data_line = np.abs(fft(data_line))
data_input[i]=data_line[0:200]
data_input = np.log(data_input + 1)
#print(data_input.shape)
return data_input
返回每批次中的频域特征,大小为(time_steps,number_features)
(1406, 200)
该部分用于获取音频文件和对应的翻译文件,即形成对应的音频文件数组和标签数组。
先看看文本翻译文件:
19-198-0000 NORTHANGER ABBEY
19-198-0001 THIS LITTLE WORK WAS FINISHED IN THE YEAR EIGHTEEN O THREE AND INTENDED FOR IMMEDIATE PUBLICATION IT WAS DISPOSED OF TO A BOOKSELLER IT WAS EVEN ADVERTISED
19-198-0002 NEITHER THE AUTHOR NOR THE PUBLIC HAVE ANY OTHER CONCERN THAN AS SOME OBSERVATION IS NECESSARY UPON THOSE PARTS OF THE WORK WHICH THIRTEEN YEARS HAVE MADE COMPARATIVELY OBSOLETE
19-198-0003 THE PUBLIC ARE ENTREATED TO BEAR IN MIND THAT THIRTEEN YEARS HAVE PASSED SINCE IT WAS FINISHED MANY MORE SINCE IT WAS BEGUN AND THAT DURING THAT PERIOD PLACES MANNERS BOOKS AND OPINIONS HAVE UNDERGONE CONSIDERABLE CHANGES
19-198-0004 CHAPTER ONE NO ONE WHO HAD EVER SEEN CATHERINE MORLAND IN HER INFANCY WOULD HAVE SUPPOSED HER BORN TO BE AN HEROINE HER SITUATION IN LIFE
19-198-0005 THE CHARACTER OF HER FATHER AND MOTHER HER OWN PERSON AND DISPOSITION WERE ALL EQUALLY AGAINST HER HER FATHER WAS A CLERGYMAN WITHOUT BEING NEGLECTED OR POOR AND A VERY RESPECTABLE MAN
19-198-0006 HER MOTHER WAS A WOMAN OF USEFUL PLAIN SENSE WITH A GOOD TEMPER AND WHAT IS MORE REMARKABLE WITH A GOOD CONSTITUTION SHE HAD THREE SONS BEFORE CATHERINE WAS BORN
19-198-0007 WHERE THERE ARE HEADS AND ARMS AND LEGS ENOUGH FOR THE NUMBER BUT THE MORLANDS HAD LITTLE OTHER RIGHT TO THE WORD FOR THEY WERE IN GENERAL VERY PLAIN AND CATHERINE FOR MANY YEARS OF HER LIFE AS PLAIN AS ANY
19-198-0008 SHE HAD A THIN AWKWARD FIGURE
获取到所有的文本翻译文件
# 获取文本翻译文件
def get_source_txt(path):
source_txt = []
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith('.txt'):
source_txt.append(os.path.join(root,file))
#print(source_txt)
return source_txt
['LibriSpeech/train-clean-100/train\\103-1240.trans.txt', 'LibriSpeech/train-clean-100/train\\103-1241.trans.txt', ...]
根据上面的文本翻译列表,获取音频文件列表以及对应的翻译
# 建立起音频文件列表及对应的翻译
def read_label(source_txt):
data_label = []
file_name = []
path,_ = os.path.split(source_txt[0])
for source in source_txt:
with open(source, 'r', encoding='utf8') as f:
data = f.readlines()
for i in range(len(data)):
data[i] = data[i].strip('\n')
data_label.append(' '.join(data[i].split(' ')[1:]))
file_name.append(os.path.join(path,data[i].split(' ')[0]+'.flac'))
#print(data_label)
#print(file_name)
return data_label,file_name
['LibriSpeech/train-clean-100/train\\103-1240-0000.flac', 'LibriSpeech/train-clean-100/train\\103-1240-0001.flac','...']
['CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A BROOK','THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD ...']
翻译文本内容建立词表,包括26个字母+空格+特殊字符
# 建立词典 29个字符
def mk_vocab():
vocab = ['','A','B','C','D','E','F','G','H','I','J',
'K','L','N','M','O','P','Q','R','S','T','U','V','W','X','Y','Z',' ','\'']
return vocab
将一句话转为向量
# 映射id
def word2id(line, vocab):
return [vocab.index(i) for i in line]
VO = mk_vocab()
print(word2id('A DOG',VO))
[1, 27, 4, 15, 7]
由于使用的是批次输入,因此需要保证输入的音频以及对应的句子标签长度一致,按照每次输入的音频数据最大长度进行补充。
def wav_padding(wav_data_lst):
wav_lens = [len(data) for data in wav_data_lst]
wav_max_len = max(wav_lens)
wav_lens = np.array([leng//4 for leng in wav_lens])#除以4是因为经过两层CNN,time_step会相应减少
new_wav_data_lst = np.zeros((len(wav_data_lst), wav_max_len, 200, 1))
for i in range(len(wav_data_lst)):
new_wav_data_lst[i, :wav_data_lst[i].shape[0], :, 0] = wav_data_lst[i]
return new_wav_data_lst, wav_lens
输入之前:(8,?,200)#不定长度
输入之后:(8,399,200,1)#8个数据中最长的音频为1596,而且cnn多了通道特征
wav_lens:[1408, 1596, 1396, 1472, 1252, 1516, 956, 1504]
数据标签是训练对象,因此需要跟每一帧语音数据对应一个字符,需要与语音数组的最后输出长度相应
# 标签补零
def label_padding(label_data_lst,wav_data_lst):
label_lens = np.array([len(label) for label in label_data_lst])
wav_lens = [len(data) for data in wav_data_lst]
wav_max_len = max(wav_lens)
new_label_data_lst = np.zeros((len(label_data_lst), wav_max_len//4))
for i in range(len(label_data_lst)):
new_label_data_lst[i][:len(label_data_lst[i])] = label_data_lst[i]
return new_label_data_lst, label_lens
输入之前:(8,?)
输入之后:(8,399)
label_lens:[201 283 250 268 227 263 160 236]
因为要建立起联合模型,需要构建相应的输入数据和输出数据
def data_generator(batch_size, shuffle_list, wav_lst, label_data, vocab):
while True:
for i in range(len(wav_lst)//batch_size):
wav_data_lst = []
label_data_lst = []
begin = i * batch_size
end = begin + batch_size
sub_list = shuffle_list[begin:end]
for index in sub_list:
fbank = compute_fbank(wav_lst[index])
pad_fbank = np.zeros((fbank.shape[0]//4*4+4, fbank.shape[1]))
pad_fbank[:fbank.shape[0], :] = fbank
label = word2id(label_data[index], vocab)
wav_data_lst.append(pad_fbank)
label_data_lst.append(label)
pad_wav_data, input_length = wav_padding(wav_data_lst)
pad_label_data, label_length = label_padding(label_data_lst,wav_data_lst)
inputs = {'the_inputs': pad_wav_data,
'the_labels': pad_label_data,
'input_length': input_length,
'label_length': label_length,
}
categorocal_data = to_categorical(pad_label_data,29)
# print(categorocal_data.shape)
outputs = {'ctc': np.zeros(pad_wav_data.shape[0],),'att':categorocal_data}
# print(outputs)
yield inputs, outputs
其中,the_inputs代表补零后的语音特征,the_labels代表补零后的标签征,input_length代表原始的一个批次中的每个数据长度,label_length代表代表原始的一个批次中的每个标签的长度。ctc是CTC模型的输出,att代表LSTM模型的输出。
主要的模型结构为以CNN为编码层,CTC和LSTM作为解码层。二者的权重比值为0.3比0.7,至于为什么这样设置,最后在进行分析。
# 定义神经网络
def conv2d(size):
return Conv2D(size, (3,3), use_bias=True, activation='relu',
padding='same', kernel_initializer='he_normal')
def norm(x):
return BatchNormalization(axis=-1)(x)
def maxpool(x):
return MaxPooling2D(pool_size=(2,2), strides=None, padding="valid")(x)
def dense(units, activation="relu"):
return Dense(units, activation=activation, use_bias=True,
kernel_initializer='he_normal')
def cnn_cell(size, x, pool=True):
x = norm(conv2d(size)(x))
x = norm(conv2d(size)(x))
if pool:
x = maxpool(x)
return x
def ctc_lambda(args):
labels, y_pred, input_length, label_length = args
y_pred = y_pred[:, :, :]
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
各种模型的参数如下。不同的模型的对象以及损失函数都不同。
# CTC-ATT
class Amodel():
"""docstring for Amodel."""
def __init__(self, vocab_size):
super(Amodel, self).__init__()
self.vocab_size = vocab_size
self._model_init()
self._lstm_init()
self._ctc_init()
self.opt_init()
def _model_init(self):
self.inputs = Input(name='the_inputs', shape=(None, 200, 1))
# self.h1 = cnn_cell(32, self.inputs)
self.h2 = cnn_cell(64, self.inputs)
self.h3 = cnn_cell(128, self.h2)
self.h4 = cnn_cell(128, self.h3, pool=False)
# 200 / 8 * 128 = 3200
self.h6 = Reshape((-1, 6400))(self.h4)
self.h7 = dense(256)(self.h6)
self.outputs = dense(self.vocab_size+1, activation='softmax')(self.h7)
def _lstm_init(self):
self.bilstm = Bidirectional(LSTM(256,return_sequences=True))(self.h6)
self.h8 = dense(256)(self.bilstm)
self.lstm_outputs = Dense(self.vocab_size, activation='softmax',name='att')(self.h8)
self.model = Model(inputs=self.inputs, outputs=self.lstm_outputs)
def _ctc_init(self):
self.labels = Input(name='the_labels', shape=[None], dtype='float32')
self.input_length = Input(name='input_length', shape=[1], dtype='int64')
self.label_length = Input(name='label_length', shape=[1], dtype='int64')
self.loss_out = Lambda(ctc_lambda, output_shape=(1,), name='ctc')\
([self.labels, self.outputs, self.input_length, self.label_length])
self.combine_model = Model(inputs=[self.labels, self.inputs,
self.input_length, self.label_length], outputs=[self.loss_out,self.lstm_outputs])
def opt_init(self):
opt = Adam(lr = 0.0008, beta_1 = 0.9, beta_2 = 0.999, decay = 0.01, epsilon = 10e-8)
#self.ctc_model=multi_gpu_model(self.ctc_model,gpus=2)
self.combine_model.compile(loss={'ctc': lambda y_true, output: output,'att':'categorical_crossentropy'},loss_weights={'ctc': 0.3,'att':0.7}, optimizer=opt)
vocab = mk_vocab()
source_txt = get_source_txt('LibriSpeech/train-clean-100/train')
label_data,flac_list = read_label(source_txt)
# fbank = compute_fbank(flac_list[0])
# fbank = fbank[:fbank.shape[0]//8*8, :]
# print(fbank.shape)
total_nums = len(flac_list)
batch_size = 10
batch_num = total_nums // batch_size
epochs = 200
shuffle_list = [i for i in range(total_nums)]
# shuffle(shuffle_list)
am = Amodel(len(vocab))# 加占位符
am.combine_model.summary()
batch = data_generator(batch_size,shuffle_list,flac_list,label_data,vocab)
#am.combine_model.load_weights('speech_colab/model_combine_model_wieghts_200.h5')#继续训练
am.combine_model.fit_generator(batch, steps_per_epoch=batch_num, epochs=epochs)
am.combine_model.save_weights('model_combine_model_wieghts_200.h5')#方便继续训练
am.model.save_weights('model_lstm_ctc_model_wieghts_200.h5')#方便随时来识别结果
直接对一批次的输入语音进行识别,返回识别的文本和数字的结果
def decode_lstm(num_result,num2word):
r = np.argmax(num_result,axis=2)
# print(r.shape)
text = []
for i in range(len(r)):
tmp = []
for j in range(len(r[i])):
if r[i][j] == -1:
continue
tmp.append(num2word[r[i][j]])
text.append(tmp)
return text,r
同样地,对一批次的语音数据进行识别。
batch = data_generator(batch_size, shuffle_list, flac_list, label_data, vocab)
for i in range(10):
# 载入训练好的模型,并进行识别
inputs, outputs = next(batch)
x = inputs['the_inputs']
y = inputs['the_labels']
result = am.model.predict(x, steps=1)
# 将数字结果转化为文本结果
seq_len = inputs['input_length']
text, num = decode_lstm(result, vocab)
for j in range(len(text)):
print('数字结果: ', num[j])
print('文本结果:', ''.join(text[j]))
print('原文结果:', ''.join([vocab[int(i)] for i in y[j] if i != -1]))
到此,有关代码的部分就结束了。接下来就是实验的结果以及简单地分析。
这是训练90次的时候结果:
数字结果: [ 1 8 1 5 20 5 8 9 20 1 5 27 20 18 27 27 5 5 27 5 27 20 15 5
9 5 8 5 27 5 27 27 27 27 27 19 19 19 27 5 5 27 5 27 27 27 5 27
27 1 1 8 27 8 8 27 27 8 8 27 8 8 8 27 27 27 5 8 27 27 1 1
20 27 1 5 5 27 27 1 15 5 27 27 5 5 8 27 27 1 5 27 5 5 5 27
27 1 12 12 9 8 5 27 20 8 8 27 8 5 5 27 27 8 8 15 15 27 8 15
18 18 1 27 27 1 18 27 20 5 18 5 27 12 15 13 1 8 27 27 9 15 27 1
12 1 9 27 27 1 20 8 27 27 27 27 1 27 27 27 27 27 1 27 27 20 5 13
27 27 27 27 5 27 27 27 27 27 13 8 27 27 27 27 27 20 15 5 27 5 27 13
27 27 5 27 27 15 20 15 27 13 18 27 27 27 27 27 27 27 27 13 27 27 27 27
27 13 1 13 27 9 9 9 5 5 27 1 5 9 20 27 27 1 15 20 1 27 5 18
5 27 27 27 15 1 15 27 5 5 5 5 5 5 5 5 8 5 15 27 5 5 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: AHAETEHITAE TR EE E TOEIEHE E SSS EE E E AAH HH HH HHH EH AAT AEE AOE EEH AE EEE ALLIHE THH HEE HHOO HORRA AR TERE LONAH IO ALAI ATH A A TEN E NH TOE E N E OTO NR N NAN IIIEE AEIT AOTA ERE OAO EEEEEEEEHEO EE
原文结果: CHAPTER TWENTY ONE A MOMENT'S GLANCE WAS ENOUGH TO SATISFY CATHERINE THAT HER APARTMENT WAS VERY UNLIKE THE ONE WHICH HENRY HAD ENDEAVOURED TO ALARM HER BY THE DESCRIPTION OF IT WAS BY NO MEANS UNREASONABLY LARGE AND CONTAINED NEITHER TAPESTRY NOR VELVET
数字结果: [20 8 27 27 27 20 27 27 27 27 5 27 27 13 27 19 20 5 5 5 5 5 27 27
27 27 27 27 1 27 27 27 27 27 1 5 5 19 19 27 27 5 20 13 13 27 20 1
5 27 27 5 5 20 15 15 20 20 20 20 20 27 27 27 18 27 27 5 27 27 27 27
27 13 19 27 5 15 18 27 15 15 27 27 27 18 27 27 27 15 27 5 27 27 27 27
27 20 8 1 27 18 15 27 27 5 18 27 27 9 27 5 15 27 27 27 27 27 19 27
27 27 27 18 27 27 5 7 27 5 5 27 27 27 27 15 27 27 5 27 5 5 27 27
5 27 8 15 20 15 15 1 27 27 27 27 19 27 27 20 20 5 27 27 5 5 27 27
5 20 27 20 27 27 20 27 20 27 1 13 27 1 1 27 27 8 5 13 18 27 1 27
27 27 20 27 20 27 12 5 5 5 18 27 5 19 27 19 27 27 1 1 27 27 15 9
5 27 1 9 9 27 27 27 27 27 5 5 27 27 27 15 8 27 5 27 18 18 27 5
5 27 27 5 27 27 15 27 27 27 15 15 27 27 27 27 5 8 5 27 27 15 5 18
27 0 27 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: TH T E N STEEEEE A AEESS ETNN TAE EETOOTTTTT R E NS EOR OO R O E THA RO ER I EO S R EG EE O E EE E HOTOOA S TTE EE ET T T T AN AA HENR A T T LEEER ES S AA OIE AII EE OH E RR EE E O OO EHE OER
原文结果: THE WALLS WERE PAPERED THE FLOOR WAS CARPETED THE WINDOWS WERE NEITHER LESS PERFECT NOR MORE DIM THAN THOSE OF THE DRAWING ROOM BELOW THE FURNITURE THOUGH NOT OF THE LATEST FASHION WAS HANDSOME AND COMFORTABLE AND THE AIR OF THE ROOM ALTOGETHER FAR FROM UNCHEERFUL
数字结果: [20 13 4 27 27 27 1 27 27 27 8 8 27 27 8 5 9 27 13 5 5 15 5 12
27 12 27 15 27 27 20 1 5 27 5 5 27 20 8 9 27 27 27 15 9 20 20 27
14 20 5 27 18 15 19 5 13 27 8 15 15 27 20 27 5 15 27 27 20 27 27 27
27 27 27 27 13 1 8 27 9 18 5 27 9 5 27 1 13 9 27 5 5 13 12 13
1 27 27 15 15 13 27 15 27 27 5 13 5 5 8 5 5 27 27 27 27 27 5 15
5 27 27 9 5 27 5 5 27 9 9 5 5 20 1 5 4 27 5 5 27 13 5 1
27 1 5 5 27 27 20 8 5 1 9 5 9 20 27 9 27 27 20 1 27 27 27 27
27 27 5 1 1 27 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 18 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: TND A HH HEI NEEOEL L O TAE EE THI OITT MTE ROSEN HOO T EO T NAH IRE IE ANI EENLNA OON O ENEEHEE EOE IE EE IIEETAED EE NEA AEE THEAIEIT I TA EAA NRA
原文结果: HER HEART INSTANTANEOUSLY AT EASE ON THIS POINT SHE RESOLVED TO LOSE NO TIME IN PARTICULAR EXAMINATION OF ANYTHING AS SHE GREATLY DREADED DISOBLIGING THE GENERAL BY ANY DELAY
这是训练180次的时候结果:
数字结果: [ 2 8 27 16 20 5 18 14 20 1 5 13 20 25 27 15 13 5 27 1 27 14 27 14
5 2 20 19 19 23 19 7 5 3 3 1 5 23 1 1 27 3 3 27 27 27 4 27
1 18 15 3 18 1 3 27 13 27 27 3 18 18 5 27 18 9 13 5 27 20 1 1
20 27 8 8 18 20 1 1 1 18 20 18 5 13 20 27 23 1 19 8 22 5 18 19
27 21 13 12 9 8 5 27 20 8 4 27 15 13 5 27 23 8 9 20 9 27 8 5
13 18 25 27 27 1 1 27 5 13 18 8 1 14 15 21 18 5 6 27 20 15 27 1
12 1 1 27 27 8 5 18 27 2 25 27 20 8 5 27 4 5 19 19 18 9 5 5
9 21 25 27 19 6 27 9 27 7 9 1 19 27 25 25 27 20 1 27 14 5 1 13
13 27 21 13 5 5 1 1 15 9 1 9 12 21 27 27 0 1 1 7 27 1 13 0
27 13 13 13 20 9 9 13 5 4 27 4 5 9 20 8 5 5 27 20 1 16 5 19
18 18 25 8 13 13 18 27 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: BH PTERMTAENTY ONE A M MEBTSSWSGECCAEWAA CC D AROCRAC N CRRE RINE TAAT HHRTAAARTRENT WASHVERS UNLIHE THD ONE WHITI HENRY AA ENRHAMOUREF TO ALAA HER BY THE DESSRIEEIUY SF I GIAS YY TA MEANN UNEEAAOIAILU AAG AN NNNTIINED DEITHEE TAPESRRYHNNR AE
原文结果: CHAPTER TWENTY ONE A MOMENT'S GLANCE WAS ENOUGH TO SATISFY CATHERINE THAT HER APARTMENT WAS VERY UNLIKE THE ONE WHICH HENRY HAD ENDEAVOURED TO ALARM HER BY THE DESCRIPTION OF IT WAS BY NO MEANS UNREASONABLY LARGE AND CONTAINED NEITHER TAPESTRY NOR VELVET
数字结果: [23 8 5 27 27 1 12 12 19 27 5 5 5 5 5 16 1 18 5 18 5 4 27 5
8 5 27 18 18 18 15 18 27 18 1 19 27 18 1 18 16 5 20 5 25 27 20 1
5 27 23 9 13 13 15 23 5 27 18 5 18 5 27 13 5 9 20 8 5 18 27 12
5 19 27 27 18 18 18 16 5 15 27 27 13 15 18 27 18 15 18 5 27 8 9 8
27 20 8 1 14 27 8 8 5 5 18 27 27 6 27 20 8 5 27 8 18 1 23 9
6 7 27 18 15 15 14 27 2 5 12 15 23 27 20 8 5 27 6 15 18 5 9 20
8 5 5 27 20 8 15 13 5 8 27 13 15 4 27 5 6 5 6 27 5 27 27 1
5 1 3 20 20 6 1 19 8 9 15 13 27 23 1 19 27 8 1 13 4 19 20 8
5 27 8 27 25 27 3 15 12 6 15 18 15 21 2 12 5 27 1 13 6 6 20 8
5 2 1 9 18 27 27 6 27 20 8 5 5 18 27 15 27 27 8 18 0 0 0 0
0 0 5 0 27 25 13 18 27 6 0 0 0 0 0 0 0 8 0 0 18 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: WHE ALLS EEEEEPARERED EHE RRROR RAS RARPETEY TAE WINNOWE RERE NEITHER LES RRRPEO NOR RORE HIH THAM HHEER F THE HRAWIFG ROOM BELOW THE FOREITHEE THONEH NOD EFEF E AEACTTFASHION WAS HANDSTHE H Y COLFOROUBLE ANFFTHEBAIR F THEER O HRE YNR FHR
原文结果: THE WALLS WERE PAPERED THE FLOOR WAS CARPETED THE WINDOWS WERE NEITHER LESS PERFECT NOR MORE DIM THAN THOSE OF THE DRAWING ROOM BELOW THE FURNITURE THOUGH NOT OF THE LATEST FASHION WAS HANDSOME AND COMFORTABLE AND THE AIR OF THE ROOM ALTOGETHER FAR FROM UNCHEERFUL
数字结果: [ 8 5 18 27 8 18 1 18 7 27 9 13 19 19 1 13 20 27 13 5 15 21 19 25
25 27 1 20 27 5 1 19 5 25 9 13 27 20 8 9 9 9 16 15 9 13 20 27
19 8 5 27 18 5 19 15 12 22 5 4 4 20 15 27 12 15 19 5 27 5 15 27
8 27 5 5 13 9 13 27 9 23 25 19 9 3 12 12 1 25 27 5 5 1 14 9
13 1 20 9 15 13 27 9 6 27 1 13 19 9 1 9 13 27 27 27 19 27 19 8
5 27 7 18 5 27 20 18 25 9 4 18 5 1 4 5 4 27 4 19 19 13 22 12
9 13 3 13 7 27 20 0 5 0 7 13 12 5 18 1 12 27 21 9 1 1 1 25
27 4 5 12 1 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: HER HRARG INSSANT NEOUSYY AT EASEYIN THIIIPOINT SHE RESOLVEDDTO LOSE EO H EENIN IWYSICLLAY EEAMINATION IF ANSIAIN S SHE GRE TRYIDREADED DSSNVLINCNG TEGNLERAL UIAAAY DELAY
原文结果: HER HEART INSTANTANEOUSLY AT EASE ON THIS POINT SHE RESOLVED TO LOSE NO TIME IN PARTICULAR EXAMINATION OF ANYTHING AS SHE GREATLY DREADED DISOBLIGING THE GENERAL BY ANY DELAY
这是训练了215次时候的训练结果,可见,已经开始可以较完整的识别出一整句话。
数字结果: [ 4 8 15 16 20 16 18 27 20 23 5 13 20 25 27 8 5 5 27 18 27 14 6 14
5 13 20 19 19 19 19 12 1 3 3 19 27 23 1 19 27 12 12 27 21 27 8 27
9 15 13 19 14 3 12 19 6 19 27 3 1 19 13 15 6 9 13 5 27 20 8 1
20 27 8 5 18 27 1 4 1 18 18 18 5 13 15 27 23 23 19 13 22 5 25 25
27 21 13 22 9 8 8 27 20 8 4 5 11 13 5 27 23 8 9 20 2 21 8 5
13 18 25 27 27 1 4 27 13 13 13 8 1 22 15 21 18 5 6 27 1 15 27 1
12 1 1 14 27 8 5 27 2 2 25 27 20 8 5 27 27 5 12 1 18 9 5 13
14 18 13 27 15 6 27 9 25 7 23 1 19 27 2 25 27 7 15 27 14 5 1 13
19 27 21 13 27 5 1 1 15 13 1 2 12 21 27 27 1 1 7 13 27 1 13 0
0 3 13 13 20 9 9 13 5 4 27 4 4 19 20 8 5 18 27 20 1 16 18 19
20 18 25 27 13 13 18 27 1 5 12 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: DHOPTPR TWENTY HEE R MFMENTSSSSLACCS WAS LL U H IONSMCLSFS CASNOFINE THAT HER ADARRRENO WWSNVEYY UNVIHH THDEKNE WHITBUHENRY AD NNNHAVOUREF AO ALAAM HE BBY THE ELARIENMRN OF IYGWAS BY GO MEANS UN EAAONABLU AAGN ANCNNTIINED DDSTHER TAPRSTRY NNR AEL
原文结果: CHAPTER TWENTY ONE A MOMENT'S GLANCE WAS ENOUGH TO SATISFY CATHERINE THAT HER APARTMENT WAS VERY UNLIKE THE ONE WHICH HENRY HAD ENDEAVOURED TO ALARM HER BY THE DESCRIPTION OF IT WAS BY NO MEANS UNREASONABLY LARGE AND CONTAINED NEITHER TAPESTRY NOR VELVET
数字结果: [23 8 5 27 8 1 12 12 19 27 5 5 5 5 5 16 1 16 5 18 5 4 27 5
8 5 27 6 12 18 6 18 27 23 1 19 27 3 1 18 16 5 20 5 25 27 20 8
5 27 23 9 13 4 15 23 19 27 23 5 18 5 27 13 5 9 20 8 5 6 27 12
5 19 27 27 18 5 18 6 5 15 27 27 13 13 18 27 7 15 18 5 27 18 9 27
27 8 8 1 27 27 25 8 15 19 18 27 27 6 27 20 18 8 14 8 18 1 23 9
7 7 27 18 15 2 14 27 2 5 12 15 23 27 20 8 5 27 6 21 18 5 9 20
12 18 5 27 20 8 18 9 7 8 27 13 15 21 27 6 6 27 6 12 5 27 27 1
1 12 9 20 3 6 1 19 19 9 20 13 27 23 1 19 27 8 1 13 4 19 15 21
8 27 1 13 2 27 27 15 27 6 15 18 1 1 2 12 5 27 1 13 1 6 20 8
5 27 1 9 18 6 21 6 27 20 8 5 5 18 27 15 14 27 18 18 13 0 0 5
20 8 5 0 27 27 13 18 27 6 0 0 0 0 0 0 0 8 0 0 18 6 0 12
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: WHE HALLS EEEEEPAPERED EHE FLRFR WAS CARPETEY THE WINDOWS WERE NEITHEF LES RERFEO NNR GORE RI HHA YHOSR F TRHMHRAWIGG ROBM BELOW THE FUREITLRE THRIGH NOU FF FLE AALITCFASSITN WAS HANDSOUH ANB O FORAABLE ANAFTHE AIRFUF THEER OM RRNETHE NR FHRFL
原文结果: THE WALLS WERE PAPERED THE FLOOR WAS CARPETED THE WINDOWS WERE NEITHER LESS PERFECT NOR MORE DIM THAN THOSE OF THE DRAWING ROOM BELOW THE FURNITURE THOUGH NOT OF THE LATEST FASHION WAS HANDSOME AND COMFORTABLE AND THE AIR OF THE ROOM ALTOGETHER FAR FROM UNCHEERFUL
数字结果: [ 8 5 18 27 8 15 1 18 7 27 9 13 19 20 1 13 20 11 13 5 15 21 5 25
数字结果: [ 8 5 18 27 8 15 1 18 7 27 9 13 19 20 1 13 20 11 13 5 15 21 5 25
20 3 1 20 27 13 19 19 5 27 15 13 27 20 8 9 9 27 16 15 9 13 20 27
19 8 5 27 18 5 19 15 12 22 5 4 4 4 15 27 12 15 19 5 15 15 15 27
20 27 14 5 13 9 13 19 16 20 5 5 9 3 21 12 5 5 27 5 5 1 14 9
13 1 20 9 15 13 27 13 6 27 1 13 5 20 1 9 13 27 27 20 19 27 19 8
5 27 18 18 5 1 20 18 18 27 4 18 5 1 2 5 4 4 4 0 19 0 22 12
9 13 19 13 7 27 20 0 5 0 7 13 13 13 27 21 12 0 2 9 19 1 13 25
27 4 5 12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
文本结果: HER HOARG INSTANTKNEOUEYTCAT NSSE ON THII POINT SHE RESOLVEDDDO LOSEOOO T MENINSPTEEICULEE EEAMINATION NF ANETAIN TS SHE RREATRR DREABEDDDSVLINSNG TEGNNN ULBISANY DELA
原文结果: HER HEART INSTANTANEOUSLY AT EASE ON THIS POINT SHE RESOLVED TO LOSE NO TIME IN PARTICULAR EXAMINATION OF ANYTHING AS SHE GREATLY DREADED DISOBLIGING THE GENERAL BY ANY DELAY
至于为什么要这样设置权重,是根据 JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING这篇文章,里面说到了CTC可以加快训练进程但识别效果不好,ATT收敛速度慢,但效果好。那么我在单独搭建CTC时候,大概是150次左右就已经收敛,但是识别效果不如LSTM;而在单独搭建LSTM的模型中,我尝试过要达到550次迭代才能有较好的识别结果,且进入收敛。因此,利用联合模型时,主要以LSTM为主体,CTC为辅帮助LSTM收敛,同时提升训练的效果和时间。
最后,希望可以与大家共同学习。因为数据大小问题,只上传了200条左右的数据,自己实验的话是2000条左右,可以自行到Librispeech下载语料
这是完整代码的CSDN下载链接