【计算机科学】【2014】用于自动语音识别ASR的深度神经网络声学模型

【计算机科学】【2014】用于自动语音识别ASR的深度神经网络声学模型_第1张图片
本文为加拿大多伦多大学(作者:Abdel-rahman Mohamed)的博士论文,共129页。

自动语音识别(ASR)是信息时代的一项关键核心技术。ASR系统已经从孤立数字的区分发展到识别电话质量、自然语音,在各个领域有着越来越多的实际应用。尽管如此,语音识别仍然面临着严峻的挑战,需要在语音识别过程的几乎每个阶段进行重大改进。多年来,ASR的标准方法基本保持不变,它使用隐马尔可夫模型(HMM)对语音信号的序列结构进行建模,每个HMM状态使用混合的对角协方差高斯(GMM)对声波谱表示进行建模。

本文描述了基于深度神经网络(DNN)的新型声学模型,这些模型已经开始取代GMM。对于ASR,DNN的深度结构及其分布式表示允许更好地将学习到的特征推广到新的情况,即使只有少量的训练数据可用。此外,DNN声学模型可以很好地扩展到大型词汇表任务中,在以前最好的系统基础上有了显著改进。分析了不同的输入特征表示,以确定哪一种更适合于DNN声学模型。Mel频率倒谱系数(MFCC)差于log Mel频率谱系数(MFSC),这有助于DNN模型在关注鉴别语音特征的同时,将特定于说话人的信息边缘化。为了进一步提高DNN的性能,还引入了多种说话人的自适应技术。提出了另一种基于卷积神经网络(CNN)的深度声学模型。CNN没有像DNN那样使用完全连接的隐藏层,而是使用一对卷积层和池化层作为构建块。卷积操作使用学习的本地谱时域滤波器扫描频率轴,而在池化层中,最大化操作被应用于学习的特征,利用输入MFSC特征的平滑度来消除表示为沿频率轴类似于声道长度标准化(VTLN)技术。结果表明,与GMMs相比,所提出的DNN和CNN声学模型在各种小词汇和大词汇任务上都有显著的改进。

Automatic speech recognition (ASR) is a keycore technology for the information age. ASR systems have evolved fromdiscriminating among isolated digits to recognizing telephone-quality,spontaneous speech, allowing for a growing number of practical applications invarious sectors. Nevertheless, there are still serious challenges facing ASRwhich require major improvement in almost every stage of the speech recognitionprocess. Until very recently, the standard approach to ASR had remained largelyunchanged for many years. It used Hidden Markov Models (HMMs) to model thesequential structure of speech signals, with each HMM state using a mixture ofdiagonal covariance Gaussians (GMM) to model a spectral representation of thesound wave. This thesis describes new acoustic models based on Deep NeuralNetworks (DNN) that have begun to replace GMMs. For ASR, the deep structure ofa DNN as well as its distributed representations allow for bettergeneralization of learned features to new situations, even when only smallamounts of training data are available. In addition, DNN acoustic models scalewell to large vocabulary tasks significantly improving upon the best previoussystems. Different input feature representations are analyzed to determinewhich one is more suitable for DNN acoustic models. Mel-frequency cepstralcoefficients (MFCC) are inferior to log Mel-frequency spectral coefficients(MFSC) which help DNN models marginalize out speaker-specific information whilefocusing on discriminant phonetic features. Various speaker adaptationtechniques are also introduced to further improve DNN performance. Another deepacoustic model based on Convolutional Neural Networks (CNN) is also proposed.Rather than using fully connected hidden layers as in a DNN, a CNN uses a pairof convolutional and pooling layers as building blocks. The convolutionoperation scans the frequency axis using a learned local spectro-temporalfilter while in the pooling layer a maximum operation is applied to the learnedfeatures utilizing the smoothness of the input MFSC features to eliminatespeaker variations expressed as shifts along the frequency axis in a waysimilar to vocal tract length normalization (VTLN) techniques. We show that theproposed DNN and CNN acoustic models achieve significant improvements over GMMson various small and large vocabulary tasks.

1 引言
2 项目背景与实验设置
3 用于声学建模的派生训练DNN
4 DNN声学模型中使用何种特征
5 DNN声学模型中的讲话者自适应训练
6 基于声学模型的卷积神经网络
7 结论与未来工作展望

更多精彩文章请关注公众号:【计算机科学】【2014】用于自动语音识别ASR的深度神经网络声学模型_第2张图片

你可能感兴趣的:(【计算机科学】【2014】用于自动语音识别ASR的深度神经网络声学模型)