https://github.com/zzw922cn/awesome-speech-recognition-speech-synthesis-papers
automatic speech recognition/speech synthesis paper roadmap, including HMM, DNN, RNN, CNN, Seq2Seq, Attention
Automatic Speech Recognition has been investigated for several decades, and speech recognition models are from HMM-GMM to deep neural networks today. It's very necessary to see the history of speech recognition by this awesome paper roadmap. I will cover papers from traditional models to nowadays popular models, not only acoustic models or ASR systems, but also many interesting language models.
An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition(1982), S. E. LEVINSON et al. [pdf]
A Maximum Likelihood Approach to Continuous Speech Recognition(1983), LALIT R. BAHL et al. [pdf]
Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition(1986), Andrew K. Halberstadt. [pdf]
Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition(1986), Lalit R. Bahi et al. [pdf]
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition(1989), Lawrence R Rabiner. [pdf]
Phoneme recognition using time-delay neural networks(1989), Alexander H. Waibel et al. [pdf]
Speaker-independent phone recognition using hidden Markov models(1989), Kai-Fu Lee et al. [pdf]
Hidden Markov Models for Speech Recognition(1991), B. H. Juang et al. [pdf]
Connectionist Speech Recognition: A Hybrid Approach(1994), Herve Bourlard et al. [pdf]
A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)(1997), J.G. Fiscus. [pdf]
Review of Tdnn (time Delay Neural Network) Architectures for Speech Recognition(2014), Masahide Sugiyamat et al. [pdf]
Framewise phoneme classification with bidirectional LSTM and other neural network architectures(2005), Alex Graves et al. [pdf]
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks(2006), Alex Graves et al. [pdf]
The kaldi speech recognition toolkit(2011), Daniel Povey et al. [pdf]
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition(2012), Ossama Abdel-Hamid et al. [pdf]
Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition(2012), George E. Dahl et al. [pdf]
Deep Neural Networks for Acoustic Modeling in Speech Recognition(2012), Geoffrey Hinton et al. [pdf]
Sequence Transduction with Recurrent Neural Networks(2012), Alex Graves et al. [pdf]
Deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Improving deep neural networks for LVCSR using rectified linear units and dropout(2013), George E. Dahl et al.[pdf]
Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training(2013), Yajie Miao et al. [pdf]
Improvements to deep convolutional neural networks for LVCSR(2013), Tara N. Sainath et al. [pdf]
Machine Learning Paradigms for Speech Recognition: An Overview(2013), Li Deng et al. [pdf]
Recent advances in deep learning for speech research at Microsoft(2013), Li Deng et al. [pdf]
Speech recognition with deep recurrent neural networks(2013), Alex Graves et al. [pdf]
Convolutional deep maxout networks for phone recognition(2014), László Tóth et al. [pdf]
Convolutional Neural Networks for Speech Recognition(2014), Ossama Abdel-Hamid et al. [pdf]
Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition(2014), László Tóth. [pdf]
Deep Speech: Scaling up end-to-end speech recognition(2014), Awni Y. Hannun et al. [pdf]
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results(2014), Jan Chorowski et al. [pdf]
First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs(2014), Andrew L. Maas et al. [pdf]
Long short-term memory recurrent neural network architectures for large scale acoustic modeling(2014), Hasim Sak et al. [pdf]
Robust CNN-based speech recognition with Gabor filter kernels(2014), Shuo-Yiin Chang et al. [pdf]
Stochastic pooling maxout networks for low-resource speech recognition(2014), Meng Cai et al. [pdf]
Towards End-to-End Speech Recognition with Recurrent Neural Networks(2014), Alex Graves et al. [pdf]
Attention-Based Models for Speech Recognition(2015), Jan Chorowski et al. [pdf]
Analysis of CNN-based speech recognition system using raw speech as input(2015), Dimitri Palaz et al. [pdf]
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks(2015), Tara N. Sainath et al. [pdf]
Deep convolutional neural networks for acoustic modeling in low resource languages(2015), William Chan et al.[pdf]
Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition(2015), Chao Weng et al. [pdf]
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition(2015), Hasim Sak et al.[pdf]
Lexicon-Free Conversational Speech Recognition with Neural Networks(2015), Andrew L. Maas et al. [pdf]
Online Sequence Training of Recurrent Neural Networks with Connectionist Temporal Classification(2015), Kyuyeon Hwang et al. [pdf]
Advances in All-Neural Speech Recognition(2016), Geoffrey Zweig et al. [pdf]
Advances in Very Deep Convolutional Neural Networks for LVCSR(2016), Tom Sercu et al. [pdf]
End-to-end attention-based large vocabulary speech recognition(2016), Dzmitry Bahdanau et al. [pdf]
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention(2016), Dong Yu et al. [pdf]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin(2016), Dario Amodei et al. [pdf]
End-to-end attention-based distant speech recognition with Highway LSTM(2016), Hassan Taherian. [pdf]
Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning(2016), Suyoun Kim et al.[pdf]
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition(2016), William Chan et al. [pdf]
Latent Sequence Decompositions(2016), William Chan et al. [pdf]
Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks(2016), Tara N. Sainath et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2016), Suyoun Kim et al. [pdf]
Segmental Recurrent Neural Networks for End-to-End Speech Recognition(2016), Liang Lu et al. [pdf]
Towards better decoding and language model integration in sequence to sequence models(2016), Jan Chorowski et al. [pdf]
Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition(2016), Yanmin Qian et al. [pdf]
Very Deep Convolutional Networks for End-to-End Speech Recognition(2016), Yu Zhang et al. [pdf]
Very deep multilingual convolutional neural networks for LVCSR(2016), Tom Sercu et al. [pdf]
Wav2Letter: an End-to-End ConvNet-based Speech Recognition System(2016), Ronan Collobert et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech(2017), Michael Neumann et al. [pdf]
An enhanced automatic speech recognition system for Arabic(2017), Mohamed Amine Menacer et al. [pdf]
Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM(2017), Takaaki Hori et al. [pdf]
A network of deep neural networks for distant speech recognition(2017), Mirco Ravanelli et al. [pdf]
An online sequence-to-sequence model for noisy speech recognition(2017), Chung-Cheng Chiu et al. [pdf]
An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems(2017), Hany Ahmed et al. [pdf]
Building DNN acoustic models for large vocabulary speech recognition(2017), Andrew L. Maas et al. [pdf]
Direct Acoustics-to-Word Models for English Conversational Speech Recognition(2017), Kartik Audhkhasi et al.[pdf]
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments(2017), Zixing Zhang et al. [pdf]
English Conversational Telephone Speech Recognition by Humans and Machines(2017), George Saon et al. [pdf]
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA(2017), Song Han et al. [pdf]
Deep LSTM for Large Vocabulary Continuous Speech Recognition(2017), Xu Tian et al. [pdf]
Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling(2017), Hairong Liu et al.[pdf]
Multichannel End-to-end Speech Recognition(2017), Tsubasa Ochiai et al. [pdf]
Multi-task Learning with CTC and Segmental CRF for Speech Recognition(2017), Liang Lu et al. [pdf]
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition(2017), Tara N. Sainath et al. [pdf]
Optimizing expected word error rate via sampling for speech recognition(2017), Matt Shannon. [pdf]
Residual Convolutional CTC Networks for Automatic Speech Recognition(2017), Yisen Wang et al. [pdf]
Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition(2017), Jaeyoung Kim et al. [pdf]
Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition(2017), Suyoun Kim et al. [pdf]
Reducing Bias in Production Speech Models(2017), Eric Battenberg et al. [pdf]
Signal estimation from modified short-time Fourier transform(1993), Daniel W. Griffin et al. [pdf]
A fast Griffin-Lim algorithm(2013), Nathanael Perraudin et al. [pdf]
First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention(2016), Wenfu Wang et al. [pdf]
Recent Advances in Google Real-Time HMM-Driven Unit Selection Synthesizer(2016), Xavi Gonzalvo et al. [pdf]
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model(2016), Soroush Mehri et al. [pdf]
WaveNet: A Generative Model for Raw Audio(2016), Aäron van den Oord et al. [pdf]
Deep Voice: Real-time Neural Text-to-Speech(2017), Sercan O. Arik et al. [pdf]
Deep Voice 2: Multi-Speaker Neural Text-to-Speech(2017), Sercan Arik et al. [pdf]
Tacotron: Towards End-to-End Speech Synthesis(2017), Yuxuan Wang et al. [pdf]
Class-Based n-gram Models of Natural Language(1992), Peter F. Brown et al. [pdf]
A Neural Probabilistic Language Model(2000), Yoshua Bengio et al. [pdf]
Discriminative n-gram language modeling(2007), Brian Roark et al. [pdf]
Efficient Training and Evaluation of Recurrent Neural Network Language Models for Automatic Speech Recognition(2016), Xie Chen et al. [pdf]
For any questions, welcome to send email to :[email protected]. Thanks!