

Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.

We provide state-of-the-art training recipes for the following speech datasets:

What's New:

April 2020: Both E2E LF-MMI (using PyChain) and Cross-Entropy training for hybrid ASR are now supported. WSJ recipes are provided here and here as examples, respectively.

March 2020: SpecAugment is supported and relevant recipes are released.

September 2019: We are in an effort of isolating Espresso from fairseq, resulting in a standalone package that can be directly pip installed.

Requirements and Installation

PyTorch version >= 1.4.0

Python version >= 3.6

For training new models, you'll also need an NVIDIA GPU and NCCL

To install Espresso from source and develop locally:

git clone

cd espresso

pip install --editable .

# on MacOS:

# CFLAGS="-stdlib=libc++" pip install --editable ./

pip install kaldi_io

pip install sentencepiece

cd espresso/tools; make KALDI=

add your Python path to PATH variable in examples/asr_/, the current default is ~/anaconda3/bin.

kaldi_io is required for reading kaldi scp files. sentencepiece is required for subword pieces training/encoding. Kaldi is required for data preparation, feature extraction, scoring for some datasets (e.g., Switchboard), and decoding for all hybrid systems.

If you want to use PyChain for LF-MMI training, you also need to install PyChain (and OpenFst):

edit PYTHON_DIR variable in espresso/tools/Makefile (default: ~/anaconda3/bin), and then

cd espresso/tools; make openfst pychain

For faster training install NVIDIA's apex library:

git clone

cd apex

pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \

--global-option="--deprecated_fused_adam" --global-option="--xentropy" \

--global-option="--fast_multihead_attn" ./


Espresso is MIT-licensed.


Please cite Espresso as:


title = {Espresso: A Fast End-to-end Neural Speech Recognition Toolkit},

author = {Yiming Wang and Tongfei Chen and Hainan Xu

and Shuoyang Ding and Hang Lv and Yiwen Shao

and Nanyun Peng and Lei Xie and Shinji Watanabe

and Sanjeev Khudanpur},

booktitle = {2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},

year = {2019},

