最近看了有关KALDI的论文,在这里介绍一下。
Abstract:
We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.
注:
1.Kaldi是免费开源的用于语音识别研究的工具包
2.finite-state transducers(FST) 是有两个tape的有限状态自动机
3.OpenFst is a library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs)
4.means and mixture weights vary in a subspace of the total parameter space. We call this a Subspace Gaussian Mixture Model (SGMM).
I INTRODUCTION
Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.
The goal of Kaldi is to have modern and flexible code that is easy to understand, modify and extend. Kaldi is available on SourceForge (see http://kaldi.sf.net/). The tools compile on the commonly used Unix-like systems and on Microsoft Windows.
注:
1. Kaldi is available on SourceForge (see http://kaldi.sf.net/)
2.The tools compile on Unix-like systems and on Microsoft Windows.
Researchers on automatic speech recognition (ASR) have several potential choices of open-source toolkits for building a recognition system. Notable among these are: HTK [1], Julius[2] (both written in C), Sphinx-4[3] (written in Java), and the RWTH ASR toolkit [4] (written in C++). Yet, our specific requirements—a finite-state transducer (FST) based frame-work, extensive linear algebra support, and a non-restrictive license—led to the development of Kaldi. Important features of Kaldi include:
Integration with Finite State Transducers: We compile against the OpenFst toolkit [5] (using it as a library).
Extensive linear algebra support: We include a matrix library that wraps standard BLAS and LAPACK routines.
Extensible design:We attempt to provide our algorithms in the most generic form possible. For instance, our decoders work with an interface that provides a score for a particular frame and FST input symbol. Thus the decoder could work from any suitable source of scores.
Open license:The code is licensed under Apache v2.0, which is one of the least restrictive licenses available.
Complete recipes:We make available complete recipes for building speech recognition systems, that work from
widely available databases such as those provided by the Linguistic Data Consortium (LDC).
Thorough testing: The goal is for all or nearly all the code to have corresponding test routines.
The main intended use for Kaldi is acoustic modeling research; thus, we view the closest competitors as being HTK
and the RWTH ASR toolkit (RASR). The chief advantage versus HTK is modern, flexible, cleanly structured code and better WFST and math support; also, our license terms are more open than either HTK or RASR.
注:
1.Kaldi's main intend is acoustic modeling research
2.Advatages: modern, flexible, cleanly structured code and better WFST and math support license terms are more
open
The paper is organized as follows: we start by describing the structure of the code and design choices (section II). This is followed by describing the individual components of a speech recognition system that the toolkit supports: feature extraction (section III), acoustic modeling (section IV), phonetic decision trees (section V), language modeling (section VI), and de-coders (section VIII). Finally, we provide some benchmarking results in section IX.
II OVERVIEW OF THE TOOLKIT
We give a schematic overview of the Kaldi toolkit in figure 1. The toolkit depends on two external libraries that are
also freely available: one is OpenFst [5] for the finite-state framework, and the other is numerical algebra libraries. We use the standard “Basic Linear Algebra Subroutines” (BLAS)and “Linear Algebra PACKage” (LAPACK) routines for the latter.
注:
1.external libraries:OpenFst、numerical algebra libraries
The library modules can be grouped into two distinct halves, each depending on only one of the external libraries
(c.f. Figure 1). A single module, the DecodableInterface (section VIII), bridges these two halves.
注:
1.DecodableInterface bridges these two halves
Access to the library functionalities is provided through command-line tools written in C++, which are then called
from a scripting language for building and running a speech recognizer. Each tool has very specific functionality with a small set of command line arguments: for example, there are separate executables for accumulating statistics, summing accumulators, and updating a GMM-based acoustic model using maximum likelihood estimation. Moreover, all the tools can read from and write to pipes which makes it easy to chain together different tools.
To avoid “code rot”, We have tried to structure the toolkit in such a way that implementing a new feature will generally involve adding new code and command-line tools rather than modifying existing ones
注:
1.command-line tools written in C++ to access the library functionalities
III FEATURE EXTRACTION
Our feature extraction and waveform-reading code aims to create standard MFCC and PLP features, setting reasonable defaults but leaving available the options that people are most likely to want to tweak (for example, the number of mel bins, minimum and maximum frequency cutoffs, etc.). We support most commonly used feature extraction approaches: e.g. VTLN, cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, and so on.
注:
1.features: MFCC and PLP
2.feature extraction approaches: VTLN, cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, and so
on
IV ACOUSTIC MODELING
Our aim is for Kaldi to support conventional models (i.e.diagonal GMMs) and Subspace Gaussian Mixture Models
(SGMMs), but to also be easily extensible to new kinds of model.
注:
1.DIAGONAL GMMs
2.Subspace Gaussian Mixture Models(SGMMs)
A.Gaussian mixture models
We support GMMs with diagonal and full covariance structures. Rather than representing individual Gaussian densities separately, we directly implement a GMM class that is parametrized by the natural parameters, i.e. means times inverse covariances and inverse covariances. The GMM classes also store the constant term in likelihood computation, which consist of all the terms that do not depend on the data vector. Such an implementation is suitable for efficient log-likelihood computation with simple dot-products.
B.GMM-based acoustic model
The “acoustic model” class AmDiagGmm represents a collection of DiagGmm objects, indexed by “pdf-ids” that correspond to context-dependent HMM states. This class does not represent any HMM structure, but just a collection of densities (i.e.GMMs). There are separate classes that represent the HMM structure, principally the topology and transition-modeling code and the code responsible for compiling decoding graphs, which provide a mapping between the HMM states and the pdf index of the acoustic model class. Speaker adaptation and other linear transforms like maximum likelihood linear transform (MLLT) [6] or semi-tied covariance (STC) [7] are implemented by separate classes.
C.HMM Topology
It is possible in Kaldi to separately specify the HMM topology for each context-independent phone. The topology
format allows nonemitting states, and allows the user to pre-specify tying of the p.d.f.’s in different HMM states.
D.Speaker adaptation
We support both model-space adaptation using maximum likelihood linear regression (MLLR) [8] and feature-space
adaptation using feature-space MLLR (fMLLR), also known as constrained MLLR [9]. For both MLLR and fMLLR,
multiple transforms can be estimated using a regression tree [10]. When a single fMLLR transform is needed, it can be used as an additional processing step in the feature pipeline. The toolkit also supports speaker normalization using a linear approximation to VTLN, similar to [11], or conventional feature-level VTLN, or a more generic approach for gender normalization which we call the “exponential transform” [12]. Both fMLLR and VTLN can be used for speaker adaptive training (SAT) of the acoustic models.
注:
1.maximum likelihood linear regression (MLLR) & feature-space adaptation using feature-space MLLR (fMLLR)
E. Subspace Gaussian Mixture Models
For subspace Gaussian mixture models (SGMMs), the toolkit provides an implementation of the approach described
in [13]. There is a single class AmSgmm that represents a whole collection of pdf’s; unlike the GMM case there is no class that represents a single pdf of the SGMM. Similar to the GMM case, however, separate classes handle model estimation and speaker adaptation using fMLLR.