语音识别开源框架
文章目录
- 语音识别开源框架
-
- ASRT
-
- DeepSpeech
-
- DeepSpeech2
-
- ESPNET
-
- kaldi
-
-
- 特征
- Kaldi's versus other toolkits
- The flavor of Kaldi
- Github地址
- 开源文档介绍
- sherpa-ncnn
-
- Wenet
-
- Speechbrain
-
- Vosk API
-
- fairseq(传统端到端)框架
-
- Eesen
-
- *Athena*
-
- PIKA
-
-
- 特征
- 环境
- Others
- Github地址
- 开源文档介绍
- SpeechLM(暂时不能用)
-
- Alibaba-MIT-Speech
-
Whisper
特征
A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.
Github地址
https://github.com/openai/whisper
开源文档介绍
https://openai.com/research/whisper
论文参考
https://arxiv.org/abs/2212.04356
ASRT
特征
ASRT是一套基于深度学习实现的语音识别系统,全称为Auto Speech Recognition Tool,由AI柠檬博主开发并在GitHub上开源(GPL 3.0协议)。本项目声学模型通过采用卷积神经网络(CNN)和连接性时序分类(CTC)方法,使用大量中文语音数据集进行训练,将声音转录为中文拼音,并通过语言模型,将拼音序列转换为中文文本。算法模型在测试集上已经获得了80%的正确率。基于该模型,在Windows平台上实现了一个基于ASRT的语音识别应用软件,取得了较好应用效果。这个应用软件包含Windows 10 UWP商店应用和Windows 版.Net平台桌面应用,也一起开源在GitHub上了。
环境
硬件
- CPU: 4核 (x86_64, amd64) +
- RAM: 16 GB +
- GPU: NVIDIA, Graph Memory 11GB+ (1080ti起步)
- 硬盘: 500 GB 机械硬盘(或固态硬盘)
软件
- Linux: Ubuntu 18.04 + / CentOS 7 + 或 Windows 10/11
- Python: 3.7 - 3.10 及后续版本
- TensorFlow: 2.5 - 2.11 及后续版本
Github地址
https://github.com/nl8590687/ASRT_SpeechRecognition
开源文档介绍
https://wiki.ailemon.net/docs/asrt-doc/asrt-doc-1demhoid4inc6
DeepSpeech
特征
DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper. Project DeepSpeech uses Google’s TensorFlow to make the implementation easier.
不基于phoneme,基于端到端的深度学习语音系统,一个使用多个gpu的优化的RNN训练系统
环境
基于TensorFlow
Github地址
https://github.com/mozilla/DeepSpeech
文档介绍
https://deepspeech.readthedocs.io/en/r0.9/?badge=latest
论文参考
https://arxiv.org/abs/1412.5567
DeepSpeech2
环境
基于PaddlePaddle
Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing modules, and deployment process. To be more specific, this toolkit features at:
- Ease of Use: low barriers to install, CLI, Server, and Streaming Server is available to quick-start your journey.
- Align to the State-of-the-Art: we provide high-speed and ultra-lightweight models, and also cutting-edge technology.
- Streaming ASR and TTS System: we provide production ready streaming asr and streaming tts system.
- Rule-based Chinese frontend: our frontend contains Text Normalization and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context.
- Varieties of Functions that Vitalize both Industrial and Academia:
- ️ Implementation of critical audio tasks: this toolkit contains audio functions like Automatic Speech Recognition, Text-to-Speech Synthesis, Speaker Verfication, KeyWord Spotting, Audio Classification, and Speech Translation, etc.
- Integration of mainstream models and datasets: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also model list for more details.
- Cascaded models application: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
Github地址
https://github.com/PaddlePaddle/PaddleSpeech
文档介绍
https://github.com/PaddlePaddle/PaddleSpeech#documents
论文参考
https://arxiv.org/abs/2205.12007
ESPNET
CMU每年都更新教程
特征
ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.****
Github地址
https://github.com/espnet/espnet
开源文档介绍
https://espnet.github.io/espnet/
kaldi
特征
C++
Kaldi’s versus other toolkits
Kaldi is similar in aims and scope to HTK. The goal is to have modern and flexible code, written in C++, that is easy to modify and extend. Important features include:
- Code-level integration with Finite State Transducers (FSTs)
- We compile against the OpenFst toolkit (using it as a library).
- Extensive linear algebra support
- We include a matrix library that wraps standard BLAS and LAPACK routines.
- Extensible design
- As far as possible, we provide our algorithms in the most generic form possible. For instance, our decoders are templated on an object that provides a score indexed by a (frame, fst-input-symbol) tuple. This means the decoder could work from any suitable source of scores, such as a neural net.
- Open license
- The code is licensed under Apache 2.0, which is one of the least restrictive licenses available.
- Complete recipes
- Our goal is to make available complete recipes for building speech recognition systems, that work from widely available databases such as those provided by the Linguistic Data Consortium (LDC).
The goal of releasing complete recipes is an important aspect of Kaldi. Since the code is publicly available under a license that permits modifications and re-release, we would like to encourage people to release their code, along with their script directories, in a similar format to Kaldi’s own example script.
We have tried to make Kaldi’s documentation as complete as possible given time constraints, but in the short term we cannot hope to generate documentation that is as thorough as HTK’s. In particular there is a lot of introductory material in the HTKBook, explaining statistical speech recognition for the uninitiated, that will probably never appear in Kaldi’s documentation. Much of Kaldi’s documentation is written in such a way that it will only be accessible to an expert. In the future we hope to make it somewhat more accessible, bearing in mind that our intended audience is speech recognition researchers or researchers-in-training. In general, Kaldi is not a speech recognition toolkit “for dummies.” It will allow you to do many kinds of operations that don’t make sense.
The flavor of Kaldi
In this section we attempt to summarize some of the more generic qualities of the Kaldi toolkit. To some extent this describes the goals of the current developers, as much as it descibes the current status of the project. It is not meant to exclude contributions from researchers whose work has a different flavor.
- We emphasize generic algorithms and universal recipes
- By “generic algorithms” we mean things like linear transforms, rather than those that are specific to speech in some way. But we don’t intend to be too dogmatic about this, if more specific algorithms are useful.
- We would like recipes that can be run on any data-set, rather than those that have to be customized.
- We prefer provably correct algorithms
- The recipes have been designed in such a way that in principle they should never fail in a catastrophic way. There has been an effort to avoid recipes and algorithms that could possibly fail, even if they don’t fail in the “normal case” (one example: FST weight-pushing, which normally helps but can crash or make things much worse in certain cases).
- Kaldi code is thoroughly tested.
- The goal is for all or nearly all the code to have corresponding test routines.
- We try to keep the simple cases simple.
- There is a danger when building a large speech toolkit that the code can become a forest of rarely used alternatives. We are trying to avoid this by structuring the toolkit in the following way. Each command-line program generally works for a limited set of cases (e.g. a decoder might just work for GMMs). Thus, when you add a new type of model, you create a new command-line decoder (that calls the same underlying templated code).
- Kaldi code is easy to understand.
- Even though the Kaldi toolkit as a whole may get very large, we aim for each individual part of it to be understandable without too much effort. We will accept some code duplication if it improves the understandability of individual pieces.
- Kaldi code is easy to reuse and refactor.
- We aim for the toolkit to as loosely coupled as possible. In general this means that any given header should need to #include as few other header files as possible. The matrix library, in particular, only depends on code in one other subdirectory so it can be used independently of almost all the rest of Kaldi.
Github地址
https://github.com/kaldi-asr/kaldi
开源文档介绍
http://kaldi-asr.org/doc/
https://github.com/mravanelli/pytorch-kaldi
sherpa-ncnn
特征
We support using ncnn to replace PyTorch for neural network computation. The code is put in a separate repository sherpa-ncnn
In the following, we describe how to build sherpa-ncnn for Linux, macOS, Windows, embedded systems, Android, and iOS.
Github地址
https://github.com/k2-fsa/sherpa-ncnn
开源文档介绍
https://k2-fsa.github.io/sherpa/ncnn/index.html
Wenet
特征
- Production first and production ready: The core design principle, WeNet provides full stack production solutions for speech recognition.
- Accurate: WeNet achieves SOTA results on a lot of public speech datasets.
- Light weight: WeNet is easy to install, easy to use, well designed, and well documented.
Github地址
https://github.com/wenet-e2e/wenet
开源文档介绍
https://wenet.org.cn/wenet/index.html
论文参考
- WeNet: Production Oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit, accepted by InterSpeech 2021.
- WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit, accepted by InterSpeech 2022.
Speechbrain
特征&环境
SpeechBrain provides various useful tools to speed up and facilitate research on speech and language technologies:
-
Various pretrained models nicely integrated with in our official organization account. These models are coupled with easy-inference interfaces that facilitate their use. To help everyone replicate our results, we also provide all the experimental results and folders (including logs, training curves, etc.) in a shared Google Drive folder.
https://camo.githubusercontent.com/b9d050a07e52c7930206d37d72a229ab484cae1ace09bf0fe1c6cf9c7f5d4bc0/68747470733a2f2f68756767696e67666163652e636f2f66726f6e742f6173736574732f68756767696e67666163655f6c6f676f2e737667
(HuggingFace)
-
The Brain
class is a fully-customizable tool for managing training and evaluation loops over data. The annoying details of training loops are handled for you while retaining complete flexibility to override any part of the process when needed.
-
A YAML-based hyperparameter file that specifies all the hyperparameters, from individual numbers (e.g., learning rate) to complete objects (e.g., custom models). This elegant solution dramatically simplifies the training script.
-
Multi-GPU training and inference with PyTorch Data-Parallel or Distributed Data-Parallel.
-
Mixed-precision for faster training.
-
A transparent and entirely customizable data input and output pipeline. SpeechBrain follows the PyTorch data loading style and enables users to customize the I/O pipelines (e.g., adding on-the-fly downsampling, BPE tokenization, sorting, threshold …).
-
On-the-fly dynamic batching
-
Efficient reading of large datasets from a shared Network File System (NFS) via WebDataset.
-
Interface with HuggingFace for popular models such as wav2vec2 and Hubert.
-
Interface with Orion for hyperparameter tuning.
Github地址
https://github.com/speechbrain/speechbrain
开源文档介绍
https://speechbrain.readthedocs.io/en/latest/index.html
Vosk API
特征&环境
Vosk是言语识别工具包。Vosk最好的事情是:
- 支持二十+种语言 - 中文,英语,印度英语,德语,法语,西班牙语,葡萄牙语,俄语,土耳其语,越南语,意大利语,荷兰人,加泰罗尼亚语,阿拉伯, 希腊语, 波斯语, 菲律宾语,乌克兰语, 哈萨克语, 瑞典语, 日语, 世界语, 印地语, 捷克语, 波兰语, 乌兹别克语, 韩国语
- 移动设备上脱机工作-Raspberry Pi, Android,iOS
- 使用简单的 pip3 install vosk 安装
- 每种语言的手提式模型只有是50Mb, 但还有更大的服务器模型可用
- 提供流媒体API,以提供最佳用户体验(与流行的语音识别python包不同)
- 还有用于不同编程语言的包装器-java / c# / javascript等
- 可以快速重新配置词汇以实现最佳准确性
- 支持说话人识别
Github地址
https://github.com/alphacep/vosk-api/
开源文档介绍
https://alphacephei.com/vosk/
fairseq(传统端到端)框架
特征
- multi-GPU training on one machine or across multiple machines (data and model parallel)
- fast generation on both CPU and GPU with multiple search algorithms implemented:
- beam search
- Diverse Beam Search (Vijayakumar et al., 2016)
- sampling (unconstrained, top-k and top-p/nucleus)
- lexically constrained decoding (Post & Vilar, 2018)
- gradient accumulation enables training with large mini-batches even on a single GPU
- mixed precision training (trains faster with less GPU memory on NVIDIA tensor cores)
- extensible: easily register new models, criterions, tasks, optimizers and learning rate schedulers
- flexible configuration based on Hydra allowing a combination of code, command-line and file based configuration
- full parameter and optimizer state sharding
- offloading parameters to CPU
环境
- PyTorch version >= 1.10.0
- Python version >= 3.8
- For training new models, you’ll also need an NVIDIA GPU and NCCL
Github地址
https://github.com/facebookresearch/fairseq
Eesen
特征&环境
- The WFST-based decoding approach can incorporate lexicons and language models into CTC decoding in an effective and efficient way.
- The RNN-LM decoding approach does not require a fixed lexicon.
- GPU implementation of LSTM model training and CTC learning, now also using Tensorflow.
- Multiple utterances are processed in parallel for training speed-up.
- Fully-fledged example setups to demonstrate end-to-end system building, with both phonemes and characters as labels, following Kaldi recipes and conventions.
Github地址
https://github.com/srvk/eesen
论文参考
https://arxiv.org/abs/1507.08240
Athena
特征&环境
- Hybrid Attention/CTC based end-to-end and streaming methods(ASR)
- Text-to-Speech(FastSpeech/FastSpeech2/Transformer)
- Voice activity detection(VAD)
- Key Word Spotting with end-to-end and streaming methods(KWS)
- ASR Unsupervised pre-training(MPC)
- Multi-GPU training on one machine or across multiple machines with Horovod
- WFST creation and WFST-based decoding with C++
- Deployment with Tensorflow C++(Local server)
Github地址
https://github.com/athena-team/athena
PIKA
特征
使用Pytorch作为深度学习引擎,Kaldi用于数据格式化和特征提取
- On-the-fly data augmentation and feature extraction loader
- TDNN Transformer encoder and convolution and transformer based decoder model structure
- RNNT training and batch decoding
- RNNT decoding with external Ngram FSTs (on-the-fly rescoring, aka, shallow fusion)
- RNNT Minimum Bayes Risk (MBR) training
- LAS forward and backward rescorer for RNNT
- Efficient BMUF (Block model update filtering) based distributed training
环境
In general, we recommend Anaconda since it comes with most dependencies. Other major dependencies include,
Please go to https://pytorch.org/ for pytorch installation, codes and scripts should be able to run against pytorch 0.4.0 and above. But we recommend 1.0.0 above for compatibility with RNNT loss module (see below)
We use Kaldi (https://github.com/kaldi-asr/kaldi))) and PyKaldi (a python wrapper for Kaldi) for data processing, feature extraction and FST manipulations. Please go to Pykaldi website https://github.com/pykaldi/pykaldi for installation and make sure to build Pykaldi with ninja for efficiency. After following the installation process of pykaldi, you should have both Kaldi and Pykaldi dependencies ready.
For RNNT loss module, we adopt the pytorch binding at https://github.com/1ytic/warp-rnnt
Others
Check requirements.txt for other dependencies.
Github地址
https://github.com/tencent-ailab/pika
开源文档介绍
https://github.com/tencent-ailab/pika
SpeechLM(暂时不能用)
(我们在预训练实验中发现了一些数据错误,这将影响 SpeechLM-P Base 模型的所有结果。我们正在重新进行相关实验,并将用新结果更新论文)
Github地址
https://github.com/microsoft/SpeechT5/tree/main/SpeechLM
开源文档介绍
https://github.com/microsoft/SpeechT5/tree/main/SpeechLM
论文参考
https://arxiv.org/abs/2209.15329
Alibaba-MIT-Speech
特征
This is a PATCH file with the DFSMN related codes and example scripts for LibriSpeech task.
Github地址
https://github.com/alibaba/Alibaba-MIT-Speech