SpeechToText的一个简单Helloworld

环境准备

python 3.7以及下面的包

pip install torch torchaudio omegaconf

代码实践一之torch hub

1. 下载并加载已经训练好的speech2text模型

2.下载一段音频来测试效果。

import torch
from glob import glob

device = torch.device('cpu')  # gpu also works, but our models are fast enough for CPU
model, decoder, utils = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                       model='silero_stt',
                                       language='en', # also available 'de', 'es'
                                       device=device, )

(read_batch, split_into_batches,read_audio, prepare_model_input) = utils  # see function signature for details

# download a single file in any format compatible with TorchAudio
torch.hub.download_url_to_file('https://opus-codec.org/static/examples/samples/speech_orig.wav', dst ='speech_orig.wav', progress=True)
test_files = glob('speech_orig.wav')

batches = split_into_batches(test_files, batch_size=10)
input = prepare_model_input(read_batch(batches[0]),
                            device=device)

output = model(input)
for example in output:
    print(decoder(example.cpu()))

运行结果

the boch canoe slit on the smooth planks blew the sheet to the dark blue background it's easy to tell a depth of a well four hours of steady work faced us

初步测试总结

英语的语速快慢影响结果的输出。你们也可以自己录一段英语,试一试。

代码实践二之deepspeech

环境

window环境, python=3.7

1. 下载模型和测试用到的音频. 分别放入model文件夹和audio文件夹

https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz

2. 创建一个新的虚拟环境(用Conda或者是virtualenv都可以). 

3. 安装deepspeech

conda install deepspeech

4. 执行下面命令

deepspeech --model model/deepspeech-0.9.3-models.pbmm --scorer model/deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav

输出结果

(speed2text_tf) PS ...speech2text_tf> deepspeech --model model/deepspeech-0.9.3-models.pbmm --scorer model/deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav
Loading model from file model/deepspeech-0.9.3-models.pbmm
TensorFlow: v2.3.0-6-g23ad988fcd
DeepSpeech: v0.9.3-0-gf2e9c858
2022-12-24 13:24:09.108387: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded model in 0.0119s.
Loading scorer from files model/deepspeech-0.9.3-models.scorer
Loaded scorer in 0.0107s.
Running inference.
experience proves this
Inference took 0.670s for 1.975s audio file.

5. 实时的语音转换(从mic到文字)

DeepSpeech-examples/mic_vad_streaming at r0.9 · mozilla/DeepSpeech-examples · GitHub

下载上面github里的mic_vad_streaming.py和requirements.txt

用下面命令安装所需要的包

pip install -r requirements.txt

执行下面命令

python mic_vad_streaming/mic_vad_streaming.py -m model/deepspeech-0.9.3-models.pbmm -s model/deepspeech-0.9.3-models.scorer

结果如下:

(speed2text_tf) PS ...speech2text_tf> python mic_vad_streaming/mic_vad_streaming.py -m model/deepspeech-0.9.3-models.pbmm -s model/deepspeech-0.9.3-models.scorer
Initializing model...
INFO:root:ARGS.model: model/deepspeech-0.9.3-models.pbmm
TensorFlow: v2.3.0-6-g23ad988fcd
DeepSpeech: v0.9.3-0-gf2e9c858
2022-12-24 13:51:58.003296: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:ARGS.scorer: model/deepspeech-0.9.3-models.scorer
Listening (ctrl-C to exit)...
Recognized: no
Recognized: he
Recognized: hear me
Recognized: hear me
Recognized: to
Recognized: for i think seven

但是说实话,效果很一般,可能和我的口音有关吧,我只能这要解释。

参考资料

Welcome to DeepSpeech’s documentation! — Mozilla DeepSpeech 0.9.3 documentation

GitHub - mozilla/DeepSpeech: DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

你可能感兴趣的:(Speech2Text,深度学习,python,人工智能)