pytorch使用speechbrain和huggingface中预训练模型实现语音(中文)转文字的推理例子

import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import warnings
warnings.filterwarnings("ignore")
# !pip install speechbrain

audio_file = f"B31_385.wav"
#load audio file
audio, sampling_rate = librosa.load(audio_file, sr=16_000)

# # audio
# display.Audio(audio_file, autoplay=True)

#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")
model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")

input_values = tokenizer(audio, return_tensors='pt').input_values
input_values

# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim=-1)

# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

transcriptions
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.





'地是内部圈层的最外层由封化的土层和坚映的岩石组成所以地也可称为岩石圈'

from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell",
                                           savedir="pretrained_models/asr-transformer-aishell")
asr_model.transcribe_file(audio_file)

The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.





'地价 是 内部 圈层 的 最 外 层 由 分化 的 吐槽 和 签应 的 延迟 组成 所以 地区 而 也 可 称 为 严实 圈'
from speechbrain.pretrained.interfaces import foreign_class

#使用显卡推理
asr_model = foreign_class(source="speechbrain/asr-wav2vec2-ctc-aishell", pymodule_file="custom_interface.py",
                          classname="CustomEncoderDecoderASR", run_opts={"device": "cuda"})
asr_model.transcribe_file(audio_file)
Some weights of the model checkpoint at TencentGameMate/chinese-wav2vec2-large were not used when initializing Wav2Vec2Model: ['project_q.bias', 'project_hid.bias', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'project_q.weight', 'project_hid.weight', 'quantizer.weight_proj.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





['地',
 '俏',
 '是',
 '内',
 '部',
 '圈',
 '层',
 '的',
 '最',
 '外',
 '层',
 '由',
 '封',
 '化',
 '的',
 '吐',
 '层',
 '和',
 '接',
 '应',
 '的',
 '沿',
 '石',
 '组',
 '成',
 '所',
 '以',
 '地',
 '俏',
 '也',
 '可',
 '称',
 '为',
 '颜',
 '石',
 '圈']
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# test_dataset = load_dataset("common_voice", "zh-CN", split="test")

tokenizer = Wav2Vec2Processor.from_pretrained("ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt")
model = Wav2Vec2ForCTC.from_pretrained("ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt")

input_values = tokenizer(audio, return_tensors='pt').input_values
input_values

# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim=-1)

# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

transcriptions

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.





'地壳是内部圈层的最外层由丰化的土层和坚硬的岩始组成所以地壳也可称为岩石圈'

你可能感兴趣的:(pytorch,深度学习,人工智能,语音识别)