
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import warnings
# !pip install speechbrain

audio_file = f"B31_385.wav"
#load audio file
audio, sampling_rate = librosa.load(audio_file, sr=16_000)

# # audio
# display.Audio(audio_file, autoplay=True)

#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")
model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")

input_values = tokenizer(audio, return_tensors='pt').input_values

# store logits (non-normalized predictions)
logits = model(input_values).logits

# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim=-1)

# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.


from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell",

The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
'地价 是 内部 圈层 的 最 外 层 由 分化 的 吐槽 和 签应 的 延迟 组成 所以 地区 而 也 可 称 为 严实 圈'
from speechbrain.pretrained.interfaces import foreign_class

asr_model = foreign_class(source="speechbrain/asr-wav2vec2-ctc-aishell", pymodule_file="",
                          classname="CustomEncoderDecoderASR", run_opts={"device": "cuda"})
Some weights of the model checkpoint at TencentGameMate/chinese-wav2vec2-large were not used when initializing Wav2Vec2Model: ['project_q.bias', 'project_hid.bias', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'project_q.weight', 'project_hid.weight', 'quantizer.weight_proj.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# test_dataset = load_dataset("common_voice", "zh-CN", split="test")

tokenizer = Wav2Vec2Processor.from_pretrained("ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt")
model = Wav2Vec2ForCTC.from_pretrained("ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt")

input_values = tokenizer(audio, return_tensors='pt').input_values

# store logits (non-normalized predictions)
logits = model(input_values).logits

# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim=-1)

# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.

