import soundfile
audio, audio_sample_rate = soundfile.read("C:\Users\air\Desktop\asr16.wav", dtype="int16",always_2d=True)
import numpy as np
audio = audio.mean(axis=1, dtype=np.int16)
def pcm16to32(audio):
assert (audio.dtype == np.int16)
audio = audio.astype("float32")
bits = np.iinfo(np.int16).bits
audio = audio / (2**(bits - 1))
return audio
def pcm32to16(audio):
assert (audio.dtype == np.float32)
bits = np.iinfo(np.int16).bits
audio = audio * (2**(bits - 1))
audio = np.round(audio).astype("int16")
return audio
def pcm32to16(audio):
assert (audio.dtype == np.float32)
bits = np.iinfo(np.int16).bits
audio = audio * (2**(bits - 1))
audio = np.round(audio).astype("int16")
return audio
import pickle
with open("configs.pkl", "rb") as tf:
configs = pickle.load(tf)
with open("cls.pkl", "rb") as tf:
cls = pickle.load(tf)
result_transcripts = model.decode(audio,paddle.to_tensor([363],dtype="int64"), text_feature=text_feature,decoding_method="attention_rescoring", beam_size=10, ctc_weight=0.5,decoding_chunk_size=-1,num_decoding_left_chunks=-1,simulate_streaming=False)
The implemented architecture of Deepspeech2 online model is based on Deepspeech2 model with some changes. The model is mainly composed of 2D convolution subsampling layers and stacked single-direction rnn layers.
To illustrate the model implementation clearly, 3 parts are described in detail.
- Data Preparation
- Encoder
- Decoder
In addition, the training process and the testing process are also introduced.
For English data, the vocabulary dictionary is composed of 26 English characters with " ’ ", space, and . The represents the blank label in CTC, the represents the unknown character and the represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of Chinese characters statistics from the training set, and three additional characters are added. The added characters are , and . For both English and mandarin data, we set the default indexes that =0, =1 and = last index.