笔者最近在挑选开源的语音识别模型,首要测试的是百度的paddlepaddle;
测试之前,肯定需要了解一下音频解析的一些基本技术点,于是有此篇先导文章。
笔者看到的音频解析主要有几个:
安装代码:
!pip install librosa -i https://mirror.baidu.com/pypi/simple
!pip install soundfile -i https://mirror.baidu.com/pypi/simple
参考文档:librosa
文档位置:https://librosa.org/doc/latest/core.html#audio-loading
signal, sr = librosa.load(path, sr=None)
其中load的参数包括:
librosa.load(path, *, sr=22050, mono=True, offset=0.0, duration=None, dtype=, res_type='kaiser_best')
其中sr = None,‘None’ 保留原始采样频率,设置其他采样频率会进行重采样,有点耗时
可以读 .wav 和 .mp3;
在网络上其他几篇:python音频采样率转换 和 python 音频文件采样率转换在导出音频文件时候,会出现错误,贴一下他们的代码
代码片段一:
def resample_rate(path,new_sample_rate = 16000):
signal, sr = librosa.load(path, sr=None)
wavfile = path.split('/')[-1]
wavfile = wavfile.split('.')[0]
file_name = wavfile + '_new.wav'
new_signal = librosa.resample(signal, sr, new_sample_rate) #
librosa.output.write_wav(file_name, new_signal , new_sample_rate)
代码片段二:
import librosa
import os
noise_name="/media/dfy/fc0b6513-c379-4548-b391-876575f1493f/home/dfy/PycharmProjects/noise_data/"
noise_name_list=os.listdir(noise_name)
for one_name in noise_name_list:
data=librosa.load(noise_name+one_name,16000)
librosa.output.write_wav(noise_name+one_name,data[0],16000,norm=False)
if __name__ == '__main__':
pass
上述都是使用 librosa.output
进行导出,最新的librosa已经摒弃了这个函数。出现报错:
AttributeError: module librosa has no attribute output No module named numba.decorators错误解决
0.8.0版本的将output的api屏蔽掉了,所以要么就是librosa降低版本,比如到0.7.2,要么使用另外的方式。
于是来到官方文档:librosa
推荐使用write的方式,是使用这个库:PySoundFile
如果出现报错:
Input audio file has sample rate [44100], but decoder expects [16000]
就是音频采样比不对,需要修改一下。
笔者将1+2的开源库结合,微调了python音频采样率转换 和 python 音频文件采样率转换,得到以下,切换音频采样频率的函数:
import librosa
import os
import numpy as np
import soundfile as sf
def resample_rate(path,new_sample_rate = 16000):
signal, sr = librosa.load(path, sr=None)
wavfile = path.split('/')[-1]
wavfile = wavfile.split('.')[0]
file_name = wavfile + '_new.wav'
new_signal = librosa.resample(signal, sr, new_sample_rate) #
#librosa.output.write_wav(file_name, new_signal , new_sample_rate)
sf.write(file_name, new_signal, new_sample_rate, subtype='PCM_24')
print(f'{file_name} has download.')
# wav_file = 'video/xxx.wav'
resample_rate(wav_file,new_sample_rate = 16000)
改变为sample_rate 为16000
的音频文件
参考:https://librosa.org/doc/latest/generated/librosa.load.html#librosa.load
第一种:
# Load using an already open SoundFile object
import soundfile
sfo = soundfile.SoundFile(librosa.ex('brahms'))
y, sr = librosa.load(sfo)
第二种:
# Load using an already open audioread object
import audioread.ffdec # Use ffmpeg decoder
aro = audioread.ffdec.FFmpegAudioFile(librosa.ex('brahms'))
y, sr = librosa.load(aro)
python-soundfile是一个基于libsndfile、CFFI和NumPy的音频库。
可以直接使用函数read()和write()来读写声音文件。要按块方式读取声音文件,请使用blocks()。另外,声音文件也可以作为SoundFile对象打开。
PySoundFile的官方文档:readthedocs
下载:
!pip install soundfile -i https://mirror.baidu.com/pypi/simple
read files from zip compressed archives:
import zipfile as zf
import soundfile as sf
import io
with zf.ZipFile('test.zip') as myzip:
with myzip.open('stereo_file.wav') as myfile:
tmp = io.BytesIO(myfile.read())
data, samplerate = sf.read(tmp)
Download and read from URL:
import soundfile as sf
import io
from six.moves.urllib.request import urlopen
url = "https://raw.githubusercontent.com/librosa/librosa/master/tests/data/test1_44100.wav"
data, samplerate = sf.read(io.BytesIO(urlopen(url).read()))
导出音频的:
import numpy as np
import soundfile as sf
rate = 44100
data = np.random.uniform(-1, 1, size=(rate * 10, 2))
# Write out audio as 24bit PCM WAV
sf.write('stereo_file.wav', data, samplerate, subtype='PCM_24')
# Write out audio as 24bit Flac
sf.write('stereo_file.flac', data, samplerate, format='flac', subtype='PCM_24')
# Write out audio as 16bit OGG
sf.write('stereo_file.ogg', data, samplerate, format='ogg', subtype='vorbis')
Python 批量转换视频音频采样率(附代码) | Python工具
下载:
pip install ffmpy -i https://pypi.douban.com/simple
具体代码见原文,只截取其中一段:
def transfor(video_path: str, tmp_dir: str, result_dir: str):
file_name = os.path.basename(video_path)
base_name = file_name.split('.')[0]
file_ext = file_name.split('.')[-1]
ext = 'wav'
audio_path = os.path.join(tmp_dir, '{}.{}'.format(base_name, ext))
print('文件名:{},提取音频'.format(audio_path))
ff = FFmpeg(
inputs={
video_path: None}, outputs={
audio_path: '-f {} -vn -ac 1 -ar 16000 -y'.format('wav')})
print(ff.cmd)
ff.run()
if os.path.exists(audio_path) is False:
return None
video_tmp_path = os.path.join(
tmp_dir, '{}_1.{}'.format(
base_name, file_ext))
ff_video = FFmpeg(inputs={video_path: None},
outputs={video_tmp_path: '-an'})
print(ff_video.cmd)
ff_video.run()
result_video_path = os.path.join(result_dir, file_name)
ff_fuse = FFmpeg(inputs={video_tmp_path: None, audio_path: None}, outputs={
result_video_path: '-map 0:v -map 1:a -c:v copy -c:a aac -shortest'})
print(ff_fuse.cmd)
ff_fuse.run()
return result_video_path
参考文章:
Python | 语音处理 | 用 librosa / AudioSegment / soundfile 读取音频文件的对比
from pydub import AudioSegment #需要导入pydub三方库,第一次使用需要安装
audio_path = './data/example.mp3'
t = time.time()
song = AudioSegment.from_file(audio_path, format='mp3')
# print(len(song)) #时长,单位:毫秒
# print(song.frame_rate) #采样频率,单位:赫兹
# print(song.sample_width) #量化位数,单位:字节
# print(song.channels) #声道数,常见的MP3多是双声道的,声道越多文件也会越大。
wav = np.array(song.get_array_of_samples())
sr = song.frame_rate
print(f"sr={sr}, len={len(wav)}, 耗时: {time.time()-t}")
print(f"(min, max, mean) = ({wav.min()}, {wav.max()}, {wav.mean()})")
wav
输出结果为:
sr=16000, len=64320, 耗时: 0.04667925834655762
(min, max, mean) = (-872, 740, -0.6079446517412935)
array([ 1, -1, -2, ..., -1, 1, -2], dtype=int16)
安装:
! pip install paddleaudio -i https://mirror.baidu.com/pypi/simple
paddle官方封装的一个,音频基本操作应该是librosa的库
具体参考:
https://paddleaudio-doc.readthedocs.io/en/latest/index.html
import paddleaudio
audio_file = 'XXX.wav'
paddleaudio.load(audio_file, sr=None, mono=True, normal=False)
得出:
(array([-3.9100647e-04, -3.0159950e-05, 1.1110306e-04, ...,
1.4603138e-04, 2.5625229e-03, -7.6780319e-03], dtype=float32),
16000)
音频数值 + 采样率
参考的是:【超简单】之基于PaddleSpeech搭建个人语音听写服务
!pip install auditok
切分原因上面交代过,因为PaddleSpeech识别最长语音为50s,故需要切分,这里直接调用好了。
from paddlespeech.cli.asr.infer import ASRExecutor
import csv
import moviepy.editor as mp
import auditok
import os
import paddle
from paddlespeech.cli import ASRExecutor, TextExecutor
import soundfile
import librosa
import warnings
warnings.filterwarnings('ignore')
# 引入auditok库
import auditok
# 输入类别为audio
def qiefen(path, ty='audio', mmin_dur=1, mmax_dur=100000, mmax_silence=1, menergy_threshold=55):
audio_file = path
audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True)
audio_regions = auditok.split(
audio_file,
min_dur=mmin_dur, # minimum duration of a valid audio event in seconds
max_dur=mmax_dur, # maximum duration of an event
# maximum duration of tolerated continuous silence within an event
max_silence=mmax_silence,
energy_threshold=menergy_threshold # threshold of detection
)
for i, r in enumerate(audio_regions):
# Regions returned by `split` have 'start' and 'end' metadata fields
print(
"Region {i}: {r.meta.start:.3f}s -- {r.meta.end:.3f}s".format(i=i, r=r))
epath = ''
file_pre = str(epath.join(audio_file.split('.')[0].split('/')[-1]))
mk = 'change'
if (os.path.exists(mk) == False):
os.mkdir(mk)
if (os.path.exists(mk + '/' + ty) == False):
os.mkdir(mk + '/' + ty)
if (os.path.exists(mk + '/' + ty + '/' + file_pre) == False):
os.mkdir(mk + '/' + ty + '/' + file_pre)
num = i
# 为了取前三位数字排序
s = '000000' + str(num)
file_save = mk + '/' + ty + '/' + file_pre + '/' + \
s[-3:] + '-' + '{meta.start:.3f}-{meta.end:.3f}' + '.wav'
filename = r.save(file_save)
print("region saved as: {}".format(filename))
return mk + '/' + ty + '/' + file_pre