开始写文章来记录以下我的声纹学习经历和记录。作为一个新菜鸟,刚入门真是痛苦不堪,而且国内的网站上大多数的声纹方面的帖子和学习资料比较少,不是从英文网站上直接翻译Copy下来的就是论文的直接翻译。我就简单记录一下仿照各种资料中所遇见的问题,大家学习过程中可以找一下共鸣。
本次i-vector主要使用以下方法,介绍这些方法主要是以防我之后忘记,能快速回忆起基本方法等。主要参考的学习网站:https://www.w3cschool.cn/pytorch/pytorch-9dfn3bnh.html,该网站主要是翻译了Pytorch官网教学的网站,虽然有部分翻译上的问题,如果英语很好可以直接跳去官方网站进行学习。
不能直接用,网络基本的定义方式,必须有__init__()和forward()方法
import torch
import torch.nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
#各种各样的网络,可以是自己设置的,也可以是torch.nn中已建立的网络例如
self.fc = nn.Linear(input_size, output_size)
def forward(self):
#设置Net(input)的过程,注意激活函数啥的不要忘了在这里写,最后返回结果
return output
自己建立dataset的好处是可以之后使用pytorch自建的DataLoader,该DataLoader可以直接调用使用batch_size,按batch_size提取出数据和标签,同时有一个shuffer乱序输出的方式。感觉使用这个简化了很多代码,挺好用的而且代码很好看,五星级建议★★★★★
简要来说该dataset需要重写的方法有__init__(),__len__(),__getitem__()。dataset还有一个transform,在图像里面一般transform为裁剪和缩放,在我的d-vector里使用transform进行音频长度的裁剪。
from torch.utils.data import Dataset, DataLoader
class MyDataset(Dataset):
def __init__(self, tranform = None):
#在这里可以初始化一些表格供之后直接调用,值得注意的是Dataset最好可以直接用1,2,3……调用,例如dataset[1]而不要使用dataset[1,1]这样。因为在dataloader默认方式里使用前种方式随机获取。
def __len__(self):
#dataloader要用,写就是了
return len(something)
def __getitem__(self):
#使用dataset[i]能获取到啥,在这里编程。
#一般返回字典
return dict{'input_feature':input_featrue, 'tag':tag}
class Mytransform(Object):
def __init__(self):#我略过了
def __call__(self):
#这里进行主要方法的编写并返回东西,这使得Mytransform可以直接用()调用
return _
#使用dataloader方法为初始化->直接用了
mydataset = MyDataset()
#这里num_workers如果不设置为0我的win报错了,不知道咋回事,应该在linux上可用
dataloader = DataLoader(mydataset, batch_size=4, shuffle=True, num_workers=0)
nn.Linear(in_features: int, out_features: int, bias: bool = True)
Args:
in_features: size of each input sample
out_features: size of each output sample
bias: If set to ``False``, the layer will not learn an additive bias.
Default: ``True``
本身使用方法我觉得基本上使用没问题,注意在Linear中batch直接放在第0维就行了,例如mat.shape=[5,4,20],经过nn.Linear(20,10)后,就变为了mat.shape=[5,4,10]。简单来说它将mat的最后一维变换了。内部的话函数主要是output = W*x + b。这个Linear()层用处挺大的,可以用激活函数添加非线性使用Dropout层去拟合啥的。
反正pytorch基本的训练流程都差不多,仿照这个写就完事了
import torch
import torch.nn as nn
import torch.optim as optim
net = Net() #网络初始化
opt = optim.Adam(net.parameters(), lr=0.01) #pytorch直接有优化器,节约代码时间
crit = nn.CrossEntropyLoss() #pytorch直接把损失函数直接用作一个层了
running_loss = 0
for epoch in range(num):
for data in dataloader:
opt.zero_grad()
output = net(input)
loss = crit(output, label)
loss.backward()
opt.step() #这里更新参数
running_loss += loss.data
if(epoch % show_times == show_times - 1):
print("in [%d] epoch, the average loss is [%.5f]" % (epoch, running_loss/show_times ))
running_loss = 0
总结一下欠拟合的几个原因:
当我完成了这两项后网络可以达到过拟合,输入的数据为wav音频变换为logfbank 40维 3s可以达到过拟合即交叉熵损失函数为0的程度。不得不说,dropout的使用需要谨慎啊,让人抓狂。
经查找各种资料,据说梯度消失和梯度爆炸说很正常,因为当网络有多层时,每次叠加乘一乘,加一加就变得很大和很小,梯度自然就很大和很小了。
一些我发现的处理方法:
训练部分组成有Dataset和DataLoader作为数据的读取。数据集为在openslr上取得的Aishell数据集,截取了前3s。
整体模型在服务器上使用gpu运行,故若移植到服务器该代码需要略作修改,将模型、输入和标签都加载在gpu上,使用.cuda()
from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opt
import torch.multiprocessing as mp
import librosa
import os
import numpy as np
import matplotlib.pyplot as plt
import python_speech_features as psf
import pandas as pd
# like kaldi, use python to extract wav.scp and features
# r'D:/VMShare/wav/test'
def get_utt2spk(src_dir, target_dir):
spks = os.listdir(src_dir)
wav_addr = target_dir + '/wav.scp'
utt2spk_addr = target_dir + '/utt2spk'
with open(wav_addr, 'w+') as scp, open(utt2spk_addr, 'w+') as utt2spk:
for spk in spks:
wavs_dir = os.path.join(src_dir, spk)
wavs = os.listdir(wavs_dir)
for wav in wavs:
print(wav[:-4], os.path.join(wavs_dir, wav), sep='\t', file=scp)
print(wav[:-4], spk, sep='\t', file=utt2spk)
class SameLenth(object):
def __init__(self, length):
self.length = length
def __call__(self, src):
if (len(src) >= self.length):
return src[:self.length]
else :
zeros = np.zeros(self.length - len(src))
src = np.hstack((src, zeros))
return src
def get_features(wav_scp, target_dir, method = 'logfbank', transform = SameLenth(3 * 16000)):
wavs_frame = pd.read_table(wav_scp, header=None, names=('name', 'addr'))
features_addr = r'data/features'
features_dict = {}
for i in range(len(wavs_frame)):
wav = wavs_frame.iloc[i]
wav_name = str(wav['name'])
wav_addr = str(wav['addr'])
src, sr = librosa.load(path = wav_addr, sr = None)
if(transform):
src = transform(src)
feature = None
if(method == 'mfcc'):
feature = psf.mfcc(src, sr, numcep = 40, nfilt = 40)
elif(method == 'logfbank'):
feature = psf.logfbank(src, sr, nfilt = 40)
elif(method == 'fbank'):
feature = psf.fbank(src, sr, nfilt = 40)[0]
features_dict[wav_name] = feature
np.save(target_dir + '/features', features_dict)
#D:\VMShare\wav
class WavDataset(Dataset):
def __init__(self, src_dir = r'D:/VMShare/wav/test', transform = None):
# init wav, features, utt2spk
sub_dir = os.path.basename(src_dir)
base_dir = 'data/' + sub_dir
utt2spk_addr = base_dir + '/utt2spk'
features_addr = base_dir + '/features.npy'
# make data dir
if not (os.path.exists(base_dir)):
os.makedirs(base_dir)
# make utt2spk
if not (os.path.exists(utt2spk_addr)):
get_utt2spk(src_dir, base_dir)
# make features
if not (os.path.exists(features_addr)):
get_features(base_dir + '/wav.scp', base_dir)
self.src_dir = src_dir
self.spks = os.listdir(src_dir)
self.utt2spk_frame = pd.read_table(utt2spk_addr, header=None, names=('name', 'spk'))
self.feature_dict = np.load(features_addr, allow_pickle=True).item()
def __len__(self):
return len(self.utt2spk_frame)
def __getitem__(self, i):
if(i >= len(self.utt2spk_frame)):
raise Exception('Invalid index', i)
wav = self.utt2spk_frame.iloc[i]
wav_name = str(wav['name'])
spkid = self.spks.index(str(wav['spk']))
mfccs = self.feature_dict[wav_name]
mfccs = torch.tensor(np.squeeze(mfccs.reshape(1, -1)))
sample = { 'mfccs': mfccs, 'mark': spkid }
return sample
def get_spekaer_num(self):
return len(os.listdir(self.src_dir))
class DNet(nn.Module):
def __init__(self, input_frame, output_speakers):
super(DNet, self).__init__()
# network start here
self.batchnorm = nn.BatchNorm1d(input_frame)
self.fc1 = nn.Linear(input_frame, 256)
self.fc2 = nn.Linear(256, 256)
self.fc3 = nn.Linear(256, 256)
self.dropout1 = nn.Dropout(p = 0.5)
self.fc4 = nn.Linear(256, 256)
self.dropout2 = nn.Dropout(p = 0.5)
self.output = nn.Linear(256, output_speakers)
def forward(self, x):
x = x.to(torch.float32)
x = self.batchnorm(x)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.dropout1(x)
x = self.fc4(x)
x = self.dropout2(x)
x = self.output(x)
return x
def param_init(m):
if(isinstance(m, nn.Linear)):
nn.init.orthogonal_(m.weight)
nn.init.constant_(m.bias, 1e-10)
# training process
def dvector_train(dnet, process_i):
# dataset prepare
wavSet = WavDataset(r'D:/VMShare/wav/test')
dataloader = DataLoader(wavSet, batch_size=4, shuffle=True, num_workers=0)
criterion = nn.CrossEntropyLoss()
optimer = opt.Adam(dnet.parameters(), lr = 0.001)#optimer = opt.SGD(dnet.parameters(), lr = 0.01, momentum = 0.9)
log_file = r'data\log' + str(process_i)
with open(log_file,'w+') as wf:
for epoch in range(10):
running_loss = 0
for i, wavs in enumerate(dataloader):
mfcc = wavs['mfccs']
label = torch.tensor(wavs['mark']).long()
optimer.zero_grad()
output = dnet(mfcc.float())
print(output.shape)
print(label.shape)
loss = criterion(output, label)
loss.backward()
optimer.step()
running_loss += loss.item()
if (i % 250 == 249):
print("in %s thread, %s epoch, %s itertory. The average loss is %5f"
% (process_i, epoch, i + 1 , running_loss / 250), file = wf)
_, predict = torch.max(output, 1)
print("predict =", predict, file = wf)
print("but it should be", label, file = wf)
running_loss = 0
if(__name__ == '__main__'):
muilti_mode = False
dnet = DNet(40 * 299, len(os.listdir(r'D:/VMShare/wav/test')))
dnet.train()
if(muilti_mode == True):
# muiltiprocess
num_processes = 5
dnet.share_memory()
processes = []
for i in range(num_processes):
p = mp.Process(target=dvector_train, args=(dnet, i, ))
p.start()
processes.append(p)
for p in processes:
p.join()
else:
dvector_train(dnet, 0)
由于是用作学习方面,故而我懒(没错,我就是懒得再分成注册集和验证集了),我只简单读取了两个人的12条语音用以验证跑出模型后的结果。
理论部分:根据论文[1] Variani E , Xin L , E Mc de rmott, et al. Deep neural networks for small footprint text-dependent speaker verification[C]// IEEE International Conference on Acoustics. IEEE, 2014. 验证部分使用Cosine距离进行得分的判定,一开始我还以为Cosine距离是啥高端的聚类方法,原来就是高中所学的Cosθ = 点积 / 模之积,Cosine距离能够应用于此系统的理论在网上可以搜到,我也不会赘述了。总之得到了dvector,在我这是一个256维的向量,然后进行L2归一化,归一化后只要点乘就是Cosine距离了,根据得到进行阈值判定即可。
代码如下:
# Eval net
class DNet_eval(nn.Module):
def __init__(self, input_frame):
super(DNet_eval, self).__init__()
# network start here
self.batchnorm = nn.BatchNorm1d(input_frame)
self.fc1 = nn.Linear(input_frame, 256)
self.fc2 = nn.Linear(256, 256)
self.fc3 = nn.Linear(256, 256)
self.dropout1 = nn.Dropout(p = 0.5)
self.fc4 = nn.Linear(256, 256)
self.dropout2 = nn.Dropout(p = 0.5)
def forward(self, x):
x = x.to(torch.float32)
x = self.batchnorm(x)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.dropout1(x)
x = self.fc4(x)
x = self.dropout2(x)
return x
# register
def extract_dvector(dnet, input):
output = dnet(input)
# L2 norm
norm = torch.sum(output ** 2, 1) ** 0.5
dvector = output
for row in range(dvector.shape[0]):
# print(dvector[row,:].shape)
# print(norm[row])
dvector[row, :] = dvector[row, :] / norm[row]
dvector = torch.mean(dvector, 0)
return dvector
def build_eval_model(model_path):
dnet = DNet_eval(40 * 299)
dnet_stat = torch.load(model_path, map_location=torch.device('cpu'))
dnet_stat.pop('output.weight')
dnet_stat.pop('output.bias')
dnet.load_state_dict(dnet_stat)
dnet.eval()
return dnet
# dnet eval procedure
def dnet_eval(rdvector, tdvector):
# load the state_dict and init
mfccs = input['mfccs']
label = input['mark']
output = dnet(mfccs)
dnet = build_eval_model('dnet_stat')
# spk(1/2)_mfcc是我预先定义的spk1的特征,有12条语音特征,我将前三条作注册,后9条进行验证
spk1_rdvector = extract_dvector(dnet, spk1_mfcc[:3,:])
spk2_rdvector = extract_dvector(dnet, spk2_mfcc[:3,:])
for i in range(3, 12):
spk1_eval = extract_dvector(dnet, spk1_mfcc[i,:].unsqueeze(0))
spk2_eval = extract_dvector(dnet, spk2_mfcc[i,:].unsqueeze(0))
score1_1 = torch.dot(spk1_eval, spk1_rdvector)
score1_2 = torch.dot(spk1_eval, spk2_rdvector)
score2_1 = torch.dot(spk2_eval, spk1_rdvector)
score2_2 = torch.dot(spk2_eval, spk2_rdvector)
print("in ( %d ), 1_1: %.5f, 1_2: %.5f, 2_1: %.5f, 2_2: %.5f" % (i, score1_1, score1_2, score2_1, score2_2) )
结果如下,1_1表示说话人spk1与spk1的匹配分数,以此类推。至于设置阈值,我这里暂时没有进行划分,不过也可以大致看出来,基本上1_1的得分比 1_2要高,2_2得分比2_1高,故而系统确实更倾向于将spk1真实语音划分为spk1,而将spk2划分为spk2,不过也有一些误判的案例,如画了下划线的那一条。如果有兴趣的话,可以自己完整的进行测试。
in ( 3 ), 1_1: 0.49472, 1_2: 0.40563, 2_1: 0.21910, 2_2: 0.47767 in ( 4 ), 1_1: 0.58923, 1_2: 0.38271, 2_1: 0.24586, 2_2: 0.50004 in ( 5 ), 1_1: 0.62884, 1_2: 0.32944, 2_1: 0.01235, 2_2: 0.30427 in ( 6 ), 1_1: 0.54063, 1_2: 0.37800, 2_1: 0.27561, 2_2: 0.46045 in ( 7 ), 1_1: 0.48017, 1_2: 0.42385, 2_1: 0.37885, 2_2: 0.29421 in ( 8 ), 1_1: 0.56035, 1_2: 0.38387, 2_1: 0.20613, 2_2: 0.46929 in ( 9 ), 1_1: 0.57202, 1_2: 0.37773, 2_1: 0.20084, 2_2: 0.38082 in ( 10 ), 1_1: 0.50321, 1_2: 0.39833, 2_1: 0.10777, 2_2: 0.46831 in ( 11 ), 1_1: 0.52713, 1_2: 0.38174, 2_1: -0.03108, 2_2: 0.31025
推荐的网络搭建路径:minibatch测试——>先达到过拟合——>利用数据增强如(dropout、L2归一、batch normalization)消除过拟合——>测试