tomato学习笔记-dvector和其他基础

目录

  1. 简介
  2. Pytorch基础学习
  3. 过拟合和欠拟合的问题
  4. 梯度消失和梯度爆炸
  5. d-vector实例
  6. 总结

1.简介

        开始写文章来记录以下我的声纹学习经历和记录。作为一个新菜鸟,刚入门真是痛苦不堪,而且国内的网站上大多数的声纹方面的帖子和学习资料比较少,不是从英文网站上直接翻译Copy下来的就是论文的直接翻译。我就简单记录一下仿照各种资料中所遇见的问题,大家学习过程中可以找一下共鸣。

2.Pytorch基础学习

        本次i-vector主要使用以下方法,介绍这些方法主要是以防我之后忘记,能快速回忆起基本方法等。主要参考的学习网站:https://www.w3cschool.cn/pytorch/pytorch-9dfn3bnh.html,该网站主要是翻译了Pytorch官网教学的网站,虽然有部分翻译上的问题,如果英语很好可以直接跳去官方网站进行学习。

1.设置自定义网络:

        不能直接用,网络基本的定义方式,必须有__init__()和forward()方法

import torch
import torch.nn
import torch.nn.functional as F

class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        #各种各样的网络,可以是自己设置的,也可以是torch.nn中已建立的网络例如
        self.fc = nn.Linear(input_size, output_size)

    def forward(self):
        #设置Net(input)的过程,注意激活函数啥的不要忘了在这里写,最后返回结果
        return output

2.设置自己的DataSet和内置DataLoader

        自己建立dataset的好处是可以之后使用pytorch自建的DataLoader,该DataLoader可以直接调用使用batch_size,按batch_size提取出数据和标签,同时有一个shuffer乱序输出的方式。感觉使用这个简化了很多代码,挺好用的而且代码很好看,五星级建议★★★★★

        简要来说该dataset需要重写的方法有__init__(),__len__(),__getitem__()。dataset还有一个transform,在图像里面一般transform为裁剪和缩放,在我的d-vector里使用transform进行音频长度的裁剪。

from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    
    def __init__(self, tranform = None):
        #在这里可以初始化一些表格供之后直接调用,值得注意的是Dataset最好可以直接用1,2,3……调用,例如dataset[1]而不要使用dataset[1,1]这样。因为在dataloader默认方式里使用前种方式随机获取。

    def __len__(self):
        #dataloader要用,写就是了
        return len(something)

    def __getitem__(self):
        #使用dataset[i]能获取到啥,在这里编程。
        #一般返回字典
        return dict{'input_feature':input_featrue, 'tag':tag}

class Mytransform(Object):

    def __init__(self):#我略过了

    def __call__(self):
        #这里进行主要方法的编写并返回东西,这使得Mytransform可以直接用()调用
        return _


#使用dataloader方法为初始化->直接用了
mydataset = MyDataset()
#这里num_workers如果不设置为0我的win报错了,不知道咋回事,应该在linux上可用
dataloader = DataLoader(mydataset, batch_size=4, shuffle=True, num_workers=0) 

3.nn.Linear()心得

nn.Linear(in_features: int, out_features: int, bias: bool = True)
Args:
    in_features: size of each input sample
    out_features: size of each output sample
    bias: If set to ``False``, the layer will not learn an additive bias.
        Default: ``True``

        本身使用方法我觉得基本上使用没问题,注意在Linear中batch直接放在第0维就行了,例如mat.shape=[5,4,20],经过nn.Linear(20,10)后,就变为了mat.shape=[5,4,10]。简单来说它将mat的最后一维变换了。内部的话函数主要是output = W*x + b。这个Linear()层用处挺大的,可以用激活函数添加非线性使用Dropout层去拟合啥的。

4.训练流程

        反正pytorch基本的训练流程都差不多,仿照这个写就完事了

import torch
import torch.nn as nn
import torch.optim as optim

net = Net() #网络初始化
opt = optim.Adam(net.parameters(), lr=0.01) #pytorch直接有优化器,节约代码时间
crit = nn.CrossEntropyLoss() #pytorch直接把损失函数直接用作一个层了

running_loss = 0
for epoch in range(num):
    for data in dataloader:

        opt.zero_grad()
        output = net(input)
        loss = crit(output, label)
        loss.backward()
        opt.step() #这里更新参数

        running_loss += loss.data
        if(epoch % show_times == show_times - 1):
            print("in [%d] epoch, the average loss is [%.5f]" % (epoch, running_loss/show_times ))
            running_loss = 0
        

3.过拟合和欠拟合的问题

        总结一下欠拟合的几个原因:

  • 数据没有归一化(主要是输入数据的问题),但归一化可能会使得输入数据不符合原数据
  • 设置了dropout

        当我完成了这两项后网络可以达到过拟合,输入的数据为wav音频变换为logfbank 40维 3s可以达到过拟合即交叉熵损失函数为0的程度。不得不说,dropout的使用需要谨慎啊,让人抓狂。

4.梯度消失和梯度爆炸

        经查找各种资料,据说梯度消失和梯度爆炸说很正常,因为当网络有多层时,每次叠加乘一乘,加一加就变得很大和很小,梯度自然就很大和很小了。

        一些我发现的处理方法:

  • 网络根本就有问题:这个是最麻烦的,因为基本上都是照着别人的改的,这有问题让我怎么着
  • 添加batchnorm层:pytorch里有nn.BatchNorm2d这个在每次激活层前使用,有用的,不过可能会对输入数据的分布造成影响,自己看着办吧。
  • 一个巨坑:在Pytorch里损失函数NLLLoss和CrossEntropyLoss的区别一定要注意,EntropyLoss整合了LogSoftmax层和NLLLoss,所以自己使用了Softmax就别用CrossEntropyLoss了,我用了后梯度全没了。

5.d-vector实例

(1)训练部分

        训练部分组成有Dataset和DataLoader作为数据的读取。数据集为在openslr上取得的Aishell数据集,截取了前3s。

        整体模型在服务器上使用gpu运行,故若移植到服务器该代码需要略作修改,将模型、输入和标签都加载在gpu上,使用.cuda()

from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as opt
import torch.multiprocessing as mp
import librosa
import os
import numpy as np
import matplotlib.pyplot as plt
import python_speech_features as psf
import pandas as pd

# like kaldi, use python to extract wav.scp and features
# r'D:/VMShare/wav/test'
def get_utt2spk(src_dir, target_dir):
    
    spks = os.listdir(src_dir)
    wav_addr = target_dir + '/wav.scp'
    utt2spk_addr = target_dir + '/utt2spk'
    with open(wav_addr, 'w+') as scp, open(utt2spk_addr, 'w+') as utt2spk:
        for spk in spks:
            wavs_dir = os.path.join(src_dir, spk)
            wavs = os.listdir(wavs_dir)
            for wav in wavs:
                print(wav[:-4], os.path.join(wavs_dir, wav), sep='\t', file=scp)
                print(wav[:-4], spk, sep='\t', file=utt2spk)

class SameLenth(object):
    
    def __init__(self, length):
        self.length = length
        
    def __call__(self, src):
        if (len(src) >= self.length):
            return src[:self.length]
        else :
            zeros = np.zeros(self.length - len(src))
            src = np.hstack((src, zeros))
            return src
        

def get_features(wav_scp, target_dir, method = 'logfbank', transform = SameLenth(3 * 16000)):
    
    wavs_frame = pd.read_table(wav_scp, header=None, names=('name', 'addr'))
    features_addr = r'data/features'
    features_dict = {}
    
    for i in range(len(wavs_frame)):
        wav = wavs_frame.iloc[i]
        wav_name = str(wav['name'])
        wav_addr = str(wav['addr'])

        src, sr = librosa.load(path = wav_addr, sr = None)
        if(transform):
            src = transform(src)

        feature = None
        if(method == 'mfcc'):
            feature = psf.mfcc(src, sr, numcep = 40, nfilt = 40)
        elif(method == 'logfbank'):
            feature = psf.logfbank(src, sr, nfilt = 40)
        elif(method == 'fbank'):
            feature = psf.fbank(src, sr, nfilt = 40)[0]
        features_dict[wav_name] = feature
        
    np.save(target_dir + '/features', features_dict)

#D:\VMShare\wav
class WavDataset(Dataset):
    
    def __init__(self, src_dir = r'D:/VMShare/wav/test', transform = None):
        # init wav, features, utt2spk
        sub_dir = os.path.basename(src_dir)
        base_dir = 'data/' + sub_dir
        utt2spk_addr = base_dir + '/utt2spk'
        features_addr = base_dir + '/features.npy'
        # make data dir
        if not (os.path.exists(base_dir)):
            os.makedirs(base_dir)
        # make utt2spk
        if not (os.path.exists(utt2spk_addr)):
            get_utt2spk(src_dir, base_dir)
        # make features
        if not (os.path.exists(features_addr)):
            get_features(base_dir + '/wav.scp', base_dir)
            
        self.src_dir = src_dir
        self.spks = os.listdir(src_dir)
        self.utt2spk_frame = pd.read_table(utt2spk_addr, header=None, names=('name', 'spk'))
        self.feature_dict = np.load(features_addr, allow_pickle=True).item()
    
    def __len__(self):
        return len(self.utt2spk_frame)
    
    def __getitem__(self, i):
        if(i >= len(self.utt2spk_frame)):
            raise Exception('Invalid index', i)
        
        wav = self.utt2spk_frame.iloc[i]
        wav_name = str(wav['name'])
        spkid = self.spks.index(str(wav['spk']))
        mfccs = self.feature_dict[wav_name]
        mfccs = torch.tensor(np.squeeze(mfccs.reshape(1, -1)))
        sample = { 'mfccs': mfccs, 'mark': spkid }
        return sample
    
    def get_spekaer_num(self):
        return len(os.listdir(self.src_dir))

class DNet(nn.Module):
    
    def __init__(self, input_frame, output_speakers):
        super(DNet, self).__init__()
        # network start here
        self.batchnorm = nn.BatchNorm1d(input_frame)
        self.fc1 = nn.Linear(input_frame, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 256)
        self.dropout1 = nn.Dropout(p = 0.5)
        self.fc4 = nn.Linear(256, 256)
        self.dropout2 = nn.Dropout(p = 0.5)
        self.output = nn.Linear(256, output_speakers)
        
    def forward(self, x):
        x = x.to(torch.float32)
        x = self.batchnorm(x)
        
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.dropout1(x)
        x = self.fc4(x)
        x = self.dropout2(x)
        x = self.output(x)
        return x
    
def param_init(m):
    
    if(isinstance(m, nn.Linear)):
        nn.init.orthogonal_(m.weight)
        nn.init.constant_(m.bias, 1e-10)

# training process
def dvector_train(dnet, process_i):
    # dataset prepare
    wavSet = WavDataset(r'D:/VMShare/wav/test')
    dataloader = DataLoader(wavSet, batch_size=4, shuffle=True, num_workers=0)
    criterion = nn.CrossEntropyLoss()
    optimer = opt.Adam(dnet.parameters(), lr = 0.001)#optimer = opt.SGD(dnet.parameters(), lr = 0.01, momentum = 0.9)
    
    log_file = r'data\log' + str(process_i)
    with open(log_file,'w+') as wf:
        for epoch in range(10):

            running_loss = 0
            for i, wavs in enumerate(dataloader):
                mfcc = wavs['mfccs']
                label = torch.tensor(wavs['mark']).long()

                optimer.zero_grad()
                output = dnet(mfcc.float())
                print(output.shape)
                print(label.shape)
                loss = criterion(output, label)
                loss.backward()
                optimer.step()

                running_loss += loss.item()
                if (i % 250 == 249): 
                    print("in %s thread, %s epoch, %s itertory. The average loss is %5f" 
                          % (process_i, epoch, i + 1 , running_loss / 250), file = wf)
                    _, predict = torch.max(output, 1)
                    print("predict =", predict, file = wf)
                    print("but it should be", label, file = wf)
                    running_loss = 0

if(__name__ == '__main__'):
    muilti_mode = False
    dnet = DNet(40 * 299, len(os.listdir(r'D:/VMShare/wav/test')))
    dnet.train()

    if(muilti_mode == True):
        # muiltiprocess
        num_processes = 5
        dnet.share_memory()
        processes = []
        for i in range(num_processes):
            p = mp.Process(target=dvector_train, args=(dnet, i, ))
            p.start()
            processes.append(p)
        for p in processes:
            p.join()
    else:
        dvector_train(dnet, 0)

(2)验证

        由于是用作学习方面,故而我懒(没错,我就是懒得再分成注册集和验证集了),我只简单读取了两个人的12条语音用以验证跑出模型后的结果。

        理论部分:根据论文[1] Variani E ,  Xin L ,  E  Mc de rmott, et al. Deep neural networks for small footprint text-dependent speaker verification[C]// IEEE International Conference on Acoustics. IEEE, 2014. 验证部分使用Cosine距离进行得分的判定,一开始我还以为Cosine距离是啥高端的聚类方法,原来就是高中所学的Cosθ = 点积 / 模之积,Cosine距离能够应用于此系统的理论在网上可以搜到,我也不会赘述了。总之得到了dvector,在我这是一个256维的向量,然后进行L2归一化,归一化后只要点乘就是Cosine距离了,根据得到进行阈值判定即可。

        代码如下:

# Eval net
class DNet_eval(nn.Module):
    
    def __init__(self, input_frame):
        super(DNet_eval, self).__init__()
        # network start here
        self.batchnorm = nn.BatchNorm1d(input_frame)
        self.fc1 = nn.Linear(input_frame, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 256)
        self.dropout1 = nn.Dropout(p = 0.5)
        self.fc4 = nn.Linear(256, 256)
        self.dropout2 = nn.Dropout(p = 0.5)
        
    def forward(self, x):
        x = x.to(torch.float32)
        x = self.batchnorm(x)
        
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.dropout1(x)
        x = self.fc4(x)
        x = self.dropout2(x)
        return x

# register
def extract_dvector(dnet, input):
    output = dnet(input)
    # L2 norm
    norm = torch.sum(output ** 2, 1) ** 0.5
    dvector = output
    for row in range(dvector.shape[0]):
#         print(dvector[row,:].shape)
#         print(norm[row])
        dvector[row, :] = dvector[row, :] / norm[row]
    
    dvector = torch.mean(dvector, 0)
    return dvector


def build_eval_model(model_path):
    dnet = DNet_eval(40 * 299)
    dnet_stat = torch.load(model_path, map_location=torch.device('cpu'))
    dnet_stat.pop('output.weight')
    dnet_stat.pop('output.bias')
    dnet.load_state_dict(dnet_stat)
    dnet.eval()
    return dnet


# dnet eval procedure
def dnet_eval(rdvector, tdvector):
    # load the state_dict and init
    mfccs = input['mfccs']
    label = input['mark']
    output = dnet(mfccs)
    

dnet = build_eval_model('dnet_stat')
# spk(1/2)_mfcc是我预先定义的spk1的特征,有12条语音特征,我将前三条作注册,后9条进行验证
spk1_rdvector = extract_dvector(dnet, spk1_mfcc[:3,:]) 
spk2_rdvector = extract_dvector(dnet, spk2_mfcc[:3,:])

for i in range(3, 12):
    spk1_eval = extract_dvector(dnet, spk1_mfcc[i,:].unsqueeze(0))
    spk2_eval = extract_dvector(dnet, spk2_mfcc[i,:].unsqueeze(0))
    
    score1_1 = torch.dot(spk1_eval, spk1_rdvector)
    score1_2 = torch.dot(spk1_eval, spk2_rdvector)
    
    score2_1 = torch.dot(spk2_eval, spk1_rdvector)
    score2_2 = torch.dot(spk2_eval, spk2_rdvector)
    
    print("in ( %d ), 1_1: %.5f, 1_2: %.5f, 2_1: %.5f, 2_2: %.5f" % (i, score1_1, score1_2, score2_1, score2_2) )

        结果如下,1_1表示说话人spk1与spk1的匹配分数,以此类推。至于设置阈值,我这里暂时没有进行划分,不过也可以大致看出来,基本上1_1的得分比 1_2要高,2_2得分比2_1高,故而系统确实更倾向于将spk1真实语音划分为spk1,而将spk2划分为spk2,不过也有一些误判的案例,如画了下划线的那一条。如果有兴趣的话,可以自己完整的进行测试。

in ( 3 ), 1_1: 0.49472, 1_2: 0.40563,         2_1: 0.21910, 2_2: 0.47767
in ( 4 ), 1_1: 0.58923, 1_2: 0.38271,         2_1: 0.24586, 2_2: 0.50004
in ( 5 ), 1_1: 0.62884, 1_2: 0.32944,         2_1: 0.01235, 2_2: 0.30427
in ( 6 ), 1_1: 0.54063, 1_2: 0.37800,         2_1: 0.27561, 2_2: 0.46045
in ( 7 ), 1_1: 0.48017, 1_2: 0.42385,         2_1: 0.37885, 2_2: 0.29421
in ( 8 ), 1_1: 0.56035, 1_2: 0.38387,         2_1: 0.20613, 2_2: 0.46929
in ( 9 ), 1_1: 0.57202, 1_2: 0.37773,         2_1: 0.20084, 2_2: 0.38082
in ( 10 ), 1_1: 0.50321, 1_2: 0.39833,         2_1: 0.10777, 2_2: 0.46831
in ( 11 ), 1_1: 0.52713, 1_2: 0.38174,         2_1: -0.03108, 2_2: 0.31025

6.总结

        推荐的网络搭建路径:minibatch测试——>先达到过拟合——>利用数据增强如(dropout、L2归一、batch normalization)消除过拟合——>测试

你可能感兴趣的:(dnn)