sidekit中GMM-UBM中speaker-adaptation部分

----2016.12.18 已补充 ----
sidekit还是挺不错的,很简单,文档更是直接把源码给你,如果能顺利搭好环境,如果有基础的话,一天之内跑通应该是没有问题的。
下面开始对GMM-UBM中说话人自适应调整以及计算得分进行详细的分析,其中也会有代码改写的部分,因为那么多h5文件,看着挺烦的, 在看下面之前首先保证已经熟悉了sidekit, 并且对里边的h5文件的格式都很清楚,否则没有必要继续往下看。

下面这是自适应部分的源码,utils是自己写的,gmm-score 和 EER 部分暂时请忽略,后面会涉及到,重点看MAP部分:

import sidekit
import numpy as np
from utils import EER, gmm_score
import h5py

'''
this stand version can run the predicted result
'''
enroll_idmap = sidekit.IdMap('task/enroll_spks2utt.h5')
ubm = sidekit.Mixture()
ubm.read("task/ubm.h5")
nj = 10

server_eval = sidekit.FeaturesServer(feature_filename_structure="./mfcc_eval/{}.h5",
                                     dataset_list=["energy", "cep", "vad"],
                                     mask=None,
                                     feat_norm="cmvn",
                                     keep_all_features=False,
                                     delta=True,
                                     double_delta=True,
                                     rasta=True,
                                     context=None)

print('Compute the sufficient statistics')
enroll_stat = sidekit.StatServer(enroll_idmap, ubm)
enroll_stat.accumulate_stat(ubm=ubm, feature_server=server_eval,\ seg_indices=range(enroll_stat.segset.shape[0]), num_thread=nj)
enroll_stat.write('task/stat_enroll_stand.h5')

print('MAP adaptation of the speaker models')

regulation_factor = 3  # MAP regulation factor
enroll_sv = enroll_stat.adapt_mean_map_multisession(ubm, regulation_factor)
enroll_sv.write('task/map_enroll_stand.h5')


print('Compute trial scores')
enroll = sidekit.StatServer('task/map_enroll_stand.h5')

s = np.zeros((59, 1024))

gscore = gmm_score(ubm, enroll, server_eval, s)
scores = gscore.compute_scores()

eer = EER(scores)
eer.compute_eer()

上面这段主要是两个方法一个是计算统计量accumulate_stat(), 还有一个是MAP部分更新统计量adapt_mean_map_multisession(), 下面分别看一下这两个方法,其中有写参数传递跟源码不太一样,本来想重写,但是写的不如人家好:

    def accumulate_stat(self, feature_server):

        '''
        reuslt: enroll_stat.write('task/stat_enroll.h5')
        stat0.shape = (228, 1024)
        stat1.shape = (228, 64512)
        start.shape = (228, ) #discard, start,stop对结果没影响
        stop.shape = (228, ) #discard
        segset.shape = (228, )
        modelset.shape = (228, )
        这里是计算每句话的统计量,然后在后面的map部分计算每个人的统计量,除了注释部分,这个函数除了sum_log_probabilities(),其它都还是比较好理解的,根据论文也基本都能看懂
        '''

        for idx in range(self.segset.shape[0]):

            print('Compute statistics for {}'.format(self.segset[idx]))
            show = self.segset[idx]
            cep, vad = feature_server.load(show)
            # Verify that frame dimension is equal to gmm dimension
            lp = self.ubm.compute_log_posterior_probabilities(cep)
            pp, foo = self.sum_log_probabilities(lp)
            # Compute 0th-order statistics, 是
            self.stat0[idx, :] = pp.sum(0) #stat0_{i} = ni = \sum{t=1}^{T} Pr(\lambda_{i} | x_{t})
            # Compute 1st-order statistics, 其中1024是num_components, 63是特征的维数
            self.stat1[idx, :] = np.reshape(np.transpose(np.dot(cep.transpose(), pp)), 1024 * 63)
    #其中求lp的实现
    def compute_log_posterior_probabilities(self, cep, mu=None):
        """ Compute log posterior probabilities for a set of feature frames.

        :param cep: a set of feature frames in a ndarray, one feature per row
        :param mu: a mean super-vector to replace the ubm's one. If it is an empty 
              vector, use the UBM

        :return: A ndarray of log-posterior probabilities corresponding to the 
              input feature set.
        """
        if cep.ndim == 1:
            cep = cep[numpy.newaxis, :]
        A = self.A
        if mu is None:
            mu = self.mu
        else:
            # for MAP, Compute the data independent term
            A = (numpy.square(mu.reshape(self.mu.shape)) * self.invcov).sum(1) \
               - 2.0 * (numpy.log(self.w) + numpy.log(self.cst))

        # Compute the data independent term
        B = numpy.dot(numpy.square(cep), self.invcov.T) \
            - 2.0 * numpy.dot(cep, numpy.transpose(mu.reshape(self.mu.shape) * self.invcov))

        # Compute the exponential term
        lp = -0.5 * (B + A)
        return lp
Q1:这里的lp矩阵就应该是 wi * P(x_{t} | \lambda_{i})的(nframes, num_components)矩阵, 不明白为什么要加log进行对数运算? 这还只是计算统计量,还没到计算得分。(已发邮件,等作者回复)
 刚才看了一下问作者的那几个问题, 昨天问过柯老师, 貌似都明白了:
A: 这些w, u, \sigma 的值都很小, 如果直接计算 P(xt|\lambda_{i}) 的话可能会出现下溢出,
在计算 bayes 公式的时候分母 \sum{i=1}^{M} w_{i}*P(x_{t}|\lambda_{i}),
由于有的项可能是特别小,有的项比较大(0.1 + 10^-10 = 0.1),会造成很多数据的丢失, 在计算过程中数据的丢失是不可以的,
因此要取对数运算

 这里lp计算完了之后需要用贝叶斯公式,转化成Pr(\lambda_{t} | x_{t}) , 按理说pp矩阵就是这样的:


    def sum_log_probabilities(self, lp):
        '''
        Args:
            lp:  ( nframes , num_components)  wi * P(xt | \lambda_i)  还不是 ni
        Returns:
        '''
        pp_max = np.max(lp, axis=1)  # 每一行取最大组成
        log_lk = pp_max + np.log(np.sum(np.exp((lp.transpose() - pp_max).transpose()), axis=1))
        ind = ~np.isfinite(pp_max)
        #print("ind: ", ind) #全是false
        if sum(ind) != 0:
            log_lk[ind] = pp_max[ind]
        pp = np.exp((lp.transpose() - log_lk).transpose())
        return pp, log_lk
   Q2: 但是这个pp没有看懂这么做的依据是什么? (已发邮件,等作者回复)
    A:  知道上面那个问题之后就可以理解了,可以在纸上仔细推导一遍, 都转化成了对数运算,
     看起来不直观,不过却是最后是Pr( \lambda_{i}|xt)

第二个函数中一些numpy的方法的使用很巧妙,都转化成矩阵运算,速度提升很大,弄懂了这个小方法,这里边没有很难理解的地方,具体的看里边的注释:

  def adapt_mean_map_multisession(self, regulation_factor):

        gsv_statserver = enroll_stat()
        gsv_statserver.modelset = np.unique(self.modelset)
        gsv_statserver.segset = np.unique(self.modelset)
        gsv_statserver.stat0 = np.ones((np.unique(self.modelset).shape[0], 1))
        num_components = 1024
        dim_feature = 63
        index_map = np.repeat(np.arange(num_components), dim_feature)

        # Sum the statistics per model
        #modelStat = self.sum_stat_per_model()[0]
        modelStat = self.sum_stat_per_model()

        # Adapt mean vectors
        alpha = modelStat.stat0 / (modelStat.stat0 + regulation_factor)  # ni = stat0

        '''
        a = np.array([['a','b','c'],['d','e','f']])
        >>> a[:, [1,1,2,2]]
        array([['b', 'b', 'c', 'c'],
                ['e', 'e', 'f', 'f']],
        dtype='|S1')

        这里把运算都用矩阵代替了, 不太好直观的理解
        由于modelStat.stat0是(59, 1024), 公式中的ni = \sum{t=1}^{T} Pr(i | xt)
        由于modelStat.stat1是(59, 1024*63), 公式中的 E_{i}(x) = \sum{t=1}^{T} Pr(i | xt) * (xt) / ni
        stat1中的每一个model的均值超矢量的都要除以 ni
        modelStat.stat0[:, index_map] 将原先的1024维的ni系数 每一个系数都复制63遍,然后扩展成1024 * 63维
        这样第一个component的的63维的均值都会除以第一个系数, 以此类推, 这种实现方法真的很高效

        >>> c
        array(['d', 'e', 'f'],
        dtype='|S1')
        >>> np.tile(c, (3,1))
        array([['d', 'e', 'f'],
        ['d', 'e', 'f'],
        ['d', 'e', 'f']],
        dtype='|S1')
        >>> np.tile(c, 3)
        array(['d', 'e', 'f', 'd', 'e', 'f', 'd', 'e', 'f'],
        dtype='|S1')

        关于第二个参数是元组的时候感受一下
        M 的更新那一行就可以理解了

        NOTE:
        但是这样的话, 整个更新的过程就值更新了一次啊, 并且句子之间并没有前后的联系
        '''
        M = modelStat.stat1 / modelStat.stat0[:, index_map]  # (59, 1024*63)
        M[np.isnan(M)] = 0  # Replace NaN due to divide by zeros
        M = alpha[:, index_map] * M \
            + (1 - alpha[:, index_map]) * np.tile(self.ubm.mu.flatten(), (M.shape[0], 1))
        gsv_statserver.stat1 = M
        return gsv_statserver

    def sum_stat_per_model(self):
        """Sum the zero- and first-order statistics per model and store them
        in a new StatServer.

        :return: a StatServer with the statistics summed per model
        """
        sts_per_model = enroll_stat()
        sts_per_model.modelset = np.unique(self.modelset)
        sts_per_model.segset = sts_per_model.modelset
        sts_per_model.stat0 = np.zeros((sts_per_model.modelset.shape[0], self.stat0.shape[1]))  # (59, 1024)
        sts_per_model.stat1 = np.zeros((sts_per_model.modelset.shape[0], self.stat1.shape[1]))  # (59, 1024*63)

        #session_per_model = np.zeros(np.unique(self.modelset).shape[0])

        '''
            print("idx is ", idx, "model is ", model)
            print("stat0 --> ", self.stat0.shape) #(228, 1024) enrollment一共有228句话, 计算的时候是每句话计算统计量
            print("how to sum", self.stat0[self.modelset == model, :].shape) #(4, 1024) 选出species为model的四句话,然后加和
            print("sum to sts_per_model",sts_per_model.stat0) #加和之后赋给 sts_per_model.stat0 的第idx行
            这样思路就清楚多了

        '''

        for idx, model in enumerate(sts_per_model.modelset):

            sts_per_model.stat0[idx, :] = self.stat0[self.modelset == model, :].sum(axis=0)
            sts_per_model.stat1[idx, :] = self.stat1[self.modelset == model, :].sum(axis=0)
            #session_per_model[idx] += self.stat1[self.modelset == model, :].shape[0]
        #return sts_per_model, session_per_model
        return sts_per_model

这里的speaker model只更新了一次,但是结果却是很不错,不知道其他的框架中怎么实现的,或者是有没有更标准的GMM-UBM的实现,如果有的话,麻烦各位大神给评论或者私信,如果文中有错误的地方,也请指出,非常感谢!

–OK

你可能感兴趣的:(sre)