个人学习笔记:EM与GMM算法

本篇文章为个人学习EM算法框架时的笔记,其中主要参考了李航老师的《统计学习方法》这本书以及PRML,中间有一些内容是从其他一些网络资料上摘抄下来的,具体来源比较杂,这里就不一一列出了,如有侵权请联系删除。

1、EM算法

1.1、示例:抛硬币问题

有两枚硬币,不均匀,随机选择一枚抛掷10次,然后再选择一枚抛掷10次,总共进行5次选择,记H为正面,T为反面,如下:

第一次:5H5T
第二次:9H1T
第三次:8H2T
第四次:4H6T
第五次:7H3T

显然,如果知道哪一次选择哪一个硬币,显然直接以MLE进行估计即可,如ABBAB,则:

θ ‾ A = 5 H + 4 H 20 \overline{\theta}_{A} = \frac{5H+4H}{20} θA=205H+4H
θ ‾ B = 9 H + 8 H + 7 H 30 \overline{\theta}_{B} = \frac{9H+8H+7H}{30} θB=309H+8H+7H

而如果不知道哪次选择了哪个硬币?如何确定?
即每次选择的硬币类别为隐含变量,则套用EM算法框架:

step1、初始化参数,即 θ 0 ⃗ = ( θ A 0 , θ B 0 ) \vec{\theta^{0}} = (\theta_{A}^{0},\theta_{B}^{0}) θ0 =(θA0,θB0)
step2、迭代直到收敛,或者达到迭代停止条件:

E步:求Q函数,计算当前参数及样本下每次的类别概率分布:
Q i ( z ( i ) ) = P ( z ( i ) ∣ x ( i ) , θ ( t ) ) Q_{i}(z^{(i)}) = P(z^{(i)}|x^{(i)},\theta^{(t)}) Qi(z(i))=P(z(i)x(i),θ(t))
M步:期望最大化,即对目标进行极大化:
∑ i = 1 m ∑ z Q i ( z ( i ) ) l o g ( P ( x ( i ) , z ( i ) ∣ θ ( t ) ) ) \sum\limits_{i=1}^{m}\sum\limits_{z}Q_{i}(z^{(i)})log(P(x^{(i)},z^{(i)}|\theta^{(t)})) i=1mzQi(z(i))log(P(x(i),z(i)θ(t)))

这里M步实际上是不好求解的。
而本轮的参数值,对于优化目标来讲, Q i ( z ( i ) ) Q_{i}(z^{(i)}) Qi(z(i))部分基本上是确定的常数,因此落在了对概率 P ( x ( i ) , z ( i ) ∣ θ ( t ) ) P(x^{(i)},z^{(i)}|\theta^{(t)}) P(x(i),z(i)θ(t))的优化,寻找一个 θ ( t + 1 ) \theta^{(t+1)} θ(t+1)使其比之前更大,则使期望最大。
代码如下。

1.2、抛硬币问题EM代码实现

1.2.1、实现算法框架

import pandas as pd
import numpy as np
class calac_EM():
    """
    针对硬币问题;
    :sample:为样本情况,形式为元组列表,每个元组对应一次实验,[(H,T)];
    :h_prob:为每次实验当中的H次数的频率;
    """
    def __init__(self,sample):
        self.sample = sample
        self.h_prob = [e[0]/sum(e) for e in sample]
        
    def step_E(self,prob_a_old,prob_b_old):
        theta_a_old = prob_a_old
        theta_b_old = prob_b_old
        prob_a_x_theta_old = np.prod(np.power([prob_a_old,1-prob_a_old],sample),
                                     axis=1)
        prob_b_x_theta_old = np.prod(np.power([prob_b_old,1-prob_b_old],sample),
                                     axis=1)
        Q_a = prob_a_x_theta_old/(prob_a_x_theta_old + prob_b_x_theta_old)
        
        Q_b = 1 - Q_a
        new_shape = (Q_a.shape[0],1)
        Q_a_repeat = np.repeat(np.reshape(Q_a,newshape=new_shape),repeats=2,axis=1)
        Q_b_repeat = 1 - Q_a_repeat
        Exception_a = np.sum(Q_a_repeat*sample,axis=0)
        Exception_b = np.sum(Q_b_repeat*sample,axis=0)
        return Exception_a,Exception_b
    
    def step_M(self,Exception_a,Exception_b):
        prob_a_theta_new = Exception_a[0]/np.sum(Exception_a)
        prob_b_theta_new = Exception_b[0]/np.sum(Exception_b)
        return prob_a_theta_new,prob_b_theta_new
    
    def iter_train(self,prob_A,prob_B,stop_e,max_iter = 11):
        """
        prob_A与prob_B分别表示初始化参数,即A/B硬币抛出正面的概率;
        stop_e为误差参数,当t轮与t+1轮的参数误差小于该阈值时,停止迭代;
        max_iter为最大迭代次数,超出则停止迭代;
        """
        prob_a_theta_old = prob_A
        prob_b_theta_old = prob_B
        ####第一轮训练
        iter_ = 0
        print("初始参数为(theta_a,theta_b):",prob_a_theta_old,prob_b_theta_old)
        
        while True:
            iter_ += 1
            print("第",iter_,"轮训练:")
            Exception_a,Exception_b = self.step_E(prob_a_old=prob_a_theta_old,
                                                  prob_b_old=prob_b_theta_old)
            prob_a_theta_new,prob_b_theta_new = self.step_M(Exception_a,Exception_b)
            print("===更新参数:",(prob_a_theta_new,prob_b_theta_new))
            iter_stop = np.abs(prob_a_theta_old-prob_a_theta_new)\
                        + np.abs(prob_b_theta_new-prob_b_theta_old)
            if iter_stop<=stop_e:
                print("满足迭代条件,训练结束!")
                break
            elif iter_ >= max_iter:
                print("达到最大迭代次数,结束训练")
                break
            else:
                prob_a_theta_old = prob_a_theta_new
                prob_b_theta_old = prob_b_theta_new
        return None

1.2.1、进行训练

sample = [(5,5),(9,1),(8,2),(4,6),(7,3)]
coin_EM = calac_EM(sample)
coin_EM.iter_train(0.7,0.8,0.00001,max_iter = 20)
初始参数为(theta_a,theta_b): 0.7 0.8
第 1 轮训练:
===更新参数: (0.5946099775593936, 0.7566461012979029)
第 2 轮训练:
===更新参数: (0.5591603804756445, 0.7686246125972386)
第 3 轮训练:
===更新参数: (0.539026245983948, 0.7823340890613432)
第 4 轮训练:
===更新参数: (0.5282554145171112, 0.7904987455040994)
第 5 轮训练:
===更新参数: (0.5232313429883886, 0.7942798192885937)
第 6 轮训练:
===更新参数: (0.5210779710123444, 0.7958282879518048)
第 7 轮训练:
===更新参数: (0.5201895676222431, 0.7964279328153537)
第 8 轮训练:
===更新参数: (0.5198283779239912, 0.7966545148106672)
第 9 轮训练:
===更新参数: (0.5196822508906818, 0.7967391807109708)
第 10 轮训练:
===更新参数: (0.5196232024601661, 0.7967706356141095)
第 11 轮训练:
===更新参数: (0.5195993378840494, 0.7967822783202753)
第 12 轮训练:
===更新参数: (0.519589686991515, 0.796786574548503)
第 13 轮训练:
===更新参数: (0.5195857811947752, 0.7967881551002687)
满足迭代条件,训练结束!

1.3、三枚硬币问题建模

有A/B/C三枚硬币,其抛出正面的概率分别为 π , p , q \pi,p,q π,p,q,进行抛掷实验——先抛硬币A,如果抛出正面则选择B硬币再抛一次,记录其正反面;如果抛出背面则选择硬币C再抛一次,记录其正反;进行10次实验,正面记为1,反面记为0,得到如下结果:

1,1,0,1,0,0,1,0,1,1

如果只能观测结果,无法观察抛币过程,如何估计模型参数,即估计出三枚硬币抛出正面的概率。
建模如下:

x x x表示显示观测,取值 { 0 , 1 } \{0,1\} {0,1},分别表示反面与正面;
z z z表示隐藏变量,其取值为 { 0 , 1 } \{0,1\} {0,1},分别表示本次观测由C硬币、B硬币抛出;
θ \theta θ为模型参数,显然 θ = ( π , p , q ) \theta=(\pi,p,q) θ=(π,p,q)

则, x x x的独立分布函数为:

p ( x ∣ θ ) = π p x ( 1 − p ) ( 1 − x ) + ( 1 − π ) q x ( 1 − q ) ( 1 − x ) p(x|\theta)=\pi{p^x}{(1-p)^{(1-x)}}+(1-\pi){q^x}{(1-q)^{(1-x)}} p(xθ)=πpx(1p)(1x)+(1π)qx(1q)(1x)

进行EM算法套用:

关键1,计算在当前参数条件下以及样本下隐含变量的条件概率分布:
p ( z = 0 ∣ x ( i ) , θ ) = ( 1 − π ) q x ( 1 − q ) ( 1 − x ) p ( x ∣ π , p , q ) p(z=0|x^{(i)},\theta)=\frac{(1-\pi){q^x}{(1-q)^{(1-x)}}}{p(x|\pi,p,q)} p(z=0x(i),θ)=p(xπ,p,q)(1π)qx(1q)(1x)
p ( z = 1 ∣ x ( i ) , θ ) = π p x ( 1 − p ) ( 1 − x ) p ( x ∣ π , p , q ) p(z=1|x^{(i)},\theta)=\frac{\pi{p^x}{(1-p)^{(1-x)}}}{p(x|\pi,p,q)} p(z=1x(i),θ)=p(xπ,p,q)πpx(1p)(1x)
函数统一表达:
p ( z ∣ x ( i ) , θ ) = [ ( 1 − π ) q x ( 1 − q ) ( 1 − x ) ] 1 − z + [ π p x ( 1 − p ) ( 1 − x ) ] z p ( x ∣ π , p , q ) p(z|x^{(i)},\theta)=\frac{[(1-\pi){q^x}{(1-q)^{(1-x)}}]^{1-z}+[\pi{p^x}{(1-p)^{(1-x)}}]^z}{p(x|\pi,p,q)} p(zx(i),θ)=p(xπ,p,q)[(1π)qx(1q)(1x)]1z+[πpx(1p)(1x)]z

关键2,优化目标函数:
Q ( θ , θ ( t ) ) = ∑ i = 1 m ∑ z i p ( z ∣ x i , θ t ) l o g ( p ( x i , z i ∣ θ ) ) Q(\theta,\theta^{(t)})=\sum\limits_{i=1}^{m}\sum\limits_{z^i}p(z|x^i,\theta^t)log(p(x^i,z^i|\theta)) Q(θ,θ(t))=i=1mzip(zxi,θt)log(p(xi,ziθ))

显然,观测变量 x x x与隐含变量 z z z的联合分布为:

p ( x , z ∣ θ ) = [ π p x ( 1 − p ) ( 1 − x ) ] z    + [ ( 1 − π ) q x ( 1 − q ) ( 1 − x ) ] ( 1 − z ) p(x,z|\theta) = [\pi{p^x}{(1-p)^{(1-x)}}]^z\;+[(1-\pi){q^x}{(1-q)^{(1-x)}}]^{(1-z)} p(x,zθ)=[πpx(1p)(1x)]z+[(1π)qx(1q)(1x)](1z)
因此,三个硬币的EM算法模型可以表示为:
step1、初始化 θ ( 0 ) = ( π 0 , p 0 , q 0 ) \theta^{(0)}=(\pi^0,p^0,q^0) θ(0)=(π0,p0,q0);
step2、对于 t t t = 0,…,MaxItera-1,有第 t t t轮迭代:

E步:对所有的样本 x i x^i xi, i = 1 , . . . , m i=1,...,m i=1,...,m,求隐含变量的条件概率如下
p ( z i ∣ x i , π t , p t , q t ) = [ ( 1 − π t ) ( q t ) x i ( 1 − q t ) ( 1 − x i ) ] 1 − z i + [ π t ( p t ) x i ( 1 − p t ) ( 1 − x i ) ] z i p ( x i ∣ π t , p t , q t ) p(z^i|x^i,\pi^t,p^t,q^t)=\frac{[(1-\pi^t){(q^t)^{x^i}}{(1-q^t)^{(1-x^i)}}]^{1-z^i}+[\pi^t{(p^t)^{x^i}}{(1-p^t)^{(1-x^i)}}]^{z^i}}{p(x^i|\pi^t,p^t,q^t)} p(zixi,πt,pt,qt)=p(xiπt,pt,qt)[(1πt)(qt)xi(1qt)(1xi)]1zi+[πt(pt)xi(1pt)(1xi)]zi
M步:进行目标优化,对目标函数求解,即:
θ ( t + 1 ) = ( π ( t + 1 ) , p ( t + 1 ) , q ( t + 1 ) ) = a r g m a x π , p , q Q ( θ , θ t ) \theta^{(t+1)} = (\pi^{(t+1)},p^{(t+1)},q^{(t+1)}) = arg\underset{\pi,p,q}{max} Q(\theta,\theta^{t}) θ(t+1)=(π(t+1),p(t+1),q(t+1))=argπ,p,qmaxQ(θ,θt)
则分别求导,会发现隐变量的条件分布在确定的参数下是常数,因此主要对隐含变量和观测变量的联合分布进行求导。首先假设使用 μ i \mu^i μi表示第 x i x^i xi个样本是由B硬币抛出的,即 z i = 1 z^i=1 zi=1,则:
μ i = p ( z = 1 ∣ x ( i ) , θ ) = π p x i ( 1 − p ) ( 1 − x i ) p ( x i ∣ π , p , q ) \mu^i = p(z=1|x^{(i)},\theta)=\frac{\pi{p^{x^i}}{(1-p)^{(1-x^i)}}}{p(x^i|\pi,p,q)} μi=p(z=1x(i),θ)=p(xiπ,p,q)πpxi(1p)(1xi)
则优化目标为:
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ &Q(\theta,\the…
对目标函数进行求导,其中 μ i \mu^i μi是在参数 θ t \theta^t θt条件下的已知参数,因此有:

  • 对参数 π \pi π求导,有:
    KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ &\frac{\partia…
    显然,获得更新参数:
    KaTeX parse error: Expected group after '_' at position 26: …\frac{1}{m}\sum_̲\limits{i=1}^{m…
  • 对于p/q参数,进行求导:
    KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ &\frac{\partia…
    最后整理得:
    KaTeX parse error: Expected group after '_' at position 20: …t+1}=\frac{\sum_̲\limits{x^i=1}\…
    同样,可得q的迭代参数:
    KaTeX parse error: Expected group after '_' at position 20: …t+1}=\frac{\sum_̲\limits{x^i=1}(…

1.3.1、使用EM算法解题

import numpy as np
import pandas as pd

class Three_Coin_EM():
    """
    对三个硬币进行建模,使用EM算法进行迭代求解;
    """
    def __init__(self,sample_seq):
        """
        输入样本,为一个序列样本,1表示正面,0表示反面;
        """
        self.sample = np.array(sample_seq)
        self.theta_solution = None
        return None
    def _Exceptation_Step(self,theta_t,sample_array):
        """
        EM算法的E步操作,主要求解在上一轮更新后的参数下隐含变量的条件概率分布;
        theta_t为第t轮迭代的参数(上一轮迭代更新的参数),为一个元组;
        输出为条件概率,形式应该为一个矩阵:
            (1)对于i=1,...,m样本序列;
            (2)对于每个样本在当前参数theta_t下,分布求解隐含变量z=1,z=0时的条件概率;
            (3)输出形式应该为2*m的矩阵;
        """
        pi_t,p_t,q_t = theta_t
        prob_z_x_B = pi_t*np.power(p_t,sample_array)*np.power(1-p_t,1-sample_array)
        prob_z_x_C = (1-pi_t)*np.power(q_t,sample_array)*np.power(1-q_t,1-sample_array)
        mu_B_t=prob_z_x_B/(prob_z_x_B+prob_z_x_C)
        return mu_B_t
    
    def _Maximization_Step(self,mu_B_t,sample_array):
        """
        关键是对优化目标写出表达式,看能否进行求偏导求极大值;
        三个硬币模型已完成优化函数建立;
        """
        pi_tp1 = np.mean(mu_B_t)
        p_tp1 = np.sum(mu_B_t*sample_array)/np.sum(mu_B_t)
        q_tp1 = np.sum((1-mu_B_t)*sample_array)/np.sum(1-mu_B_t)
        return [pi_tp1,p_tp1,q_tp1]
    
    def train(self,theta_0,stop_error=0.0001,max_iter = 20):
        """
        theta_0为初始参数,为元组
        """
        sample_array = np.array(self.sample)
        theta_old = theta_0
        iter_ = 0
        while True:
            iter_+=1
            print(f"开始第{iter_}轮迭代:")
            print("              初始参数为:",theta_old)
            mu_B_t = self._Exceptation_Step(theta_t=theta_old,
                                           sample_array=sample_array)
            mu_B_t = np.array(mu_B_t)
            theta_new = self._Maximization_Step(mu_B_t=mu_B_t,
                                               sample_array=sample_array)
            print("              更新参数为:",theta_new)
            step_error = np.sum(np.abs(np.array(theta_old)-np.array(theta_new)))
            if (step_error > stop_error) and (iter_<=max_iter):
                theta_old = theta_new
                continue
            elif step_error<stop_error:
                print("满足迭代条件,停止迭代!")
                self.theta_solution = theta_new
                break
            elif iter_>max_iter:
                print("达到最大迭代次数,停止迭代!")
                self.theta_solution = theta_new
                break
        return None

1.3.2、进行三个硬币模型训练

#进行训练
theta_0 = (0.5,0.5,0.5)
sample = [1,1,0,1,0,0,1,0,1,1]
tc = Three_Coin_EM(sample)
tc.train(theta_0=(0.46,0.55,0.67),stop_error=0.0000001)
开始第1轮迭代:
              初始参数为: (0.46, 0.55, 0.67)
              更新参数为: [0.461862835113919, 0.5345950037850112, 0.6561346417857326]
开始第2轮迭代:
              初始参数为: [0.461862835113919, 0.5345950037850112, 0.6561346417857326]
              更新参数为: [0.46186283511391907, 0.5345950037850111, 0.6561346417857326]
满足迭代条件,停止迭代!

2、混合高斯模型及其EM算法求解

2.1、混合高斯模型

混合概率模型,定义如下:
KaTeX parse error: Expected group after '_' at position 17: …(x|\theta)=\sum_̲\limits{k=1}^{K…
其中:

α k \alpha_{k} αk为各个子模型的权重系数,满足 α k ≥ 0 \alpha_{k}\geq{0} αk0且符合:
KaTeX parse error: Expected group after '_' at position 5: \sum_̲\limits{k}^{K}\…
Φ ( x ∣ θ k ) \Phi(x|\theta_{k}) Φ(xθk)为混合模型的各个子模型,可以是不同的概率分布;

如果所有子模型均为高斯模型,则混合模型为高斯混合模型,即:
KaTeX parse error: Expected group after '_' at position 17: …(x|\theta)=\sum_̲\limits{k=1}^{K…
Φ ( x ∣ μ k , Σ k ) = 1 ( 2 π ) D / 2 ∗ ∣ Σ k ∣ 1 2 e x p ( − ( x − μ k ) T Σ k − 1 ( x − μ k ) 2 ) \Phi(x|\mu_{k},\Sigma_{k}) = \frac{1}{(2\pi)^{D/2}*|\Sigma_{k}|^{\frac{1}{2}}}exp\left(-\frac{(x-\mu_{k})^T\Sigma_{k}^{-1}(x-\mu_{k})}{2}\right) Φ(xμk,Σk)=(2π)D/2Σk211exp(2(xμk)TΣk1(xμk))

2.2、GMM模型的求解——EM算法

算法框架:
按照EM算法框架,其关键要素如下:
(1)参数:K个高斯分布,每个高斯分布有两个概率分布参数与一个权重系数,在权重系数约束下,K个子模型共有 ( 3 ∗ K − 1 ) (3*K-1) (3K1)个参数;

(2)Exceptation——即求隐含变量在样本下的条件概率分布 p ( z ∣ x , θ ) p(z|x,\theta) p(zx,θ):
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ p(z|x,\theta)&…

(1)关键步——引入隐藏变量 z i = ( z 1 i , z 2 i , . . . , z K i ) z^i=(z_{1}^i,z_{2}^i,...,z_{K}^i) zi=(z1i,z2i,...,zKi),如果样本 x i x^i xi属于第 k k k个高斯分布,则又 z k i = 1 z_{k}^{i}=1 zki=1 z i z^i zi向量的其余分量元素则为0;
(2)关键步——建立隐藏变量下,混合高斯模型的表达式:
KaTeX parse error: Expected group after '_' at position 17: …(x|\theta)=\sum_̲\limits{k=1}^{K…
(3)关键步——隐含变量的边缘分布 p ( z ) p(z) p(z):显然,对于样本所属类别(属于哪个高斯分布)的判定来讲,p(z)相当于样本类别判定的先验概率,也即联合分布关于隐含变量的边缘分布,对应有:
KaTeX parse error: Expected group after '_' at position 34: …lies p(z)=\prod_̲\limits{k=1}^{K…
(4)关键步——隐含变量下样本 x x x的条件分布 p ( x ∣ z , θ ) p(x|z,\theta) p(xz,θ):显然,依据 x x x的边缘分布,有其条件分布如下:
p ( x ∣ z , θ ) = ∏ k = 1 K [ π k ∗ Φ ( x ∣ μ k , Σ k ) ] z k p(x|z,\theta)=\prod_{k=1}^{K}[\pi_k*\Phi(x|\mu_{k},\Sigma_k) ]^{z_k} p(xz,θ)=k=1K[πkΦ(xμk,Σk)]zk
(5)关键步——隐含变量与样本的联合分布 p ( x , z ∣ θ ) p(x,z|\theta) p(x,zθ)
KaTeX parse error: Expected group after '_' at position 45: …z,\theta)=\prod_̲\limits{k=1}^{K…
(6)对于样本序列 i = 1 , . . . , m i=1,...,m i=1,...,m,则该条件概率分布即定义为混合高斯分布子模型 z i z^i zi对于样本 x i x^i xi响应度(Responsibility),定义为:
γ k i = p ( z i ∣ x i , θ ) \gamma_{k}^{i}=p(z^i|x^i,\theta) γki=p(zixi,θ)

(3)Maximization——对极大似然的逼近函数进行求导运算,获得更新的 θ \theta θ参数,亦即对 π , μ , Σ \pi,\mu,\Sigma π,μ,Σ进行更新;

关键步——建立含有隐含变量的对数似然函数,获得优化目标(Q函数的推导见PRML):
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \theta^{t+1}&=…
关键步,即进行参数求偏导——获得结果如下:
KaTeX parse error: Expected group after '_' at position 79: …t+1}=\frac{\sum_̲\limits{i=1}^{m…
KaTeX parse error: Expected group after '_' at position 89: …t+1}=\frac{\sum_̲\limits{i=1}^{m…
KaTeX parse error: Expected group after '_' at position 79: …t+1}=\frac{\sum_̲\limits{i=1}^{m…

2.2.1、混合高斯模型的EM框架

import numpy as np
import pandas as pd
import scipy.stats as stats
from collections import namedtuple
class GMM_EM():
    """
    对高斯混合模型进行求解,求解出每个样本的归属,除给出样本的标签外;
    还可以给出样本在各个高斯分量下的概率密度大小,可以理解为归属的概率大小;
    """
    def __init__(self,sample):
        np.set_printoptions(precision=4)
        self.sample_num = len(sample)
        self.sample_seq = sample
        self.k = None
        self.__guassian_par = None
        self.sample_label = np.array([])
        self.__score = None
        return None
    
    @property
    def model_par(self):
        return self.__guassian_par
    @property
    def model_score(self):
        return self.__score
    
    def __repr__(self):
        model = namedtuple("GMM",["mu","sigma","pi"])
        mu,sigma,pi = self.__guassian_par
        return str(model(mu,sigma,pi))
    
    def _gmm_em_Exceptation(self,sample,theta_old,sub_model,training = True):
        """
        theta_old为当前轮参数,为一个3行K列的数组矩阵,三列分别为mean,sd,pi;
        sample为样本序列,目前只支持一维数据;
        sub_model为混合模型的分量个数,为人为指定的训练参数
        输出为一个m*K的数组矩阵,元素为第k个模型对样本i的响应度;
        """
        if theta_old.shape[0] != 3:
            print("参数矩阵输入错误,应为3行!")
            return None
        #分解提取参数
        mu_old,sigma_old,pi_old = theta_old
        K = sub_model
        #校验输入
        if type(sample) != np.ndarray:
            sample = np.array(sample)
        sample_shape = sample.shape
        sample_matrix = sample.reshape(sample_shape[0],1).repeat(repeats = K,axis = 1)
       
        #建立概率密度函数
        pdf_func = stats.norm(loc = mu_old,scale = np.sqrt(sigma_old)).pdf
        #计算每个样本的在相应子高斯分布下的条件概率密度值,结果为一个m*K矩阵
        prob_x_condition_theta_z = pdf_func(sample_matrix)
        sample_responsibility = prob_x_condition_theta_z/np.sum(prob_x_condition_theta_z,axis = 1,keepdims=True)
        sample_label = np.argmax(sample_responsibility,axis=1)
        if training:
            self.sample_label = sample_label
            return sample_responsibility
        else:
            return sample_responsibility,sample_label
    
    def _gmm_em_Maximization(self,sample,sample_responsibility):
        """
        依据E步计算获得的样本响应度,求导获得迭代参数进行输出;
        sample为输入的样本,目前只支持一维,为numpy数组对象;
        responsibility为样本响应度,为m*K的数组;
        """
        ##样本校验
        if type(sample)!=np.ndarray:
            new_shape = (len(sample),1)
            sample = np.array(sample).reshape(new_shape)
            
        responsibility = sample_responsibility
        responsibility_sum = np.sum(responsibility,axis=0)
        sample_respon_sum = np.sum(responsibility*sample,axis = 0)
        mu_new = sample_respon_sum/responsibility_sum
        sample_respon_sd_sum = np.sum(responsibility*(sample.repeat(2,axis = 1)-mu_new)**2,axis = 0)
        sigma_new = sample_respon_sd_sum/responsibility_sum
        pi_new = responsibility_sum/self.sample_num
        #theta_ = pd.DataFrame({"mu":mu_new,"sigma":sigma_new,"pi":pi_new})
        theta_ = np.row_stack((mu_new,sigma_new,pi_new))
        return theta_
    
    def Q_func(self,sample,theta_new,responsibility):
        """
        sample样本为一维m长度数组;
        theta_new为3*K的数组;
        responsibility为样本响应度,为m*K的数组;
        """
        mu_new,sigma_new,pi_new = theta_new
        sample_matrix = np.array(sample).reshape(self.sample_num,1).repeat(2,axis = 1)
        #part2为m*K,求和为对一列求和,结果为1*K
        part2 = np.log(1/np.sqrt(2*sigma_new))-(sample_matrix-mu_new)**2/(2*sigma_new)
        part2_sum = np.sum(part2,axis=0)
        #par1为1*K
        part1 = np.log(pi_new)
        part1_sum = part1 * self.sample_num
        #responsibility为m*K,结果为m*K
        Q = responsibility*(part1_sum+part2_sum)
        return np.sum(Q)
        
    
    def train_GMM(self,sample_,theta_0,K,stop_error = 0.0001,max_iter = 10,
                  print_detail = False):
        """
        要求theta_0为np数组,三行K列:
            第一行为mu初始化参数;
            第二行为sigma初始化参数;
            第三行为pi参数,其需要满足和为1;
        K为调优参数,指定混合模型分量;
        """
        sample_ = self.sample_seq
        theta_new = theta_0
        iter_ = 0
        Q_old = 0
        while True:
            iter_ += 1
            if iter_ > max_iter:
                self.__guassian_par = theta_new
                self.__score = Q_new
                print("达到最大迭代次数,停止训练!")
                break
            else:
                print(f"开始第{iter_}轮训练!")
                responsibility = self._gmm_em_Exceptation(sample=sample_,
                                                          theta_old=theta_new,
                                                          sub_model=K)
                theta_new = self._gmm_em_Maximization(sample=sample_,
                                                      sample_responsibility=responsibility)
                Q_new = self.Q_func(sample=sample_,
                                    theta_new=theta_new,
                                    responsibility=responsibility)
                iter_error = Q_new-Q_old
                if print_detail:
                    print("===>参数更新为:\n",theta_new)
                    print("===>训练结果为:",self.sample_label)
                if iter_error >= stop_error:
                    Q_old = Q_new
                    continue
                else:
                    self.__guassian_par = theta_new
                    self.__score = Q_new
                    print("满足迭代停止条件,结束训练!")
                    break
        return None
    
    def fit(self,new_data):
        """
        对新的数据进行拟合,核心在于计算各分模型对数据的响应度,最大者为标签。
        new_data要求为一维数组或者序列
        """
        if type(new_data) != np.ndarray:
            new_data = np.array(new_data)
        #predcit_respon = self._gmm_em_Exceptation(sample=new_data,
        #                                          theta_old=self.__guassian_par,
        #                                          sub_model=self.k,
        #                                         training=False)
        mu_old,sigma_old,pi_old = self.__guassian_par
        K = self.k
        sample = new_data
        #校验输入
        if type(sample) != np.ndarray:
            sample = np.array(sample)
        sample_shape = sample.shape
        sample_matrix = sample.reshape(sample_shape[0],1)
        #建立概率密度函数
        pdf_func = stats.norm(loc = mu_old,scale = np.sqrt(sigma_old)).pdf
        #计算每个样本的在相应子高斯分布下的条件概率密度值,结果为一个m*K矩阵
        prob_x_condition_theta_z = pdf_func(sample_matrix)
        sample_responsibility = prob_x_condition_theta_z/np.sum(prob_x_condition_theta_z,axis = 1,keepdims=True)
        sample_label = np.argmax(sample_responsibility,axis=1)
        return sample_responsibility,sample_label

2.2.2、进行训练

##调用自己实现的代码
sample = pd.Series([-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75])
gmm = GMM_EM(sample)
##EM算法会受限于算法的初始值,采用不同的初始值会有不同的聚类效果
theta_ = np.array([[-60,10],[10,10],[0.5,0.5]])
gmm.train_GMM(sample_=sample,theta_0=theta_,K = 2,max_iter=1000)
print("获得的结果:\n")
print(f"均值为:{gmm.model_par[0,:]};")
print(f"方差为:{gmm.model_par[1,:]};")
print(f"分量权重系数为:{gmm.model_par[2,:]};")
print(f"样本分类为:{gmm.sample_label};")
开始第1轮训练!
满足迭代停止条件,结束训练!
获得的结果:

均值为:[-57.5  33. ];
方差为:[ 90.25   428.3077];
分量权重系数为:[0.1333 0.8667];
样本分类为:[0 0 1 1 1 1 1 1 1 1 1 1 1 1 1];
##改变初始值
theta_ = np.array([[10,1],[10,10],[0.5,0.5]])
gmm.train_GMM(sample_=sample,theta_0=theta_,K = 2,max_iter=1000)
print("获得的结果:\n")
print(f"均值为:{gmm.model_par[0,:]};")
print(f"方差为:{gmm.model_par[1,:]};")
print(f"分量权重系数为:{gmm.model_par[2,:]};")
print(f"样本分类为:{gmm.sample_label};")
开始第1轮训练!
满足迭代停止条件,结束训练!
获得的结果:

均值为:[ 34.0313 -45.0226];
方差为:[416.3937 714.4159];
分量权重系数为:[0.8343 0.1657];
样本分类为:[1 1 0 0 0 0 0 0 0 0 0 0 0 0 0];

2.2.3、sklearn中的GMM模型

sklean中的混合高斯模型需要从mixture模块中导入,其同样采用EM算法框架进行模型求解,不同的是其只需要指定分模型的个数(即K参数,对应于类创建初始化参数 n _ c o m p o n e n t s n\_components n_components);EM算法的初始迭代参数可以有两种方式指定——kmeans以及随机选择(由类创建初始参数 i n i t _ p a r a m s init\_params init_params指定)。

2.2.3.1、创建GMM类
GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)

该类主要用于创建高斯混合模型的概率表示,对分布参数进行估计。
其主要参数如下:

  • n _ c o m p o n e n t s n\_components n_components,指定混合模型分量数;
  • c o v a r i a n c e _ t y p e covariance\_type covariance_type,方差类型,即指定高斯分布中使用哪种方差进行概率密度表达式建立,主要有{‘full’ (default), ‘tied’, ‘diag’, ‘spherical’}四种,其分别表示:

“full”,即每个分量模型(子高斯分布)都独立拥有各自的协方差矩阵,为完全协方差矩阵,矩阵元素全部不为0;
“tied”,即所有的分量模型共享一个协方差矩阵,使用相同的协方差矩阵;
“diag”,即每个分量都拥有独立的对角协方差矩阵,也就是变量之间相互独立,对角元素为各个变量的方差,非对角元素不为0;
“spherical”,即每个分量模型都有单独的方差值,即对角元素相等,非对角元素为0;

  • t o l tol tol,即迭代停止的阈值,当目标函数的下边界的平均优化增益小于该阈值时,停止迭代;
  • r e g _ c o v a r reg\_covar reg_covar,即对协方差矩阵的对角线元素添加非负正则项,以保证所有的协方差矩阵为正——对角线一定是非负的,但其余元素可能为负,即半正定;
  • m a x _ i t e r max\_iter max_iter,EM算法的最大迭代次数,默认为100次;
  • n _ i n i t n\_init n_init,即初始化EM算法参数的次数,进行几次初始化,默认为1次;
  • i n i t _ p a r a m s init\_params init_params,即初始化参数的方法,可选“kmeans”与“random”;默认为使用“kmeans”方法进行模型分量权重、均值以及精度进行初始化;
  • w e i g h t s _ i n i t weights\_init weights_init,权重初始化参数,可选参数,形状为(n_components, );当取值未指定时,则使用init_params进行权重初始化;
  • m e a n s _ i n i t means\_init means_init,均值初始化参数,可选,形状为(n_components, n_features),同上;
  • p r e c i s i o n s _ i n i t precisions\_init precisions_init,精度初始化参数,为协方差矩阵的逆矩阵;当为None时,使用初始化方法进行参数初始化,其形状取决于协方差矩阵的类型:

当为spherical时,为(n_components,);
当为tied时,为(n_features, n_features);
当为diag时,为(n_components, n_features);
当为full时,为(n_components, n_features, n_features);

  • r a n d o m _ s t a t e random\_state random_state,整数,用来控制初始化方法中随机种子的生成;因为无论kmeans初始化还是random初始化,其都有随机参数指定的要求;
  • w a r m _ s t a t warm\_stat warm_stat,即热启动还是冷启动,为True则将最后一次拟合的结果作为下一次调用 f i t ( ) fit() fit()函数的初始化参数;主要用在类似问题被多次训练时加快训练速度;
  • v e r b o s e verbose verbose,即是进行详细输出;如果为1则打印初始化及每一步迭代过程,如果大于1则再打印对数概率以及每一步迭代的时间;
  • v e r b o s e _ i n t e r v a l verbose\_interval verbose_interval,即在下一次打印时迭代的次数,默认为10;
from sklearn.mixture import GaussianMixture as GMM_skl
sample = pd.Series([-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75])
gmm_skl = GMM_skl(n_components=2,init_params="random")
sample_skl = np.array(sample).reshape(len(sample),1)
gmm_model_fit = gmm_skl.fit(sample_skl)
gmm_model_fit.fit_predict(sample_skl)

array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
2.2.3.2、类属性

该类拥有如下weights_、means_、covariances_ 、precisions_ 、precisions_cholesky_、converged_、lower_bound_属性,其含义分别如下:

  • w e i g h t s _ weights\_ weights_,即每个分量模型的权重系数,形状为(n_components,)的数组对象;
  • m e a n s _ means\_ means_,即每个分量模型的均值,形状为(n_components, n_features);
  • c o v a r i a n c e s _ covariances\_ covariances_,即每个分量模型的方差,同样取决于创建类时的方差类型参数( c o v a r i a n c e _ t y p e covariance\_type covariance_type):

当为spherical时,为(n_components,);
当为tied时,为(n_features, n_features);
当为diag时,为(n_components, n_features);
当为full时,为(n_components, n_features, n_features);

  • p r e c i s i o n s _ precisions\_ precisions_,即每个分量模型的精度矩阵,是协方差矩阵的逆矩阵;对于高斯模型,协方差矩阵是对称正定矩阵,因此可以使用精度矩阵对高斯分布进行等价地参数化;使用精度矩阵来替代协方差矩阵的优势在于计算新样本的对数似然函数时更加有效;
  • c o n v e r g e d _ converged\_ converged_,布尔型取值,当为True时表示在 f i t ( ) fit() fit()函数中收敛,反之则为False;
  • n _ i t e r _ n\_iter\_ n_iter_,即EM算法收敛的最佳迭代步数;
  • l o w e r _ b o u n d _ lower\_bound\_ lower_bound_,即EM算法所拟合的对数似然函数的最佳下界值;
print("EM迭代步数:",gmm_model_fit.n_iter_)
print("对数似然的最佳下界值:",gmm_model_fit.lower_bound_)
EM迭代步数: 22
对数似然的最佳下界值: -4.737557457654969
2.2.3.3、GMM类的主要方法

(1)gmm.aic(x),即当前模型对输入x计算赤池信息准则;
(2)gmm.bic(x),与赤池信息准则相对,计算贝叶斯信息准则,用以评价当前模型的复杂度,也就是对对数似然函数进行评价;
(3)gmm.fit(X,y = None),继承自 s k l e a r n . m i x t u r e . _ b a s e . B a s e M i x t u r e sklearn.mixture.\_base.BaseMixture sklearn.mixture._base.BaseMixture类的方法,用来拟合数据(模型训练),拟合的次数与参数有模型训练获得;
(4)gmm.fit_predict(X,y=None),与fit()方法类似,区别在于该方法会返回对各个样本的labels标签预测;
(5)gmm.predict(X),使用拟合训练好的模型参数对X进行标签预测;
(6)gmm.predict_proba(X),与predict()方法类似,不过输出的为每个分量模型对新样本的响应度,亦即每个样本属于不同模型分量的概率;
(7)gmm.sample(n_samples=1),即从拟合获得的混合高斯分布中产生抽样样本;
(8)gmm.score(X,y=None),对给定的数据计算每个样本上的平均对数似然函数;
(9)gmm.score_samples(self, X),对每个计算其加权对数概率;
(10)gmm.get_params(self, deep=True),继承自sklearn.base.BaseEstimator类的方法,获得模型参数;
(11)gmm.set_params(self, **params),承自sklearn.base.BaseEstimator类的方法,对当前模型手动设置参数;

sample = np.array(sample).reshape(len(sample),1)
print("模型的总对数似然值:",gmm_skl.score(sample))
print("模型的每个样本的加权平均对数似然值:\n",gmm_skl.score_samples(sample))
模型的总对数似然值: -4.737557457647792
模型的每个样本的加权平均对数似然值:
 [-5.6851 -5.6851 -4.9409 -4.8199 -4.5127 -4.429  -4.2092 -4.1871 -4.122
 -4.1116 -4.1679 -4.3917 -4.7098 -4.9428 -6.1483]

你可能感兴趣的:(笔记,算法,python,机器学习)