本篇文章为个人学习EM算法框架时的笔记,其中主要参考了李航老师的《统计学习方法》这本书以及PRML,中间有一些内容是从其他一些网络资料上摘抄下来的,具体来源比较杂,这里就不一一列出了,如有侵权请联系删除。
有两枚硬币,不均匀,随机选择一枚抛掷10次,然后再选择一枚抛掷10次,总共进行5次选择,记H为正面,T为反面,如下:
第一次:5H5T
第二次:9H1T
第三次:8H2T
第四次:4H6T
第五次:7H3T
显然,如果知道哪一次选择哪一个硬币,显然直接以MLE进行估计即可,如ABBAB,则:
θ ‾ A = 5 H + 4 H 20 \overline{\theta}_{A} = \frac{5H+4H}{20} θA=205H+4H
θ ‾ B = 9 H + 8 H + 7 H 30 \overline{\theta}_{B} = \frac{9H+8H+7H}{30} θB=309H+8H+7H
而如果不知道哪次选择了哪个硬币?如何确定?
即每次选择的硬币类别为隐含变量,则套用EM算法框架:
step1、初始化参数,即 θ 0 ⃗ = ( θ A 0 , θ B 0 ) \vec{\theta^{0}} = (\theta_{A}^{0},\theta_{B}^{0}) θ0=(θA0,θB0);
step2、迭代直到收敛,或者达到迭代停止条件:E步:求Q函数,计算当前参数及样本下每次的类别概率分布:
Q i ( z ( i ) ) = P ( z ( i ) ∣ x ( i ) , θ ( t ) ) Q_{i}(z^{(i)}) = P(z^{(i)}|x^{(i)},\theta^{(t)}) Qi(z(i))=P(z(i)∣x(i),θ(t))
M步:期望最大化,即对目标进行极大化:
∑ i = 1 m ∑ z Q i ( z ( i ) ) l o g ( P ( x ( i ) , z ( i ) ∣ θ ( t ) ) ) \sum\limits_{i=1}^{m}\sum\limits_{z}Q_{i}(z^{(i)})log(P(x^{(i)},z^{(i)}|\theta^{(t)})) i=1∑mz∑Qi(z(i))log(P(x(i),z(i)∣θ(t)))
这里M步实际上是不好求解的。
而本轮的参数值,对于优化目标来讲, Q i ( z ( i ) ) Q_{i}(z^{(i)}) Qi(z(i))部分基本上是确定的常数,因此落在了对概率 P ( x ( i ) , z ( i ) ∣ θ ( t ) ) P(x^{(i)},z^{(i)}|\theta^{(t)}) P(x(i),z(i)∣θ(t))的优化,寻找一个 θ ( t + 1 ) \theta^{(t+1)} θ(t+1)使其比之前更大,则使期望最大。
代码如下。
import pandas as pd
import numpy as np
class calac_EM():
"""
针对硬币问题;
:sample:为样本情况,形式为元组列表,每个元组对应一次实验,[(H,T)];
:h_prob:为每次实验当中的H次数的频率;
"""
def __init__(self,sample):
self.sample = sample
self.h_prob = [e[0]/sum(e) for e in sample]
def step_E(self,prob_a_old,prob_b_old):
theta_a_old = prob_a_old
theta_b_old = prob_b_old
prob_a_x_theta_old = np.prod(np.power([prob_a_old,1-prob_a_old],sample),
axis=1)
prob_b_x_theta_old = np.prod(np.power([prob_b_old,1-prob_b_old],sample),
axis=1)
Q_a = prob_a_x_theta_old/(prob_a_x_theta_old + prob_b_x_theta_old)
Q_b = 1 - Q_a
new_shape = (Q_a.shape[0],1)
Q_a_repeat = np.repeat(np.reshape(Q_a,newshape=new_shape),repeats=2,axis=1)
Q_b_repeat = 1 - Q_a_repeat
Exception_a = np.sum(Q_a_repeat*sample,axis=0)
Exception_b = np.sum(Q_b_repeat*sample,axis=0)
return Exception_a,Exception_b
def step_M(self,Exception_a,Exception_b):
prob_a_theta_new = Exception_a[0]/np.sum(Exception_a)
prob_b_theta_new = Exception_b[0]/np.sum(Exception_b)
return prob_a_theta_new,prob_b_theta_new
def iter_train(self,prob_A,prob_B,stop_e,max_iter = 11):
"""
prob_A与prob_B分别表示初始化参数,即A/B硬币抛出正面的概率;
stop_e为误差参数,当t轮与t+1轮的参数误差小于该阈值时,停止迭代;
max_iter为最大迭代次数,超出则停止迭代;
"""
prob_a_theta_old = prob_A
prob_b_theta_old = prob_B
####第一轮训练
iter_ = 0
print("初始参数为(theta_a,theta_b):",prob_a_theta_old,prob_b_theta_old)
while True:
iter_ += 1
print("第",iter_,"轮训练:")
Exception_a,Exception_b = self.step_E(prob_a_old=prob_a_theta_old,
prob_b_old=prob_b_theta_old)
prob_a_theta_new,prob_b_theta_new = self.step_M(Exception_a,Exception_b)
print("===更新参数:",(prob_a_theta_new,prob_b_theta_new))
iter_stop = np.abs(prob_a_theta_old-prob_a_theta_new)\
+ np.abs(prob_b_theta_new-prob_b_theta_old)
if iter_stop<=stop_e:
print("满足迭代条件,训练结束!")
break
elif iter_ >= max_iter:
print("达到最大迭代次数,结束训练")
break
else:
prob_a_theta_old = prob_a_theta_new
prob_b_theta_old = prob_b_theta_new
return None
sample = [(5,5),(9,1),(8,2),(4,6),(7,3)]
coin_EM = calac_EM(sample)
coin_EM.iter_train(0.7,0.8,0.00001,max_iter = 20)
初始参数为(theta_a,theta_b): 0.7 0.8
第 1 轮训练:
===更新参数: (0.5946099775593936, 0.7566461012979029)
第 2 轮训练:
===更新参数: (0.5591603804756445, 0.7686246125972386)
第 3 轮训练:
===更新参数: (0.539026245983948, 0.7823340890613432)
第 4 轮训练:
===更新参数: (0.5282554145171112, 0.7904987455040994)
第 5 轮训练:
===更新参数: (0.5232313429883886, 0.7942798192885937)
第 6 轮训练:
===更新参数: (0.5210779710123444, 0.7958282879518048)
第 7 轮训练:
===更新参数: (0.5201895676222431, 0.7964279328153537)
第 8 轮训练:
===更新参数: (0.5198283779239912, 0.7966545148106672)
第 9 轮训练:
===更新参数: (0.5196822508906818, 0.7967391807109708)
第 10 轮训练:
===更新参数: (0.5196232024601661, 0.7967706356141095)
第 11 轮训练:
===更新参数: (0.5195993378840494, 0.7967822783202753)
第 12 轮训练:
===更新参数: (0.519589686991515, 0.796786574548503)
第 13 轮训练:
===更新参数: (0.5195857811947752, 0.7967881551002687)
满足迭代条件,训练结束!
有A/B/C三枚硬币,其抛出正面的概率分别为 π , p , q \pi,p,q π,p,q,进行抛掷实验——先抛硬币A,如果抛出正面则选择B硬币再抛一次,记录其正反面;如果抛出背面则选择硬币C再抛一次,记录其正反;进行10次实验,正面记为1,反面记为0,得到如下结果:
1,1,0,1,0,0,1,0,1,1
如果只能观测结果,无法观察抛币过程,如何估计模型参数,即估计出三枚硬币抛出正面的概率。
建模如下:
取 x x x表示显示观测,取值 { 0 , 1 } \{0,1\} {0,1},分别表示反面与正面;
取 z z z表示隐藏变量,其取值为 { 0 , 1 } \{0,1\} {0,1},分别表示本次观测由C硬币、B硬币抛出;
取 θ \theta θ为模型参数,显然 θ = ( π , p , q ) \theta=(\pi,p,q) θ=(π,p,q);
则, x x x的独立分布函数为:
p ( x ∣ θ ) = π p x ( 1 − p ) ( 1 − x ) + ( 1 − π ) q x ( 1 − q ) ( 1 − x ) p(x|\theta)=\pi{p^x}{(1-p)^{(1-x)}}+(1-\pi){q^x}{(1-q)^{(1-x)}} p(x∣θ)=πpx(1−p)(1−x)+(1−π)qx(1−q)(1−x)
进行EM算法套用:
关键1,计算在当前参数条件下以及样本下隐含变量的条件概率分布:
p ( z = 0 ∣ x ( i ) , θ ) = ( 1 − π ) q x ( 1 − q ) ( 1 − x ) p ( x ∣ π , p , q ) p(z=0|x^{(i)},\theta)=\frac{(1-\pi){q^x}{(1-q)^{(1-x)}}}{p(x|\pi,p,q)} p(z=0∣x(i),θ)=p(x∣π,p,q)(1−π)qx(1−q)(1−x)
p ( z = 1 ∣ x ( i ) , θ ) = π p x ( 1 − p ) ( 1 − x ) p ( x ∣ π , p , q ) p(z=1|x^{(i)},\theta)=\frac{\pi{p^x}{(1-p)^{(1-x)}}}{p(x|\pi,p,q)} p(z=1∣x(i),θ)=p(x∣π,p,q)πpx(1−p)(1−x)
函数统一表达:
p ( z ∣ x ( i ) , θ ) = [ ( 1 − π ) q x ( 1 − q ) ( 1 − x ) ] 1 − z + [ π p x ( 1 − p ) ( 1 − x ) ] z p ( x ∣ π , p , q ) p(z|x^{(i)},\theta)=\frac{[(1-\pi){q^x}{(1-q)^{(1-x)}}]^{1-z}+[\pi{p^x}{(1-p)^{(1-x)}}]^z}{p(x|\pi,p,q)} p(z∣x(i),θ)=p(x∣π,p,q)[(1−π)qx(1−q)(1−x)]1−z+[πpx(1−p)(1−x)]z
关键2,优化目标函数:
Q ( θ , θ ( t ) ) = ∑ i = 1 m ∑ z i p ( z ∣ x i , θ t ) l o g ( p ( x i , z i ∣ θ ) ) Q(\theta,\theta^{(t)})=\sum\limits_{i=1}^{m}\sum\limits_{z^i}p(z|x^i,\theta^t)log(p(x^i,z^i|\theta)) Q(θ,θ(t))=i=1∑mzi∑p(z∣xi,θt)log(p(xi,zi∣θ))
显然,观测变量 x x x与隐含变量 z z z的联合分布为:
p ( x , z ∣ θ ) = [ π p x ( 1 − p ) ( 1 − x ) ] z + [ ( 1 − π ) q x ( 1 − q ) ( 1 − x ) ] ( 1 − z ) p(x,z|\theta) = [\pi{p^x}{(1-p)^{(1-x)}}]^z\;+[(1-\pi){q^x}{(1-q)^{(1-x)}}]^{(1-z)} p(x,z∣θ)=[πpx(1−p)(1−x)]z+[(1−π)qx(1−q)(1−x)](1−z)
因此,三个硬币的EM算法模型可以表示为:
step1、初始化 θ ( 0 ) = ( π 0 , p 0 , q 0 ) \theta^{(0)}=(\pi^0,p^0,q^0) θ(0)=(π0,p0,q0);
step2、对于 t t t = 0,…,MaxItera-1,有第 t t t轮迭代:
E步:对所有的样本 x i x^i xi, i = 1 , . . . , m i=1,...,m i=1,...,m,求隐含变量的条件概率如下
p ( z i ∣ x i , π t , p t , q t ) = [ ( 1 − π t ) ( q t ) x i ( 1 − q t ) ( 1 − x i ) ] 1 − z i + [ π t ( p t ) x i ( 1 − p t ) ( 1 − x i ) ] z i p ( x i ∣ π t , p t , q t ) p(z^i|x^i,\pi^t,p^t,q^t)=\frac{[(1-\pi^t){(q^t)^{x^i}}{(1-q^t)^{(1-x^i)}}]^{1-z^i}+[\pi^t{(p^t)^{x^i}}{(1-p^t)^{(1-x^i)}}]^{z^i}}{p(x^i|\pi^t,p^t,q^t)} p(zi∣xi,πt,pt,qt)=p(xi∣πt,pt,qt)[(1−πt)(qt)xi(1−qt)(1−xi)]1−zi+[πt(pt)xi(1−pt)(1−xi)]zi
M步:进行目标优化,对目标函数求解,即:
θ ( t + 1 ) = ( π ( t + 1 ) , p ( t + 1 ) , q ( t + 1 ) ) = a r g m a x π , p , q Q ( θ , θ t ) \theta^{(t+1)} = (\pi^{(t+1)},p^{(t+1)},q^{(t+1)}) = arg\underset{\pi,p,q}{max} Q(\theta,\theta^{t}) θ(t+1)=(π(t+1),p(t+1),q(t+1))=argπ,p,qmaxQ(θ,θt)
则分别求导,会发现隐变量的条件分布在确定的参数下是常数,因此主要对隐含变量和观测变量的联合分布进行求导。首先假设使用 μ i \mu^i μi表示第 x i x^i xi个样本是由B硬币抛出的,即 z i = 1 z^i=1 zi=1,则:
μ i = p ( z = 1 ∣ x ( i ) , θ ) = π p x i ( 1 − p ) ( 1 − x i ) p ( x i ∣ π , p , q ) \mu^i = p(z=1|x^{(i)},\theta)=\frac{\pi{p^{x^i}}{(1-p)^{(1-x^i)}}}{p(x^i|\pi,p,q)} μi=p(z=1∣x(i),θ)=p(xi∣π,p,q)πpxi(1−p)(1−xi)
则优化目标为:
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ &Q(\theta,\the…
对目标函数进行求导,其中 μ i \mu^i μi是在参数 θ t \theta^t θt条件下的已知参数,因此有:
import numpy as np
import pandas as pd
class Three_Coin_EM():
"""
对三个硬币进行建模,使用EM算法进行迭代求解;
"""
def __init__(self,sample_seq):
"""
输入样本,为一个序列样本,1表示正面,0表示反面;
"""
self.sample = np.array(sample_seq)
self.theta_solution = None
return None
def _Exceptation_Step(self,theta_t,sample_array):
"""
EM算法的E步操作,主要求解在上一轮更新后的参数下隐含变量的条件概率分布;
theta_t为第t轮迭代的参数(上一轮迭代更新的参数),为一个元组;
输出为条件概率,形式应该为一个矩阵:
(1)对于i=1,...,m样本序列;
(2)对于每个样本在当前参数theta_t下,分布求解隐含变量z=1,z=0时的条件概率;
(3)输出形式应该为2*m的矩阵;
"""
pi_t,p_t,q_t = theta_t
prob_z_x_B = pi_t*np.power(p_t,sample_array)*np.power(1-p_t,1-sample_array)
prob_z_x_C = (1-pi_t)*np.power(q_t,sample_array)*np.power(1-q_t,1-sample_array)
mu_B_t=prob_z_x_B/(prob_z_x_B+prob_z_x_C)
return mu_B_t
def _Maximization_Step(self,mu_B_t,sample_array):
"""
关键是对优化目标写出表达式,看能否进行求偏导求极大值;
三个硬币模型已完成优化函数建立;
"""
pi_tp1 = np.mean(mu_B_t)
p_tp1 = np.sum(mu_B_t*sample_array)/np.sum(mu_B_t)
q_tp1 = np.sum((1-mu_B_t)*sample_array)/np.sum(1-mu_B_t)
return [pi_tp1,p_tp1,q_tp1]
def train(self,theta_0,stop_error=0.0001,max_iter = 20):
"""
theta_0为初始参数,为元组
"""
sample_array = np.array(self.sample)
theta_old = theta_0
iter_ = 0
while True:
iter_+=1
print(f"开始第{iter_}轮迭代:")
print(" 初始参数为:",theta_old)
mu_B_t = self._Exceptation_Step(theta_t=theta_old,
sample_array=sample_array)
mu_B_t = np.array(mu_B_t)
theta_new = self._Maximization_Step(mu_B_t=mu_B_t,
sample_array=sample_array)
print(" 更新参数为:",theta_new)
step_error = np.sum(np.abs(np.array(theta_old)-np.array(theta_new)))
if (step_error > stop_error) and (iter_<=max_iter):
theta_old = theta_new
continue
elif step_error<stop_error:
print("满足迭代条件,停止迭代!")
self.theta_solution = theta_new
break
elif iter_>max_iter:
print("达到最大迭代次数,停止迭代!")
self.theta_solution = theta_new
break
return None
#进行训练
theta_0 = (0.5,0.5,0.5)
sample = [1,1,0,1,0,0,1,0,1,1]
tc = Three_Coin_EM(sample)
tc.train(theta_0=(0.46,0.55,0.67),stop_error=0.0000001)
开始第1轮迭代:
初始参数为: (0.46, 0.55, 0.67)
更新参数为: [0.461862835113919, 0.5345950037850112, 0.6561346417857326]
开始第2轮迭代:
初始参数为: [0.461862835113919, 0.5345950037850112, 0.6561346417857326]
更新参数为: [0.46186283511391907, 0.5345950037850111, 0.6561346417857326]
满足迭代条件,停止迭代!
混合概率模型,定义如下:
KaTeX parse error: Expected group after '_' at position 17: …(x|\theta)=\sum_̲\limits{k=1}^{K…
其中:
α k \alpha_{k} αk为各个子模型的权重系数,满足 α k ≥ 0 \alpha_{k}\geq{0} αk≥0且符合:
KaTeX parse error: Expected group after '_' at position 5: \sum_̲\limits{k}^{K}\…
Φ ( x ∣ θ k ) \Phi(x|\theta_{k}) Φ(x∣θk)为混合模型的各个子模型,可以是不同的概率分布;
如果所有子模型均为高斯模型,则混合模型为高斯混合模型,即:
KaTeX parse error: Expected group after '_' at position 17: …(x|\theta)=\sum_̲\limits{k=1}^{K…
Φ ( x ∣ μ k , Σ k ) = 1 ( 2 π ) D / 2 ∗ ∣ Σ k ∣ 1 2 e x p ( − ( x − μ k ) T Σ k − 1 ( x − μ k ) 2 ) \Phi(x|\mu_{k},\Sigma_{k}) = \frac{1}{(2\pi)^{D/2}*|\Sigma_{k}|^{\frac{1}{2}}}exp\left(-\frac{(x-\mu_{k})^T\Sigma_{k}^{-1}(x-\mu_{k})}{2}\right) Φ(x∣μk,Σk)=(2π)D/2∗∣Σk∣211exp(−2(x−μk)TΣk−1(x−μk))
算法框架:
按照EM算法框架,其关键要素如下:
(1)参数:K个高斯分布,每个高斯分布有两个概率分布参数与一个权重系数,在权重系数约束下,K个子模型共有 ( 3 ∗ K − 1 ) (3*K-1) (3∗K−1)个参数;
(2)Exceptation——即求隐含变量在样本下的条件概率分布 p ( z ∣ x , θ ) p(z|x,\theta) p(z∣x,θ):
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ p(z|x,\theta)&…
(1)关键步——引入隐藏变量 z i = ( z 1 i , z 2 i , . . . , z K i ) z^i=(z_{1}^i,z_{2}^i,...,z_{K}^i) zi=(z1i,z2i,...,zKi),如果样本 x i x^i xi属于第 k k k个高斯分布,则又 z k i = 1 z_{k}^{i}=1 zki=1, z i z^i zi向量的其余分量元素则为0;
(2)关键步——建立隐藏变量下,混合高斯模型的表达式:
KaTeX parse error: Expected group after '_' at position 17: …(x|\theta)=\sum_̲\limits{k=1}^{K…
(3)关键步——隐含变量的边缘分布 p ( z ) p(z) p(z):显然,对于样本所属类别(属于哪个高斯分布)的判定来讲,p(z)相当于样本类别判定的先验概率,也即联合分布关于隐含变量的边缘分布,对应有:
KaTeX parse error: Expected group after '_' at position 34: …lies p(z)=\prod_̲\limits{k=1}^{K…
(4)关键步——隐含变量下样本 x x x的条件分布 p ( x ∣ z , θ ) p(x|z,\theta) p(x∣z,θ):显然,依据 x x x的边缘分布,有其条件分布如下:
p ( x ∣ z , θ ) = ∏ k = 1 K [ π k ∗ Φ ( x ∣ μ k , Σ k ) ] z k p(x|z,\theta)=\prod_{k=1}^{K}[\pi_k*\Phi(x|\mu_{k},\Sigma_k) ]^{z_k} p(x∣z,θ)=k=1∏K[πk∗Φ(x∣μk,Σk)]zk
(5)关键步——隐含变量与样本的联合分布 p ( x , z ∣ θ ) p(x,z|\theta) p(x,z∣θ):
KaTeX parse error: Expected group after '_' at position 45: …z,\theta)=\prod_̲\limits{k=1}^{K…
(6)对于样本序列 i = 1 , . . . , m i=1,...,m i=1,...,m,则该条件概率分布即定义为混合高斯分布子模型 z i z^i zi对于样本 x i x^i xi的响应度(Responsibility),定义为:
γ k i = p ( z i ∣ x i , θ ) \gamma_{k}^{i}=p(z^i|x^i,\theta) γki=p(zi∣xi,θ)
(3)Maximization——对极大似然的逼近函数进行求导运算,获得更新的 θ \theta θ参数,亦即对 π , μ , Σ \pi,\mu,\Sigma π,μ,Σ进行更新;
关键步——建立含有隐含变量的对数似然函数,获得优化目标(Q函数的推导见PRML):
KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ \theta^{t+1}&=…
关键步,即进行参数求偏导——获得结果如下:
KaTeX parse error: Expected group after '_' at position 79: …t+1}=\frac{\sum_̲\limits{i=1}^{m…
KaTeX parse error: Expected group after '_' at position 89: …t+1}=\frac{\sum_̲\limits{i=1}^{m…
KaTeX parse error: Expected group after '_' at position 79: …t+1}=\frac{\sum_̲\limits{i=1}^{m…
import numpy as np
import pandas as pd
import scipy.stats as stats
from collections import namedtuple
class GMM_EM():
"""
对高斯混合模型进行求解,求解出每个样本的归属,除给出样本的标签外;
还可以给出样本在各个高斯分量下的概率密度大小,可以理解为归属的概率大小;
"""
def __init__(self,sample):
np.set_printoptions(precision=4)
self.sample_num = len(sample)
self.sample_seq = sample
self.k = None
self.__guassian_par = None
self.sample_label = np.array([])
self.__score = None
return None
@property
def model_par(self):
return self.__guassian_par
@property
def model_score(self):
return self.__score
def __repr__(self):
model = namedtuple("GMM",["mu","sigma","pi"])
mu,sigma,pi = self.__guassian_par
return str(model(mu,sigma,pi))
def _gmm_em_Exceptation(self,sample,theta_old,sub_model,training = True):
"""
theta_old为当前轮参数,为一个3行K列的数组矩阵,三列分别为mean,sd,pi;
sample为样本序列,目前只支持一维数据;
sub_model为混合模型的分量个数,为人为指定的训练参数
输出为一个m*K的数组矩阵,元素为第k个模型对样本i的响应度;
"""
if theta_old.shape[0] != 3:
print("参数矩阵输入错误,应为3行!")
return None
#分解提取参数
mu_old,sigma_old,pi_old = theta_old
K = sub_model
#校验输入
if type(sample) != np.ndarray:
sample = np.array(sample)
sample_shape = sample.shape
sample_matrix = sample.reshape(sample_shape[0],1).repeat(repeats = K,axis = 1)
#建立概率密度函数
pdf_func = stats.norm(loc = mu_old,scale = np.sqrt(sigma_old)).pdf
#计算每个样本的在相应子高斯分布下的条件概率密度值,结果为一个m*K矩阵
prob_x_condition_theta_z = pdf_func(sample_matrix)
sample_responsibility = prob_x_condition_theta_z/np.sum(prob_x_condition_theta_z,axis = 1,keepdims=True)
sample_label = np.argmax(sample_responsibility,axis=1)
if training:
self.sample_label = sample_label
return sample_responsibility
else:
return sample_responsibility,sample_label
def _gmm_em_Maximization(self,sample,sample_responsibility):
"""
依据E步计算获得的样本响应度,求导获得迭代参数进行输出;
sample为输入的样本,目前只支持一维,为numpy数组对象;
responsibility为样本响应度,为m*K的数组;
"""
##样本校验
if type(sample)!=np.ndarray:
new_shape = (len(sample),1)
sample = np.array(sample).reshape(new_shape)
responsibility = sample_responsibility
responsibility_sum = np.sum(responsibility,axis=0)
sample_respon_sum = np.sum(responsibility*sample,axis = 0)
mu_new = sample_respon_sum/responsibility_sum
sample_respon_sd_sum = np.sum(responsibility*(sample.repeat(2,axis = 1)-mu_new)**2,axis = 0)
sigma_new = sample_respon_sd_sum/responsibility_sum
pi_new = responsibility_sum/self.sample_num
#theta_ = pd.DataFrame({"mu":mu_new,"sigma":sigma_new,"pi":pi_new})
theta_ = np.row_stack((mu_new,sigma_new,pi_new))
return theta_
def Q_func(self,sample,theta_new,responsibility):
"""
sample样本为一维m长度数组;
theta_new为3*K的数组;
responsibility为样本响应度,为m*K的数组;
"""
mu_new,sigma_new,pi_new = theta_new
sample_matrix = np.array(sample).reshape(self.sample_num,1).repeat(2,axis = 1)
#part2为m*K,求和为对一列求和,结果为1*K
part2 = np.log(1/np.sqrt(2*sigma_new))-(sample_matrix-mu_new)**2/(2*sigma_new)
part2_sum = np.sum(part2,axis=0)
#par1为1*K
part1 = np.log(pi_new)
part1_sum = part1 * self.sample_num
#responsibility为m*K,结果为m*K
Q = responsibility*(part1_sum+part2_sum)
return np.sum(Q)
def train_GMM(self,sample_,theta_0,K,stop_error = 0.0001,max_iter = 10,
print_detail = False):
"""
要求theta_0为np数组,三行K列:
第一行为mu初始化参数;
第二行为sigma初始化参数;
第三行为pi参数,其需要满足和为1;
K为调优参数,指定混合模型分量;
"""
sample_ = self.sample_seq
theta_new = theta_0
iter_ = 0
Q_old = 0
while True:
iter_ += 1
if iter_ > max_iter:
self.__guassian_par = theta_new
self.__score = Q_new
print("达到最大迭代次数,停止训练!")
break
else:
print(f"开始第{iter_}轮训练!")
responsibility = self._gmm_em_Exceptation(sample=sample_,
theta_old=theta_new,
sub_model=K)
theta_new = self._gmm_em_Maximization(sample=sample_,
sample_responsibility=responsibility)
Q_new = self.Q_func(sample=sample_,
theta_new=theta_new,
responsibility=responsibility)
iter_error = Q_new-Q_old
if print_detail:
print("===>参数更新为:\n",theta_new)
print("===>训练结果为:",self.sample_label)
if iter_error >= stop_error:
Q_old = Q_new
continue
else:
self.__guassian_par = theta_new
self.__score = Q_new
print("满足迭代停止条件,结束训练!")
break
return None
def fit(self,new_data):
"""
对新的数据进行拟合,核心在于计算各分模型对数据的响应度,最大者为标签。
new_data要求为一维数组或者序列
"""
if type(new_data) != np.ndarray:
new_data = np.array(new_data)
#predcit_respon = self._gmm_em_Exceptation(sample=new_data,
# theta_old=self.__guassian_par,
# sub_model=self.k,
# training=False)
mu_old,sigma_old,pi_old = self.__guassian_par
K = self.k
sample = new_data
#校验输入
if type(sample) != np.ndarray:
sample = np.array(sample)
sample_shape = sample.shape
sample_matrix = sample.reshape(sample_shape[0],1)
#建立概率密度函数
pdf_func = stats.norm(loc = mu_old,scale = np.sqrt(sigma_old)).pdf
#计算每个样本的在相应子高斯分布下的条件概率密度值,结果为一个m*K矩阵
prob_x_condition_theta_z = pdf_func(sample_matrix)
sample_responsibility = prob_x_condition_theta_z/np.sum(prob_x_condition_theta_z,axis = 1,keepdims=True)
sample_label = np.argmax(sample_responsibility,axis=1)
return sample_responsibility,sample_label
##调用自己实现的代码
sample = pd.Series([-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75])
gmm = GMM_EM(sample)
##EM算法会受限于算法的初始值,采用不同的初始值会有不同的聚类效果
theta_ = np.array([[-60,10],[10,10],[0.5,0.5]])
gmm.train_GMM(sample_=sample,theta_0=theta_,K = 2,max_iter=1000)
print("获得的结果:\n")
print(f"均值为:{gmm.model_par[0,:]};")
print(f"方差为:{gmm.model_par[1,:]};")
print(f"分量权重系数为:{gmm.model_par[2,:]};")
print(f"样本分类为:{gmm.sample_label};")
开始第1轮训练!
满足迭代停止条件,结束训练!
获得的结果:
均值为:[-57.5 33. ];
方差为:[ 90.25 428.3077];
分量权重系数为:[0.1333 0.8667];
样本分类为:[0 0 1 1 1 1 1 1 1 1 1 1 1 1 1];
##改变初始值
theta_ = np.array([[10,1],[10,10],[0.5,0.5]])
gmm.train_GMM(sample_=sample,theta_0=theta_,K = 2,max_iter=1000)
print("获得的结果:\n")
print(f"均值为:{gmm.model_par[0,:]};")
print(f"方差为:{gmm.model_par[1,:]};")
print(f"分量权重系数为:{gmm.model_par[2,:]};")
print(f"样本分类为:{gmm.sample_label};")
开始第1轮训练!
满足迭代停止条件,结束训练!
获得的结果:
均值为:[ 34.0313 -45.0226];
方差为:[416.3937 714.4159];
分量权重系数为:[0.8343 0.1657];
样本分类为:[1 1 0 0 0 0 0 0 0 0 0 0 0 0 0];
sklean中的混合高斯模型需要从mixture模块中导入,其同样采用EM算法框架进行模型求解,不同的是其只需要指定分模型的个数(即K参数,对应于类创建初始化参数 n _ c o m p o n e n t s n\_components n_components);EM算法的初始迭代参数可以有两种方式指定——kmeans以及随机选择(由类创建初始参数 i n i t _ p a r a m s init\_params init_params指定)。
GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)
该类主要用于创建高斯混合模型的概率表示,对分布参数进行估计。
其主要参数如下:
“full”,即每个分量模型(子高斯分布)都独立拥有各自的协方差矩阵,为完全协方差矩阵,矩阵元素全部不为0;
“tied”,即所有的分量模型共享一个协方差矩阵,使用相同的协方差矩阵;
“diag”,即每个分量都拥有独立的对角协方差矩阵,也就是变量之间相互独立,对角元素为各个变量的方差,非对角元素不为0;
“spherical”,即每个分量模型都有单独的方差值,即对角元素相等,非对角元素为0;
当为spherical时,为(n_components,);
当为tied时,为(n_features, n_features);
当为diag时,为(n_components, n_features);
当为full时,为(n_components, n_features, n_features);
from sklearn.mixture import GaussianMixture as GMM_skl
sample = pd.Series([-67,-48,6,8,14,16,23,24,28,29,41,49,56,60,75])
gmm_skl = GMM_skl(n_components=2,init_params="random")
sample_skl = np.array(sample).reshape(len(sample),1)
gmm_model_fit = gmm_skl.fit(sample_skl)
gmm_model_fit.fit_predict(sample_skl)
array([0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
该类拥有如下weights_、means_、covariances_ 、precisions_ 、precisions_cholesky_、converged_、lower_bound_属性,其含义分别如下:
当为spherical时,为(n_components,);
当为tied时,为(n_features, n_features);
当为diag时,为(n_components, n_features);
当为full时,为(n_components, n_features, n_features);
print("EM迭代步数:",gmm_model_fit.n_iter_)
print("对数似然的最佳下界值:",gmm_model_fit.lower_bound_)
EM迭代步数: 22
对数似然的最佳下界值: -4.737557457654969
(1)gmm.aic(x),即当前模型对输入x计算赤池信息准则;
(2)gmm.bic(x),与赤池信息准则相对,计算贝叶斯信息准则,用以评价当前模型的复杂度,也就是对对数似然函数进行评价;
(3)gmm.fit(X,y = None),继承自 s k l e a r n . m i x t u r e . _ b a s e . B a s e M i x t u r e sklearn.mixture.\_base.BaseMixture sklearn.mixture._base.BaseMixture类的方法,用来拟合数据(模型训练),拟合的次数与参数有模型训练获得;
(4)gmm.fit_predict(X,y=None),与fit()方法类似,区别在于该方法会返回对各个样本的labels标签预测;
(5)gmm.predict(X),使用拟合训练好的模型参数对X进行标签预测;
(6)gmm.predict_proba(X),与predict()方法类似,不过输出的为每个分量模型对新样本的响应度,亦即每个样本属于不同模型分量的概率;
(7)gmm.sample(n_samples=1),即从拟合获得的混合高斯分布中产生抽样样本;
(8)gmm.score(X,y=None),对给定的数据计算每个样本上的平均对数似然函数;
(9)gmm.score_samples(self, X),对每个计算其加权对数概率;
(10)gmm.get_params(self, deep=True),继承自sklearn.base.BaseEstimator类的方法,获得模型参数;
(11)gmm.set_params(self, **params),承自sklearn.base.BaseEstimator类的方法,对当前模型手动设置参数;
sample = np.array(sample).reshape(len(sample),1)
print("模型的总对数似然值:",gmm_skl.score(sample))
print("模型的每个样本的加权平均对数似然值:\n",gmm_skl.score_samples(sample))
模型的总对数似然值: -4.737557457647792
模型的每个样本的加权平均对数似然值:
[-5.6851 -5.6851 -4.9409 -4.8199 -4.5127 -4.429 -4.2092 -4.1871 -4.122
-4.1116 -4.1679 -4.3917 -4.7098 -4.9428 -6.1483]