简介:最大熵模型的思想也就是当信息不确定的时候,让熵达到最大值,这个逻辑是符合我们的经验判断的。
本文参考李航博士的《统计学习方法》
下面的公式以如下表格的例子进行举例解释:
ID | age | working | house | credit_situation | label |
---|---|---|---|---|---|
1 | youth | no | no | 1 | refuse |
2 | youth | no | no | 2 | refuse |
3 | youth | yes | no | 2 | agree |
4 | youth | yes | yes | 1 | agree |
5 | youth | no | no | 1 | refuse |
6 | mid | no | no | 1 | refuse |
7 | mid | no | no | 2 | refuse |
8 | mid | yes | yes | 2 | agree |
9 | mid | no | yes | 3 | agree |
10 | mid | no | yes | 3 | agree |
11 | elder | no | yes | 3 | agree |
12 | elder | no | yes | 2 | agree |
13 | elder | yes | no | 2 | agree |
14 | elder | yes | no | 3 | agree |
15 | elder | no | no | 1 | refuse |
H ( P ) = − ∑ x P ( x ) log P ( x ) H(P) = -\sum_x P(x) \log P(x) H(P)=−x∑P(x)logP(x)
熵越大,表示信息越不确定,最大熵模型的思想也就是当信息不确定的时候,让熵达到最大值,这个逻辑是符合我们的经验判断的。
最大熵模型表示的是对于给定的输入 X X X,以条件概率 P ( Y ∣ X ) P(Y|X) P(Y∣X)输入 Y Y Y.
给定训练数据集:
T = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) T = {(x_1,y_1),(x_2,y_2),...,(x_N,y_N)} T=(x1,y1),(x2,y2),...,(xN,yN)
给定训练数据集,可以确定联合分布 P ( X , Y ) P(X,Y) P(X,Y)的经验分布和边缘分布 P ( X ) P(X) P(X)的经验分布,分别以 P ^ ( X , Y ) \hat P~(X,Y) P^ (X,Y)和 P ^ ( X ) \hat P(X) P^(X)表示,
P ^ ( X = x , Y = y ) = v ( X = x , Y = y ) N (1) \hat P(X=x, Y=y) = \frac {v(X=x, Y=y)} {N} \tag{1} P^(X=x,Y=y)=Nv(X=x,Y=y)(1)
P ^ ( X = x ) = v ( X = x ) N (2) \hat P(X=x) = \frac {v(X=x)} {N} \tag{2} P^(X=x)=Nv(X=x)(2)
其中, v ( X = x , Y = y ) v(X=x, Y=y) v(X=x,Y=y)表示训练数据中样本 ( x , y ) (x,y) (x,y)中出现的频数, v ( X = x ) v(X=x) v(X=x)表示训练数据中输入 x x x出现的频数, N N N表示训练样本容量。
在这个模型中,为了让样本的特征值互斥,所以对输入数据做一些处理,处理后结果如下:
第一个样本的输入:[‘age_youth’, ‘working_no’, ‘house_no’, ‘credit_situation_1’]P ^ ( X = x , Y = y ) = v ( X = x , Y = y ) N \hat P(X=x, Y=y) = \frac {v(X=x, Y=y)} {N} P^(X=x,Y=y)=Nv(X=x,Y=y)统计后的不完全结果:
{(‘age_youth’, ‘refuse’): 0.25, (‘working_no’, ‘refuse’): 0.4166666666666667}
用特征函数 f ( x , y ) f(x,y) f(x,y)描述输入 x x x和输出 y y y之间的某一个事实,其定义是
f ( x , y ) = { 1 , x 与 y 满 足 某 一 事 实 0 , 否 则 f(x,y) = \begin{cases} 1,x与y满足某一事实\\ 0,否则 \end{cases} f(x,y)={1,x与y满足某一事实0,否则
它是一个二值函数。
特征函数 f ( x , y ) f(x,y) f(x,y)关于经验分布 P ^ ( X , Y ) \hat P(X,Y) P^(X,Y)的期望值,用 E P ^ ( f ) E_{\hat P}(f) EP^(f)表示:
E P ^ ( f ) = ∑ x , y P ^ ( x , y ) f ( x , y ) (3) E_{\hat P}(f) = \sum_{x,y} \hat P(x,y) f(x,y) \tag{3} EP^(f)=x,y∑P^(x,y)f(x,y)(3)
特征函数 f ( x , y ) f(x,y) f(x,y)关于模型 P ( Y ∣ X ) P(Y|X) P(Y∣X)与经验分布 P ^ ( X ) \hat P(X) P^(X)的其万至,用 E P ( f ) E_P(f) EP(f)表示:
E P ( f ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) f ( x , y ) (4) E_P(f) = \sum_{x,y} \hat P(x) P(y|x) f(x,y) \tag{4} EP(f)=x,y∑P^(x)P(y∣x)f(x,y)(4)
(3)(4)式和(1)(2)式在存储结构上完全一致了
如果模型能够获取训练数据中的信息,那么就假设这两个期望值相等,即:
E P ^ ( f ) = E P ( f ) (5) E_{\hat P}(f) = E_P(f) \tag{5} EP^(f)=EP(f)(5)
最大熵模型:
假设满足所有约束条件:
E P ^ ( f i ) = E P ( f i ) , i = 1 , 2 , . . . , n (6) E_{\hat P}(f_i) = E_P(f_i), i = 1,2,...,n \tag{6} EP^(fi)=EP(fi),i=1,2,...,n(6)
定义在条件概率分布 P ( Y ∣ X ) P(Y|X) P(Y∣X)上的条件熵为:
H ( P ) = − ∑ x , y P ^ ( x ) P ( y ∣ x ) log P ( y ∣ x ) (7) H(P) = - \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) \tag{7} H(P)=−x,y∑P^(x)P(y∣x)logP(y∣x)(7)
在满足约束的情况下,条件熵最大的模型称为最大熵模型。
对(7)式做一个简单的说明:
条件概率的的条件熵为:
H ( P ) = − ∑ x , y P ( y ∣ x ) log P ( y ∣ x ) H(P) = - \sum_{x,y} P(y|x) \log P(y|x) H(P)=−x,y∑P(y∣x)logP(y∣x)在此基础上做了对 x x x的条件熵期望,得到了(7)式。
最大熵模型的学习过程就是求解最大熵模型的过程。最大熵模型可以形式化为约束最优问题。
max P ( y ∣ x ) H ( P ) = − ∑ x , y P ^ ( x ) P ( y ∣ x ) log P ( y ∣ x ) s . t . E P ^ ( f i ) = E P ( f i ) , i = 1 , 2 , . . . , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \max_{P(y|x)} &H(P) = - \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) \\ s.t. &E_{\hat P}(f_i) = E_P(f_i), i = 1,2,...,n \\ &\sum_y P(y|x) = 1 \end{aligned} P(y∣x)maxs.t.H(P)=−x,y∑P^(x)P(y∣x)logP(y∣x)EP^(fi)=EP(fi),i=1,2,...,ny∑P(y∣x)=1
按照最优化的习惯,将求最大值的问题改写为等价的求最小值的问题:
min P ( y ∣ x ) − H ( P ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) log P ( y ∣ x ) s . t . E P ^ ( f i ) = E P ( f i ) , i = 1 , 2 , . . . , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \min_{P(y|x)} &-H(P) = \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) \\ s.t. &E_{\hat P}(f_i) = E_P(f_i), i = 1,2,...,n \\ &\sum_y P(y|x) = 1 \end{aligned} P(y∣x)mins.t.−H(P)=x,y∑P^(x)P(y∣x)logP(y∣x)EP^(fi)=EP(fi),i=1,2,...,ny∑P(y∣x)=1
将约束最优化的原始问题转化为无约束最优化的对偶问题。
首先,引进拉格朗日乘子 w 0 , w 1 , w 2 , . . . , w n w_0,w_1,w_2,...,w_n w0,w1,w2,...,wn,定义拉格朗日函数 L ( P , w ) L(P,w) L(P,w):
L ( P , w ) = − H ( P ) + w 0 ( ∑ y P ( y ∣ x ) − 1 ) + ∑ i = 1 n w i ( E P ^ ( f i ) − E P ( f i ) ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) log P ( y ∣ x ) + w 0 ( ∑ y P ( y ∣ x ) − 1 ) + ∑ i = 1 n w i ( E P ^ ( f i ) − E P ( f i ) ) (8) \begin{aligned} L(P,w) &= -H(P) + w_0( \sum_y P(y|x) - 1) + \sum_{i=1}^n w_i(E_{\hat P}(f_i) - E_P(f_i)) \\ & = \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) + w_0( \sum_y P(y|x) - 1) + \sum_{i=1}^n w_i(E_{\hat P}(f_i) - E_P(f_i)) \tag{8} \end{aligned} L(P,w)=−H(P)+w0(y∑P(y∣x)−1)+i=1∑nwi(EP^(fi)−EP(fi))=x,y∑P^(x)P(y∣x)logP(y∣x)+w0(y∑P(y∣x)−1)+i=1∑nwi(EP^(fi)−EP(fi))(8)
最优化的原始问题是:
min P max w L ( P , w ) \min_P \max_w L(P,w) PminwmaxL(P,w)
对偶问题是:
max w min P L ( P , w ) \max_w \min_P L(P,w) wmaxPminL(P,w)
首先求解对偶问题内部的极小化问题 min P L ( P , w ) \min_P L(P,w) minPL(P,w),其是 w w w的函数,将其记做:
Ψ ( w ) = min P L ( P , w ) = L ( P w , w ) \Psi (w) = \min_P L(P,w) = L(P_w, w) Ψ(w)=PminL(P,w)=L(Pw,w)
上式就是对偶函数。同时,将其解记做:
P w = arg min P L ( P , w ) = P w ( y ∣ x ) P_w = \arg \min_P L(P,w) = P_w(y|x) Pw=argPminL(P,w)=Pw(y∣x)
求 L ( P , w ) L(P,w) L(P,w)对 P ( y ∣ x ) P(y|x) P(y∣x)的导数:
∂ L ( P , w ) ∂ P ( y ∣ x ) = ∑ x , y P ^ ( x ) ( log P ( y ∣ x ) + 1 ) + w 0 − ∑ x , y ( P ^ ( x ) ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ^ ( x ) ( log P ( y ∣ x ) + 1 + w 0 − ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} \frac{\partial L(P,w)} {\partial P(y|x)} &= \sum_{x,y} \hat P(x) (\log P(y|x) + 1) + w_0 - \sum_{x,y}(\hat P(x) \sum_{i=1}^n w_i f_i(x,y)) \\ & = \sum_{x,y} \hat P(x) (\log P(y|x) + 1 + w_0 - \sum_{i=1}^n w_i f_i(x,y)) \\ \end{aligned} ∂P(y∣x)∂L(P,w)=x,y∑P^(x)(logP(y∣x)+1)+w0−x,y∑(P^(x)i=1∑nwifi(x,y))=x,y∑P^(x)(logP(y∣x)+1+w0−i=1∑nwifi(x,y))
在 P ^ ( x ) > 0 \hat P(x) > 0 P^(x)>0的情况下,令偏导为0得:
P ( y ∣ x ) = exp ( ∑ i = 1 n w i f i ( x , y ) − 1 − w 0 ) = exp ( ∑ i = 1 n w i f i ( x , y ) ) exp ( 1 − w 0 ) P(y|x) = \exp ( \sum_{i=1}^n w_i f_i(x,y) - 1-w_0) = \frac {\exp (\sum_{i=1}^n w_i fi(x,y))} {\exp (1-w_0)} P(y∣x)=exp(i=1∑nwifi(x,y)−1−w0)=exp(1−w0)exp(∑i=1nwifi(x,y))
因为 ∑ y P ( y ∣ x ) = 1 \sum_y P(y|x) = 1 ∑yP(y∣x)=1:
1 = ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) exp ( 1 − w 0 ) 1 = \frac {\sum_y \exp (\sum_{i=1}^n w_i fi(x,y))} {\exp (1-w_0)} 1=exp(1−w0)∑yexp(∑i=1nwifi(x,y))
前两个式子左右两边分别做比得到:
P w ( y ∣ x ) = 1 Z w ( x ) exp ( ∑ i = 1 n w i f i ( x , y ) ) (9) P_w(y|x) = \frac {1} {Z_w(x)} \exp (\sum_{i=1}^n w_i f_i(x,y)) \tag{9} Pw(y∣x)=Zw(x)1exp(i=1∑nwifi(x,y))(9)
其中,
Z w ( x ) = ∑ y exp ( ∑ i = 1 n w i f i ( x , y ) ) (10) Z_w(x) = \sum_y \exp (\sum_{i=1}^n w_i f_i(x,y)) \tag{10} Zw(x)=y∑exp(i=1∑nwifi(x,y))(10)
Z w ( x ) Z_w(x) Zw(x)称为规范化因子;
之后,求解对偶问题外部的极大化问题
max w Ψ ( w ) \max_w \Psi(w) wmaxΨ(w)
将其解记为 w ∗ w* w∗,即:
w ∗ = arg max w Ψ ( w ) w* = \arg \max_w \Psi(w) w∗=argwmaxΨ(w)
现在将(9)(10)带回(8)准备得到 w ∗ w* w∗式:
Ψ ( w ) = ∑ x , y P ^ ( x ) P w ( y ∣ x ) log P w ( y ∣ x ) + w 0 ( ∑ y P w ( y ∣ x ) − 1 ) + ∑ i = 1 n w i ( E P ^ w ( f i ) − E P w ( f i ) ) = ∑ x , y P ^ ( x ) P w ( y ∣ x ) log P w ( y ∣ x ) + ∑ i = 1 n w i ( ∑ x , y P ^ ( x , y ) f ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) f ( x , y ) ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) + ∑ x , y P ^ ( x ) P w ( y ∣ x ) [ log P w ( y ∣ x ) − ∑ i = 1 n w i f i ( x , y ) ] = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) [ log ( exp ∑ i = 1 n w i f i ( x , y ) ) − log P w ( y ∣ x ) ] = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) log Z w ( x ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ^ ( x ) log Z w ( x ) \begin{aligned} \Psi(w) &= \sum_{x,y} \hat P(x) P_w(y|x) \log P_w(y|x) + w_0( \sum_y P_w(y|x) - 1) + \sum_{i=1}^n w_i(E_{\hat P_w}(f_i) - E_{P_w}(f_i)) \\ &= \sum_{x,y} \hat P(x) P_w(y|x) \log P_w(y|x) + \sum_{i=1}^n w_i (\sum_{x,y} \hat P(x,y) f(x,y) - \sum_{x,y} \hat P(x) P_w(y|x) f(x,y)) \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) + \sum_{x,y} \hat P(x) P_w(y|x)[\log P_w(y|x) - \sum_{i=1}^n w_i f_i(x,y)] \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x,y} \hat P(x) P_w(y|x)[ \log (\exp \sum_{i=1}^n w_i f_i(x,y)) - \log P_w(y|x)] \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x,y} \hat P(x) P_w(y|x) \log Z_w(x) \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x} \hat P(x) \log Z_w(x) \\ \end{aligned} Ψ(w)=x,y∑P^(x)Pw(y∣x)logPw(y∣x)+w0(y∑Pw(y∣x)−1)+i=1∑nwi(EP^w(fi)−EPw(fi))=x,y∑P^(x)Pw(y∣x)logPw(y∣x)+i=1∑nwi(x,y∑P^(x,y)f(x,y)−x,y∑P^(x)Pw(y∣x)f(x,y))=x,y∑P^(x,y)i=1∑nwifi(x,y)+x,y∑P^(x)Pw(y∣x)[logPw(y∣x)−i=1∑nwifi(x,y)]=x,y∑P^(x,y)i=1∑nwifi(x,y)−x,y∑P^(x)Pw(y∣x)[log(expi=1∑nwifi(x,y))−logPw(y∣x)]=x,y∑P^(x,y)i=1∑nwifi(x,y)−x,y∑P^(x)Pw(y∣x)logZw(x)=x,y∑P^(x,y)i=1∑nwifi(x,y)−x∑P^(x)logZw(x)
求对偶函数的极大值 w ∗ w* w∗
IIS的想法和推导EM算法时的思路一致:假设最大熵的当前模型参数向量是 w = ( w 1 , w 2 , . . . , w n ) T w=(w_1,w_2,...,w_n)^T w=(w1,w2,...,wn)T,希望可以找到新的参数向量 w + δ = ( w 1 + δ 1 , w 2 , δ 2 , . . . , w n + δ n ) T w+\delta = (w_1+\delta_1, w_2, \delta_2,...,w_n+\delta_n)^T w+δ=(w1+δ1,w2,δ2,...,wn+δn)T,使得模型的对数似然函数值增大,那么就可以用 w → w + δ w \to w + \delta w→w+δ迭代求解参数。
Ψ ( w + δ ) − Ψ ( w ) = { ∑ x , y P ^ ( x , y ) ∑ i = 1 n ( w i + δ i ) f i ( x , y ) − ∑ x P ^ ( x ) log Z w + δ ( x ) } − { ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ^ ( x ) log Z w ( x ) } = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) − ∑ x P ^ ( x ) log Z w + δ ( x ) Z w ( x ) \begin{aligned} \Psi(w+\delta) - \Psi(w) &= \{ \sum_{x,y} \hat P(x,y) \sum_{i=1}^n (w_i+\delta_i) f_i(x,y) - \sum_{x} \hat P(x) \log Z_{w+\delta}(x) \} \\ &- \{ \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x} \hat P(x) \log Z_w(x) \} \\ &=\sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) - \sum_{x} \hat P(x) \log \frac {Z_{w+\delta}(x)}{ Z_w(x)} \end{aligned} Ψ(w+δ)−Ψ(w)={x,y∑P^(x,y)i=1∑n(wi+δi)fi(x,y)−x∑P^(x)logZw+δ(x)}−{x,y∑P^(x,y)i=1∑nwifi(x,y)−x∑P^(x)logZw(x)}=x,y∑P^(x,y)i=1∑nδifi(x,y)−x∑P^(x)logZw(x)Zw+δ(x)
利用不等式
− log α ≥ 1 − α , α > 0 -\log \alpha \ge 1 - \alpha, \alpha >0 −logα≥1−α,α>0
建立对偶函数改变量的下界:
Ψ ( w + δ ) − Ψ ( w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + ∑ x P ^ ( x ) ( − log Z w + δ ( x ) Z w ( x ) ) ≥ ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + ∑ x P ^ ( x ) ( 1 − Z w + δ ( x ) Z w ( x ) ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) Z w + δ ( x ) Z w ( x ) \begin{aligned} \Psi(w+\delta) - \Psi(w) &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + \sum_{x} \hat P(x) (-\log \frac {Z_{w+\delta}(x)}{ Z_w(x)}) \\ &\ge \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + \sum_{x} \hat P(x) (1- \frac {Z_{w+\delta}(x)}{ Z_w(x)}) \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \frac {Z_{w+\delta}(x)}{ Z_w(x)} \\ \end{aligned} Ψ(w+δ)−Ψ(w)=x,y∑P^(x,y)i=1∑nδifi(x,y)+x∑P^(x)(−logZw(x)Zw+δ(x))≥x,y∑P^(x,y)i=1∑nδifi(x,y)+x∑P^(x)(1−Zw(x)Zw+δ(x))=x,y∑P^(x,y)i=1∑nδifi(x,y)+1−x∑P^(x)Zw(x)Zw+δ(x)
其中:
Z w ( x ) = 1 P w ( y ∣ x ) exp ( ∑ i = 1 n w i f i ( x , y ) ) Z_w(x) = \frac {1} {P_w(y|x)} \exp (\sum_{i=1}^n w_i f_i(x,y)) Zw(x)=Pw(y∣x)1exp(i=1∑nwifi(x,y))
Z w + δ ( x ) = ∑ y exp ( ∑ i = 1 n ( w i + δ i ) f i ( x , y ) ) (10) Z_{w+\delta}(x) = \sum_y \exp (\sum_{i=1}^n (w_i+\delta_i) f_i(x,y)) \tag{10} Zw+δ(x)=y∑exp(i=1∑n(wi+δi)fi(x,y))(10)
则:
Z w + δ ( x ) Z w ( x ) = ∑ y P w ( y ∣ x ) exp ( ∑ i = 1 n δ i f i ( x , y ) ) \frac {Z_{w+\delta}(x)}{ Z_w(x)} = \sum_y P_w(y|x) \exp (\sum_{i=1}^n \delta_i f_i(x,y)) Zw(x)Zw+δ(x)=y∑Pw(y∣x)exp(i=1∑nδifi(x,y))
所以:
Ψ ( w + δ ) − Ψ ( w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ( ∑ i = 1 n δ i f i ( x , y ) ) \Psi(w+\delta) - \Psi(w)= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp (\sum_{i=1}^n \delta_i f_i(x,y)) Ψ(w+δ)−Ψ(w)=x,y∑P^(x,y)i=1∑nδifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))
将右段记做:
A ( δ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ( ∑ i = 1 n δ i f i ( x , y ) ) (11) A(\delta|w) = \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp (\sum_{i=1}^n \delta_i f_i(x,y)) \tag{11} A(δ∣w)=x,y∑P^(x,y)i=1∑nδifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)exp(i=1∑nδifi(x,y))(11)
即 A ( δ ∣ w ) A(\delta|w) A(δ∣w)是对偶函数改变量的一个下界。
接下来固定其他变量,优化其中的一个变量 δ i \delta_i δi,为了达到这个目的,进一步降低下界 A ( δ ∣ w ) A(\delta|w) A(δ∣w),现在引进一个量
f ∗ ( x , y ) = ∑ i f i ( x , y ) f^*(x,y) = \sum_i f_i(x,y) f∗(x,y)=i∑fi(x,y)
上式表示所有特征在 ( x , y ) (x,y) (x,y)出现的次数。 A ( δ ∣ w ) A(\delta|w) A(δ∣w)可以写成这样:
A ( δ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp [ ∑ i = 1 n f ∗ ( x , y ) δ i f i ( x , y ) f ∗ ( x , y ) ] A(\delta|w) = \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp [\sum_{i=1}^n f^*(x,y) \frac { \delta_i f_i(x,y)}{f^*(x,y)}] A(δ∣w)=x,y∑P^(x,y)i=1∑nδifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)exp[i=1∑nf∗(x,y)f∗(x,y)δifi(x,y)]
因为指数函数是凸函数,且对于任意 i i i,有 f i ( x , y ) f ∗ ( x , y ) ≥ 0 \frac {f_i(x,y)}{f^*(x,y)} \ge 0 f∗(x,y)fi(x,y)≥0且 ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) = 1 \sum_{i=1}^n \frac {f_i(x,y)}{f^*(x,y)} = 1 ∑i=1nf∗(x,y)fi(x,y)=1所以根据Jensen不等式,可以得到:
exp [ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) ( f ∗ ( x , y ) ∗ δ i ) ] ≤ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ( f ∗ ( x , y ) ∗ δ i ) \exp [\sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} (f^*(x,y)*\delta_i)] \le \sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} \exp (f^*(x,y)*\delta_i) exp[i=1∑nf∗(x,y)fi(x,y)(f∗(x,y)∗δi)]≤i=1∑nf∗(x,y)fi(x,y)exp(f∗(x,y)∗δi)
可以得到:
A ( δ ∣ w ) ≥ ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp [ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ( f ∗ ( x , y ) ∗ δ i ) ] A(\delta|w) \ge \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp [\sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} \exp (f^*(x,y)*\delta_i)] A(δ∣w)≥x,y∑P^(x,y)i=1∑nδifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)exp[i=1∑nf∗(x,y)fi(x,y)exp(f∗(x,y)∗δi)]
将右段记做
B ( δ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp [ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ( f ∗ ( x , y ) ∗ δ i ) ] (12) B(\delta|w) = \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp [\sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} \exp (f^*(x,y)*\delta_i)] \tag{12} B(δ∣w)=x,y∑P^(x,y)i=1∑nδifi(x,y)+1−x∑P^(x)y∑Pw(y∣x)exp[i=1∑nf∗(x,y)fi(x,y)exp(f∗(x,y)∗δi)](12)
B ( δ ∣ w ) B(\delta|w) B(δ∣w)是对偶函数改变量的一个新的更加宽松的下界。
求 B ( δ ∣ w ) B(\delta|w) B(δ∣w)对 δ i \delta_i δi的偏导数:
∂ B ( δ ∣ w ) ∂ δ i = ∑ x , y P ^ ( x , y ) f i ( x , y ) − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ( f ∗ ( x , y ) ∗ δ i ) \frac {\partial B(\delta|w)} {\partial \delta_i} = \sum_{x,y} \hat P(x,y) f_i(x,y) - \sum_{x} \hat P(x) \sum_y P_w(y|x) f_i(x,y) \exp (f^*(x,y)*\delta_i) ∂δi∂B(δ∣w)=x,y∑P^(x,y)fi(x,y)−x∑P^(x)y∑Pw(y∣x)fi(x,y)exp(f∗(x,y)∗δi)
令偏导数为0:
∑ x , y P ^ ( x , y ) f i ( x , y ) = ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ( f ∗ ( x , y ) ∗ δ i ) \sum_{x,y} \hat P(x,y) f_i(x,y) = \sum_{x} \hat P(x) \sum_y P_w(y|x) f_i(x,y) \exp (f^*(x,y)*\delta_i) x,y∑P^(x,y)fi(x,y)=x∑P^(x)y∑Pw(y∣x)fi(x,y)exp(f∗(x,y)∗δi)
将(3)(4)式分别带入上式得:
E P ^ ( f i ) = E P ( f i ) ∗ exp ( f ∗ ( x , y ) ∗ δ i ) E_{\hat P}(f_i) = E_P(f_i) * \exp (f^*(x,y)*\delta_i) EP^(fi)=EP(fi)∗exp(f∗(x,y)∗δi)
解得:
δ i = 1 f ∗ ( x , y ) log E P ^ ( f i ) E P ( f i ) (13) \delta_i = \frac {1}{f^*(x,y)} \log \frac {E_{\hat P}(f_i) }{E_P(f_i)} \tag{13} δi=f∗(x,y)1logEP(fi)EP^(fi)(13)
想要求得参数 w i w_i wi就可以通过迭代得到:
w i = w i + δ i (14) w_i = w_i + \delta_i \tag{14} wi=wi+δi(14)
参数 w w w的不完全结果如下:
{‘w’: {(‘age_youth’, ‘refuse’): 0.6607178829713231, (‘working_no’, ‘refuse’): 0.22746736807343207}}
import numpy as np
np.random.seed(10)
class MyMaxEntropy(object):
def __init__(self, lr=0.0001):
"""
最大熵模型的实现,为了方便理解,尽可能的将参数都存储为字典形式
:param lr: 学习率,默认值为0.0001
其他参数:
:param w: 模型的参数,字典
:param N: 样本数量
:param label: 标签空间
:param hat_p_x: 边缘分布P(X)的经验分布
:param hat_p_x_y: 联合分布P(X,Y)的经验分布
:param E_p: 特征函数f(x,y)关于模型P(X|Y)与经验分布hatP(X)的期望值
:param E_hat_p: 特征函数f(x,y)关于经验分布hatP(X,Y)的期望值
:param eps: 一个接近于0的正数极小值,这个值放在log的计算中,防止报错
"""
self.lr = lr
self.params = {'w': None}
self.N = None
self.label = None
self.hat_p_x = {}
self.hat_p_x_y = {}
self.E_p = {}
self.E_hat_p = {}
self.eps = np.finfo(np.float32).eps
def _init_params(self):
"""
随机初始化模型参数w
:return:
"""
w = {}
for key in self.hat_p_x_y.keys():
w[key] = np.random.rand()
self.params['w'] = w
def _rebuild_X(self, X):
"""
为了自变量的差异化处理,重新命名自变量
:param X: 原始自变量
:return:
"""
X_result = []
for x in X:
X_result.append([y_s + '_' + x_s for x_s, y_s in zip(x, self.X_columns)])
return X_result
def _build_mapping(self, X, Y):
"""
求取经验分布,参照公式(1)(2)
:param X: 训练样本的输入值
:param Y: 训练样本的输出值
:return:
"""
for x, y in zip(X, Y):
for x_s in x:
if x_s in self.hat_p_x.keys():
self.hat_p_x[x_s] += 1
else:
self.hat_p_x[x_s] = 1
if (x_s, y) in self.hat_p_x_y.keys():
self.hat_p_x_y[(x_s, y)] += 1
else:
self.hat_p_x_y[(x_s, y)] = 1
self.hat_p_x = {key: count / self.N for key, count in self.hat_p_x.items()}
self.hat_p_x_y = {key: count / self.N for key, count in self.hat_p_x_y.items()}
def _cal_E_hat_p(self):
"""
计算特征函数f(x,y)关于经验分布hatP(X,Y)的期望值,参照公式(3)
:return:
"""
self.E_hat_p = self.hat_p_x_y
def _cal_E_p(self, X):
"""
计算特征函数f(x,y)关于模型P(X|Y)与经验分布hatP(X)的期望值,参照公式(4)
:param X:
:return:
"""
for key in self.params['w'].keys():
self.E_p[key] = 0
for x in X:
p_y_x = self._cal_prob(x)
for x_s in x:
for (p_y_x_s, y) in p_y_x:
if (x_s, y) not in self.E_p.keys():
continue
self.E_p[(x_s, y)] += (1/self.N) * p_y_x_s
def _cal_p_y_x(self, x, y):
"""
计算模型条件概率值,参照公式(9)的指数部分
:param x: 单个样本的输入值
:param y: 单个样本的输出值
:return:
"""
sum = 0.0
for x_s in x:
sum += self.params['w'].get((x_s, y), 0)
return np.exp(sum), y
def _cal_prob(self, x):
"""
计算模型条件概率值,参照公式(9)
:param x: 单个样本的输入值
:return:
"""
p_y_x = [(self._cal_p_y_x(x, y)) for y in self.label]
sum_y = np.sum([p_y_x_s for p_y_x_s, y in p_y_x])
return [(p_y_x_s / sum_y, y) for p_y_x_s, y in p_y_x]
def fit(self, X, X_columns, Y, label, max_iter=20000):
"""
模型训练入口
:param X: 训练样本输入值
:param X_columns: 训练样本的columns
:param Y: 训练样本的输出值
:param label: 训练样本的输出空间
:param max_iter: 最大训练次数
:return:
"""
self.N = len(X)
self.label = label
self.X_columns = X_columns
X = self._rebuild_X(X)
self._build_mapping(X, Y)
self._cal_E_hat_p()
self._init_params()
for iter in range(max_iter):
self._cal_E_p(X)
for key in self.params['w'].keys():
sigma = self.lr * np.log(self.E_hat_p.get(key, self.eps) / self.E_p.get(key, self.eps))
self.params['w'][key] += sigma
def predict(self, X):
"""
预测结果
:param X: 样本
:return:
"""
X = self._rebuild_X(X)
result_list = []
for x in X:
max_result = 0
y_result = self.label[0]
p_y_x = self._cal_prob(x)
for (p_y_x_s, y) in p_y_x:
if p_y_x_s > max_result:
max_result = p_y_x_s
y_result = y
result_list.append((max_result, y_result))
return result_list
def run_my_model():
data_set = [['youth', 'no', 'no', '1', 'refuse'],
['youth', 'no', 'no', '2', 'refuse'],
['youth', 'yes', 'no', '2', 'agree'],
['youth', 'yes', 'yes', '1', 'agree'],
['youth', 'no', 'no', '1', 'refuse'],
['mid', 'no', 'no', '1', 'refuse'],
['mid', 'no', 'no', '2', 'refuse'],
['mid', 'yes', 'yes', '2', 'agree'],
['mid', 'no', 'yes', '3', 'agree'],
['mid', 'no', 'yes', '3', 'agree'],
['elder', 'no', 'yes', '3', 'agree'],
['elder', 'no', 'yes', '2', 'agree'],
['elder', 'yes', 'no', '2', 'agree'],
['elder', 'yes', 'no', '3', 'agree'],
['elder', 'no', 'no', '1', 'refuse'],
]
columns = ['age', 'working', 'house', 'credit_situation', 'label']
X = [i[:-1] for i in data_set]
X_columns = columns[:-1]
Y = [i[-1] for i in data_set]
print(X)
print(Y)
my = MyMaxEntropy()
train_X = X[:12]
test_X = X[12:]
train_Y = Y[:12]
test_Y = Y[12:]
my.fit(train_X, X_columns, train_Y, label=['refuse', 'agree'])
print(my.params)
pred_Y= my.predict(test_X)
print('result: ')
print('test: ', test_Y)
print('pred: ', pred_Y)
结果:
result:
test: [‘agree’, ‘agree’, ‘refuse’]
pred: [(0.7958750339709215, ‘agree’), (0.9026238777607725, ‘agree’), (0.7143440316123404, ‘refuse’)]
参考资料:
《统计学习方法》 李航著