机器学习之最大熵模型及python实现

最大熵模型

简介:最大熵模型的思想也就是当信息不确定的时候,让熵达到最大值,这个逻辑是符合我们的经验判断的。

本文参考李航博士的《统计学习方法》

1. 最大熵模型原理

下面的公式以如下表格的例子进行举例解释:

ID age working house credit_situation label
1 youth no no 1 refuse
2 youth no no 2 refuse
3 youth yes no 2 agree
4 youth yes yes 1 agree
5 youth no no 1 refuse
6 mid no no 1 refuse
7 mid no no 2 refuse
8 mid yes yes 2 agree
9 mid no yes 3 agree
10 mid no yes 3 agree
11 elder no yes 3 agree
12 elder no yes 2 agree
13 elder yes no 2 agree
14 elder yes no 3 agree
15 elder no no 1 refuse
1.1 最大熵模型的定义

H ( P ) = − ∑ x P ( x ) log ⁡ P ( x ) H(P) = -\sum_x P(x) \log P(x) H(P)=xP(x)logP(x)

熵越大,表示信息越不确定,最大熵模型的思想也就是当信息不确定的时候,让熵达到最大值,这个逻辑是符合我们的经验判断的。

最大熵模型表示的是对于给定的输入 X X X,以条件概率 P ( Y ∣ X ) P(Y|X) P(YX)输入 Y Y Y.
给定训练数据集:

T = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) T = {(x_1,y_1),(x_2,y_2),...,(x_N,y_N)} T=(x1,y1),(x2,y2),...,(xN,yN)

给定训练数据集,可以确定联合分布 P ( X , Y ) P(X,Y) P(X,Y)的经验分布和边缘分布 P ( X ) P(X) P(X)的经验分布,分别以 P ^   ( X , Y ) \hat P~(X,Y) P^ (X,Y) P ^ ( X ) \hat P(X) P^(X)表示,

P ^ ( X = x , Y = y ) = v ( X = x , Y = y ) N (1) \hat P(X=x, Y=y) = \frac {v(X=x, Y=y)} {N} \tag{1} P^(X=x,Y=y)=Nv(X=x,Y=y)(1)

P ^ ( X = x ) = v ( X = x ) N (2) \hat P(X=x) = \frac {v(X=x)} {N} \tag{2} P^(X=x)=Nv(X=x)(2)

其中, v ( X = x , Y = y ) v(X=x, Y=y) v(X=x,Y=y)表示训练数据中样本 ( x , y ) (x,y) (x,y)中出现的频数, v ( X = x ) v(X=x) v(X=x)表示训练数据中输入 x x x出现的频数, N N N表示训练样本容量。

在这个模型中,为了让样本的特征值互斥,所以对输入数据做一些处理,处理后结果如下:
第一个样本的输入:[‘age_youth’, ‘working_no’, ‘house_no’, ‘credit_situation_1’]

P ^ ( X = x , Y = y ) = v ( X = x , Y = y ) N \hat P(X=x, Y=y) = \frac {v(X=x, Y=y)} {N} P^(X=x,Y=y)=Nv(X=x,Y=y)统计后的不完全结果:
{(‘age_youth’, ‘refuse’): 0.25, (‘working_no’, ‘refuse’): 0.4166666666666667}

用特征函数 f ( x , y ) f(x,y) f(x,y)描述输入 x x x和输出 y y y之间的某一个事实,其定义是

f ( x , y ) = { 1 , x 与 y 满 足 某 一 事 实 0 , 否 则 f(x,y) = \begin{cases} 1,x与y满足某一事实\\ 0,否则 \end{cases} f(x,y)={1xy0

它是一个二值函数。

特征函数 f ( x , y ) f(x,y) f(x,y)关于经验分布 P ^ ( X , Y ) \hat P(X,Y) P^(X,Y)的期望值,用 E P ^ ( f ) E_{\hat P}(f) EP^(f)表示:

E P ^ ( f ) = ∑ x , y P ^ ( x , y ) f ( x , y ) (3) E_{\hat P}(f) = \sum_{x,y} \hat P(x,y) f(x,y) \tag{3} EP^(f)=x,yP^(x,y)f(x,y)(3)

特征函数 f ( x , y ) f(x,y) f(x,y)关于模型 P ( Y ∣ X ) P(Y|X) P(YX)与经验分布 P ^ ( X ) \hat P(X) P^(X)的其万至,用 E P ( f ) E_P(f) EP(f)表示:

E P ( f ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) f ( x , y ) (4) E_P(f) = \sum_{x,y} \hat P(x) P(y|x) f(x,y) \tag{4} EP(f)=x,yP^(x)P(yx)f(x,y)(4)

(3)(4)式和(1)(2)式在存储结构上完全一致了

如果模型能够获取训练数据中的信息,那么就假设这两个期望值相等,即:

E P ^ ( f ) = E P ( f ) (5) E_{\hat P}(f) = E_P(f) \tag{5} EP^(f)=EP(f)(5)

最大熵模型:

假设满足所有约束条件:

E P ^ ( f i ) = E P ( f i ) , i = 1 , 2 , . . . , n (6) E_{\hat P}(f_i) = E_P(f_i), i = 1,2,...,n \tag{6} EP^(fi)=EP(fi),i=1,2,...,n(6)

定义在条件概率分布 P ( Y ∣ X ) P(Y|X) P(YX)上的条件熵为:

H ( P ) = − ∑ x , y P ^ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) (7) H(P) = - \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) \tag{7} H(P)=x,yP^(x)P(yx)logP(yx)(7)

在满足约束的情况下,条件熵最大的模型称为最大熵模型。

对(7)式做一个简单的说明:
条件概率的的条件熵为:
H ( P ) = − ∑ x , y P ( y ∣ x ) log ⁡ P ( y ∣ x ) H(P) = - \sum_{x,y} P(y|x) \log P(y|x) H(P)=x,yP(yx)logP(yx)

在此基础上做了对 x x x的条件熵期望,得到了(7)式。

1.2 最大熵模型的学习

最大熵模型的学习过程就是求解最大熵模型的过程。最大熵模型可以形式化为约束最优问题。

max ⁡ P ( y ∣ x ) H ( P ) = − ∑ x , y P ^ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) s . t . E P ^ ( f i ) = E P ( f i ) , i = 1 , 2 , . . . , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \max_{P(y|x)} &H(P) = - \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) \\ s.t. &E_{\hat P}(f_i) = E_P(f_i), i = 1,2,...,n \\ &\sum_y P(y|x) = 1 \end{aligned} P(yx)maxs.t.H(P)=x,yP^(x)P(yx)logP(yx)EP^(fi)=EP(fi),i=1,2,...,nyP(yx)=1

按照最优化的习惯,将求最大值的问题改写为等价的求最小值的问题:

min ⁡ P ( y ∣ x ) − H ( P ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) s . t . E P ^ ( f i ) = E P ( f i ) , i = 1 , 2 , . . . , n ∑ y P ( y ∣ x ) = 1 \begin{aligned} \min_{P(y|x)} &-H(P) = \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) \\ s.t. &E_{\hat P}(f_i) = E_P(f_i), i = 1,2,...,n \\ &\sum_y P(y|x) = 1 \end{aligned} P(yx)mins.t.H(P)=x,yP^(x)P(yx)logP(yx)EP^(fi)=EP(fi),i=1,2,...,nyP(yx)=1

将约束最优化的原始问题转化为无约束最优化的对偶问题。
首先,引进拉格朗日乘子 w 0 , w 1 , w 2 , . . . , w n w_0,w_1,w_2,...,w_n w0,w1,w2,...,wn,定义拉格朗日函数 L ( P , w ) L(P,w) L(P,w):

L ( P , w ) = − H ( P ) + w 0 ( ∑ y P ( y ∣ x ) − 1 ) + ∑ i = 1 n w i ( E P ^ ( f i ) − E P ( f i ) ) = ∑ x , y P ^ ( x ) P ( y ∣ x ) log ⁡ P ( y ∣ x ) + w 0 ( ∑ y P ( y ∣ x ) − 1 ) + ∑ i = 1 n w i ( E P ^ ( f i ) − E P ( f i ) ) (8) \begin{aligned} L(P,w) &= -H(P) + w_0( \sum_y P(y|x) - 1) + \sum_{i=1}^n w_i(E_{\hat P}(f_i) - E_P(f_i)) \\ & = \sum_{x,y} \hat P(x) P(y|x) \log P(y|x) + w_0( \sum_y P(y|x) - 1) + \sum_{i=1}^n w_i(E_{\hat P}(f_i) - E_P(f_i)) \tag{8} \end{aligned} L(P,w)=H(P)+w0(yP(yx)1)+i=1nwi(EP^(fi)EP(fi))=x,yP^(x)P(yx)logP(yx)+w0(yP(yx)1)+i=1nwi(EP^(fi)EP(fi))(8)

最优化的原始问题是:

min ⁡ P max ⁡ w L ( P , w ) \min_P \max_w L(P,w) PminwmaxL(P,w)

对偶问题是:

max ⁡ w min ⁡ P L ( P , w ) \max_w \min_P L(P,w) wmaxPminL(P,w)

首先求解对偶问题内部的极小化问题 min ⁡ P L ( P , w ) \min_P L(P,w) minPL(P,w),其是 w w w的函数,将其记做:

Ψ ( w ) = min ⁡ P L ( P , w ) = L ( P w , w ) \Psi (w) = \min_P L(P,w) = L(P_w, w) Ψ(w)=PminL(P,w)=L(Pw,w)

上式就是对偶函数。同时,将其解记做:

P w = arg ⁡ min ⁡ P L ( P , w ) = P w ( y ∣ x ) P_w = \arg \min_P L(P,w) = P_w(y|x) Pw=argPminL(P,w)=Pw(yx)

L ( P , w ) L(P,w) L(P,w) P ( y ∣ x ) P(y|x) P(yx)的导数:

∂ L ( P , w ) ∂ P ( y ∣ x ) = ∑ x , y P ^ ( x ) ( log ⁡ P ( y ∣ x ) + 1 ) + w 0 − ∑ x , y ( P ^ ( x ) ∑ i = 1 n w i f i ( x , y ) ) = ∑ x , y P ^ ( x ) ( log ⁡ P ( y ∣ x ) + 1 + w 0 − ∑ i = 1 n w i f i ( x , y ) ) \begin{aligned} \frac{\partial L(P,w)} {\partial P(y|x)} &= \sum_{x,y} \hat P(x) (\log P(y|x) + 1) + w_0 - \sum_{x,y}(\hat P(x) \sum_{i=1}^n w_i f_i(x,y)) \\ & = \sum_{x,y} \hat P(x) (\log P(y|x) + 1 + w_0 - \sum_{i=1}^n w_i f_i(x,y)) \\ \end{aligned} P(yx)L(P,w)=x,yP^(x)(logP(yx)+1)+w0x,y(P^(x)i=1nwifi(x,y))=x,yP^(x)(logP(yx)+1+w0i=1nwifi(x,y))

P ^ ( x ) > 0 \hat P(x) > 0 P^(x)>0的情况下,令偏导为0得:

P ( y ∣ x ) = exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) − 1 − w 0 ) = exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) exp ⁡ ( 1 − w 0 ) P(y|x) = \exp ( \sum_{i=1}^n w_i f_i(x,y) - 1-w_0) = \frac {\exp (\sum_{i=1}^n w_i fi(x,y))} {\exp (1-w_0)} P(yx)=exp(i=1nwifi(x,y)1w0)=exp(1w0)exp(i=1nwifi(x,y))

因为 ∑ y P ( y ∣ x ) = 1 \sum_y P(y|x) = 1 yP(yx)=1:

1 = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) exp ⁡ ( 1 − w 0 ) 1 = \frac {\sum_y \exp (\sum_{i=1}^n w_i fi(x,y))} {\exp (1-w_0)} 1=exp(1w0)yexp(i=1nwifi(x,y))

前两个式子左右两边分别做比得到:

P w ( y ∣ x ) = 1 Z w ( x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) (9) P_w(y|x) = \frac {1} {Z_w(x)} \exp (\sum_{i=1}^n w_i f_i(x,y)) \tag{9} Pw(yx)=Zw(x)1exp(i=1nwifi(x,y))(9)

其中,

Z w ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) (10) Z_w(x) = \sum_y \exp (\sum_{i=1}^n w_i f_i(x,y)) \tag{10} Zw(x)=yexp(i=1nwifi(x,y))(10)

Z w ( x ) Z_w(x) Zw(x)称为规范化因子;
之后,求解对偶问题外部的极大化问题

max ⁡ w Ψ ( w ) \max_w \Psi(w) wmaxΨ(w)

将其解记为 w ∗ w* w,即:

w ∗ = arg ⁡ max ⁡ w Ψ ( w ) w* = \arg \max_w \Psi(w) w=argwmaxΨ(w)

现在将(9)(10)带回(8)准备得到 w ∗ w* w式:

Ψ ( w ) = ∑ x , y P ^ ( x ) P w ( y ∣ x ) log ⁡ P w ( y ∣ x ) + w 0 ( ∑ y P w ( y ∣ x ) − 1 ) + ∑ i = 1 n w i ( E P ^ w ( f i ) − E P w ( f i ) ) = ∑ x , y P ^ ( x ) P w ( y ∣ x ) log ⁡ P w ( y ∣ x ) + ∑ i = 1 n w i ( ∑ x , y P ^ ( x , y ) f ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) f ( x , y ) ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) + ∑ x , y P ^ ( x ) P w ( y ∣ x ) [ log ⁡ P w ( y ∣ x ) − ∑ i = 1 n w i f i ( x , y ) ] = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) [ log ⁡ ( exp ⁡ ∑ i = 1 n w i f i ( x , y ) ) − log ⁡ P w ( y ∣ x ) ] = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x , y P ^ ( x ) P w ( y ∣ x ) log ⁡ Z w ( x ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ^ ( x ) log ⁡ Z w ( x ) \begin{aligned} \Psi(w) &= \sum_{x,y} \hat P(x) P_w(y|x) \log P_w(y|x) + w_0( \sum_y P_w(y|x) - 1) + \sum_{i=1}^n w_i(E_{\hat P_w}(f_i) - E_{P_w}(f_i)) \\ &= \sum_{x,y} \hat P(x) P_w(y|x) \log P_w(y|x) + \sum_{i=1}^n w_i (\sum_{x,y} \hat P(x,y) f(x,y) - \sum_{x,y} \hat P(x) P_w(y|x) f(x,y)) \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) + \sum_{x,y} \hat P(x) P_w(y|x)[\log P_w(y|x) - \sum_{i=1}^n w_i f_i(x,y)] \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x,y} \hat P(x) P_w(y|x)[ \log (\exp \sum_{i=1}^n w_i f_i(x,y)) - \log P_w(y|x)] \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x,y} \hat P(x) P_w(y|x) \log Z_w(x) \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x} \hat P(x) \log Z_w(x) \\ \end{aligned} Ψ(w)=x,yP^(x)Pw(yx)logPw(yx)+w0(yPw(yx)1)+i=1nwi(EP^w(fi)EPw(fi))=x,yP^(x)Pw(yx)logPw(yx)+i=1nwi(x,yP^(x,y)f(x,y)x,yP^(x)Pw(yx)f(x,y))=x,yP^(x,y)i=1nwifi(x,y)+x,yP^(x)Pw(yx)[logPw(yx)i=1nwifi(x,y)]=x,yP^(x,y)i=1nwifi(x,y)x,yP^(x)Pw(yx)[log(expi=1nwifi(x,y))logPw(yx)]=x,yP^(x,y)i=1nwifi(x,y)x,yP^(x)Pw(yx)logZw(x)=x,yP^(x,y)i=1nwifi(x,y)xP^(x)logZw(x)

1.3 改进的迭代尺度法

求对偶函数的极大值 w ∗ w* w

IIS的想法和推导EM算法时的思路一致:假设最大熵的当前模型参数向量是 w = ( w 1 , w 2 , . . . , w n ) T w=(w_1,w_2,...,w_n)^T w=(w1,w2,...,wn)T,希望可以找到新的参数向量 w + δ = ( w 1 + δ 1 , w 2 , δ 2 , . . . , w n + δ n ) T w+\delta = (w_1+\delta_1, w_2, \delta_2,...,w_n+\delta_n)^T w+δ=(w1+δ1,w2,δ2,...,wn+δn)T,使得模型的对数似然函数值增大,那么就可以用 w → w + δ w \to w + \delta ww+δ迭代求解参数。

Ψ ( w + δ ) − Ψ ( w ) = { ∑ x , y P ^ ( x , y ) ∑ i = 1 n ( w i + δ i ) f i ( x , y ) − ∑ x P ^ ( x ) log ⁡ Z w + δ ( x ) } − { ∑ x , y P ^ ( x , y ) ∑ i = 1 n w i f i ( x , y ) − ∑ x P ^ ( x ) log ⁡ Z w ( x ) } = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) − ∑ x P ^ ( x ) log ⁡ Z w + δ ( x ) Z w ( x ) \begin{aligned} \Psi(w+\delta) - \Psi(w) &= \{ \sum_{x,y} \hat P(x,y) \sum_{i=1}^n (w_i+\delta_i) f_i(x,y) - \sum_{x} \hat P(x) \log Z_{w+\delta}(x) \} \\ &- \{ \sum_{x,y} \hat P(x,y) \sum_{i=1}^n w_i f_i(x,y) - \sum_{x} \hat P(x) \log Z_w(x) \} \\ &=\sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) - \sum_{x} \hat P(x) \log \frac {Z_{w+\delta}(x)}{ Z_w(x)} \end{aligned} Ψ(w+δ)Ψ(w)={x,yP^(x,y)i=1n(wi+δi)fi(x,y)xP^(x)logZw+δ(x)}{x,yP^(x,y)i=1nwifi(x,y)xP^(x)logZw(x)}=x,yP^(x,y)i=1nδifi(x,y)xP^(x)logZw(x)Zw+δ(x)

利用不等式

− log ⁡ α ≥ 1 − α , α > 0 -\log \alpha \ge 1 - \alpha, \alpha >0 logα1α,α>0

建立对偶函数改变量的下界:

Ψ ( w + δ ) − Ψ ( w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + ∑ x P ^ ( x ) ( − log ⁡ Z w + δ ( x ) Z w ( x ) ) ≥ ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + ∑ x P ^ ( x ) ( 1 − Z w + δ ( x ) Z w ( x ) ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) Z w + δ ( x ) Z w ( x ) \begin{aligned} \Psi(w+\delta) - \Psi(w) &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + \sum_{x} \hat P(x) (-\log \frac {Z_{w+\delta}(x)}{ Z_w(x)}) \\ &\ge \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + \sum_{x} \hat P(x) (1- \frac {Z_{w+\delta}(x)}{ Z_w(x)}) \\ &= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \frac {Z_{w+\delta}(x)}{ Z_w(x)} \\ \end{aligned} Ψ(w+δ)Ψ(w)=x,yP^(x,y)i=1nδifi(x,y)+xP^(x)(logZw(x)Zw+δ(x))x,yP^(x,y)i=1nδifi(x,y)+xP^(x)(1Zw(x)Zw+δ(x))=x,yP^(x,y)i=1nδifi(x,y)+1xP^(x)Zw(x)Zw+δ(x)

其中:

Z w ( x ) = 1 P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n w i f i ( x , y ) ) Z_w(x) = \frac {1} {P_w(y|x)} \exp (\sum_{i=1}^n w_i f_i(x,y)) Zw(x)=Pw(yx)1exp(i=1nwifi(x,y))

Z w + δ ( x ) = ∑ y exp ⁡ ( ∑ i = 1 n ( w i + δ i ) f i ( x , y ) ) (10) Z_{w+\delta}(x) = \sum_y \exp (\sum_{i=1}^n (w_i+\delta_i) f_i(x,y)) \tag{10} Zw+δ(x)=yexp(i=1n(wi+δi)fi(x,y))(10)

则:

Z w + δ ( x ) Z w ( x ) = ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) \frac {Z_{w+\delta}(x)}{ Z_w(x)} = \sum_y P_w(y|x) \exp (\sum_{i=1}^n \delta_i f_i(x,y)) Zw(x)Zw+δ(x)=yPw(yx)exp(i=1nδifi(x,y))

所以:

Ψ ( w + δ ) − Ψ ( w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) \Psi(w+\delta) - \Psi(w)= \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp (\sum_{i=1}^n \delta_i f_i(x,y)) Ψ(w+δ)Ψ(w)=x,yP^(x,y)i=1nδifi(x,y)+1xP^(x)yPw(yx)exp(i=1nδifi(x,y))

将右段记做:

A ( δ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ ( ∑ i = 1 n δ i f i ( x , y ) ) (11) A(\delta|w) = \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp (\sum_{i=1}^n \delta_i f_i(x,y)) \tag{11} A(δw)=x,yP^(x,y)i=1nδifi(x,y)+1xP^(x)yPw(yx)exp(i=1nδifi(x,y))(11)

A ( δ ∣ w ) A(\delta|w) A(δw)是对偶函数改变量的一个下界。

接下来固定其他变量,优化其中的一个变量 δ i \delta_i δi,为了达到这个目的,进一步降低下界 A ( δ ∣ w ) A(\delta|w) A(δw),现在引进一个量

f ∗ ( x , y ) = ∑ i f i ( x , y ) f^*(x,y) = \sum_i f_i(x,y) f(x,y)=ifi(x,y)

上式表示所有特征在 ( x , y ) (x,y) (x,y)出现的次数。 A ( δ ∣ w ) A(\delta|w) A(δw)可以写成这样:

A ( δ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ [ ∑ i = 1 n f ∗ ( x , y ) δ i f i ( x , y ) f ∗ ( x , y ) ] A(\delta|w) = \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp [\sum_{i=1}^n f^*(x,y) \frac { \delta_i f_i(x,y)}{f^*(x,y)}] A(δw)=x,yP^(x,y)i=1nδifi(x,y)+1xP^(x)yPw(yx)exp[i=1nf(x,y)f(x,y)δifi(x,y)]

因为指数函数是凸函数,且对于任意 i i i,有 f i ( x , y ) f ∗ ( x , y ) ≥ 0 \frac {f_i(x,y)}{f^*(x,y)} \ge 0 f(x,y)fi(x,y)0 ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) = 1 \sum_{i=1}^n \frac {f_i(x,y)}{f^*(x,y)} = 1 i=1nf(x,y)fi(x,y)=1所以根据Jensen不等式,可以得到:

exp ⁡ [ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) ( f ∗ ( x , y ) ∗ δ i ) ] ≤ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ⁡ ( f ∗ ( x , y ) ∗ δ i ) \exp [\sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} (f^*(x,y)*\delta_i)] \le \sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} \exp (f^*(x,y)*\delta_i) exp[i=1nf(x,y)fi(x,y)(f(x,y)δi)]i=1nf(x,y)fi(x,y)exp(f(x,y)δi)

可以得到:

A ( δ ∣ w ) ≥ ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ [ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ⁡ ( f ∗ ( x , y ) ∗ δ i ) ] A(\delta|w) \ge \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp [\sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} \exp (f^*(x,y)*\delta_i)] A(δw)x,yP^(x,y)i=1nδifi(x,y)+1xP^(x)yPw(yx)exp[i=1nf(x,y)fi(x,y)exp(f(x,y)δi)]

将右段记做

B ( δ ∣ w ) = ∑ x , y P ^ ( x , y ) ∑ i = 1 n δ i f i ( x , y ) + 1 − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) exp ⁡ [ ∑ i = 1 n f i ( x , y ) f ∗ ( x , y ) exp ⁡ ( f ∗ ( x , y ) ∗ δ i ) ] (12) B(\delta|w) = \sum_{x,y} \hat P(x,y) \sum_{i=1}^n \delta_i f_i(x,y) + 1 - \sum_{x} \hat P(x) \sum_y P_w(y|x) \exp [\sum_{i=1}^n \frac { f_i(x,y)}{f^*(x,y)} \exp (f^*(x,y)*\delta_i)] \tag{12} B(δw)=x,yP^(x,y)i=1nδifi(x,y)+1xP^(x)yPw(yx)exp[i=1nf(x,y)fi(x,y)exp(f(x,y)δi)](12)

B ( δ ∣ w ) B(\delta|w) B(δw)是对偶函数改变量的一个新的更加宽松的下界。
B ( δ ∣ w ) B(\delta|w) B(δw) δ i \delta_i δi的偏导数:

∂ B ( δ ∣ w ) ∂ δ i = ∑ x , y P ^ ( x , y ) f i ( x , y ) − ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( f ∗ ( x , y ) ∗ δ i ) \frac {\partial B(\delta|w)} {\partial \delta_i} = \sum_{x,y} \hat P(x,y) f_i(x,y) - \sum_{x} \hat P(x) \sum_y P_w(y|x) f_i(x,y) \exp (f^*(x,y)*\delta_i) δiB(δw)=x,yP^(x,y)fi(x,y)xP^(x)yPw(yx)fi(x,y)exp(f(x,y)δi)

令偏导数为0:

∑ x , y P ^ ( x , y ) f i ( x , y ) = ∑ x P ^ ( x ) ∑ y P w ( y ∣ x ) f i ( x , y ) exp ⁡ ( f ∗ ( x , y ) ∗ δ i ) \sum_{x,y} \hat P(x,y) f_i(x,y) = \sum_{x} \hat P(x) \sum_y P_w(y|x) f_i(x,y) \exp (f^*(x,y)*\delta_i) x,yP^(x,y)fi(x,y)=xP^(x)yPw(yx)fi(x,y)exp(f(x,y)δi)

将(3)(4)式分别带入上式得:

E P ^ ( f i ) = E P ( f i ) ∗ exp ⁡ ( f ∗ ( x , y ) ∗ δ i ) E_{\hat P}(f_i) = E_P(f_i) * \exp (f^*(x,y)*\delta_i) EP^(fi)=EP(fi)exp(f(x,y)δi)

解得:

δ i = 1 f ∗ ( x , y ) log ⁡ E P ^ ( f i ) E P ( f i ) (13) \delta_i = \frac {1}{f^*(x,y)} \log \frac {E_{\hat P}(f_i) }{E_P(f_i)} \tag{13} δi=f(x,y)1logEP(fi)EP^(fi)(13)

想要求得参数 w i w_i wi就可以通过迭代得到:

w i = w i + δ i (14) w_i = w_i + \delta_i \tag{14} wi=wi+δi(14)

参数 w w w的不完全结果如下:
{‘w’: {(‘age_youth’, ‘refuse’): 0.6607178829713231, (‘working_no’, ‘refuse’): 0.22746736807343207}}

2.Python实现最大熵模型
2.1 模型实现
import numpy as np
np.random.seed(10)


class MyMaxEntropy(object):
    def __init__(self, lr=0.0001):
        """
        最大熵模型的实现,为了方便理解,尽可能的将参数都存储为字典形式
        :param lr: 学习率,默认值为0.0001

        其他参数:
        :param w: 模型的参数,字典
        :param N: 样本数量
        :param label: 标签空间
        :param hat_p_x: 边缘分布P(X)的经验分布
        :param hat_p_x_y: 联合分布P(X,Y)的经验分布
        :param E_p: 特征函数f(x,y)关于模型P(X|Y)与经验分布hatP(X)的期望值
        :param E_hat_p: 特征函数f(x,y)关于经验分布hatP(X,Y)的期望值
        :param eps: 一个接近于0的正数极小值,这个值放在log的计算中,防止报错
        """
        self.lr = lr
        self.params = {'w': None}

        self.N = None
        self.label = None

        self.hat_p_x = {}
        self.hat_p_x_y = {}

        self.E_p = {}
        self.E_hat_p = {}

        self.eps = np.finfo(np.float32).eps


    def _init_params(self):
        """
        随机初始化模型参数w
        :return:
        """
        w = {}
        for key in self.hat_p_x_y.keys():
            w[key] = np.random.rand()
        self.params['w'] = w

    def _rebuild_X(self, X):
        """
        为了自变量的差异化处理,重新命名自变量
        :param X: 原始自变量
        :return:
        """
        X_result = []
        for x in X:
            X_result.append([y_s + '_' + x_s for x_s, y_s in zip(x, self.X_columns)])
        return X_result

    def _build_mapping(self, X, Y):
        """
        求取经验分布,参照公式(1)(2)
        :param X: 训练样本的输入值
        :param Y: 训练样本的输出值
        :return:
        """
        for x, y in zip(X, Y):
            for x_s in x:
                if x_s in self.hat_p_x.keys():
                    self.hat_p_x[x_s] += 1
                else:
                    self.hat_p_x[x_s] = 1
                if (x_s, y) in self.hat_p_x_y.keys():
                    self.hat_p_x_y[(x_s, y)] += 1
                else:
                    self.hat_p_x_y[(x_s, y)] = 1

        self.hat_p_x = {key: count / self.N for key, count in self.hat_p_x.items()}
        self.hat_p_x_y = {key: count / self.N for key, count in self.hat_p_x_y.items()}

    def _cal_E_hat_p(self):
        """
        计算特征函数f(x,y)关于经验分布hatP(X,Y)的期望值,参照公式(3)
        :return:
        """
        self.E_hat_p = self.hat_p_x_y


    def _cal_E_p(self, X):
        """
        计算特征函数f(x,y)关于模型P(X|Y)与经验分布hatP(X)的期望值,参照公式(4)
        :param X:
        :return:
        """
        for key in self.params['w'].keys():
            self.E_p[key] = 0
        for x in X:
            p_y_x = self._cal_prob(x)
            for x_s in x:
                for (p_y_x_s, y) in p_y_x:
                    if (x_s, y) not in self.E_p.keys():
                        continue
                    self.E_p[(x_s, y)] += (1/self.N) * p_y_x_s

    def _cal_p_y_x(self, x, y):
        """
        计算模型条件概率值,参照公式(9)的指数部分
        :param x: 单个样本的输入值
        :param y: 单个样本的输出值
        :return:
        """

        sum = 0.0
        for x_s in x:
            sum += self.params['w'].get((x_s, y), 0)
        return np.exp(sum), y


    def _cal_prob(self, x):
        """
        计算模型条件概率值,参照公式(9)
        :param x: 单个样本的输入值
        :return:
        """
        p_y_x = [(self._cal_p_y_x(x, y)) for y in self.label]
        sum_y = np.sum([p_y_x_s for p_y_x_s, y in p_y_x])
        return [(p_y_x_s / sum_y, y) for p_y_x_s, y in p_y_x]


    def fit(self, X, X_columns, Y, label, max_iter=20000):
        """
        模型训练入口
        :param X: 训练样本输入值
        :param X_columns: 训练样本的columns
        :param Y: 训练样本的输出值
        :param label: 训练样本的输出空间
        :param max_iter: 最大训练次数
        :return:
        """
        self.N = len(X)
        self.label = label
        self.X_columns = X_columns

        X = self._rebuild_X(X)

        self._build_mapping(X, Y)

        self._cal_E_hat_p()

        self._init_params()

        for iter in range(max_iter):

            self._cal_E_p(X)

            for key in self.params['w'].keys():
                sigma = self.lr * np.log(self.E_hat_p.get(key, self.eps) / self.E_p.get(key, self.eps))
                self.params['w'][key] += sigma

    def predict(self, X):
        """
        预测结果
        :param X: 样本
        :return:
        """
        X = self._rebuild_X(X)
        result_list = []

        for x in X:
            max_result = 0
            y_result = self.label[0]
            p_y_x = self._cal_prob(x)
            for (p_y_x_s, y) in p_y_x:
                if p_y_x_s > max_result:
                    max_result = p_y_x_s
                    y_result = y
            result_list.append((max_result, y_result))
        return result_list
2.2 模型测试
def run_my_model():
    data_set = [['youth', 'no', 'no', '1', 'refuse'],
               ['youth', 'no', 'no', '2', 'refuse'],
               ['youth', 'yes', 'no', '2', 'agree'],
               ['youth', 'yes', 'yes', '1', 'agree'],
               ['youth', 'no', 'no', '1', 'refuse'],
               ['mid', 'no', 'no', '1', 'refuse'],
               ['mid', 'no', 'no', '2', 'refuse'],
               ['mid', 'yes', 'yes', '2', 'agree'],
               ['mid', 'no', 'yes', '3', 'agree'],
               ['mid', 'no', 'yes', '3', 'agree'],
               ['elder', 'no', 'yes', '3', 'agree'],
               ['elder', 'no', 'yes', '2', 'agree'],
               ['elder', 'yes', 'no', '2', 'agree'],
               ['elder', 'yes', 'no', '3', 'agree'],
               ['elder', 'no', 'no', '1', 'refuse'],
               ]
    columns = ['age', 'working', 'house', 'credit_situation', 'label']
    X = [i[:-1] for i in data_set]
    X_columns = columns[:-1]
    Y = [i[-1] for i in data_set]
    print(X)
    print(Y)

    my = MyMaxEntropy()
    train_X = X[:12]
    test_X = X[12:]
    train_Y = Y[:12]
    test_Y = Y[12:]
    my.fit(train_X, X_columns, train_Y, label=['refuse', 'agree'])

    print(my.params)

    pred_Y= my.predict(test_X)
    print('result: ')
    print('test: ', test_Y)
    print('pred: ', pred_Y)

结果:
result:
test: [‘agree’, ‘agree’, ‘refuse’]
pred: [(0.7958750339709215, ‘agree’), (0.9026238777607725, ‘agree’), (0.7143440316123404, ‘refuse’)]

参考资料:
《统计学习方法》 李航著

你可能感兴趣的:(NLP,机器学习)