了解最大熵模型之前,我们需要先了解一个和最大熵模型相伴的概念,指数家族。
指数家族是一个包含我们常见的概率分布的分布族。不管是离散概率分布的代表伯努利分布还是连续概率分布的代表高斯分布,它们都属于指数家族。将其抽象到指数家族这一类会有一些性质,利于求解部分问题。指数家族的基本公式形式为:
p ( x ∣ θ ) = h ( x ) exp ( θ T ϕ ( x ) − A ( θ ) ) p(x \mid \theta)=h(x) \exp \left(\theta^{\mathrm{T}} \phi(x)-A(\theta)\right) p(x∣θ)=h(x)exp(θTϕ(x)−A(θ))
其中 h ( x ) h(x) h(x)一般被认为是一个简单的乘子项,大多数时候是一个常量。 θ \theta θ表示模型中的参数, ϕ ( x ) \phi(x) ϕ(x)被成为函数的充分统计量(Sufficient Statistic
), A ( θ ) A(\theta) A(θ)被称为对数分割函数(Log Partition Function
),一般被用来归一化所有的概率项,具体形式为:
A ( θ ) = log ∫ x ∈ X h ( x ) exp ( θ T ϕ ( x ) ) d x A(\theta)=\log \int_{x \in X} h(x) \exp \left(\theta^{\mathrm{T}} \phi(x)\right) \mathrm{d} x A(θ)=log∫x∈Xh(x)exp(θTϕ(x))dx
那常见的分布模型是如何变换成上述这种形式的呢?
伯努利分布的形式为:
P ( X = x ) = p x ( 1 − p ) 1 − x P(X=x)=p^{x}(1-p)^{1-x} P(X=x)=px(1−p)1−x
其中 p p p表示 X = 1 X=1 X=1时的概率,将其做一定变换得到:
P ( X = x ) = exp ( log ( p x ( 1 − p ) 1 − x ) ) = exp ( x log p + ( 1 − x ) log ( 1 − p ) ) = exp ( x ( log p − log ( 1 − p ) ) + log ( 1 − p ) = exp ( x log p 1 − p − log 1 1 − p ) \begin{aligned} P(X=x) &=\exp \left(\log \left(p^{x}(1-p)^{1-x}\right)\right) \\ &=\exp (x \log p+(1-x) \log (1-p)) \\ &=\exp (x(\log p-\log (1-p))+\log (1-p)\\ &=\exp \left(x \log \frac{p}{1-p}-\log \frac{1}{1-p}\right) \end{aligned} P(X=x)=exp(log(px(1−p)1−x))=exp(xlogp+(1−x)log(1−p))=exp(x(logp−log(1−p))+log(1−p)=exp(xlog1−pp−log1−p1)
其中 h ( x ) = 1 h(x)=1 h(x)=1, ϕ ( x ) = x \phi(x)=x ϕ(x)=x, θ = l o g ( p 1 − p ) \theta=log(\frac{p}{1-p}) θ=log(1−pp), A ( θ ) = l o g 1 1 − p = l o g ( 1 + p 1 − p ) A(\theta)=log\frac{1}{1-p}=log(1+\frac{p}{1-p}) A(θ)=log1−p1=log(1+1−pp)。
N ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp [ − ( x − μ ) 2 2 σ 2 ] = 1 2 π exp [ − log σ − ( x − μ ) 2 2 σ 2 ] = 1 2 π exp [ − log σ − 1 2 σ 2 ( ( x 2 − 2 x μ + μ 2 ) ) ] = 1 2 π exp [ − log σ − 1 2 σ 2 μ 2 + 1 σ 2 x μ − 1 2 σ 2 x 2 ] \begin{aligned} N\left(x \mid \mu, \sigma^{2}\right) &=\frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left[-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right] \\ &=\frac{1}{\sqrt{2 \pi}} \exp \left[-\log \sigma-\frac{(x-\mu)^{2}}{2 \sigma^{2}}\right] \\ &=\frac{1}{\sqrt{2 \pi}} \exp \left[-\log \sigma-\frac{1}{2 \sigma^{2}}\left(\left(x^{2}-2 x \mu+\mu^{2}\right)\right)\right] \\ &=\frac{1}{\sqrt{2 \pi}} \exp \left[-\log \sigma-\frac{1}{2 \sigma^{2}} \mu^{2}+\frac{1}{\sigma^{2}} x \mu-\frac{1}{2 \sigma^{2}} x^{2}\right] \end{aligned} N(x∣μ,σ2)=2πσ21exp[−2σ2(x−μ)2]=2π1exp[−logσ−2σ2(x−μ)2]=2π1exp[−logσ−2σ21((x2−2xμ+μ2))]=2π1exp[−logσ−2σ21μ2+σ21xμ−2σ21x2]
其中 h ( x ) = 1 2 π h(x)=\frac{1}{\sqrt{2\pi}} h(x)=2π1, θ = [ μ σ 2 , − 1 2 σ 2 ] \theta=\left[\frac{\mu}{\sigma^{2}},-\frac{1}{2 \sigma^{2}}\right] θ=[σ2μ,−2σ21], ϕ ( x ) = [ x , x 2 ] \phi(x)=\left[x, x^{2}\right] ϕ(x)=[x,x2],对于 A ( θ ) A(\theta) A(θ)有:
A ( θ ) = log σ + 1 2 σ 2 μ 2 = log σ 2 + ( μ σ 2 ) 2 × 1 2 σ 2 = − log 1 σ 2 + θ 1 2 × 1 2 × 1 σ 2 = − log − 2 × − 1 2 σ 2 − θ 1 2 × 1 4 × − 1 2 σ 2 = − log − 2 × θ 2 − θ 1 2 × 1 4 × θ 2 = − 1 2 log ( − 2 θ 2 ) − θ 1 2 4 θ 2 \begin{aligned} A(\theta) &=\log \sigma+\frac{1}{2 \sigma^{2}} \mu^{2} \\ &=\log \sqrt{\sigma^{2}}+\left(\frac{\mu}{\sigma^{2}}\right)^{2} \times \frac{1}{2} \sigma^{2} \\ &=-\log \sqrt{\frac{1}{\sigma^{2}}}+\theta_{1}^{2} \times \frac{1}{2 \times \frac{1}{\sigma^{2}}} \\ &=-\log \sqrt{-2 \times-\frac{1}{2 \sigma^{2}}}-\theta_{1}^{2} \times \frac{1}{4 \times-\frac{1}{2 \sigma^{2}}} \\ &=-\log \sqrt{-2 \times \theta_{2}}-\theta_{1}^{2} \times \frac{1}{4 \times \theta_{2}}\\ &=-\frac{1}{2} \log(-2\theta_{2})-\frac{\theta_{1}^{2}}{4\theta_{2}} \end{aligned} A(θ)=logσ+2σ21μ2=logσ2+(σ2μ)2×21σ2=−logσ21+θ12×2×σ211=−log−2×−2σ21−θ12×4×−2σ211=−log−2×θ2−θ12×4×θ21=−21log(−2θ2)−4θ2θ12
∂ A ( θ ) ∂ θ = ∂ ∂ θ ( log ∫ h ( x ) exp { θ T ϕ ( x ) } d x ) = ∫ h ( x ) exp { θ T ϕ ( x ) } ϕ ( x ) d x exp ( A ( θ ) ) = ∫ h ( x ) exp { θ T ϕ ( x ) − A ( θ ) } ϕ ( x ) d x = ∫ p ( x ∣ θ ) ϕ ( x ) d x = E θ [ ϕ ( x ) ] \begin{aligned} \frac{\partial A(\theta)}{\partial \theta} &=\frac{\partial}{\partial \theta}\left(\log \int h(x) \exp \left\{\theta^{\mathrm{T}} \phi(x)\right\} \mathrm{d} x\right) \\ &=\frac{\int h(x) \exp \left\{\theta^{\mathrm{T}} \phi(x)\right\} \phi(x) \mathrm{d} x}{\exp (A(\theta))} \\ &=\int h(x) \exp \left\{\theta^{\mathrm{T}} \phi(x)-A(\theta)\right\} \phi(x) \mathrm{d} x \\ &=\int p(x \mid \theta) \phi(x) \mathrm{d} x \\ &=E_{\theta}[\phi(x)] \end{aligned} ∂θ∂A(θ)=∂θ∂(log∫h(x)exp{θTϕ(x)}dx)=exp(A(θ))∫h(x)exp{θTϕ(x)}ϕ(x)dx=∫h(x)exp{θTϕ(x)−A(θ)}ϕ(x)dx=∫p(x∣θ)ϕ(x)dx=Eθ[ϕ(x)]
∂ 2 A ( θ ) ∂ θ ∂ θ T = ∫ h ( x ) exp ( θ T ϕ ( x ) − A ( θ ) ) ϕ ( x ) ( ϕ ( x ) − ∂ A ( θ ) ∂ θ ) d x = ∫ p ( x ∣ θ ) ϕ ( x ) ( ϕ ( x ) − E θ [ ϕ ( x ) ] ) d x = ∫ p ( x ∣ θ ) ϕ 2 ( x ) d x − E θ [ ϕ ( x ) ] ∫ p ( x ∣ θ ) ϕ ( x ) d x = E θ [ ϕ 2 ( x ) ] − E θ 2 [ ϕ ( x ) ] = Var θ [ ϕ ( x ) ] \begin{aligned} \frac{\partial^{2} A(\theta)}{\partial \theta \partial \theta^{\mathrm{T}}} &=\int h(x) \exp \left(\theta^{\mathrm{T}} \phi(x)-A(\theta)\right) \phi(x)\left(\phi(x)-\frac{\partial A(\theta)}{\partial \theta}\right) \mathrm{d} x \\ &=\int p(x \mid \theta) \phi(x)\left(\phi(x)-E_{\theta}[\phi(x)]\right) \mathrm{d} x \\ &=\int p(x \mid \theta) \phi^{2}(x) \mathrm{d} x-E_{\theta}[\phi(x)] \int p(x \mid \theta) \phi(x) \mathrm{d} x \\ &=E_{\theta}\left[\phi^{2}(x)\right]-E_{\theta}^{2}[\phi(x)] \\ &=\operatorname{Var}_{\theta}[\phi(x)] \end{aligned} ∂θ∂θT∂2A(θ)=∫h(x)exp(θTϕ(x)−A(θ))ϕ(x)(ϕ(x)−∂θ∂A(θ))dx=∫p(x∣θ)ϕ(x)(ϕ(x)−Eθ[ϕ(x)])dx=∫p(x∣θ)ϕ2(x)dx−Eθ[ϕ(x)]∫p(x∣θ)ϕ(x)dx=Eθ[ϕ2(x)]−Eθ2[ϕ(x)]=Varθ[ϕ(x)]
因此,将概率分布以指数家族的形式表达后, ϕ ( x ) \phi(x) ϕ(x)的期望与方差实际上拥有两种解法,一种是直接使用这个函数进行求解,另一种则是采用对数分割函数进行求解。在实际问题中选择一种简便方式即可。
之后为了简便,指数家族采用如下简单的形式:
p ( x ∣ θ ) = 1 Z ( θ ) h ( x ) e θ T ϕ ( x ) p(x \mid \theta)=\frac{1}{Z(\theta)} h(x) \mathrm{e}^{\theta^{\mathrm{T}} \phi(x)} p(x∣θ)=Z(θ)1h(x)eθTϕ(x)
设随机变量 X X X的概率分布为 P ( X ) P(X) P(X),熵可表示为:
H ( P ) = − ∑ X P ( X ) log P ( X ) H(P)=-\sum_{X}P(X)\log P(X) H(P)=−X∑P(X)logP(X)
在建立一个概率模型时,往往会伴随一些约束,有时,这些约束的存在并不能得到一个唯一的模型,满足这些条件的模型有很多种,这些模型在约束上的表现基本一致的,而在没有约束的子空间上则表现得不尽相同,那么根据最大熵的思想,没有约束的子空间应该拥有均等的概率,这样就能确定一个唯一的概率分布。
因此最大熵原理为:在满足已知条件的情况下,选取熵最大的模型。而决策树的划分是不断降低实例所属类的不确定性,最终给实例一个合适的分类,是不确定性不断减小的过程,所以选择熵最小的划分。
通常我们希望模型能学习到数据中的分布特性,因此我们可以建立一个约束:模型分布的特征期望和训练样本分布的特征期望相等。其中模型分布的特征期望和训练样本的特征期望可分别表示为:
p ( f ) = ∑ x , y p ~ ( x ) p ( y ∣ x ) f ( x , y ) p(f)=\sum_{x, y} \tilde{p}(x) p(y \mid x) f(x, y) p(f)=x,y∑p~(x)p(y∣x)f(x,y)
p ~ ( f ) = ∑ x , y p ~ ( x , y ) f ( x , y ) = ∑ x , y 1 N count ( x , y ) f ( x , y ) \tilde{p}(f)=\sum_{x, y} \tilde{p}(x, y) f(x, y)=\sum_{x, y} \frac{1}{N} \operatorname{count}(x, y) f(x, y) p~(f)=x,y∑p~(x,y)f(x,y)=x,y∑N1count(x,y)f(x,y)
这样我们就找到一个约束条件:
∑ x , y p ~ ( x , y ) f ( x , y ) = ∑ x , y p ~ ( x ) p ( y ∣ x ) f ( x , y ) \sum_{x, y} \tilde{p}(x, y) f(x, y)=\sum_{x, y} \tilde{p}(x) p(y \mid x) f(x, y) x,y∑p~(x,y)f(x,y)=x,y∑p~(x)p(y∣x)f(x,y)
于此同时希望模型的概率分布能最大化熵 H ( p ) H(p) H(p):
H ( p ) = − ∑ x , y p ~ ( x ) p ( y ∣ x ) log p ( y ∣ x ) H(p)=-\sum_{x, y} \tilde{p}(x) p(y \mid x) \log p(y \mid x) H(p)=−x,y∑p~(x)p(y∣x)logp(y∣x)
此时问题又变成了一个规划问题,其目标函数和约束项可表示为:
p ∗ = argmax p H ( p ) = argmax p − ∑ x , y p ~ ( x ) p ( y ∣ x ) log p ( y ∣ x ) s.t. p ( y ∣ x ) ⩾ 0 , ∀ x , y ∑ y p ( y ∣ x ) = 1 , ∀ x ∑ x , y p ~ ( x ) p ( y ∣ x ) f ( x , y ) = ∑ x , y p ~ ( x , y ) f ( x , y ) , i ∈ { 1 , 2 , … , n } \begin{aligned} &p^{*}=\operatorname{argmax}_{p} H(p)=\operatorname{argmax}_{p}-\sum_{x, y} \tilde{p}(x) p(y \mid x) \log p(y \mid x)\\ &\text { s.t. } \quad p(y \mid x) \geqslant 0, \forall x, y\\ &\begin{array}{l} \sum_{y} p(y \mid x)=1, \forall x \\ \sum_{x, y} \tilde{p}(x) p(y \mid x) f(x, y)=\sum_{x, y} \tilde{p}(x, y) f(x, y), i \in\{1,2, \ldots, n\} \end{array} \end{aligned} p∗=argmaxpH(p)=argmaxp−x,y∑p~(x)p(y∣x)logp(y∣x) s.t. p(y∣x)⩾0,∀x,y∑yp(y∣x)=1,∀x∑x,yp~(x)p(y∣x)f(x,y)=∑x,yp~(x,y)f(x,y),i∈{1,2,…,n}
那怎么求 p p p呢?采用拉格朗日乘子法,得到拉格朗日函数 ξ ( p , Λ , γ ) \xi(p, \Lambda, \gamma) ξ(p,Λ,γ):
ξ ( p , Λ , γ ) = − ∑ x , y p ~ ( x ) p ( y ∣ x ) log p ( y ∣ x ) + ∑ i = 1 n λ i ( ∑ x , y p ~ ( x ) p ( y ∣ x ) f i ( x , y ) − ∑ x , y p ~ ( x , y ) f i ( x , y ) ) + γ ( ∑ y p ( y ∣ x ) − 1 ) \begin{aligned} \xi(p, \Lambda, \gamma)=&-\sum_{x, y} \tilde{p}(x) p(y \mid x) \log p(y \mid x)+\sum_{i=1}^{n} \lambda_{i}\left(\sum_{x, y} \tilde{p}(x) p(y \mid x) f_{i}(x, y)\right.\\ &\left.-\sum_{x, y} \tilde{p}(x, y) f_{i}(x, y)\right)+\gamma\left(\sum_{y} p(y \mid x)-1\right) \end{aligned} ξ(p,Λ,γ)=−x,y∑p~(x)p(y∣x)logp(y∣x)+i=1∑nλi(x,y∑p~(x)p(y∣x)fi(x,y)−x,y∑p~(x,y)fi(x,y))+γ(y∑p(y∣x)−1)
其中 Λ \Lambda Λ表示乘子 λ 1 , ⋯ , λ n \lambda_{1},\cdots,\lambda_{n} λ1,⋯,λn的集合。假设 Λ \Lambda Λ和 γ \gamma γ不变,对 p p p求导,可以得到:
∂ ξ ∂ p ( y ∣ x ) = − p ~ ( x ) ( log p ( y ∣ x ) + 1 ) + ∑ i = 1 n λ i p ~ ( x ) f i ( x , y ) + γ \frac{\partial \xi}{\partial p(y \mid x)}=-\tilde{p}(x)(\log p(y \mid x)+1)+\sum_{i=1}^{n} \lambda_{i} \tilde{p}(x) f_{i}(x, y)+\gamma ∂p(y∣x)∂ξ=−p~(x)(logp(y∣x)+1)+i=1∑nλip~(x)fi(x,y)+γ
令导数为0
,可以得到 p p p的极值:
− p ~ ( x ) ( log p ( y ∣ x ) + 1 ) + ∑ i = 1 n λ i p ~ ( x ) f i ( x , y ) + γ = 0 -\tilde{p}(x)(\log p(y \mid x)+1)+\sum_{i=1}^{n} \lambda_{i} \tilde{p}(x) f_{i}(x, y)+\gamma=0 −p~(x)(logp(y∣x)+1)+i=1∑nλip~(x)fi(x,y)+γ=0
整理得到:
p ( y ∣ x ) = exp ( ∑ i = 1 n λ i f i ( x , y ) ) exp ( γ p ~ ( x ) − 1 ) p(y \mid x)=\exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right) p(y∣x)=exp(i=1∑nλifi(x,y))exp(p~(x)γ−1)
再次使用之前关于概率总和为1
的约束 ∑ y p ( y ∣ x ) = 1 , ∀ x \sum_{y} p(y \mid x)=1, \forall x ∑yp(y∣x)=1,∀x,代入求得:
∑ y p ( y ∣ x ) = ∑ y exp ( ∑ i = 1 n λ i f i ( x , y ) ) exp ( γ p ~ ( x ) − 1 ) → 1 = ∑ y exp ( ∑ i = 1 n λ i f i ( x , y ) ) exp ( γ p ~ ( x ) − 1 ) → exp ( γ p ~ ( x ) − 1 ) = 1 ∑ y exp ( ∑ i = 1 n λ i f i ( x , y ) ) \begin{aligned} \sum_{y} p(y \mid x) &=\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right) \\ \rightarrow & 1=\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right) \\ \rightarrow \exp \left(\frac{\gamma}{\tilde{p}(x)}-1\right) &=\frac{1}{\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right)} \end{aligned} y∑p(y∣x)→→exp(p~(x)γ−1)=y∑exp(i=1∑nλifi(x,y))exp(p~(x)γ−1)1=y∑exp(i=1∑nλifi(x,y))exp(p~(x)γ−1)=∑yexp(∑i=1nλifi(x,y))1
此时 p ( y ∣ x ) p(y|x) p(y∣x)可表示为:
p ( y ∣ x ) = exp ( ∑ i = 1 n λ i f i ( x , y ) ) 1 ∑ y exp ( ∑ i = 1 n λ i f i ( x , y ) ) p(y \mid x)=\exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) \frac{1}{\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right)} p(y∣x)=exp(i=1∑nλifi(x,y))∑yexp(∑i=1nλifi(x,y))1
令 Z ( x ) Z(x) Z(x)为:
Z ( x ) = ∑ y exp ( ∑ i = 1 n λ i f i ( x , y ) ) Z(x)=\sum_{y} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) Z(x)=y∑exp(i=1∑nλifi(x,y))
模型最优解为:
p ( y ∣ x ) = 1 Z ( x ) exp ( ∑ i = 1 n λ i f i ( x , y ) ) p(y \mid x)=\frac{1}{Z(x)} \exp \left(\sum_{i=1}^{n} \lambda_{i} f_{i}(x, y)\right) p(y∣x)=Z(x)1exp(i=1∑nλifi(x,y))
经过拉格朗日乘子法计算,变量由模型 p p p变成了乘子 λ \lambda λ,此时的乘子 λ \lambda λ更像是每一个特征的权重项,为每个特征乘一个权重判断数据所属的 y y y的概率值。
那如何来求取参数 λ \lambda λ呢?采用最大似然法求解,模型展开为:
p ( y ∣ x , λ ) = exp [ ∑ i λ i f i ( x , y ) ] ∑ y ′ exp [ ∑ i λ i f i ( x , y ′ ) ] p(y \mid x, \lambda)=\frac{\exp \left[\sum_{i} \lambda_{i} f_{i}(x, y)\right]}{\sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]} p(y∣x,λ)=∑y′exp[∑iλifi(x,y′)]exp[∑iλifi(x,y)]
最大化对数似然 log p ( y ∣ x , λ ) \log p(y|x,\lambda) logp(y∣x,λ),可以得到:
log p ( Y ∣ X ) = ∑ ( x , y ) ∈ ( X , Y ) log p ( y ∣ x , λ ) = ∑ ( x , y ) ∈ ( X , Y ) log exp [ ∑ i λ i f i ( x , y ) ] ∑ y ′ exp [ ∑ i λ i f i ( x , y ′ ) ] = ∑ ( x , y ) ∈ ( X , Y ) ∑ i λ i f i ( x , y ) − ∑ ( x , y ) ∈ ( X , Y ) log ∑ y ′ exp [ ∑ i λ i f i ( x , y ′ ) ] \begin{aligned} \log p(Y \mid X) &=\sum_{(x, y) \in(X, Y)} \log p(y \mid x, \lambda) \\ &=\sum_{(x, y) \in(X, Y)} \log \frac{\exp \left[\sum_{i} \lambda_{i} f_{i}(x, y)\right]}{\sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]} \\ &=\sum_{(x, y) \in(X, Y)} \sum_{i} \lambda_{i} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \log \sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right] \end{aligned} logp(Y∣X)=(x,y)∈(X,Y)∑logp(y∣x,λ)=(x,y)∈(X,Y)∑log∑y′exp[∑iλifi(x,y′)]exp[∑iλifi(x,y)]=(x,y)∈(X,Y)∑i∑λifi(x,y)−(x,y)∈(X,Y)∑logy′∑exp[i∑λifi(x,y′)]
对公式进行求导,可以得到:
∂ log P ( Y ∣ X ) ∂ λ i = ∑ ( x , y ) ∈ ( X , Y ) f i ( x , y ) − ∑ ( x , y ) ∈ ( X , Y ) 1 ∑ y ′ ′ exp [ ∑ i λ i f i ( x , y ′ ′ ) ] ∂ ∑ y ′ exp [ ∑ i λ i f i ( x , y ′ ) ] ∂ λ i = ∑ ( x , y ) ∈ ( X , Y ) f i ( x , y ) − ∑ ( x , y ) ∈ ( X , Y ) 1 ∑ y ′ ′ exp [ ∑ i λ i f i ( x , y ′ ′ ) ] ∑ y ′ exp [ ∑ i λ i f i ( x , y ′ ) ] f i ( x , y ′ ) = ∑ ( x , y ) ∈ ( X , Y ) f i ( x , y ) − ∑ ( x , y ) ∈ ( X , Y ) ∑ y ′ exp [ ∑ i λ i f i ( x , y ′ ) ] ∑ y ′ ′ exp [ ∑ i λ i f i ( x , y ′ ′ ) ] f i ( x , y ′ ) = ∑ ( x , y ) ∈ ( X , Y ) f i ( x , y ) − ∑ ( x , y ) ∈ ( X , Y ) ∑ y ′ p ( y ′ ∣ x , λ ) f i ( x , y ′ ) \begin{array}{l} \frac{\partial \log P(Y \mid X)}{\partial \lambda_{i}} \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \frac{1}{\sum_{y^{\prime \prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime \prime}\right)\right]} \frac{\partial \sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]}{\partial \lambda_{i}} \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \frac{1}{\sum_{y^{\prime \prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime \prime}\right)\right]} \sum_{y^{\prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right] f_{i}\left(x, y^{\prime}\right) \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \sum_{y^{\prime}} \frac{\exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime}\right)\right]}{\sum_{y^{\prime \prime}} \exp \left[\sum_{i} \lambda_{i} f_{i}\left(x, y^{\prime \prime}\right)\right]} f_{i}\left(x, y^{\prime}\right) \\ =\sum_{(x, y) \in(X, Y)} f_{i}(x, y)-\sum_{(x, y) \in(X, Y)} \sum_{y^{\prime}} p\left(y^{\prime} \mid x, \lambda\right) f_{i}\left(x, y^{\prime}\right) \end{array} ∂λi∂logP(Y∣X)=∑(x,y)∈(X,Y)fi(x,y)−∑(x,y)∈(X,Y)∑y′′exp[∑iλifi(x,y′′)]1∂λi∂∑y′exp[∑iλifi(x,y′)]=∑(x,y)∈(X,Y)fi(x,y)−∑(x,y)∈(X,Y)∑y′′exp[∑iλifi(x,y′′)]1∑y′exp[∑iλifi(x,y′)]fi(x,y′)=∑(x,y)∈(X,Y)fi(x,y)−∑(x,y)∈(X,Y)∑y′∑y′′exp[∑iλifi(x,y′′)]exp[∑iλifi(x,y′)]fi(x,y′)=∑(x,y)∈(X,Y)fi(x,y)−∑(x,y)∈(X,Y)∑y′p(y′∣x,λ)fi(x,y′)
可以看出,最大似然的梯度等于训练数据分布的特征期望与模型的特征期望的差。当梯度为0时,刚好满足约束条件:
∑ ( x , y ) ∈ ( X , Y ) f i ( x , y ) = ∑ ( x , y ) ∈ ( X , Y ) ∑ y ′ p ( y ′ ∣ x , λ ) f i ( x , y ′ ) \sum_{(x, y) \in(X, Y)} f_{i}(x, y)=\sum_{(x, y) \in(X, Y)} \sum_{y^{\prime}} p\left(y^{\prime} \mid x, \lambda\right) f_{i}\left(x, y^{\prime}\right) (x,y)∈(X,Y)∑fi(x,y)=(x,y)∈(X,Y)∑y′∑p(y′∣x,λ)fi(x,y′)
所以可以通过梯度下降法,不断迭代更新参数 λ \lambda λ。
上文已经求出了最大熵模型的目标梯度。令 X X X的数量为 d i m x dim_{x} dimx, Y Y Y的数量为 d i m y dim_{y} dimy,于是我们可以定义 d i m x × d i m y dim_{x} \times dim_{y} dimx×dimy个特征,每一个特征可以指示一组数据是否存在:
f i , j ( x , y ) = { 1 i = x 且 j = y 0 其他情况 f_{i, j}(x, y)=\left\{\begin{array}{ll} 1 & i=x \text { 且 } j=y \\ 0 & \text { 其他情况 } \end{array}\right. fi,j(x,y)={10i=x 且 j=y 其他情况
模型拥有 d i m x × d i m y dim_{x}\times dim_{y} dimx×dimy个参数 λ i , j \lambda_{i,j} λi,j,每一个参数表示对应特征的权重。如果一个样本拥有 N N N个特征,那么我们可以得到某个样本 x x x关于结果 y i y_{i} yi的概率:
p ( y j ∣ x ) = exp ( ∑ i λ i , j f i , j ( x i , y j ) ) ∑ j exp ( ∑ i λ i , j f i , j ( x i , y j ) ) p\left(y_{j} \mid x\right)=\frac{\exp \left(\sum_{i} \lambda_{i, j} f_{i, j}\left(x_{i}, y_{j}\right)\right)}{\sum_{j} \exp \left(\sum_{i} \lambda_{i, j} f_{i, j}\left(x_{i}, y_{j}\right)\right)} p(yj∣x)=∑jexp(∑iλi,jfi,j(xi,yj))exp(∑iλi,jfi,j(xi,yj))
知道了这个概率,就能计算出一个样本对所有特征的累积量,其中每一个特征累积量为:
f i , j ( x ) = ∑ j p ( y j ∣ x ) f i , j ( x i , y j ) f_{i, j}(x)=\sum_{j} p\left(y_{j} \mid x\right) f_{i, j}\left(x_{i}, y_{j}\right) fi,j(x)=j∑p(yj∣x)fi,j(xi,yj)
也就可以求出一批样本对所有特征的累积量,其中每一个特征累积量为:
f i , j ( X ) = ∑ ( x , y ) ∼ ( X , Y ) ∑ j p ( y j ∣ x ) f i , j ( x i , y j ) f_{i, j}(X)=\sum_{(x, y) \sim(X, Y)} \sum_{j} p\left(y_{j} \mid x\right) f_{i, j}\left(x_{i}, y_{j}\right) fi,j(X)=(x,y)∼(X,Y)∑j∑p(yj∣x)fi,j(xi,yj)
完整算法如下所示: