一维高斯分布可以写成:
p ( x ∣ θ ) = 1 2 π σ e x p ( − ( x − μ ) 2 2 σ 2 ) p(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}exp(-\frac{(x-\mu)^2}{2\sigma^2}) p(x∣θ)=2πσ1exp(−2σ2(x−μ)2)
将这个式子改写:
1 2 π σ 2 exp ( − 1 2 σ 2 ( x 2 − 2 μ x + μ 2 ) ) = exp ( log ( 2 π σ 2 ) − 1 / 2 ) exp ( − 1 2 σ 2 ( − 2 μ 1 ) ( x x 2 ) − μ 2 2 σ 2 ) \frac{1}{\sqrt{2\pi\sigma^2}}\exp(-\frac{1}{2\sigma^2}(x^2-2\mu x+\mu^2))\\ =\exp(\log(2\pi\sigma^2)^{-1/2})\exp(-\frac{1}{2\sigma^2}\begin{pmatrix}-2\mu&1\end{pmatrix}\begin{pmatrix}x\\x^2\end{pmatrix}-\frac{\mu^2}{2\sigma^2}) 2πσ21exp(−2σ21(x2−2μx+μ2))=exp(log(2πσ2)−1/2)exp(−2σ21(−2μ1)(xx2)−2σ2μ2)
所以:
η = ( μ σ 2 − 1 2 σ 2 ) = ( η 1 η 2 ) \eta=\begin{pmatrix}\frac{\mu}{\sigma^2}\\-\frac{1}{2\sigma^2}\end{pmatrix}=\begin{pmatrix}\eta_1\\\eta_2\end{pmatrix} η=(σ2μ−2σ21)=(η1η2)
于是对数配分函数 A ( η ) A(\eta) A(η):
A ( η ) = − η 1 2 4 η 2 + 1 2 log ( − π η 2 ) A(\eta)=-\frac{\eta_1^2}{4\eta_2}+\frac{1}{2}\log(-\frac{\pi}{\eta_2}) A(η)=−4η2η12+21log(−η2π)
对概率密度函数求积分:
e x p ( A ( η ) ) = ∫ h ( x ) exp ( η T ϕ ( x ) ) d x exp(A(\eta))=\int h(x)\exp(\eta^T\phi(x))dx exp(A(η))=∫h(x)exp(ηTϕ(x))dx
两边对参数求导:
exp ( A ( η ) ) A ′ ( η ) = ∫ h ( x ) exp ( η T ϕ ( x ) ) ϕ ( x ) d x ⟹ A ′ ( η ) = E p ( x ∣ η ) [ ϕ ( x ) ] \exp(A(\eta))A'(\eta)=\int h(x)\exp(\eta^T\phi(x))\phi(x)dx\\ \Longrightarrow A'(\eta)=\mathbb{E}_{p(x|\eta)}[\phi(x)] exp(A(η))A′(η)=∫h(x)exp(ηTϕ(x))ϕ(x)dx⟹A′(η)=Ep(x∣η)[ϕ(x)]
类似的:
A ′ ′ ( η ) = V a r p ( x ∣ η ) [ ϕ ( x ) ] A''(\eta)=Var_{p(x|\eta)}[\phi(x)] A′′(η)=Varp(x∣η)[ϕ(x)]
由于方差为正,于是 A ( η ) A(\eta) A(η)一定是凸函数
对于独立全同采样得到的数据集 D = { x 1 , x 2 , ⋯ , x N } \mathcal{D}=\{x_1,x_2,\cdots,x_N\} D={x1,x2,⋯,xN}
η M L E = a r g m a x η ∑ i = 1 N log p ( x i ∣ η ) = a r g m a x η ∑ i = 1 N ( η T ϕ ( x i ) − A ( η ) ) ⟹ A ′ ( η M L E ) = 1 N ∑ i = 1 N ϕ ( x i ) \eta_{MLE}=\mathop{argmax}_\eta\sum\limits_{i=1}^N\log p(x_i|\eta)\\ =\mathop{argmax}_\eta\sum\limits_{i=1}^N(\eta^T\phi(x_i)-A(\eta))\\ \Longrightarrow A'(\eta_{MLE})=\frac{1}{N}\sum\limits_{i=1}^N\phi(x_i) ηMLE=argmaxηi=1∑Nlogp(xi∣η)=argmaxηi=1∑N(ηTϕ(xi)−A(η))⟹A′(ηMLE)=N1i=1∑Nϕ(xi)
由此可以看到,为了估算参数,只需要知道充分统计量就可以。
信息熵记为:
E n t r o p y = ∫ − p ( x ) log ( p ( x ) ) d x Entropy=\int-p(x)\log(p(x))dx Entropy=∫−p(x)log(p(x))dx
一般地,对于完全随机的变量(等可能),信息熵最大。
我们的假设为最大熵原则,假设数据是离散分布的,k个特征的概率分别为 p k p_k pk,最大熵原理可以表述为:
max { H ( p ) } = min { ∑ k = 1 K p k log p k } s . t . ∑ k = 1 K p k = 1 \max\{H(p)\}=\min\{\sum\limits_{k=1}^Kp_k\log p_k\}\ s.t.\ \sum\limits_{k=1}^Kp_k=1 max{H(p)}=min{k=1∑Kpklogpk} s.t. k=1∑Kpk=1
利用Lagrange乘子法:
L ( p , λ ) = ∑ k = 1 K p k log p k + λ ( 1 − ∑ k = 1 K p k ) L(p,\lambda)=\sum\limits_{k=1}^Kp_k\log p_k+\lambda(1-\sum\limits_{k=1}^Kp_k) L(p,λ)=k=1∑Kpklogpk+λ(1−k=1∑Kpk)
于是可得:
p 1 = p 2 = ⋯ = p K = 1 K p_1=p_2=\cdots=p_K=\frac{1}{K} p1=p2=⋯=pK=K1
因此等可能的情况熵最大。
一个数据集 D \mathcal{D} D,在这个数据集上的经验分布为 p ^ ( x ) = C o u n t ( x ) N \hat{p}(x)=\frac{Count(x)}{N} p^(x)=NCount(x),实际不可能满足所有的经验概率相同,于是在上面的最大熵原理中还需要加入这个经验分布的约束。
对任意一个函数,经验分布的经验期望可以求得为:
E p ^ [ f ( x ) ] = Δ \mathbb{E}_{\hat{p}}[f(x)]=\Delta Ep^[f(x)]=Δ
于是:
max { H ( p ) } = min { ∑ k = 1 N p k log p k } s . t . ∑ k = 1 N p k = 1 , E p [ f ( x ) ] = Δ \max\{H(p)\}=\min\{\sum\limits_{k=1}^Np_k\log p_k\}\ s.t.\ \sum\limits_{k=1}^Np_k=1,\mathbb{E}_p[f(x)]=\Delta max{H(p)}=min{k=1∑Npklogpk} s.t. k=1∑Npk=1,Ep[f(x)]=Δ
Lagrange函数为:
L ( p , λ 0 , λ ) = ∑ k = 1 N p k l o g p k + λ 0 ( 1 − ∑ k = 1 N p k ) + λ T ( Δ − E p [ f ( x ) ] ) L(p,\lambda_0,\lambda)=\sum_{k=1}^Np_klogp_k+\lambda_0(1-\sum_{k=1}^Np_k)+\lambda^T(\Delta-\mathbb{E}_p[f(x)]) L(p,λ0,λ)=k=1∑Npklogpk+λ0(1−k=1∑Npk)+λT(Δ−Ep[f(x)])
求导得到:
∂ ∂ p ( x ) L = ∑ k = 1 N ( log p ( x ) + 1 ) − ∑ k = 1 N λ 0 − ∑ k = 1 N λ T f ( x ) ⟹ ∑ k = 1 N log p ( x ) + 1 − λ 0 − λ T f ( x ) = 0 \frac{\partial}{\partial p(x)}L=\sum\limits_{k=1}^N(\log p(x)+1)-\sum\limits_{k=1}^N\lambda_0-\sum\limits_{k=1}^N\lambda^Tf(x)\\ \Longrightarrow\sum\limits_{k=1}^N\log p(x)+1-\lambda_0-\lambda^Tf(x)=0 ∂p(x)∂L=k=1∑N(logp(x)+1)−k=1∑Nλ0−k=1∑NλTf(x)⟹k=1∑Nlogp(x)+1−λ0−λTf(x)=0
由于数据集是任意的,对数据集求和也意味着求和项里面的每一项都是0:
p ( x ) = e x p ( λ T f ( x ) + λ 0 − 1 ) p(x)=exp(\lambda^Tf(x)+\lambda_0-1) p(x)=exp(λTf(x)+λ0−1)
这就是指数族分布