极大似然估计和最小化交叉熵损失(KL散度)

极大似然估计和最小化交叉熵损失(KL散度)

  • 1.无标签样本
    • 1.1.数据集
    • 1.2.公式推导
  • 2.有标签样本
    • 2.1.数据集
    • 2.2.公式推导
  • 3.参考

先说结论:极大似然估计和最小化交叉熵损失(KL散度)完全等价
我们从无标签和有标签两个角度来证明这个结论。

1.无标签样本

1.1.数据集

假设我们的数据集为 D = { x 1 , x 2 , ⋯   , x N } \mathcal{D}=\{x_1,x_2,\cdots,x_N\} D={x1,x2,,xN},每个数据为 x i = [ x i 1 x i 2 ⋯ x i p ] x_i=\begin{bmatrix}x_i^1\\x_i^2\\\cdots\\x_i^p\end{bmatrix} xi=xi1xi2xip p p p维列向量。假设数据真实的概率分布为 p ( x ) p(x) p(x)。我们用最大似然估计来计算数据的近似分布 q ( x ∣ θ ) q(x|\theta) q(xθ)

1.2.公式推导

极大似然估计如下:
θ = a r g m a x θ   q ( x ∣ θ ) = a r g m a x θ ∏ i = 1 N q ( x i ∣ θ ) = a r g m a x θ ∑ i = 1 N log ⁡ q ( x i ∣ θ ) = a r g m a x θ 1 N ∑ i = 1 N log ⁡ q ( x i ∣ θ ) (1) \begin{aligned}&\theta=\underset{\theta}{argmax}\ q(x|\theta)\\&=\underset{\theta}{argmax}\prod_{i=1}^N q(x_i|\theta)\\&=\underset{\theta}{argmax}\sum_{i=1}^N \log q(x_i|\theta)\\&=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(x_i|\theta)\end{aligned}\tag{1} θ=θargmax q(xθ)=θargmaxi=1Nq(xiθ)=θargmaxi=1Nlogq(xiθ)=θargmaxN1i=1Nlogq(xiθ)(1)已知大数定理,若 X X X服从某一分布 p ( x ) p(x) p(x),对 X X X依分布 p ( x ) p(x) p(x)采样 n n n次,当 n n n充分大的时候, 1 n ∑ i = 1 n x i \frac{1}{n}\sum_{i=1}^nx_i n1i=1nxi依概率收敛 E x ∼ p ( x ) [ X ] E_{x\sim p(x)}[X] Exp(x)[X]
所以(1)可转换为 θ = a r g m a x θ 1 N ∑ i = 1 N log ⁡ q ( x i ∣ θ )   ≈ a r g m a x θ E x ∼ p ( x ) [ log ⁡ q ( x ∣ θ ) ]   = a r g m a x θ ∫ p ( x ) log ⁡ q ( x ∣ θ ) d x   = a r g m i n θ ∫ − p ( x ) log ⁡ q ( x ∣ θ ) d x   = a r g m i n θ ∫ p ( x ) log ⁡ 1 q ( x ∣ θ ) d x   = a r g m i n θ ∫ p ( x ) log ⁡ p ( x ) q ( x ∣ θ ) d x   = a r g m i n θ   K L ( p ( x ) ∣ ∣ q ( x ∣ θ ) ) (2) \begin{aligned} &\theta=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(x_i|\theta)\\ &\ \approx\underset{\theta}{argmax}E_{x\sim p(x)}[\log q(x|\theta)]\\ &\ =\underset{\theta}{argmax}\int p(x)\log q(x|\theta)dx\\ &\ =\underset{\theta}{argmin}\int -p(x)\log q(x|\theta)dx\\ &\ =\underset{\theta}{argmin}\int p(x)\log \frac{1}{q(x|\theta)}dx\\ &\ =\underset{\theta}{argmin}\int p(x)\log \frac{p(x)}{q(x|\theta)}dx\\ &\ =\underset{\theta}{argmin}\ KL(p(x)||q(x|\theta)) \end{aligned}\tag{2} θ=θargmaxN1i=1Nlogq(xiθ) θargmaxExp(x)[logq(xθ)] =θargmaxp(x)logq(xθ)dx =θargminp(x)logq(xθ)dx =θargminp(x)logq(xθ)1dx =θargminp(x)logq(xθ)p(x)dx =θargmin KL(p(x)q(xθ))(2)由(1)(2)可知最大似然和最小化 K L KL KL散度等价。

2.有标签样本

2.1.数据集

有标签样本的数据集和我之前写过的线性回归数据集在形式上是一样的: D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯   , ( x N , y N ) } \mathcal{D}=\{(x_1, y_1),(x_2, y_2),\cdots,(x_N, y_N)\} D={(x1,y1),(x2,y2),,(xN,yN)}每个数据为 x i = [ x i 1 x i 2 ⋯ x i p ] x_i=\begin{bmatrix}x_i^1\\x_i^2\\\cdots\\x_i^p\end{bmatrix} xi=xi1xi2xip p p p维列向量,对应一个标量 y i y_i yi

2.2.公式推导

θ = a r g m a x θ   q ( Y ∣ X , θ ) = a r g m a x θ ∏ i = 1 N q ( y i ∣ x i , θ ) = a r g m a x θ ∑ i = 1 N log ⁡ q ( y i ∣ x i , θ ) = a r g m a x θ 1 N ∑ i = 1 N log ⁡ q ( y i ∣ x i , θ ) (3) \begin{aligned}&\theta=\underset{\theta}{argmax}\ q(Y|X,\theta)\\&=\underset{\theta}{argmax}\prod_{i=1}^N q(y_i|x_i,\theta)\\&=\underset{\theta}{argmax}\sum_{i=1}^N \log q(y_i|x_i,\theta)\\&=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(y_i|x_i,\theta)\end{aligned}\tag{3} θ=θargmax q(YX,θ)=θargmaxi=1Nq(yixi,θ)=θargmaxi=1Nlogq(yixi,θ)=θargmaxN1i=1Nlogq(yixi,θ)(3)
(3)由大数定理转化为:
θ = a r g m a x θ 1 N ∑ i = 1 N log ⁡ q ( y i ∣ x i , θ )   ≈ a r g m a x θ E x , y ∼ p ( x , y ) [ log ⁡ q ( y ∣ x , θ ) ]   = a r g m a x θ ∫ ∫ p ( x , y ) log ⁡ q ( y ∣ x , θ ) d x d y   = a r g m a x θ ∫ ∫ p ( y ∣ x ) p ( x ) log ⁡ q ( y ∣ x , θ ) d x d y   = a r g m a x θ ∫ p ( x ) [ ∫ p ( y ∣ x ) log ⁡ q ( y ∣ x , θ ) d y ] d x   ≈ a r g m a x θ E x ∼ p ( x ) [ ∫ p ( y ∣ x ) log ⁡ q ( y ∣ x , θ ) d y ]   = a r g m i n θ E x ∼ p ( x ) [ ∫ − p ( y ∣ x ) log ⁡ q ( y ∣ x , θ ) d y ]   = a r g m i n θ E x ∼ p ( x ) [ ∫ p ( y ∣ x ) 1 log ⁡ q ( y ∣ x , θ ) d y ]   = a r g m i n θ E x ∼ p ( x ) [ ∫ p ( y ∣ x ) p ( y ∣ x ) log ⁡ q ( y ∣ x , θ ) d y ]   = a r g m i n θ E x ∼ p ( x ) [ K L ( p ( y ∣ x ) ∣ ∣ q ( y ∣ x , θ ) ) ] (4) \begin{aligned} &\theta=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(y_i|x_i,\theta)\\ &\ \approx\underset{\theta}{argmax}E_{x,y\sim p(x,y)}[\log q(y|x,\theta)]\\ &\ =\underset{\theta}{argmax}\int\int p(x,y)\log q(y|x,\theta)dxdy\\ &\ =\underset{\theta}{argmax}\int\int p(y|x)p(x)\log q(y|x,\theta)dxdy\\ &\ =\underset{\theta}{argmax}\int p(x)[\int p(y|x)\log q(y|x,\theta)dy]dx\\ &\ \approx\underset{\theta}{argmax}E_{x\sim p(x)}[\int p(y|x)\log q(y|x,\theta)dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[\int -p(y|x)\log q(y|x,\theta)dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[\int p(y|x)\frac{1}{\log q(y|x,\theta)}dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[\int p(y|x)\frac{p(y|x)}{\log q(y|x,\theta)}dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[KL(p(y|x)||q(y|x,\theta))]\\ \end{aligned}\tag{4} θ=θargmaxN1i=1Nlogq(yixi,θ) θargmaxEx,yp(x,y)[logq(yx,θ)] =θargmaxp(x,y)logq(yx,θ)dxdy =θargmaxp(yx)p(x)logq(yx,θ)dxdy =θargmaxp(x)[p(yx)logq(yx,θ)dy]dx θargmaxExp(x)[p(yx)logq(yx,θ)dy] =θargminExp(x)[p(yx)logq(yx,θ)dy] =θargminExp(x)[p(yx)logq(yx,θ)1dy] =θargminExp(x)[p(yx)logq(yx,θ)p(yx)dy] =θargminExp(x)[KL(p(yx)q(yx,θ))](4)由于我们是对条件概率 q ( y ∣ x ) q(y|x) q(yx)进行建模,因此极大似然估计也等同于最小化KL散度等价。

3.参考

https://zhuanlan.zhihu.com/p/84764177

你可能感兴趣的:(#,回归问题,机器学习)