假设我们的数据集为 D = { x 1 , x 2 , ⋯ , x N } \mathcal{D}=\{x_1,x_2,\cdots,x_N\} D={x1,x2,⋯,xN},每个数据为 x i = [ x i 1 x i 2 ⋯ x i p ] x_i=\begin{bmatrix}x_i^1\\x_i^2\\\cdots\\x_i^p\end{bmatrix} xi=⎣⎢⎢⎡xi1xi2⋯xip⎦⎥⎥⎤的 p p p维列向量。假设数据真实的概率分布为 p ( x ) p(x) p(x)。我们用最大似然估计来计算数据的近似分布 q ( x ∣ θ ) q(x|\theta) q(x∣θ)
极大似然估计如下:
θ = a r g m a x θ q ( x ∣ θ ) = a r g m a x θ ∏ i = 1 N q ( x i ∣ θ ) = a r g m a x θ ∑ i = 1 N log q ( x i ∣ θ ) = a r g m a x θ 1 N ∑ i = 1 N log q ( x i ∣ θ ) (1) \begin{aligned}&\theta=\underset{\theta}{argmax}\ q(x|\theta)\\&=\underset{\theta}{argmax}\prod_{i=1}^N q(x_i|\theta)\\&=\underset{\theta}{argmax}\sum_{i=1}^N \log q(x_i|\theta)\\&=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(x_i|\theta)\end{aligned}\tag{1} θ=θargmax q(x∣θ)=θargmaxi=1∏Nq(xi∣θ)=θargmaxi=1∑Nlogq(xi∣θ)=θargmaxN1i=1∑Nlogq(xi∣θ)(1)已知大数定理,若 X X X服从某一分布 p ( x ) p(x) p(x),对 X X X依分布 p ( x ) p(x) p(x)采样 n n n次,当 n n n充分大的时候, 1 n ∑ i = 1 n x i \frac{1}{n}\sum_{i=1}^nx_i n1∑i=1nxi依概率收敛 E x ∼ p ( x ) [ X ] E_{x\sim p(x)}[X] Ex∼p(x)[X]。
所以(1)可转换为 θ = a r g m a x θ 1 N ∑ i = 1 N log q ( x i ∣ θ ) ≈ a r g m a x θ E x ∼ p ( x ) [ log q ( x ∣ θ ) ] = a r g m a x θ ∫ p ( x ) log q ( x ∣ θ ) d x = a r g m i n θ ∫ − p ( x ) log q ( x ∣ θ ) d x = a r g m i n θ ∫ p ( x ) log 1 q ( x ∣ θ ) d x = a r g m i n θ ∫ p ( x ) log p ( x ) q ( x ∣ θ ) d x = a r g m i n θ K L ( p ( x ) ∣ ∣ q ( x ∣ θ ) ) (2) \begin{aligned} &\theta=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(x_i|\theta)\\ &\ \approx\underset{\theta}{argmax}E_{x\sim p(x)}[\log q(x|\theta)]\\ &\ =\underset{\theta}{argmax}\int p(x)\log q(x|\theta)dx\\ &\ =\underset{\theta}{argmin}\int -p(x)\log q(x|\theta)dx\\ &\ =\underset{\theta}{argmin}\int p(x)\log \frac{1}{q(x|\theta)}dx\\ &\ =\underset{\theta}{argmin}\int p(x)\log \frac{p(x)}{q(x|\theta)}dx\\ &\ =\underset{\theta}{argmin}\ KL(p(x)||q(x|\theta)) \end{aligned}\tag{2} θ=θargmaxN1i=1∑Nlogq(xi∣θ) ≈θargmaxEx∼p(x)[logq(x∣θ)] =θargmax∫p(x)logq(x∣θ)dx =θargmin∫−p(x)logq(x∣θ)dx =θargmin∫p(x)logq(x∣θ)1dx =θargmin∫p(x)logq(x∣θ)p(x)dx =θargmin KL(p(x)∣∣q(x∣θ))(2)由(1)(2)可知最大似然和最小化 K L KL KL散度等价。
有标签样本的数据集和我之前写过的线性回归数据集在形式上是一样的: D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } \mathcal{D}=\{(x_1, y_1),(x_2, y_2),\cdots,(x_N, y_N)\} D={(x1,y1),(x2,y2),⋯,(xN,yN)}每个数据为 x i = [ x i 1 x i 2 ⋯ x i p ] x_i=\begin{bmatrix}x_i^1\\x_i^2\\\cdots\\x_i^p\end{bmatrix} xi=⎣⎢⎢⎡xi1xi2⋯xip⎦⎥⎥⎤的 p p p维列向量,对应一个标量 y i y_i yi。
θ = a r g m a x θ q ( Y ∣ X , θ ) = a r g m a x θ ∏ i = 1 N q ( y i ∣ x i , θ ) = a r g m a x θ ∑ i = 1 N log q ( y i ∣ x i , θ ) = a r g m a x θ 1 N ∑ i = 1 N log q ( y i ∣ x i , θ ) (3) \begin{aligned}&\theta=\underset{\theta}{argmax}\ q(Y|X,\theta)\\&=\underset{\theta}{argmax}\prod_{i=1}^N q(y_i|x_i,\theta)\\&=\underset{\theta}{argmax}\sum_{i=1}^N \log q(y_i|x_i,\theta)\\&=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(y_i|x_i,\theta)\end{aligned}\tag{3} θ=θargmax q(Y∣X,θ)=θargmaxi=1∏Nq(yi∣xi,θ)=θargmaxi=1∑Nlogq(yi∣xi,θ)=θargmaxN1i=1∑Nlogq(yi∣xi,θ)(3)
(3)由大数定理转化为:
θ = a r g m a x θ 1 N ∑ i = 1 N log q ( y i ∣ x i , θ ) ≈ a r g m a x θ E x , y ∼ p ( x , y ) [ log q ( y ∣ x , θ ) ] = a r g m a x θ ∫ ∫ p ( x , y ) log q ( y ∣ x , θ ) d x d y = a r g m a x θ ∫ ∫ p ( y ∣ x ) p ( x ) log q ( y ∣ x , θ ) d x d y = a r g m a x θ ∫ p ( x ) [ ∫ p ( y ∣ x ) log q ( y ∣ x , θ ) d y ] d x ≈ a r g m a x θ E x ∼ p ( x ) [ ∫ p ( y ∣ x ) log q ( y ∣ x , θ ) d y ] = a r g m i n θ E x ∼ p ( x ) [ ∫ − p ( y ∣ x ) log q ( y ∣ x , θ ) d y ] = a r g m i n θ E x ∼ p ( x ) [ ∫ p ( y ∣ x ) 1 log q ( y ∣ x , θ ) d y ] = a r g m i n θ E x ∼ p ( x ) [ ∫ p ( y ∣ x ) p ( y ∣ x ) log q ( y ∣ x , θ ) d y ] = a r g m i n θ E x ∼ p ( x ) [ K L ( p ( y ∣ x ) ∣ ∣ q ( y ∣ x , θ ) ) ] (4) \begin{aligned} &\theta=\underset{\theta}{argmax}\frac{1}{N}\sum_{i=1}^N \log q(y_i|x_i,\theta)\\ &\ \approx\underset{\theta}{argmax}E_{x,y\sim p(x,y)}[\log q(y|x,\theta)]\\ &\ =\underset{\theta}{argmax}\int\int p(x,y)\log q(y|x,\theta)dxdy\\ &\ =\underset{\theta}{argmax}\int\int p(y|x)p(x)\log q(y|x,\theta)dxdy\\ &\ =\underset{\theta}{argmax}\int p(x)[\int p(y|x)\log q(y|x,\theta)dy]dx\\ &\ \approx\underset{\theta}{argmax}E_{x\sim p(x)}[\int p(y|x)\log q(y|x,\theta)dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[\int -p(y|x)\log q(y|x,\theta)dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[\int p(y|x)\frac{1}{\log q(y|x,\theta)}dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[\int p(y|x)\frac{p(y|x)}{\log q(y|x,\theta)}dy]\\ &\ =\underset{\theta}{argmin}E_{x\sim p(x)}[KL(p(y|x)||q(y|x,\theta))]\\ \end{aligned}\tag{4} θ=θargmaxN1i=1∑Nlogq(yi∣xi,θ) ≈θargmaxEx,y∼p(x,y)[logq(y∣x,θ)] =θargmax∫∫p(x,y)logq(y∣x,θ)dxdy =θargmax∫∫p(y∣x)p(x)logq(y∣x,θ)dxdy =θargmax∫p(x)[∫p(y∣x)logq(y∣x,θ)dy]dx ≈θargmaxEx∼p(x)[∫p(y∣x)logq(y∣x,θ)dy] =θargminEx∼p(x)[∫−p(y∣x)logq(y∣x,θ)dy] =θargminEx∼p(x)[∫p(y∣x)logq(y∣x,θ)1dy] =θargminEx∼p(x)[∫p(y∣x)logq(y∣x,θ)p(y∣x)dy] =θargminEx∼p(x)[KL(p(y∣x)∣∣q(y∣x,θ))](4)由于我们是对条件概率 q ( y ∣ x ) q(y|x) q(y∣x)进行建模,因此极大似然估计也等同于最小化KL散度等价。
https://zhuanlan.zhihu.com/p/84764177