【题目一】最大似然估计也可以用来估计先验概率。假设样本是连续独立地从自然状态 ω i \omega_i ωi 中抽取的, 每一个自然状态的概率为 P ( ω i ) P\left(\omega_i\right) P(ωi) 。如果第 k k k 个样本的自然状态为 ω i \omega_i ωi, 那么就记 z i k = 1 z_{i k}=1 zik=1, 否则 z i k = 0 z_{i k}=0 zik=0
【解】在第 i i i 类的概率为 P ( ω i ) P(\omega_i) P(ωi)的条件下, z i 1 = 1 z_{i1}=1 zi1=1即第一个样本属于第 i i i 类的概率为 P ( ω i ) P(\omega_i) P(ωi);在第 i i i 类的概率为 P ( ω i ) P(\omega_i) P(ωi)的条件下, z i 1 = 0 z_{i1}=0 zi1=0即第一个样本不属于第 i i i 类的概率为 1 − P ( ω i ) 1-P(\omega_i) 1−P(ωi)。整理一下得
P ( z i 1 ∣ P ( ω i ) ) = P ( ω i ) z i 1 ( 1 − P ( ω i ) ) 1 − z i 1 P\left(z_{i 1} \mid P\left(\omega_i\right)\right)= P\left(\omega_i\right)^{z_{i 1}}\left(1-P\left(\omega_i\right)\right)^{1-z_{i 1}} P(zi1∣P(ωi))=P(ωi)zi1(1−P(ωi))1−zi1
于是
P ( z i 1 , ⋯ , z i n ∣ P ( ω i ) ) = P ( z i 1 ∣ P ( ω i ) ) P ( z i 2 ∣ P ( ω i ) ) … P ( z i n ∣ P ( ω i ) ) = ∏ k = 1 n P ( ω i ) z i k ( 1 − P ( ω i ) ) 1 − z i k \begin{aligned} P\left(z_{i 1}, \cdots, z_{i n} \mid P\left(\omega_i\right)\right) & =P\left(z_{i 1} \mid P\left(\omega_i\right)\right) P\left(z_{i 2} \mid P\left(\omega_i\right)\right) \ldots P\left(z_{i n} \mid P\left(\omega_i\right)\right) \\ &=\prod_{k=1}^n P\left(\omega_i\right)^{z_{i k}}\left(1-P\left(\omega_i\right)\right)^{1-z_{i k}} \end{aligned} P(zi1,⋯,zin∣P(ωi))=P(zi1∣P(ωi))P(zi2∣P(ωi))…P(zin∣P(ωi))=k=1∏nP(ωi)zik(1−P(ωi))1−zik
【解】由 (1) 得对数似然函数:
ln P ( z i 1 , ⋯ , z i n ∣ P ( ω i ) ) = ∑ k = 1 n z i k ln P ( ω i ) + ∑ k = 1 n ( 1 − z i k ) ln ( 1 − P ( ω i ) ) \ln P\left(z_{i 1}, \cdots, z_{i n} \mid P\left(\omega_i\right)\right)=\sum_{k=1}^n z_{i k} \ln P\left(\omega_i\right)+\sum_{k=1}^n\left(1-z_{i k}\right) \ln \left(1-P\left(\omega_i\right)\right) lnP(zi1,⋯,zin∣P(ωi))=k=1∑nziklnP(ωi)+k=1∑n(1−zik)ln(1−P(ωi))
由
∂ ln P ∂ P ( ω i ) = ∑ k = 1 n z i k 1 P ( ω i ) − ∑ k = 1 n ( 1 − z i k ) 1 1 − P ( ω i ) = 0 \frac{\partial \ln P}{\partial P\left(\omega_i\right)}=\sum_{k=1}^n z_{i k} \frac{1}{P\left(\omega_i\right)}-\sum_{k=1}^n\left(1-z_{i k}\right) \frac{1}{1-P\left(\omega_i\right)}=0 ∂P(ωi)∂lnP=k=1∑nzikP(ωi)1−k=1∑n(1−zik)1−P(ωi)1=0
得:
∑ k = 1 n z i k ( 1 − P ( ω i ) ) − ∑ k = 1 n ( 1 − z i k ) P ( ω i ) = 0 \sum_{k=1}^n z_{i k}\left(1-P\left(\omega_i\right)\right)-\sum_{k=1}^n\left(1-z_{i k}\right) P\left(\omega_i\right)=0 k=1∑nzik(1−P(ωi))−k=1∑n(1−zik)P(ωi)=0
化简可得, 最大似然估计为:
P ^ ( ω i ) = 1 n ∑ k = 1 n z i k \hat{P}\left(\omega_i\right)=\frac{1}{n} \sum_{k=1}^n z_{i k} P^(ωi)=n1k=1∑nzik
该结果表示, 某个类别的先验概率的最大似然估计等于样本中属于该类的样本数在总样本数中的占比。
【题目二】设 x x x 的概率密度为均匀分布:
p ( x ∣ θ ) ∼ U ( 0 , θ ) = { 1 / θ , 0 ≤ x ≤ θ 0 , otherwise p(x \mid \theta) \sim U(0, \theta)=\left\{\begin{array}{cc} 1 / \theta, & 0 \leq x \leq \theta \\ 0, & \text { otherwise } \end{array}\right. p(x∣θ)∼U(0,θ)={1/θ,0,0≤x≤θ otherwise
【解】 n n n 个样本独立同分布, 则:
P ( D ∣ θ ) = ∏ k = 1 n p ( x k ∣ θ ) = { 1 θ n , 0 ≤ x 1 , x 2 , … , x n ≤ θ 0 , otherwise \begin{aligned} P(\mathcal{D} \mid \theta) & =\prod_{k=1}^n p\left(x_k \mid \theta\right) \\ & = \begin{cases}\frac{1}{\theta^n}, & 0 \leq x_1, x_2, \ldots, x_n \leq \theta \\ 0, & \text { otherwise }\end{cases} \end{aligned} P(D∣θ)=k=1∏np(xk∣θ)={θn1,0,0≤x1,x2,…,xn≤θ otherwise
对数似然函数为:
L ( D ∣ θ ) = ln ( D ∣ θ ) = = { − n ln θ , 0 ≤ x 1 , x 2 , … , x n ≤ θ − ∞ , otherwise L(\mathcal{D} \mid \theta)=\ln (\mathcal{D} \mid \theta)==\left\{\begin{array}{lr} -n \ln \theta, & 0 \leq x_1, x_2, \ldots, x_n \leq \theta \\ -\infty, & \text { otherwise } \end{array}\right. L(D∣θ)=ln(D∣θ)=={−nlnθ,−∞,0≤x1,x2,…,xn≤θ otherwise
由于 − n ln θ -n \ln \theta −nlnθ 是递减的, θ \theta θ 越小,似然函数越大,但是 θ \theta θ 又有限制 0 ≤ x 1 , x 2 , … , x n ≤ θ 0 \leq x_1, x_2, \ldots, x_n \leq \theta 0≤x1,x2,…,xn≤θ ,因此 θ \theta θ 的极大似然估计为 max [ D ] \max [\mathcal{D}] max[D] 。
【解】由 (1) 得, 似然函数为:
P ( D ∣ θ ) = { 1 θ 5 0 ≤ x 1 , x 2 , … , x 5 ≤ θ 0 , otherwise P(\mathcal{D} \mid \theta)=\left\{\begin{array}{lr} \frac{1}{\theta^5} & 0 \leq x_1, x_2, \ldots, x_5 \leq \theta \\ 0, & \text { otherwise } \end{array}\right. P(D∣θ)={θ510,0≤x1,x2,…,x5≤θ otherwise
在区间 [ 0 , 1 ] [0,1] [0,1] 上似然函数 p ( D ∣ θ ) p(\mathcal{D} \mid \theta) p(D∣θ) 曲线如图 1 。由于 θ ≥ max [ D ] \theta \geq \max [\mathcal{D}] θ≥max[D], 则无需知道其他四个点的具体值也可以得到似然函数。(不妨设 x 1 = 0.6 x_1=0.6 x1=0.6, 当 θ < 0.6 \theta<0.6 θ<0.6 时, p ( x 1 ∣ θ ) = 0 , p ( D ∣ θ ) = 0 p\left(x_1 \mid \theta\right)=0, p(D \mid \theta)=0 p(x1∣θ)=0,p(D∣θ)=0; 当 θ ≥ 0.6 \theta \geq 0.6 θ≥0.6 时, p ( D ∣ θ ) = ( 1 θ ) 5 p(D \mid \theta)=\left(\frac{1}{\theta}\right)^5 p(D∣θ)=(θ1)5)(我用MATLAB画的)
【题目三】一种度量同一空间中的两个不同分布的距离的方式为 KullbackLeibler 散度 (简称 KL 散度)
D K L ( p 2 ( x ) ∥ p 1 ( x ) ) = ∫ p 2 ( x ) ln p 2 ( x ) p 1 ( x ) d x D_{K L}\left(p_2(\mathbf{x}) \| p_1(\mathbf{x})\right)=\int p_2(\mathbf{x}) \ln \frac{p_2(\mathbf{x})}{p_1(\mathbf{x})} d x DKL(p2(x)∥p1(x))=∫p2(x)lnp1(x)p2(x)dx这个距离度量并不符合严格意义上的度量必须满足的对称性和三角不等式关系。假设我们使用正态分布 p 1 ( x ) ∼ N ( μ , Σ ) p_1(\mathbf{x}) \sim N(\boldsymbol{\mu}, \Sigma) p1(x)∼N(μ,Σ) 来近似某一个任意的分布 p 2 ( x ) p_2(\mathbf{x}) p2(x) 。证明能够产生最小的 KL 散度的结果为下面这个明显的结论:
μ = ε 2 [ x ] Σ = ε 2 [ ( x − μ ) ( x − μ ) t ] \begin{aligned} & \boldsymbol{\mu}=\varepsilon_2[\mathbf{x}] \\ & \Sigma=\varepsilon_2\left[(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu})^t\right] \end{aligned} μ=ε2[x]Σ=ε2[(x−μ)(x−μ)t]其中的数学期望是对概率密度函数 p 2 ( x ) p_2(\mathbf{x}) p2(x) 进行的
【解】带入 p 1 ( x ) p_1(\mathbf{x}) p1(x) 的分布, 可得
D K L ( p 2 ( x ) ∥ p 1 ( x ) ) = ∫ p 2 ( x ) ln p 2 ( x ) + 1 2 p 2 ( x ) ( d ln 2 π + ln ∣ Σ ∣ ) + 1 2 ( x − μ ) t Σ − 1 ( x − μ ) p 2 ( x ) d x \begin{gathered} D_{K L}\left(p_2(\mathbf{x}) \| p_1(\mathbf{x})\right)=\int p_2(\mathbf{x}) \ln p_2(\mathbf{x})+\frac{1}{2} p_2(x)(d \ln 2 \pi+\ln |\Sigma|) \\ +\frac{1}{2}(x-\mu)^t \Sigma^{-1}(x-\mu) p_2(x) d x \end{gathered} DKL(p2(x)∥p1(x))=∫p2(x)lnp2(x)+21p2(x)(dln2π+ln∣Σ∣)+21(x−μ)tΣ−1(x−μ)p2(x)dx不考虑无关项, 令
f ( μ , Σ ) = ∫ p 2 ( x ) ( ln ∣ Σ ∣ + ( x − μ ) t Σ − 1 ( x − μ ) ) d x f(\mu, \Sigma)=\int p_2(x)\left(\ln |\Sigma|+(x-\mu)^t \Sigma^{-1}(x-\mu)\right) d x f(μ,Σ)=∫p2(x)(ln∣Σ∣+(x−μ)tΣ−1(x−μ))dx对 μ , Σ \mu, \Sigma μ,Σ 求偏导数
∂ f ( μ , Σ ) ∂ μ = − ( Σ − 1 + Σ − t ) ( μ − ∫ x p 2 ( x ) d x ) = − ( Σ − 1 + Σ − t ) ( μ − ε 2 [ x ] ) \frac{\partial f(\mu, \Sigma)}{\partial \mu}=-\left(\Sigma^{-1}+\Sigma^{-t}\right)\left(\mu-\int x p_2(x) d x\right)=-\left(\Sigma^{-1}+\Sigma^{-t}\right)\left(\mu-\varepsilon_2[x]\right) ∂μ∂f(μ,Σ)=−(Σ−1+Σ−t)(μ−∫xp2(x)dx)=−(Σ−1+Σ−t)(μ−ε2[x]) ∂ f ( μ , Σ ) ∂ Σ = ∫ p 2 ( x ) Σ − t + p 2 ( x ) [ − Σ − t ( x − μ ) ( x − μ ) t Σ − t ] d x = Σ − t ⋅ ∫ p 2 ( x ) [ Σ t − ( x − μ ) ( x − μ ) t ] Σ − t d x = Σ − t ⋅ ( Σ − ε 2 [ ( x − μ ) ( x − μ ) t ] ) Σ − t \begin{aligned} \frac{\partial f(\mu, \Sigma)}{\partial \Sigma} & =\int p_2(x) \Sigma^{-t}+p_2(x)\left[-\Sigma^{-t}(x-\mu)(x-\mu)^t \Sigma^{-t}\right] d x \\ & =\Sigma^{-t} \cdot \int p_2(x)\left[\Sigma^t-(x-\mu)(x-\mu)^t\right] \Sigma^{-t} d x \\ & =\Sigma^{-t} \cdot\left(\Sigma-\varepsilon_2\left[(x-\mu)(x-\mu)^t\right]\right) \Sigma^{-t} \end{aligned} ∂Σ∂f(μ,Σ)=∫p2(x)Σ−t+p2(x)[−Σ−t(x−μ)(x−μ)tΣ−t]dx=Σ−t⋅∫p2(x)[Σt−(x−μ)(x−μ)t]Σ−tdx=Σ−t⋅(Σ−ε2[(x−μ)(x−μ)t])Σ−t令偏导数为 0 , 可得
μ = ε 2 [ x ] Σ = ε 2 [ ( x − μ ) ( x − μ ) t ] \begin{aligned} \boldsymbol{\mu} & =\varepsilon_2[\mathbf{x}] \\ \Sigma & =\varepsilon_2\left[(\mathbf{x}-\boldsymbol{\mu})(\mathbf{x}-\boldsymbol{\mu})^t\right] \end{aligned} μΣ=ε2[x]=ε2[(x−μ)(x−μ)t]
【题目四】 数据 D = { ( 1 1 ) , ( 3 3 ) , ( 2 ∗ ) } \mathcal{D}=\left\{\left(\begin{array}{l}1 \\ 1\end{array}\right),\left(\begin{array}{l}3 \\ 3\end{array}\right),\left(\begin{array}{l}2 \\ *\end{array}\right)\right\} D={(11),(33),(2∗)} 中的样本独立地服从二维的分布 p ( x 1 , x 2 ) = p ( x 1 ) p ( x 2 ) p\left(x_1, x_2\right)=p\left(x_1\right) p\left(x_2\right) p(x1,x2)=p(x1)p(x2) 。其中, ∗ * ∗ 代表丢失的数据, 且有
p ( x 1 ) = { 1 θ 1 e − x 1 / θ 1 , x 1 ≥ 0 0 , otherwise p\left(x_1\right)=\left\{\begin{array}{l} \frac{1}{\theta_1} e^{-x_1 / \theta_1}, \quad x_1 \geq 0 \\ 0, \quad \text { otherwise } \end{array}\right. p(x1)={θ11e−x1/θ1,x1≥00, otherwise p ( x 2 ) ∼ U ( 0 , θ 2 ) = { 1 θ 2 , 0 ≤ x 2 ≤ θ 0 , otherwise p\left(x_2\right) \sim U\left(0, \theta_2\right)=\left\{\begin{array}{cl} \frac{1}{\theta_2}, & 0 \leq x_2 \leq \theta \\ 0, & \text { otherwise } \end{array}\right. p(x2)∼U(0,θ2)={θ21,0,0≤x2≤θ otherwise
【解】对于 E \mathbf{E} E 步骤:
Q ( θ ; θ 0 ) = E x 32 [ ln p ( x g ; x b ; θ ) ∣ θ 0 , D g ] = ∫ − ∞ ∞ [ ln p ( x 1 ∣ θ ) + ln p ( x 2 ∣ θ ) + ln p ( x 3 ∣ θ ) ] p ( x 32 ∣ θ 0 ; x 31 = 2 ) d x 32 = ln p ( x 1 ∣ θ ) + ln p ( x 2 ∣ θ ) + ∫ − ∞ ∞ ln p ( x 3 ∣ θ ) ⋅ p ( x 32 ∣ θ 0 ; x 31 = 2 ) d x 32 = ln p ( x 1 ∣ θ ) + ln p ( x 2 ∣ θ ) + ∫ − ∞ ∞ ln p ( ( 2 x 32 ) ∣ θ ) ⋅ p ( ( 2 x 32 ) ∣ θ 0 ) ∫ − ∞ ∞ p ( ( 2 x 32 ′ ) ∣ θ 0 ) d x 32 ′ ⏟ 1 / ( 2 e 4 ) d x 32 = ln p ( x 1 ∣ θ ) + ln p ( x 2 ∣ θ ) + 2 e ∫ − ∞ ∞ ln p ( ( 2 x 32 ) ∣ θ ) ⋅ p ( ( 2 x 32 ) ∣ θ 0 ) d x 32 = ln p ( x 1 ∣ θ ) + ln p ( x 2 ∣ θ ) + C (1) \begin{aligned} & Q\left(\boldsymbol{\theta} ; \boldsymbol{\theta}^0\right)=\mathcal{E}_{x_{32}}\left[\ln p\left(\mathbf{x}_g ; \mathbf{x}_b ; \boldsymbol{\theta}\right) \mid \boldsymbol{\theta}^0, \mathcal{D}_g\right] \\ & =\int_{-\infty}^{\infty}\left[\ln p\left(\mathbf{x}_1 \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_2 \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_3 \mid \boldsymbol{\theta}\right)\right] p\left(x_{32} \mid \boldsymbol{\theta}^0 ; x_{31}=2\right) \mathrm{d} x_{32} \\ & =\ln p\left(\mathbf{x}_1 \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_2 \mid \boldsymbol{\theta}\right)+\int_{-\infty}^{\infty} \ln p\left(\mathbf{x}_3 \mid \boldsymbol{\theta}\right) \cdot p\left(x_{32} \mid \boldsymbol{\theta}^0 ; x_{31}=2\right) \mathrm{d} x_{32} \\ & =\ln p\left(\mathbf{x}_1 \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_2 \mid \boldsymbol{\theta}\right)+\int_{-\infty}^{\infty} \ln p\left(\left(\begin{array}{c} 2 \\ x_{32} \end{array}\right) \mid \boldsymbol{\theta}\right) \cdot \frac{p\left(\left(\begin{array}{c} 2 \\ x_{32} \end{array}\right) \mid \boldsymbol{\theta}^0\right)}{\underbrace{\int_{-\infty}^{\infty} p\left(\left(\begin{array}{c} 2 \\ x_{32}^{\prime} \end{array}\right) \mid \boldsymbol{\theta}^0\right) \mathrm{d} x_{32}^{\prime}}_{1 /\left(2 e^4\right)} \mathrm{d} x_{32}} \\ & =\ln p\left(\mathbf{x}_1 \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_2 \mid \boldsymbol{\theta}\right)+2 e \int_{-\infty}^{\infty} \ln p\left(\left(\begin{array}{c} 2 \\ x_{32} \end{array}\right) \mid \boldsymbol{\theta}\right) \cdot p\left(\left(\begin{array}{c} 2 \\ x_{32} \end{array}\right) \mid \boldsymbol{\theta}^0\right) \mathrm{d} x_{32} \\ & =\ln p\left(\mathbf{x}_1 \mid \boldsymbol{\theta}\right)+\ln p\left(\mathbf{x}_2 \mid \boldsymbol{\theta}\right)+C \\ &\tag{1} \end{aligned} Q(θ;θ0)=Ex32[lnp(xg;xb;θ)∣θ0,Dg]=∫−∞∞[lnp(x1∣θ)+lnp(x2∣θ)+lnp(x3∣θ)]p(x32∣θ0;x31=2)dx32=lnp(x1∣θ)+lnp(x2∣θ)+∫−∞∞lnp(x3∣θ)⋅p(x32∣θ0;x31=2)dx32=lnp(x1∣θ)+lnp(x2∣θ)+∫−∞∞lnp((2x32)∣θ)⋅1/(2e4) ∫−∞∞p((2x32′)∣θ0)dx32′dx32p((2x32)∣θ0)=lnp(x1∣θ)+lnp(x2∣θ)+2e∫−∞∞lnp((2x32)∣θ)⋅p((2x32)∣θ0)dx32=lnp(x1∣θ)+lnp(x2∣θ)+C(1)其中, 式 (1) 中的归一化项计算方式为
∫ − ∞ ∞ p ( ( 2 x 32 ′ ) ∣ θ 0 ) d x 32 ′ = ∫ − ∞ ∞ p ( x 31 = 2 ∣ θ 1 0 = 2 ) ⋅ p ( x 32 ′ ∣ θ 2 0 = 4 ) d x 32 ′ = ∫ 0 4 1 2 e − 2 × 2 ⋅ 1 4 d x 32 ′ = 1 2 e 4 (2) \begin{aligned} \int_{-\infty}^{\infty} p\left(\left(\begin{array}{c} 2 \\ x_{32}^{\prime} \end{array}\right) \mid \boldsymbol{\theta}^0\right) \mathrm{d} x_{32}^{\prime} & =\int_{-\infty}^{\infty} p\left(x_{31}=2 \mid \theta_1^0=2\right) \cdot p\left(x_{32}^{\prime} \mid \theta_2^0=4\right) \mathrm{d} x_{32}^{\prime} \\ & =\int_0^4 \frac{1}{2} e^{-2 \times 2} \cdot \frac{1}{4} \mathrm{~d} x_{32}^{\prime} \\ & =\frac{1}{2 e^4}\tag{2} \end{aligned} ∫−∞∞p((2x32′)∣θ0)dx32′=∫−∞∞p(x31=2∣θ10=2)⋅p(x32′∣θ20=4)dx32′=∫0421e−2×2⋅41 dx32′=2e41(2)根据 θ 2 \theta_2 θ2 分情况, 求式 (1) 中 C C C 的不同取值, 由于已知样本中, max x 2 = \max x_2= maxx2= x 22 = 3 x_{22}=3 x22=3, 故: θ 2 ≥ 3 \theta_2 \geq 3 θ2≥3 。
分类讨论如下:
【解】对于 M \mathrm{M} M 步骤, 估计准则为:
θ ^ = arg max θ Q ( θ ; θ 0 ) \hat{\boldsymbol{\theta}}=\arg \max _{\boldsymbol{\theta}} Q\left(\boldsymbol{\theta} ; \boldsymbol{\theta}^0\right) θ^=argθmaxQ(θ;θ0)
综合两种情况, θ = ( 2 3 ) \theta=\left(\begin{array}{l}2 \\ 3\end{array}\right) θ=(23), 此时 Q Q Q 最大
(后面还有两题,知识盲区,考了吃屎)