1.硬分类:
a.线性判别分析
b.感知机
2.软分类
(1)生成式
a. GDA
b. 朴素贝叶斯
(2)判别式:logistic 回归
首先选取激活函数为
s i g n ( a ) = { + 1 , a ≥ 0 − 1 , a < 0 sign(a) = \lbrace{+1,\quad {a\geq0} \atop -1, \quad a<0} sign(a)={−1,a<0+1,a≥0
通过该函数可以将线性回归的结果映射到两个类别,接下来定义损失函数,在这里将损失函数定义为错误分类的个数,可以使用指示函数,但指示函数不可导,如下所示,将指示函数定义为,
L ( W ) = ∑ i = 1 N I { y i W T x i < 0 } L(W)=\sum_{i=1}^N{I\{y_iW^Tx_i<0\}} L(W)=i=1∑NI{yiWTxi<0}
上面这样定义是因为当正确分类时, y i W T x i y_iW^Tx_i yiWTxi大于等于0,在 W W W有一个微小的变化 Δ W \Delta{W} ΔW的时候,指示函数的值可能会从0越变到1,也可能从1越变到0,因此说指示函数不可导,这也是个np难得问题,而 y i W T x i y_{i}W^Tx_{i} yiWTxi是关于 W W W的线性函数,因此将损失函数定义为
L ( W ) = ∑ X i ∈ D e r r o r − y i W T x i L(W)=\sum_{X_i\in{D_{error}}}{-y_iW^Tx_i} L(W)=Xi∈Derror∑−yiWTxi
LDA的基本思想是,选定一个方向,其方向向量用W表示,将训练样本顺着这个方向投影,投影后的结果具有下面的特点
1.类内距离小(高内聚),即相同类的样本投影距离接近。
2.类间距离大(低耦合),即不同类的样本投影距离疏远。
下面开始定义算法,假设数据样本向量为 X i X_i Xi,将 x i x_i xi顺着 W W W方向投影有
z i = W T x i = ∣ W ∣ ∣ x i ∣ cos θ z_i=W^Tx_i=|W||x_i|\cos{\theta} zi=WTxi=∣W∣∣xi∣cosθ
假设属于两类的样本的数量分别为 N 1 , N 2 N_1,N_2 N1,N2, C 1 , C 2 C_1,C_2 C1,C2分别代表类别1和类别2
C 1 : Z 1 ˉ = 1 N 1 ∑ i = 1 N 1 W T x i S Z 1 = 1 N 1 ∑ i = 1 N 1 ( z i − Z 1 ˉ ) ( z i − Z 1 ˉ ) T = 1 N 1 ∑ i = 1 N 1 ( W T x i − Z 1 ˉ ) ( W T x i − Z 1 ˉ ) T C_1:\qquad \bar{Z_1} = \frac{1}{N_1}\sum_{i=1}^{N_1}{W^Tx_i}\\ \qquad\qquad\qquad\qquad\qquad {S_{Z_1}=\frac{1}{N_1}\sum_{i=1}^{N_1}{(z_i-\bar{Z_1})(z_i-\bar{Z_1})^T}}\\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad=\frac{1}{N_1}\sum_{i=1}^{N_1}{(W^Tx_i-\bar{Z_1})(W^Tx_i-\bar{Z_1})^T} C1:Z1ˉ=N11i=1∑N1WTxiSZ1=N11i=1∑N1(zi−Z1ˉ)(zi−Z1ˉ)T=N11i=1∑N1(WTxi−Z1ˉ)(WTxi−Z1ˉ)T
C 2 : Z 2 ˉ = 1 N 2 ∑ i = 1 N 2 W T x i S Z 2 = 1 N 2 ∑ i = 1 N 2 ( z i − Z 2 ˉ ) ( z i − Z 2 ˉ ) T = 1 N 2 ∑ i = 1 N 2 ( W T x i − Z 2 ˉ ) ( W T x i − Z 2 ˉ ) T C_2:\qquad \bar{Z_2} = \frac{1}{N_2}\sum_{i=1}^{N_2}{W^Tx_i}\\ \qquad\qquad\qquad\qquad\qquad {S_{Z_2}=\frac{1}{N_2}\sum_{i=1}^{N_2}{(z_i-\bar{Z_2})(z_i-\bar{Z_2})^T}}\\ \qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad=\frac{1}{N_2}\sum_{i=1}^{N_2}{(W^Tx_i-\bar{Z_2})(W^Tx_i-\bar{Z_2})^T} C2:Z2ˉ=N21i=1∑N2WTxiSZ2=N21i=1∑N2(zi−Z2ˉ)(zi−Z2ˉ)T=N21i=1∑N2(WTxi−Z2ˉ)(WTxi−Z2ˉ)T
类间距离可以表示为 ( Z 1 ˉ − Z 2 ˉ ) 2 {(\bar{Z_1}-\bar{Z_2})^2} (Z1ˉ−Z2ˉ)2,类内距离表示为 S 1 + S 2 S_1+S_2 S1+S2,令 J ( W ) = ( Z 1 ˉ − Z 2 ˉ ) 2 S 1 + S 2 J(W)=\frac{(\bar{Z_1}-\bar{Z_2})^2}{S_1+S_2} J(W)=S1+S2(Z1ˉ−Z2ˉ)2,故优化目标为
W ^ = a r g m a x W J ( W ) \hat{W}=\mathop{argmax}\limits_{W}{J(W)} W^=WargmaxJ(W)
由
( Z 1 ˉ − Z 2 ˉ ) 2 = ( 1 N 1 ∑ i = 1 N 1 W T x i − 1 N 2 ∑ i = 1 N 2 W T x i ) 2 = W T ( X c 1 ˉ − X c 2 ˉ ) ( X c 1 ˉ − X c 2 ˉ ) T W (\bar{Z_1}-\bar{Z_2})^2=(\frac{1}{N_1}\sum_{i=1}^{N_1}W^Tx_i-\frac{1}{N_2}\sum_{i=1}^{N_2}W^Tx_i)^2\\ \qquad\qquad=W^T(\bar{X_{c_1}}-\bar{X_{c_2}})(\bar{X_{c_1}}-\bar{X_{c_2}})^TW (Z1ˉ−Z2ˉ)2=(N11i=1∑N1WTxi−N21i=1∑N2WTxi)2=WT(Xc1ˉ−Xc2ˉ)(Xc1ˉ−Xc2ˉ)TW
S 1 + S 2 = 1 N 1 ∑ i = 1 N 1 ( W T x i − Z 1 ˉ ) ( W T x i − Z 1 ˉ ) T + 1 N 2 ∑ i = 1 N 2 ( W T x i − Z 2 ˉ ) ( W T x i − Z 2 ˉ ) T = 1 N 1 ∑ i = 1 N 1 W T ( x i − X c 1 ˉ ) ( x i − X c 1 ˉ ) T W + 1 N 2 ∑ i = 1 N 2 W T ( x i − X c 2 ˉ ( x i − X c 2 ˉ ) T W = W T S c 1 W + W T S c 2 W = W T ( S c 1 + S c 2 ) W S_1+S_2=\frac{1}{N_1}\sum_{i=1}^{N_1}{(W^Tx_i-\bar{Z_1})(W^Tx_i-\bar{Z_1})^T}+\frac{1}{N_2}\sum_{i=1}^{N_2}{(W^Tx_i-\bar{Z_2})(W^Tx_i-\bar{Z_2})^T}\\ \qquad\qquad=\frac{1}{N_1}\sum_{i=1}^{N_1}{W^T(x_i-\bar{X_{c_1}})(x_i-\bar{X_{c_1}})^TW}+\frac{1}{N_2}\sum_{i=1}^{N_2}{W^T(x_i-\bar{X_{c_2}}(x_i-\bar{X_{c_2}})^TW}\\ =W^TS_{c_1}W+W^TS_{c_2}W\\ =W^T(S_{c_1}+S_{c_2})W S1+S2=N11i=1∑N1(WTxi−Z1ˉ)(WTxi−Z1ˉ)T+N21i=1∑N2(WTxi−Z2ˉ)(WTxi−Z2ˉ)T=N11i=1∑N1WT(xi−Xc1ˉ)(xi−Xc1ˉ)TW+N21i=1∑N2WT(xi−Xc2ˉ(xi−Xc2ˉ)TW=WTSc1W+WTSc2W=WT(Sc1+Sc2)W
其中, S c 1 , S c 2 S_{c_1},S_{c_2} Sc1,Sc2分别是类别1和类别2的协方差矩阵,因此
J ( W ) = W T ( X c 1 ˉ − X c 2 ˉ ) ( X c 1 ˉ − X c 2 ˉ ) T W W T ( S c 1 + S c 2 ) W = W T S b W W T S w W = W T S b W ( W T S w W ) − 1 J(W)=\frac{W^T(\bar{X_{c_1}}-\bar{X_{c_2}})(\bar{X_{c_1}}-\bar{X_{c_2}})^TW}{W^T(S_{c_1}+S_{c_2})W}\\ \qquad\qquad=\frac{W^TS_bW}{W^TS_wW}={W^TS_bW}(W^TS_wW)^{-1} J(W)=WT(Sc1+Sc2)WWT(Xc1ˉ−Xc2ˉ)(Xc1ˉ−Xc2ˉ)TW=WTSwWWTSbW=WTSbW(WTSwW)−1
求最优解
∂ J ( W ) ∂ W = 2 S b W ( W T S w W ) − 1 + W T S b W ( − 1 ) ( W T S w W ) − 2 2 S w W = 0 ⇒ S b ( W T S w W ) − W T S b W S w W = 0 ⇒ ( W T S b W ) S w W = S b ( W T S w W ) w h e r e W ∈ R p × 1 , W T ∈ R 1 × P , S w ∈ R P × P t h e n W T S b W a n d W T S w W a r e R e a l i t y \frac{\partial{J(W)}}{\partial{W}}=2S_bW(W^TS_wW)^{-1}+W^TS_bW(-1)(W^TS_wW)^{-2}2S_wW=0\\ \quad\Rightarrow S_b(W^TS_wW)-W^TS_bWS_wW=0\\ \Rightarrow(W^TS_bW)S_wW=S_b(W^TSwW)\\ \qquad{}where\quad{W\in{R^{p\times1}},W^T\in{R^{1\times{P}}},S_w\in{R^{P\times{P}}}}\\ \qquad{}\qquad{}then\quad{W^TS_bW\quad {and} \quad {W^TS_wW}\quad {are} \quad {Reality}} ∂W∂J(W)=2SbW(WTSwW)−1+WTSbW(−1)(WTSwW)−22SwW=0⇒Sb(WTSwW)−WTSbWSwW=0⇒(WTSbW)SwW=Sb(WTSwW)whereW∈Rp×1,WT∈R1×P,Sw∈RP×PthenWTSbWandWTSwWareReality
于是
S w W = W T S w W W T S b W S w − 1 S b W S_wW=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}S_bW SwW=WTSbWWTSwWSw−1SbW
实际上,我们只关心W的方向,并不关心W的大小,于是
S w W = W T S w W W T S b W S w − 1 S b W ∝ S w − 1 S b W = S w − 1 ( X c 1 ˉ − X c 2 ˉ ) ( X c 1 ˉ − X c 2 ˉ ) T W w h e r e ( X c 1 ˉ − X c 2 ˉ ) T ∈ R 1 × P t h e n S w W ∝ S w − 1 ( X c 1 ˉ − X c 2 ˉ ) T S_wW=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}S_bW\propto{S_w^{-1}S_bW}=S_w^{-1}(\bar{X_{c_1}}-\bar{X_{c_2}})(\bar{X_{c_1}}-\bar{X_{c_2}})^TW\\ where\quad{(\bar{X_{c_1}}-\bar{X_{c_2}})^T\in{R^{1\times{P}}}}\\ then \quad{}S_wW\propto{S_w^{-1}{(\bar{X_{c_1}}-\bar{X_{c_2}})^T}} SwW=WTSbWWTSwWSw−1SbW∝Sw−1SbW=Sw−1(Xc1ˉ−Xc2ˉ)(Xc1ˉ−Xc2ˉ)TWwhere(Xc1ˉ−Xc2ˉ)T∈R1×PthenSwW∝Sw−1(Xc1ˉ−Xc2ˉ)T
如果 S w − 1 S_w^{-1} Sw−1为对角矩阵且各向同性,那么 S w − 1 ∝ I ⇒ W ∝ ( X c 1 ˉ − X c 2 ˉ ) S_w^{-1}\propto{I}\Rightarrow{W\propto{(\bar{X_{c_1}}-\bar{X_{c_2}})}} Sw−1∝I⇒W∝(Xc1ˉ−Xc2ˉ)
Logistic回归通过引入一个非线性函数,将实数域的值映射到[0, 1]区间的概率值,这里的非线性函数为sigmoid函数
σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1+e^{-z}} σ(z)=1+e−z1
对于logistic回归的模型定义如下
P 1 = P ( Y = 1 ∣ X ) = σ ( W T X ) = 1 1 + e − W T X P 0 = P ( Y = 0 ∣ x ) = 1 − P ( Y = 1 ∣ X ) = e − W T X 1 + e − W T X P_1 = P(Y=1|X) = \sigma(W^TX) = \frac{1}{1+e^{-W^TX}}\\ P_0 = P(Y=0|x) = 1-P(Y=1|X) = \frac{e^{-W^TX}}{1+e^{-W^TX}} P1=P(Y=1∣X)=σ(WTX)=1+e−WTX1P0=P(Y=0∣x)=1−P(Y=1∣X)=1+e−WTXe−WTX
简化为
P ( Y ∣ X ) = P 1 Y P 0 1 − Y , Y = 1 , P = P 1 , Y = 0 , P = P 0 P(Y|X) = P_1^{Y}P_0^{1-Y},\quad{Y=1, P=P_1, Y=0, P=P_0} P(Y∣X)=P1YP01−Y,Y=1,P=P1,Y=0,P=P0
利用最大似然估计
W ^ = a r g m a x W log P ( Y ∣ X ) = a r g m a x W log ∏ i = 1 N P ( y i ∣ x i ) = a r g m a x W ∑ i = 1 N P ( y i ∣ x i ) = a r g m a x W ∑ i = 1 N ( y i log P 1 + ( 1 − y i ) log P 0 ) = a r g m i n W ∑ i = 1 N ( − y i log P 1 − ( 1 − y i ) log P 0 ) \hat{W} = \mathop{argmax}\limits_{W}\log{P(Y|X)}\\ =\mathop{argmax}\limits_{W}\log{\prod_{i=1}^{N}{P(y_i|x_i)}}\\ =\mathop{argmax}\limits_{W}\sum_{i=1}^{N}{P(y_i|x_i)}\\ =\mathop{argmax}\limits_{W}\sum_{i=1}^{N}{(y_i\log{P_1} + (1-y_i)\log{P_0} )}\\ =\mathop{argmin}\limits_{W}\sum_{i=1}^{N}{(-y_i\log{P_1}-(1-y_i)\log{P_0})} W^=WargmaxlogP(Y∣X)=Wargmaxlogi=1∏NP(yi∣xi)=Wargmaxi=1∑NP(yi∣xi)=Wargmaxi=1∑N(yilogP1+(1−yi)logP0)=Wargmini=1∑N(−yilogP1−(1−yi)logP0)
最后一项可以称为Cross Entropy,可以由信息论推出
概率判别模型通过求出每个样本属于每个类别的具体概率,最终判定样本属于哪一个类别,生成模型与此不同。我们的目的实际是求样本属于哪个类别的概率大,生成模型借助贝叶斯定理 P ( Y ∣ X ) = P ( X ∣ Y ) P ( Y ) P ( X ) P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)} P(Y∣X)=P(X)P(X∣Y)P(Y),我们可以通过 P ( X ∣ Y ) P ( Y ) P ( X ) \frac{P(X|Y)P(Y)}{P(X)} P(X)P(X∣Y)P(Y)去判别概率的大小,有
P ( Y ∣ X ) ∝ P ( X ∣ Y ) P ( Y ) = P ( X , Y ) P(Y|X)\propto{P(X|Y)P(Y)}=P(X,Y) P(Y∣X)∝P(X∣Y)P(Y)=P(X,Y)
生成模型实际是对联合概率进行建模,假设
1. y ∼ B e r n o u l l i ( ϕ ) 2. x ∣ y = 1 ∼ N ( μ 1 , Σ ) 3. x ∣ y = 0 ∼ N ( μ 0 , Σ ) 1.y\sim{Bernoulli{(\phi)}}\\ \quad2.x|y=1\sim{N(\mu_1, \Sigma)}\\ \quad3.x|y=0\sim{N(\mu_0, \Sigma)} 1.y∼Bernoulli(ϕ)2.x∣y=1∼N(μ1,Σ)3.x∣y=0∼N(μ0,Σ)
似然函数为
L ( θ ) = log ∏ i = 1 N p ( x i ∣ y i ) = ∑ i = 1 N p ( x i ∣ y i ) p ( y i ) = ∑ i = 1 N [ log p ( x i ∣ y i ) + log p ( y i ) ] = ∑ i = 1 N [ ( 1 − y i ) log N ( μ 0 , Σ ) + y i log N ( μ 1 , Σ ) + y i log ϕ + ( 1 − y i ) log ( 1 − ϕ ) ] L(\theta)=\log\prod_{i=1}^{N}{p(x_i|y_i)}\\ \qquad\quad=\sum_{i=1}^{N}{p(x_i|y_i)p(y_i)}\\ \qquad\qquad\qquad\qquad=\sum_{i=1}^{N}{[\log{p(x_i|y_i)}+\log{p(y_i)}]}\\ \qquad\qquad\qquad\qquad\qquad\qquad=\sum_{i=1}^{N}{[(1-y_i)\log{N(\mu_0,\Sigma)}+y_i\log{N(\mu_1, \Sigma)+y_i\log\phi} + (1-y_i)\log(1-\phi)]} L(θ)=logi=1∏Np(xi∣yi)=i=1∑Np(xi∣yi)p(yi)=i=1∑N[logp(xi∣yi)+logp(yi)]=i=1∑N[(1−yi)logN(μ0,Σ)+yilogN(μ1,Σ)+yilogϕ+(1−yi)log(1−ϕ)]
利用最大似然估计,求 a r g m a x ϕ , μ 0 , μ 1 , Σ L ( θ ) \mathop{argmax}\limits_{\phi,\mu_0,\mu_1,\Sigma}{L(\theta)} ϕ,μ0,μ1,ΣargmaxL(θ)
求 ϕ \phi ϕ,令
∂ L ( θ ) ∂ ϕ = ∑ i = 1 N ( y i ϕ + y i − 1 1 − ϕ ) = 0 \frac{\partial{L(\theta)}}{\partial{\phi}}=\sum_{i=1}^N{(\frac{y_i}{\phi}+\frac{y_i-1}{1-\phi})}=0 ∂ϕ∂L(θ)=i=1∑N(ϕyi+1−ϕyi−1)=0
求出
ϕ = ∑ i = 1 N y i N = N 1 N \phi=\frac{\sum_{i=1}^N{y_i}}{N}=\frac{N_1}{N} ϕ=N∑i=1Nyi=NN1
求 μ 1 \mu_1 μ1,对于与 μ 1 \mu_1 μ1相关的第二项,有
∑ i = 1 N y i log N ( μ 1 , Σ ) = ∑ i = 1 N y i 1 ( 2 π ) 1 P ∣ Σ ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) = ∑ i = 1 N y i ( 1 ( 2 π ) 1 P ∣ Σ ∣ 1 2 − 1 2 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) ) \sum_{i=1}^N{y_i\log{N(\mu_1,\Sigma)}}=\sum_{i=1}^N{y_i\frac{1}{(2\pi)^\frac{1}{P}|\Sigma|^\frac{1}{2}}e^{-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)}}\\ \qquad\qquad\qquad\qquad\qquad\qquad\quad=\sum_{i=1}^N{y_i(\frac{1}{(2\pi)^\frac{1}{P}|\Sigma|^\frac{1}{2}}-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1))} i=1∑NyilogN(μ1,Σ)=i=1∑Nyi(2π)P1∣Σ∣211e−21(x−μ1)TΣ−1(x−μ1)=i=1∑Nyi((2π)P1∣Σ∣211−21(x−μ1)TΣ−1(x−μ1))
对 μ 1 \mu_1 μ1求偏导至于第二项有关
∑ i = 1 N y i ( − 1 2 ( x i − μ 1 ) T Σ − 1 ( x i − μ 1 ) ) = ∑ i = 1 N y i ( − 1 2 ( x i T Σ − 1 + μ 1 T Σ − 1 ) ( x i − μ 1 ) ) = − 1 2 ∑ i = 1 N y i ( x i T Σ − 1 x i − μ 1 T Σ − 1 x i − x i T Σ − 1 μ 1 + μ 1 T Σ − 1 μ 1 ) = − 1 2 ∑ i = 1 N y i ( x i T Σ − 1 x i − 2 x i T Σ − 1 μ 1 + μ 1 T Σ − 1 μ 1 ) \sum_{i=1}^Ny_i(-\frac{1}{2}(x_i-\mu_1)^T\Sigma^{-1}(x_i-\mu_1))=\sum_{i=1}^Ny_i(-\frac{1}{2}(x_i^T\Sigma^{-1}+\mu_1^T\Sigma^{-1})(x_i-\mu_1))\\ \qquad\qquad=-\frac{1}{2}\sum_{i=1}^N{y_i(x_i^T\Sigma^{-1}x_i-\mu_1^T\Sigma^{-1}x_i-x_i^T\Sigma^{-1}\mu_1+\mu_1^T\Sigma^{-1}\mu_1)}\\ =-\frac{1}{2}\sum_{i=1}^N{y_i(x_i^T\Sigma^{-1}x_i-2x_i^T\Sigma^{-1}\mu_1+\mu_1^T\Sigma^{-1}\mu_1)}\\ i=1∑Nyi(−21(xi−μ1)TΣ−1(xi−μ1))=i=1∑Nyi(−21(xiTΣ−1+μ1TΣ−1)(xi−μ1))=−21i=1∑Nyi(xiTΣ−1xi−μ1TΣ−1xi−xiTΣ−1μ1+μ1TΣ−1μ1)=−21i=1∑Nyi(xiTΣ−1xi−2xiTΣ−1μ1+μ1TΣ−1μ1)
令 Δ = − 1 2 ∑ i = 1 N y i ( x i T Σ − 1 x i − 2 x i T Σ − 1 μ 1 + μ 1 T Σ − 1 μ 1 ) \Delta=-\frac{1}{2}\sum_{i=1}^N{y_i(x_i^T\Sigma^{-1}x_i-2x_i^T\Sigma^{-1}\mu_1+\mu_1^T\Sigma^{-1}\mu_1)} Δ=−21∑i=1Nyi(xiTΣ−1xi−2xiTΣ−1μ1+μ1TΣ−1μ1), 由 Δ \Delta Δ对 μ 1 \mu_1 μ1求偏导,令
∂ Δ ∂ μ 1 = − 1 2 ∑ i = 1 N y i ( − 2 Σ − 1 x i + 2 Σ − 1 μ 1 ) = 0 ⇒ ∑ i = 1 N ( y i Σ − 1 x i − y i Σ − 1 μ 1 ) = 0 ⇒ ∑ i = 1 N y i x i = ∑ i = 1 N y i μ 1 ⇒ μ 1 = ∑ i = 1 N y i x i N 1 \frac{\partial\Delta}{\partial\mu_1}=-\frac{1}{2}\sum_{i=1}^Ny_i(-2\Sigma^{-1}x_i+2\Sigma^{-1}\mu_1)=0\\ \Rightarrow{\sum_{i=1}^N{(y_i\Sigma^{-1}x_i-y_i\Sigma^{-1}\mu_1)}}=0\\ \Rightarrow{\sum_{i=1}^N{y_ix_i}=\sum_{i=1}^N{y_i\mu_1}}\\ \Rightarrow{\mu_1=\frac{\sum_{i=1}^N{y_ix_i}}{N_1}} ∂μ1∂Δ=−21i=1∑Nyi(−2Σ−1xi+2Σ−1μ1)=0⇒i=1∑N(yiΣ−1xi−yiΣ−1μ1)=0⇒i=1∑Nyixi=i=1∑Nyiμ1⇒μ1=N1∑i=1Nyixi
求解 μ 0 \mu_0 μ0,由于与 μ 1 \mu_1 μ1正反对称,所以
μ 0 = ∑ i = 1 N ( 1 − y i ) x i N 0 \mu_0=\frac{\sum_{i=1}^N{(1-y_i)x_i}}{N_0} μ0=N0∑i=1N(1−yi)xi
求 Σ \Sigma Σ, Σ \Sigma Σ与 L ( θ ) L(\theta) L(θ)的前两项有关,对矩阵有以下性质
∂ t r ( A B ) ∂ A = B T ∂ ∣ A ∣ ∂ A = ∣ A ∣ A − 1 t r ( A B ) = t r ( B A ) t r ( A B C ) = t r ( C B A ) = t r ( B C A ) \frac{\partial{tr(AB)}}{\partial{A}}=B^T\\ \frac{\partial{|A|}}{\partial{A}}=|A|A^{-1}\\ tr(AB)=tr(BA)\\ tr(ABC)=tr(CBA)=tr(BCA) ∂A∂tr(AB)=BT∂A∂∣A∣=∣A∣A−1tr(AB)=tr(BA)tr(ABC)=tr(CBA)=tr(BCA)
对 Σ \Sigma Σ求最大似然
Σ ^ = a r g m a x Σ ( ∑ x i ∈ c 1 log N ( μ 1 , Σ ) + ∑ x i ∈ c 1 log N ( μ 2 , Σ ) ) \hat{\Sigma}=\mathop{argmax}_{\Sigma}(\sum_{x_i\in{c_1}}{\log{N(\mu_1, \Sigma)+}}\sum_{x_i\in{c_1}}{\log{N(\mu_2, \Sigma)}}) Σ^=argmaxΣ(xi∈c1∑logN(μ1,Σ)+xi∈c1∑logN(μ2,Σ))
由
$$
\sum_{i=1}^N{\log{N(\mu, \Sigma)}}=\sum_{i=1}N{\log{\frac{1}{(2\pi){\frac{1}{p}}|\Sigma|\frac{1}{2}}e{\frac{1}{2}(x_i-\mu)T\Sigma{-1}(x_i-\mu)}}}\
=\sum_{i=1}N(\log{\frac{1}{(2\pi)\frac{1}{p}|\Sigma|\frac{1}{2}}}-\frac{1}{2}(x_i-\mu)T\Sigma^{-1}(x_i-\mu))\
\qquad\quad=\sum_{i=1}N(\log{\frac{1}{(2\pi){\frac{1}{P}}}}+\log{\Sigma}{-\frac{1}{2}}-\frac{1}{2}(x_i-\mu)T\Sigma^{-1}(x_i-\mu))\
\qquad=Const-\frac{1}{2}N\log|\Sigma|-\sum_{i=1}N{\frac{1}{2}(x_i-\mu)T\Sigma^{-1}(x_i-\mu)}\
\qquad\quad=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}\sum_{i=1}N{tr((x_i-\mu)T\Sigma^{-1}(x_i-\mu))}\
\qquad\quad=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}tr(\sum_{i=1}N{(x_i-\mu)T\Sigma^{-1}(x_i-\mu)})\
\qquad\quad=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}tr(\sum_{i=1}N{(x_i-\mu)T(x_i-\mu)\Sigma^{-1}})\
=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}Ntr(S\Sigma^{-1})\
于 是 于是 于是
\sum_{x_i\in{c_1}}{\log{N(\mu_1, \Sigma)+}}\sum_{x_i\in{c_1}}{\log{N(\mu_2, \Sigma)}}=Const-\frac{1}{2}N\log|\Sigma|-\frac{1}{2}N_1tr(S_1\Sigma{-1})-\frac{1}{2}N_2tr(S_2\Sigma{-1})
KaTeX parse error: Can't use function '$' in math mode at position 3: 令$̲\Delta=Const-\f…
\frac{\partial{\Delta}}{\partial\Sigma}=-\frac{1}{2}|\Sigma|{-1}|\Sigma|\Sigma{-1}+\frac{1}{2}N_1S_1T\Sigma{-2}+\frac{1}{2}N_2S_2T\Sigma{-2}=0\
\Rightarrow{-\frac{1}{2}+\frac{1}{2}N_1S_1+\frac{1}{2}N_1S_1}=0
$$
解出 Σ = N 1 S 1 + N 2 S 2 N \Sigma=\frac{N_1S_1+N_2S_2}{N} Σ=NN1S1+N2S2
朴素贝叶斯算法的思想就是贝叶斯假设,也称为田间独立性假设,从概率图模型的视角来看就是最简单的概率图模型,而且是有向图。
假设各个属性独立
p ( X ∣ Y ) = ∏ i = 1 P P ( x i ∣ Y ) p(X|Y)=\prod_{i=1}^P{P(x_i|Y)} p(X∣Y)=i=1∏PP(xi∣Y)
即
x i ⊥ x j ∣ y i , ∀ i ≠ j x_i\bot{x_j}|y_i,\quad{\forall{i\neq{j}}} xi⊥xj∣yi,∀i=j
利用贝叶斯定理,对单次观测
P ( y ∣ x ) = P ( x ∣ y ) P ( y ) P ( x ) = ∏ i = 1 P P ( x i ∣ y ) P ( y ) P ( x ) P(y|x)=\frac{P(x|y)P(y)}{P(x)}=\frac{\prod_{i=1}^P{P(x_i|y)P(y)}}{P(x)} P(y∣x)=P(x)P(x∣y)P(y)=P(x)∏i=1PP(xi∣y)P(y)
对于二分类,假设 y ∼ B e r n o u l l i D i s t y\sim{Bernoulli\quad{Dist}} y∼BernoulliDist,对于多分类,假设 y ∼ C a t e g o r i a l D i s t y\sim{Categorial\quad{Dist}} y∼CategorialDist,如果 x i x_i xi为连续变量,假设 P ( x i ∣ y ) ∼ N ( μ i , σ i 2 ) P(x_i|y)\sim{N(\mu_i, \sigma_i^2)} P(xi∣y)∼N(μi,σi2), x i x_i xi为离散变量,假设为类别分布Categorial Distribution。对这些参数的估计,常用 MLE 的方法直接在数据集上估计,由于不需要知道各个维度之间的关系,因此,所需数据量大大减少了。估算完这些参数,再代入贝叶斯定理中得到类别的后验分布。
t}} , 对 于 多 分 类 , 假 设 ,对于多分类,假设 ,对于多分类,假设y\sim{Categorial\quad{Dist}} , 如 果 ,如果 ,如果x_i 为 连 续 变 量 , 假 设 为连续变量,假设 为连续变量,假设P(x_i|y)\sim{N(\mu_i, \sigma_i^2)} , , ,x_i$为离散变量,假设为类别分布Categorial Distribution。对这些参数的估计,常用 MLE 的方法直接在数据集上估计,由于不需要知道各个维度之间的关系,因此,所需数据量大大减少了。估算完这些参数,再代入贝叶斯定理中得到类别的后验分布。