作者课堂笔记整理,[email protected]
分类输入的数据x,成两个类别 y ∈ { 0 , 1 } y\in\{0,1\} y∈{0,1}
该算法学习条件概率 p ( y ∣ x ) p(y|x) p(y∣x)或者直接学习函数映射。
举例:线性回归,Logistic回归,K近邻…
该学习算法学习联合概率 p ( x , y ) p(x,y) p(x,y)。
p ( y ∣ x ) = p ( x ∣ y ) p ( y ) p ( x ) p(y|x)=\frac{p(x|y)p(y)}{p(x)} p(y∣x)=p(x)p(x∣y)p(y)
a r g m a x y p ( y ∣ x ) = a r g m a x y p ( x ∣ y ) p ( y ) p ( x ) = a r g m a x y p ( x ∣ y ) p ( y ) argmax_yp(y|x)=argmax_y\frac{p(x|y)p(y)}{p(x)}=argmax_yp(x|y)p(y) argmaxyp(y∣x)=argmaxyp(x)p(x∣y)p(y)=argmaxyp(x∣y)p(y)
没有必要计算p(x).
生成分类算法:
给定参数 ϕ , μ 0 , μ 1 , Σ \phi,\mu_0,\mu_1,\Sigma ϕ,μ0,μ1,Σ
y ∼ B e r n o u l l i ( ϕ ) x ∣ y = 0 ∼ N ( μ 0 , Σ ) x ∣ y = 1 ∼ N ( μ 1 , Σ ) y\sim Bernoulli(\phi)\quad x|y=0\sim N(\mu_0,\Sigma)\quad x|y=1\sim N(\mu_1,\Sigma) y∼Bernoulli(ϕ)x∣y=0∼N(μ0,Σ)x∣y=1∼N(μ1,Σ)
p ( y ) = ϕ y ( 1 − ϕ ) 1 − y p(y)=\phi^y(1-\phi)^{1-y} p(y)=ϕy(1−ϕ)1−y
p ( x ∣ y = 0 ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 e 1 2 ( x − μ 0 ) T Σ − 1 ( x − μ 0 ) p(x|y=0)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)} p(x∣y=0)=(2π)n/2∣Σ∣1/21e21(x−μ0)TΣ−1(x−μ0)
p ( x ∣ y = 1 ) = 1 ( 2 π ) n / 2 ∣ Σ ∣ 1 / 2 e 1 2 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) p(x|y=1)=\frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}e^{\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)} p(x∣y=1)=(2π)n/2∣Σ∣1/21e21(x−μ1)TΣ−1(x−μ1)
Log 数据似然函数:
I ( ϕ , μ 0 , μ 1 , Σ ) = l o g ∏ i = 1 m p ( x ( i ) , y ( i ) ; ϕ , μ 0 , μ 1 , Σ ) = l o g ∏ i = 1 m p ( x ( i ) ∣ y ( i ) ) ; μ 0 , μ 1 , Σ ) p ( y ( i ) ; ϕ ) I(\phi,\mu_0,\mu_1,\Sigma)=log\prod_{i=1}^{m}p(x^{(i)},y^{(i)};\phi,\mu_0,\mu_1,\Sigma)=log\prod_{i=1}^{m}p(x^{(i)}|y^{(i)});\mu_0,\mu_1,\Sigma)p(y^{(i)};\phi) I(ϕ,μ0,μ1,Σ)=logi=1∏mp(x(i),y(i);ϕ,μ0,μ1,Σ)=logi=1∏mp(x(i)∣y(i));μ0,μ1,Σ)p(y(i);ϕ)
最大似然函数估计:
l ( ϕ , μ 0 , μ 1 , Σ ) = l o g ∏ i = 1 m p ( x ( i ) , y ( i ) ) = l o g ∏ i = 1 m p ( x ( i ) ∣ y ( i ) ) p ( y ( i ) ) = ∑ i = 1 m l o g    p ( x ( i ) ∣ y ( i ) ) + ∑ i = 1 m l o g    p ( y ( i ) ) = ∑ i = 1 m l o g    (    p ( x ( i ) ∣ y ( i ) = 0 ) 1 − y ( i ) ∗ p ( x ( i ) ∣ y ( i ) = 1 ) y ( i )    ) + ∑ i = 1 m l o g    p ( y ( i ) ) = ∑ i = 1 m ( 1 − y ( i ) ) l o g    p ( x ( i ) ∣ y ( i ) = 0 ) + ∑ i = 1 m y ( i ) l o g    p ( x ( i ) ∣ y ( i ) = 1 ) + ∑ i = 1 m l o g    p ( y ( i ) ) \begin{aligned} l(\phi,\mu_0,\mu_1,\Sigma) &=log{\prod_{i=1}^m{p(x^{(i)},y^{(i)})}}=log{\prod_{i=1}^m{p(x^{(i)}|y^{(i)})p(y^{(i)})}} \\ &=\sum_{i=1}^m{log\;p(x^{(i)}|y^{(i)})}+\sum_{i=1}^m{log\;p(y^{(i)})} \\ &=\sum_{i=1}^m{log\;(\;p(x^{(i)}|y^{(i)}=0)^{1-y^{(i)}}*p(x^{(i)}|y^{(i)}=1)^{y^{(i)}}\;)}+\sum_{i=1}^m{log\;p(y^{(i)})} \\ &=\sum_{i=1}^m{(1-y^{(i)})log\;p(x^{(i)}|y^{(i)}=0)}+\sum_{i=1}^m{{y^{(i)}}log\;p(x^{(i)}|y^{(i)}=1)}+\sum_{i=1}^m{log\;p(y^{(i)})} \end{aligned} l(ϕ,μ0,μ1,Σ)=logi=1∏mp(x(i),y(i))=logi=1∏mp(x(i)∣y(i))p(y(i))=i=1∑mlogp(x(i)∣y(i))+i=1∑mlogp(y(i))=i=1∑mlog(p(x(i)∣y(i)=0)1−y(i)∗p(x(i)∣y(i)=1)y(i))+i=1∑mlogp(y(i))=i=1∑m(1−y(i))logp(x(i)∣y(i)=0)+i=1∑my(i)logp(x(i)∣y(i)=1)+i=1∑mlogp(y(i))
对 ϕ \phi ϕ求导:
∂    l ( ϕ , μ 0 , μ 1 , Σ ) ∂ ϕ = ∑ i = 1 m l o g    p ( y ( i ) ) ∂ ϕ = ∂ ∑ i = 1 m l o g    ϕ y ( i ) ( 1 − ϕ ) 1 − y ( i ) ) ∂ ϕ = ∂ ∑ i = 1 m y ( i )    l o g    ϕ + ( 1 − y ( i ) ) l o g ( 1 − ϕ ) ∂ ϕ = ∑ i = 1 m ( y ( i ) 1 ϕ − ( 1 − y ( i ) ) 1 1 − ϕ ) = ∑ i = 1 m ( I ( y ( i ) = 1 ) 1 ϕ − I ( y ( i ) = 0 ) 1 1 − ϕ ) \begin{aligned} \frac{\partial\;l(\phi,\mu_0,\mu_1,\Sigma)}{\partial\phi}&=\frac{\sum_{i=1}^m{log\;p(y^{(i)})}}{\partial\phi} \\&= \frac{\partial\sum_{i=1}^m{log\;\phi^{y^{(i)}}(1-\phi)^{1-y^{(i)}})}}{\partial\phi} \\&=\frac{\partial\sum_{i=1}^m{y^{(i)}\;log\;\phi+(1-y^{(i)})log(1-\phi)}}{\partial\phi} \\&=\sum_{i=1}^m{(y^{(i)}\frac{1}{\phi}-(1-y^{(i)})\frac{1}{1-\phi})} \\&=\sum_{i=1}^m{(I(y^{(i)}=1)\frac{1}{\phi}-I(y^{(i)}=0)\frac{1}{1-\phi})} \end{aligned} ∂ϕ∂l(ϕ,μ0,μ1,Σ)=∂ϕ∑i=1mlogp(y(i))=∂ϕ∂∑i=1mlogϕy(i)(1−ϕ)1−y(i))=∂ϕ∂∑i=1my(i)logϕ+(1−y(i))log(1−ϕ)=i=1∑m(y(i)ϕ1−(1−y(i))1−ϕ1)=i=1∑m(I(y(i)=1)ϕ1−I(y(i)=0)1−ϕ1)
上式等于0,可得:
ϕ = I ( y ( i ) = 1 ) I ( y ( i ) = 0 ) + I ( y ( i ) = 1 ) = ∑ i = 1 m I ( y ( i ) = 1 ) m \begin{aligned} \phi=\frac{I(y^{(i)}=1)}{I(y^{(i)}=0)+I(y^{(i)}=1)}=\frac{\sum_{i=1}^mI(y^{(i)}=1)}{m}\end{aligned} ϕ=I(y(i)=0)+I(y(i)=1)I(y(i)=1)=m∑i=1mI(y(i)=1)
对 μ 0 \mu_0 μ0求导:
∂    l ( ϕ , μ 0 , μ 1 , Σ ) ∂ μ 0 = ∂ ∑ i = 1 m ( 1 − y ( i ) ) l o g    p ( x ( i ) ∣ y ( i ) = 0 ) ∂ μ 0 = ∂ ∑ i = 1 m ( 1 − y ( i ) ) ( l o g 1 ( 2 π ) n ∣ Σ ∣ − 1 2 ( x ( i ) − μ 0 ) T Σ − 1 ( x ( i ) − μ 0 ) ) ∂ μ 0 = ∑ i = 1 m ( 1 − y ( i ) ) ( Σ − 1 ( x ( i ) − μ 0 ) ) = ∑ i = 1 m I ( y ( i ) = 0 ) Σ − 1 ( x ( i ) − μ 0 ) \begin{aligned} \frac{\partial\;l(\phi,\mu_0,\mu_1,\Sigma)}{\partial\mu_0}&=\frac{\partial\sum_{i=1}^m{(1-y^{(i)})log\;p(x^{(i)}|y^{(i)}=0)}}{\partial\mu_0} \\&=\frac{\partial\sum_{i=1}^m{(1-y^{(i)})(log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0))}}{\partial\mu_0} \\&=\sum_{i=1}^m{(1-y^{(i)})(\Sigma^{-1}(x^{(i)}-\mu_0))} \\&=\sum_{i=1}^m{I(y^{(i)}=0)\Sigma^{-1}(x^{(i)}-\mu_0)} \end{aligned} ∂μ0∂l(ϕ,μ0,μ1,Σ)=∂μ0∂∑i=1m(1−y(i))logp(x(i)∣y(i)=0)=∂μ0∂∑i=1m(1−y(i))(log(2π)n∣Σ∣1−21(x(i)−μ0)TΣ−1(x(i)−μ0))=i=1∑m(1−y(i))(Σ−1(x(i)−μ0))=i=1∑mI(y(i)=0)Σ−1(x(i)−μ0)
上式等于0:
μ 0 = ∑ i = 1 m I ( y ( i ) = 0 ) x ( i ) ∑ i = 1 m I ( y ( i ) = 0 ) \begin{aligned} \mu_0=\frac{\sum_{i=1}^m{I(y^{(i)}=0)x^{(i)}}}{\sum_{i=1}^m{I(y^{(i)}=0)}} \end{aligned} μ0=∑i=1mI(y(i)=0)∑i=1mI(y(i)=0)x(i)
μ 1 = ∑ i = 1 m I ( y ( i ) = 1 ) x ( i ) ∑ i = 1 m I ( y ( i ) = 1 ) \begin{aligned} \mu_1=\frac{\sum_{i=1}^m{I(y^{(i)}=1)x^{(i)}}}{\sum_{i=1}^m{I(y^{(i)}=1)}} \end{aligned} μ1=∑i=1mI(y(i)=1)∑i=1mI(y(i)=1)x(i)
∑ i = 1 m ( 1 − y ( i ) ) l o g    p ( x ( i ) ∣ y ( i ) = 0 ) + ∑ i = 1 m y ( i ) l o g    p ( x ( i ) ∣ y ( i ) = 1 ) = ∑ i = 1 m ( 1 − y ( i ) ) ( l o g 1 ( 2 π ) n ∣ Σ ∣ − 1 2 ( x ( i ) − μ 0 ) T Σ − 1 ( x ( i ) − μ 0 ) ) + ∑ i = 1 m y ( i ) ( l o g 1 ( 2 π ) n ∣ Σ ∣ − 1 2 ( x ( i ) − μ 1 ) T Σ − 1 ( x ( i ) − μ 1 ) ) = ∑ i = 1 m l o g 1 ( 2 π ) n ∣ Σ ∣ − 1 2 ∑ i = 1 m ( x ( i ) − μ y ( i ) ) T Σ − 1 ( x ( i ) − μ y ( i ) ) = ∑ i = 1 m ( − n 2 l o g ( 2 π ) − 1 2 l o g ( ∣ Σ ∣ ) ) − 1 2 ∑ i = 1 m ( x ( i ) − μ y ( i ) ) T Σ − 1 ( x ( i ) − μ y ( i ) ) \begin{aligned} &\sum_{i=1}^m{(1-y^{(i)})log\;p(x^{(i)}|y^{(i)}=0)}+\sum_{i=1}^m{{y^{(i)}}log\;p(x^{(i)}|y^{(i)}=1)}\\&=\sum_{i=1}^m{(1-y^{(i)})(log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}-\frac{1}{2}(x^{(i)}-\mu_0)^T\Sigma^{-1}(x^{(i)}-\mu_0))}+\sum_{i=1}^m{{y^{(i)}}(log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}-\frac{1}{2}(x^{(i)}-\mu_1)^T\Sigma^{-1}(x^{(i)}-\mu_1))}\\&=\sum_{i=1}^m{log\frac{1}{\sqrt{(2\pi)^n|\Sigma|}}}-\frac{1}{2}\sum_{i=1}^m{(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i)}})}\\&=\sum_{i=1}^m{(-\frac{n}{2}log(2\pi)-\frac{1}{2}log(|\Sigma|))}-\frac{1}{2}\sum_{i=1}^m{(x^{(i)}-\mu_{y^{(i)}})^T\Sigma^{-1}(x^{(i)}-\mu_{y^{(i)}})} \end{aligned} i=1∑m(1−y(i))logp(x(i)∣y(i)=0)+i=1∑my(i)logp(x(i)∣y(i)=1)=i=1∑m(1−y(i))(log(2π)n∣Σ∣1−21(x(i)−μ0)TΣ−1(x(i)−μ0))+i=1∑my(i)(log(2π)n∣Σ∣1−21(x(i)−μ1)TΣ−1(x(i)−μ1))=i=1∑mlog(2π)n∣Σ∣1−21i=1∑m(x(i)−μy(i))TΣ−1(x(i)−μy(i))=i=1∑m(−2nlog(2π)−21log(∣Σ∣))−21i=1∑m(x(i)−μy(i))TΣ−1(x(i)−μy(i))
进而对 Σ \Sigma Σ求导:
∂    l ( ϕ , μ 0 , μ 1 , Σ ) ) ∂ Σ = − 1 2 ∑ i = 1 m ( 1 ∣ Σ ∣ ∣ Σ ∣ Σ − 1 ) − 1 2 ∑ i = 1 m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T ∂ Σ − 1 ∂ Σ = − m 2 Σ − 1 − 1 2 ∑ i = 1 m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T ( − Σ − 2 ) ) \begin{aligned} \frac{\partial\;l(\phi,\mu_0,\mu_1,\Sigma))}{\partial\Sigma}&=-\frac{1}{2}\sum_{i=1}^m(\frac{1}{|\Sigma|}|\Sigma|\Sigma^{-1})-\frac{1}{2}\sum_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T\frac{\partial\Sigma^{-1}}{\partial\Sigma}\\&=-\frac{m}{2}\Sigma^{-1}-\frac{1}{2}\sum_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T(-\Sigma^{-2})) \end{aligned} ∂Σ∂l(ϕ,μ0,μ1,Σ))=−21i=1∑m(∣Σ∣1∣Σ∣Σ−1)−21i=1∑m(x(i)−μy(i))(x(i)−μy(i))T∂Σ∂Σ−1=−2mΣ−1−21i=1∑m(x(i)−μy(i))(x(i)−μy(i))T(−Σ−2))
∂ ∣ Σ ∣ ∂ Σ = ∣ Σ ∣ Σ − 1 \begin{aligned} \frac{\partial|\Sigma|}{\partial\Sigma}=|\Sigma|\Sigma^{-1}\end{aligned} ∂Σ∂∣Σ∣=∣Σ∣Σ−1
∂ Σ − 1 ∂ Σ = − Σ − 2 \begin{aligned} \frac{\partial\Sigma^{-1}}{\partial\Sigma}=-\Sigma^{-2}\end{aligned} ∂Σ∂Σ−1=−Σ−2
上式为0求得:
Σ = 1 m ∑ i = 1 m ( x ( i ) − μ y ( i ) ) ( x ( i ) − μ y ( i ) ) T \begin{aligned} \Sigma=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T\end{aligned} Σ=m1i=1∑m(x(i)−μy(i))(x(i)−μy(i))T
p ( y = 1 ∣ x ; μ 0 , μ 1 , Σ ) p(y=1|x;\mu_0,\mu_1,\Sigma) p(y=1∣x;μ0,μ1,Σ)可以被写成:
p ( y = 1 ∣ x ; ϕ , Σ , μ 0 , μ 1 ) = 1 1 + e − θ T x p(y=1|x;\phi,\Sigma,\mu_0,\mu_1)=\frac{1}{1+e^{-\theta^Tx}} p(y=1∣x;ϕ,Σ,μ0,μ1)=1+e−θTx1
θ = [ l o g 1 − ϕ ϕ − 1 2 ( μ 0 T Σ − 1 μ 0 − μ 1 T Σ − 1 μ 1 ) Σ − 1 ( μ 0 − μ 1 ) ] \theta=\begin{bmatrix}log\frac{1-\phi}{\phi}-\frac{1}{2}(\mu_0^T\Sigma^{-1}\mu_0-\mu_1^T\Sigma^{-1}\mu_1)\\\Sigma^{-1}(\mu_0-\mu_1)\end{bmatrix} θ=[logϕ1−ϕ−21(μ0TΣ−1μ0−μ1TΣ−1μ1)Σ−1(μ0−μ1)]
GDA:
Logistic:
是一种针对离散输入的一个简单的生成学习算法。
实战:垃圾邮件分类
项目源代码:
给出每封信x,判断是否属于垃圾邮件(y=0或者y=1)
每封信的词由一个字典维度大小的向量代表, x i = 1 x_i=1 xi=1表示第i个词在这封信中,反之则不在。
朴素贝叶斯假设:
p ( x 1 , x 2 , . . . x n ∣ y ) = ∏ i = 1 n p ( x i ∣ y ) p(x_1,x_2,...x_n|y)=\prod_{i=1}^np(x_i|y) p(x1,x2,...xn∣y)=i=1∏np(xi∣y)
参数学习:
多变量伯努利事件模型:
p ( x , y ) = p ( y ) ∏ i = 1 n p ( x i ∣ y ) p(x,y)=p(y)\prod_{i=1}^np(x_i|y) p(x,y)=p(y)i=1∏np(xi∣y)
最大似然:
L = ∏ i = 1 n p ( x ( i ) , y ( i ) ) L=\prod_{i=1}^np(x^{(i)},y^{(i)}) L=i=1∏np(x(i),y(i))
由此可以求出最大似然估计的各个参数值。
进行预测
给一个新的样本,计算 p ( y = 1 ∣ x ) p(y=1|x) p(y=1∣x):
p ( y = 1 ∣ x ) = p ( x ∣ y = 1 ) p ( y = 1 ) p ( x ) = p ( x ∣ y = 1 ) p ( y = 1 ) p ( x ∣ y = 1 ) p ( y = 1 ) + p ( x ∣ y = 0 ) p ( y = 0 ) p(y=1|x)=\frac{p(x|y=1)p(y=1)}{p(x)}=\frac{p(x|y=1)p(y=1)}{p(x|y=1)p(y=1)+p(x|y=0)p(y=0)} p(y=1∣x)=p(x)p(x∣y=1)p(y=1)=p(x∣y=1)p(y=1)+p(x∣y=0)p(y=0)p(x∣y=1)p(y=1)
. . . = ∏ i = 1 n p ( x i ∣ y = 1 ) p ( y = 1 ) ∏ i = 1 n p ( x i ∣ y = 1 ) p ( y = 1 ) + ∏ i = 1 n p ( x i ∣ y = 0 ) p ( y = 0 ) ...=\frac{\prod_{i=1}^np(x_i|y=1)p(y=1)}{\prod_{i=1}^np(x_i|y=1)p(y=1)+\prod_{i=1}^np(x_i|y=0)p(y=0)} ...=∏i=1np(xi∣y=1)p(y=1)+∏i=1np(xi∣y=0)p(y=0)∏i=1np(xi∣y=1)p(y=1)
如果 p ( y = 1 ∣ x ) > 0.5 , y = 1 p(y=1|x)>0.5,y=1 p(y=1∣x)>0.5,y=1.
源代码:https://github.com/Miraclemin/Quadratic-Discriminant-Analysis