贝叶斯学习

文章目录

    • 2.2 贝叶斯决策论
    • 2.3 贝叶斯分类器
    • 2.4 贝叶斯学习与参数估计问题

2.1 概述

2.2 贝叶斯决策论

概率基础:

  • 事件A的概率$0\leq P(A) \leq 1 $
  • 条件概率: P ( A ∣ B ) = P ( A B ) P ( B ) P(A|B)=\frac{P(AB)}{P(B)} P(AB)=P(B)P(AB), P ( B ∣ A ) = P ( A B ) P ( A ) P(B|A)=\frac{P(AB)}{P(A)} P(BA)=P(A)P(AB)
  • 乘法定律: P ( A B ) = P ( A ∣ B ) P ( B ) = P ( B ∣ A ) P ( A ) P(AB)=P(A|B)P(B)=P(B|A)P(A) P(AB)=P(AB)P(B)=P(BA)P(A)
  • 全概率公式: A 1 ∪ A 2 ∪ . . . A n = Ω 且 A i ∩ A j = φ ,则 P ( A ) = ∑ i = 1 n P ( A ∣ B i ) P ( B i ) A_{1}\cup A_{2}\cup ... A_{n}=\Omega 且 A_{i}\cap A_{j}=\varphi , 则P(A)=\sum_{i=1}^{n}P(A|B_{i})P(B_{i}) A1A2...An=ΩAiAj=φ,则P(A)=i=1nP(ABi)P(Bi)
  • Bayes公式:
    • P ( A i ∣ B ) = P ( B ∣ A i ) P ( A i ) ∑ j = 1 n P ( B ∣ A j ) P ( A j ) P(A_{i}|B)=\frac{P(B|A_{i})P(A_{i})}{\sum_{j=1}^{n}P(B|A_{j})P(A_{j})} P(AiB)=j=1nP(BAj)P(Aj)P(BAi)P(Ai)
    • P ( A i ∣ B ) ∝ P ( B ∣ A i ) P ( A i ) P(A_{i}|B)\propto P(B|A_{i})P(A_{i}) P(AiB)P(BAi)P(Ai)

Bayes决策:

  • 基于观察特征、类别的贝叶斯公式
  • P ( ω i ∣ x ) = P ( x ∣ ω i ) P ( ω i ) P ( x ) P(\omega _{i}|x)=\frac{P(x|\omega _{i})P(\omega _{i})}{P(x)} P(ωix)=P(x)P(xωi)P(ωi) ( p o s t e r i o r = l i k e h o o d ∗ p r i o r e v i d e n c e ) (posterior=\frac{likehood*prior}{evidence}) (posterior=evidencelikehoodprior)
  • = P ( x ∣ ω i ) P ( ω i ) ∑ j P ( x ∣ ω i ) P ( ω i ) =\frac{P(x|\omega _{i})P(\omega _{i})}{\sum_{j} P(x|\omega _{i})P(\omega _{i})} =jP(xωi)P(ωi)P(xωi)P(ωi)
  • P ( ω i ∣ x ) ∝ P ( x ∣ ω i ) P ( ω i ) P(\omega _{i}|x)\propto P(x|\omega _{i})P(\omega _{i}) P(ωix)P(xωi)P(ωi) ( p o s t e r i o r ∝ l i k e l i h o o d ∗ p r i o r ) (posterior\propto likelihood*prior) (posteriorlikelihoodprior)
  • 贝叶斯决策:
    • D e c i d e = { ω 1 p ( ω 1 ∣ x ) > p ( ω 2 ∣ x ) ω 2 o t h e r w i s e Decide=\left\{\begin{matrix} \omega_{1} &p(\omega_{1}|x)>p(\omega_{2}|x) \\ \omega_{2} &otherwise \\ \end{matrix}\right. Decide={ω1ω2p(ω1x)>p(ω2x)otherwise $\Rightarrow $ $\left{\begin{matrix} \omega_{1} &p(x|\omega_{1})p(\omega_{1})>p(x|\omega_{2})p(\omega_{2}) \ \omega_{2} &otherwise \ \end{matrix}\right. $
    • { ω 1 p ( x ∣ ω 1 ) p ( x ∣ ω 2 ) > p ( ω 2 ) p ( ω 1 ) ω 2 o t h e r w i s e \left\{\begin{matrix} \omega_{1} &\frac{p(x|\omega_{1})}{p(x|\omega_{2})}>\frac{p(\omega_{2})}{p(\omega_{1})} \\ \omega_{2} &otherwise \\ \end{matrix}\right. {ω1ω2p(xω2)p(xω1)>p(ω1)p(ω2)otherwise
  • 类别相似性函数:
    • g i ( x ) = p ( ω i ∣ x ) = p ( x ∣ ω i ) p ( ω i ) ∑ j = 1 c p ( x ∣ ω j ) p ( ω j ) g_{i}(x)=p(\omega_{i}|x)=\frac{p(x|\omega_{i})p(\omega_{i})}{\sum_{j=1}^{c}p(x|\omega_{j})p(\omega_{j})} gi(x)=p(ωix)=j=1cp(xωj)p(ωj)p(xωi)p(ωi)
    • g i ( x ) = p ( x ∣ ω i ) p ( ω i ) g_{i}(x)=p(x|\omega_{i})p(\omega_{i}) gi(x)=p(xωi)p(ωi)
    • g i ( x ) = l n p ( x ∣ ω i ) + l n p ( ω i ) g_{i}(x)=lnp(x|\omega_{i})+lnp(\omega_{i}) gi(x)=lnp(xωi)+lnp(ωi)
  • 决策函数:
    • g ( x ) = g 1 ( x ) − g 2 ( x ) g(x)=g_{1}(x)-g_{2}(x) g(x)=g1(x)g2(x)
    • g ( x ) = p ( ω 1 ∣ x ) − p ( ω 2 ∣ x ) g(x)=p(\omega_{1}|x)-p(\omega_{2}|x) g(x)=p(ω1x)p(ω2x)
    • g ( x ) = l n p ( x ∣ ω 1 ) p ( x ∣ ω 2 ) + l n p ( ω 1 ) p ( ω 2 ) g(x)=ln\frac{p(x|\omega_{1})}{p(x|\omega_{2})}+ln\frac{p(\omega_{1})}{p(\omega_{2})} g(x)=lnp(xω2)p(xω1)+lnp(ω2)p(ω1)

2.3 贝叶斯分类器

贝叶斯分类器:

  • 朴素贝叶斯分类器:假设 p ( x ∣ c ) p(x|c) p(xc)中x特征向量的各维属性独立
    • 采用了“属性独立性假设”
    • p ( c ∣ x ) = p ( c ) p ( x ∣ c ) p ( x ) ∝ p ( c ) p ( x ∣ c ) = p ( c ) ∏ i = 1 d p ( x i ∣ c ) p(c|x)=\frac{p(c)p(x|c)}{p(x)} \propto p(c)p(x|c)=p(c)\prod_{i=1}^{d}p(x_{i}|c) p(cx)=p(x)p(c)p(xc)p(c)p(xc)=p(c)i=1dp(xic)
    • 关键问题:由训练样本学习类别条件概率和类别先验概率 p ( x i ∣ c ) 和 p ( c ) p(x_{i}|c)和p(c) p(xic)p(c)
    • k个类别,d个属性,共1+k*d个概率分布要统计
    • 类别先验概率的估计: p ( c ) = ∣ D c ∣ ∣ D ∣ p(c)=\frac{|D_{c}|}{|D|} p(c)=DDc
    • 类别概率密度估计:
      • x i x_{i} xi离散情况: p ( x i ∣ c ) = ∣ D c , x i ∣ ∣ D c ∣ p(x_{i}|c)=\frac{|D_{c,x_{i}}|}{|D_{c}|} p(xic)=DcDc,xi
      • x i x_{i} xi连续情况: p ( x i ∣ c ) = 1 2 π σ c , i e x p ( − ( x i − μ c , i ) 2 2 σ c , i 2 ) p(x_{i}|c)=\frac{1}{\sqrt{2\pi}\sigma _{c,i}}exp(-\frac{(x_{i}-\mu_{c,i})^{2}}{2\sigma_{c,i}^{2}}) p(xic)=2π σc,i1exp(2σc,i2(xiμc,i)2) (由某一概率分布估计类别概率)
    • 学习过程:
      • 类别先验估计
      • 类别条件概率估计
    • 决策过程:
      • 类别先验估计
      • 类别条件概率估计
      • 贝叶斯决策 h ( x ) = a r g m a x c ϵ y P ( c ) ∏ i = 1 d P ( x i ∣ c ) h(x)=\underset{c\epsilon y}{argmax}P(c)\prod_{i=1}^{d}P(x_{i}|c) h(x)=cϵyargmaxP(c)i=1dP(xic)
  • 半朴素贝叶斯分类器:假设 p ( x ∣ c ) p(x|c) p(xc)中x各维属性存在依赖
  • 正态分布的贝叶斯分类器:假设 p ( x ∣ c ( θ ) ) p(x|c(\theta)) p(xc(θ))服从正态分布

2.4 贝叶斯学习与参数估计问题

贝叶斯学习

​ 通过观测数据likelihood修正模型先验,得到后验概率分布:
p ( θ ∣ D , α ) ∝ p ( D ∣ θ ) p ( θ ∣ α ) p(\theta|D,\alpha)\propto p(D|\theta)p(\theta|\alpha) p(θD,α)p(Dθ)p(θα)
​ 其中, α \alpha α是超参数,不是估计的参数

极大似然估计

  • 最大化观测数据的概率
    • p ( θ ∣ D , α ) ∝ p ( D ∣ θ ) p ( θ ∣ α ) p(\theta|D,\alpha)\propto p(D|\theta)p(\theta|\alpha) p(θD,α)p(Dθ)p(θα)
    • 最大化 p ( D ∣ θ ) p(D|\theta) p(Dθ)
  • 似然函数likelihood:
    • p ( D ∣ θ ) = p ( x 1 , . . . , x n ∣ θ ) = ∏ i = 1 n p ( x i ∣ θ ) p(D|\theta)=p(x_{1},...,x_{n}|\theta)=\prod_{i=1}^{n}p(x_{i}|\theta) p(Dθ)=p(x1,...,xnθ)=i=1np(xiθ)
  • Maximum Likelihood
    • θ ^ = a r g max ⁡ θ p ( D ∣ θ ) \hat{\theta}=arg \displaystyle \max_{\theta}p(D|\theta) θ^=argθmaxp(Dθ)

转化为求log-likelihood极大的问题:

θ ^ = a r g max ⁡ θ ∑ i = 1 n l o g p ( x i ∣ θ ) \hat{\theta}=arg \displaystyle \max_{\theta}\sum_{i=1}^{n}logp(x_{i}|\theta) θ^=argθmaxi=1nlogp(xiθ)

求解过程

∑ i = 1 n ∇ θ l o g p ( x i ∣ θ ) = 0 \sum_{i=1}^{n}\nabla_{\theta}logp(x_{i}|\theta)=0 i=1nθlogp(xiθ)=0

最大后验估计

问题描述

  • 求使后验概率最大的模型或参数( θ \theta θ)
    • p ( θ ∣ D , α ) ∝ p ( D ∣ θ ) p ( θ ∣ α ) p(\theta|D,\alpha)\propto p(D|\theta)p(\theta|\alpha) p(θD,α)p(Dθ)p(θα)
    • 最大化 p ( θ ∣ D , α ) p(\theta|D,\alpha) p(θD,α)
  • 贝叶斯公式中
    • p ( θ ∣ D , α ) = P ( D ∣ θ ) P ( θ ∣ α ) P ( D ∣ α ) p(\theta|D,\alpha)=\frac{P(D|\theta)P(\theta|\alpha)}{P(D|\alpha)} p(θD,α)=P(Dα)P(Dθ)P(θα)
    • θ ^ M A P : ∂ ∂ θ p ( θ ∣ D , α ) = 0   o r   ∂ ∂ θ p ( D ∣ θ ) p ( θ ∣ α ) = 0 \hat{\theta}_{MAP}:\frac{\partial }{\partial \theta}p(\theta|D,\alpha)=0 \ or \ \frac{\partial }{\partial \theta}p(D|\theta)p(\theta|\alpha)=0 θ^MAP:θp(θD,α)=0 or θp(Dθ)p(θα)=0

你可能感兴趣的:(机器学习)