关于贝叶斯的介绍在之前的文章中也有说明,网上也有许多资料,在这里就不在做过多赘述。
假设我们有数据样本如下:
( X 1 , X 2 , . . X n , Y ) (X_1,X_2,..X_n,Y) (X1,X2,..Xn,Y)
有m个样本,每个样本有n个特征,特征输出有K个类别
我们可以通过以上样本学习得出先验概率:
P ( Y = c k ) , k = 1 , 2... k P(Y=c_k),k=1,2...k P(Y=ck),k=1,2...k
然后学习得出条件概率:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , X ( n ) = x ( n ) ∣ Y = c k ) ) P(X=x|Y=c_k) =P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)},X^{(n)}=x^{(n)}|Y=c_k)) P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),X(n)=x(n)∣Y=ck))
朴素贝叶斯之所以朴素是因为对条件概率分布做了条件独立性假设(即用于分类的特征在类确定的条件下都是独立的),具体表现为:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , X ( n ) = x ( n ) ∣ Y = c k ) ) = ∏ i = 1 N P ( X ( i ) = x ( i ) ∣ Y = c k ) P(X=x|Y=c_k) =P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)},X^{(n)}=x^{(n)}|Y=c_k))= \prod_{i=1}^N P(X^{(i)}=x^{(i)}|Y=c_k) P(X=x∣Y=ck)=P(X(1)=x(1),X(2)=x(2),X(n)=x(n)∣Y=ck))=i=1∏NP(X(i)=x(i)∣Y=ck)
朴素贝叶斯的这一假设大大简化了条件分布的计算,但是也牺牲了分类的准确性,因此,在特征之间非常不独立的情况下,可以优先考虑其他分类算法。
在给定的特征向量X=x的情况下,通过学习得到后验概率分布:
P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ck∣X=x)
根据贝叶斯定量整理得出:
P ( Y = c k ∣ X = x ) = ∏ i = 1 N P ( X ( i ) = x ( i ) ∣ Y = c k ) P ( Y = c k ) ∑ k = 1 N P ( Y = c k ) ∏ i = 1 N P ( X ( i ) = x ( i ) ∣ Y = c k ) P(Y=c_k|X=x) =\frac{ \prod_{i=1}^N P(X^{(i)}=x^{(i)}|Y=c_k)P(Y=c_k)}{\sum_{k=1}^N P(Y=c_k) \prod_{i=1}^N P(X^{(i)}=x^{(i)}|Y=c_k)} P(Y=ck∣X=x)=∑k=1NP(Y=ck)∏i=1NP(X(i)=x(i)∣Y=ck)∏i=1NP(X(i)=x(i)∣Y=ck)P(Y=ck)
将后验概率最大的类最X=x的输出。
以上过程是朴素贝叶斯的极大似然估计方法,在实际计算中,为了防止计算出现概率值为0的情况,我们在计算先验概率和条件概率时加入一个拉普拉斯平滑指数 λ(λ>=0)。(贝叶斯估计)
先验概率的贝叶斯估计:
P λ ( Y = c k ) = ∑ k = 1 N I ( y i = c k ) + λ N + k λ P_λ(Y=c_k) = \frac{\sum_{k=1}^N I(y_i=c_k)+λ}{N+kλ} Pλ(Y=ck)=N+kλ∑k=1NI(yi=ck)+λ
条件概率的贝叶斯估计:
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ k = 1 N I ( ( X ( j ) = a j l , y i = c k ) + λ ∑ k = 1 N I ( y i = c k ) + S j λ P_λ(X^{(j)}=a_{jl} |Y=c_k) =\frac{\sum_{k=1}^N I((X^{(j)}=a_{jl} ,y_i=c_k)+λ}{\sum_{k=1}^N I(y_i=c_k)+S_jλ} Pλ(X(j)=ajl∣Y=ck)=∑k=1NI(yi=ck)+Sjλ∑k=1NI((X(j)=ajl,yi=ck)+λ
跟多解释和证明请参考李航老师的《统计学习方法》
本案例数据来自于MNIST数据库,是一份手写数字数据库。
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import scala.beans.BeanProperty
import scala.collection.mutable
/**
* Created by WZZC on 2019/12/10
**/
case class MultinomilNaiveBayesModel(data: DataFrame, labelColName: String) {
private val spark: SparkSession = data.sparkSession
@BeanProperty var fts: Array[String] =
data.columns.filterNot(_ == labelColName)
private val ftsName: String = Identifiable.randomUID("NaiveBayesModel")
// 拉普拉斯平滑指数
@BeanProperty var lamada: Double = 1.0
/**
* 数据特征转换
*
* @param dataFrame
* @return
*/
def dataTransForm(dataFrame: DataFrame) = {
new VectorAssembler()
.setInputCols(fts)
.setOutputCol(ftsName)
.transform(dataFrame)
}
var contingentProb: Array[(Double, (Int, Seq[mutable.Map[Double, Double]]))] =
_
var priorProb: Map[Double, Double] = _
def fit = {
def seqOp(c: (Int, Seq[mutable.Map[Double, Int]]),
v: (Int, Seq[Double])) = {
val w: Int = c._1 + v._1
val tuples = c._2
.zip(v._2)
.map(tp => {
val ftsvalueNum: Int = tp._1.getOrElse(tp._2, 0)
tp._1 += tp._2 -> (ftsvalueNum + 1)
tp._1
})
(w, tuples)
}
def combOp(c1: (Int, Seq[mutable.Map[Double, Int]]),
c2: (Int, Seq[mutable.Map[Double, Int]])) = {
val w = c2._1 + c1._1
val resMap = c1._2
.zip(c2._2)
.map(tp => {
val m1: mutable.Map[Double, Int] = tp._1
val m2: mutable.Map[Double, Int] = tp._2
m1.foreach(kv => {
val i = m2.getOrElse(kv._1, 0)
m2 += kv._1 -> (kv._2 + i)
})
m2
})
(w, resMap)
}
val nilSeq = new Array[Any](fts.length)
.map(x => mutable.Map[Double, Int]())
.toSeq
val agged = dataTransForm(data).rdd
.map(row => {
val lable: Double = row.getAs[Double](labelColName)
val fts: Seq[Double] = row.getAs[Vector](ftsName).toArray.toSeq
(lable, (1, fts))
})
.aggregateByKey[(Int, Seq[mutable.Map[Double, Int]])]((0, nilSeq))(
seqOp,
combOp
)
val numLabels: Long = agged.count()
val numDocuments: Double = agged.map(_._2._1).sum
// 拉普拉斯变换
val lpsamples: Double = numDocuments + lamada * numLabels
// 条件概率
contingentProb = agged
.mapValues(tp => {
val freq = tp._1
val lprob = tp._2.map(
m =>
m.map {
case (k, v) => {
val logprob = (v / freq.toDouble).formatted("%.4f")
// val logprob = math.log(v / freq.toDouble).formatted("%.4f")
(k, logprob.toDouble)
}
}
)
(freq, lprob)
})
.collect()
// 先验概率
priorProb = agged
.map(tp => (tp._1, tp._2._1))
.mapValues(v => ((v + lamada) / lpsamples))
// .mapValues(v => math.log((v + lamada) / lpsamples))
.collect()
.toMap
}
def predict(predictData: Seq[Double]): Double = {
val posteriorProb: Array[(Double, Double)] = contingentProb
.map(tp3 => {
val label: Double = tp3._1
val tp: (Int, Seq[mutable.Map[Double, Double]]) = tp3._2
val missProb: Double = lamada / (tp._2.length * lamada)
val sum: Double = tp._2
.zip(predictData)
.map {
case (pmap, ftValue) => {
val d: Double = pmap.getOrElse(ftValue, missProb)
math.log(d)
}
}
.sum
(label, sum)
})
.map(tp => (tp._1, priorProb.getOrElse(tp._1, 0.0) + math.log(tp._2)))
posteriorProb.maxBy(_._2)._1
}
def predict(predictData: DataFrame): DataFrame = {
contingentProb.foreach(println)
priorProb.foreach(println)
val predictudf = udf((vec: Vector) => predict(vec.toArray.toSeq))
dataTransForm(predictData)
.withColumn(labelColName, predictudf(col(ftsName)))
.drop(ftsName)
}
}
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.DoubleType
/**
* Created by WZZC on 2019/4/27
**/
object NaiveBayesRunner {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName(s"${this.getClass.getSimpleName}")
.master("local[*]")
.getOrCreate()
import spark.implicits._
// 数据加载
val data = spark.read
.option("header", true)
.option("inferSchema", true)
.csv("data/naviebayes.csv")
val model = new StringIndexer()
.setInputCol("x2")
.setOutputCol("indexX2")
.fit(data)
val dataFrame = model
.transform(data)
.withColumn("x1", $"x1".cast(DoubleType))
.withColumn("y", $"y".cast(DoubleType))
val bayes = MultinomilNaiveBayesModel(dataFrame, "y")
bayes.setFts(Array("x1", "indexX2"))
bayes.fts.foreach(println)
bayes.fit
bayes.predict(dataFrame).show()
spark.stop()
}
}
《统计学习方法》-- 李航
https://www.cnblogs.com/pinard/p/6069267.html