CH4_朴素贝叶斯及其Spark实现

CH4_朴素贝叶斯及其Spark实现

关于贝叶斯的介绍在之前的文章中也有说明,网上也有许多资料,在这里就不在做过多赘述。

朴素贝叶斯模型

假设我们有数据样本如下:
( X 1 , X 2 , . . X n , Y ) (X_1,X_2,..X_n,Y) (X1,X2,..Xn,Y)
有m个样本,每个样本有n个特征,特征输出有K个类别
我们可以通过以上样本学习得出先验概率
P ( Y = c k ) , k = 1 , 2... k P(Y=c_k),k=1,2...k P(Y=ck),k=1,2...k
然后学习得出条件概率
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , X ( n ) = x ( n ) ∣ Y = c k ) ) P(X=x|Y=c_k) =P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)},X^{(n)}=x^{(n)}|Y=c_k)) P(X=xY=ck)=P(X(1)=x(1),X(2)=x(2),X(n)=x(n)Y=ck))
朴素贝叶斯之所以朴素是因为对条件概率分布做了条件独立性假设(即用于分类的特征在类确定的条件下都是独立的),具体表现为:
P ( X = x ∣ Y = c k ) = P ( X ( 1 ) = x ( 1 ) , X ( 2 ) = x ( 2 ) , X ( n ) = x ( n ) ∣ Y = c k ) ) = ∏ i = 1 N P ( X ( i ) = x ( i ) ∣ Y = c k ) P(X=x|Y=c_k) =P(X^{(1)}=x^{(1)}, X^{(2)}=x^{(2)},X^{(n)}=x^{(n)}|Y=c_k))= \prod_{i=1}^N P(X^{(i)}=x^{(i)}|Y=c_k) P(X=xY=ck)=P(X(1)=x(1),X(2)=x(2),X(n)=x(n)Y=ck))=i=1NP(X(i)=x(i)Y=ck)
朴素贝叶斯的这一假设大大简化了条件分布的计算,但是也牺牲了分类的准确性,因此,在特征之间非常不独立的情况下,可以优先考虑其他分类算法。
在给定的特征向量X=x的情况下,通过学习得到后验概率分布:
P ( Y = c k ∣ X = x ) P(Y=c_k|X=x) P(Y=ckX=x)
根据贝叶斯定量整理得出:
P ( Y = c k ∣ X = x ) = ∏ i = 1 N P ( X ( i ) = x ( i ) ∣ Y = c k ) P ( Y = c k ) ∑ k = 1 N P ( Y = c k ) ∏ i = 1 N P ( X ( i ) = x ( i ) ∣ Y = c k ) P(Y=c_k|X=x) =\frac{ \prod_{i=1}^N P(X^{(i)}=x^{(i)}|Y=c_k)P(Y=c_k)}{\sum_{k=1}^N P(Y=c_k) \prod_{i=1}^N P(X^{(i)}=x^{(i)}|Y=c_k)} P(Y=ckX=x)=k=1NP(Y=ck)i=1NP(X(i)=x(i)Y=ck)i=1NP(X(i)=x(i)Y=ck)P(Y=ck)
将后验概率最大的类最X=x的输出。

以上过程是朴素贝叶斯的极大似然估计方法,在实际计算中,为了防止计算出现概率值为0的情况,我们在计算先验概率和条件概率时加入一个拉普拉斯平滑指数 λ(λ>=0)。(贝叶斯估计)
先验概率的贝叶斯估计:
P λ ( Y = c k ) = ∑ k = 1 N I ( y i = c k ) + λ N + k λ P_λ(Y=c_k) = \frac{\sum_{k=1}^N I(y_i=c_k)+λ}{N+kλ} Pλ(Y=ck)=N+kλk=1NI(yi=ck)+λ
条件概率的贝叶斯估计:
P λ ( X ( j ) = a j l ∣ Y = c k ) = ∑ k = 1 N I ( ( X ( j ) = a j l , y i = c k ) + λ ∑ k = 1 N I ( y i = c k ) + S j λ P_λ(X^{(j)}=a_{jl} |Y=c_k) =\frac{\sum_{k=1}^N I((X^{(j)}=a_{jl} ,y_i=c_k)+λ}{\sum_{k=1}^N I(y_i=c_k)+S_jλ} Pλ(X(j)=ajlY=ck)=k=1NI(yi=ck)+Sjλk=1NI((X(j)=ajl,yi=ck)+λ

跟多解释和证明请参考李航老师的《统计学习方法》

数据说明

本案例数据来自于MNIST数据库,是一份手写数字数据库。

实现代码

1. 朴素贝叶斯算法模型


import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import scala.beans.BeanProperty
import scala.collection.mutable

/**
  * Created by WZZC on 2019/12/10
  **/
case class MultinomilNaiveBayesModel(data: DataFrame, labelColName: String) {

  private val spark: SparkSession = data.sparkSession
  @BeanProperty var fts: Array[String] =
    data.columns.filterNot(_ == labelColName)
  private val ftsName: String = Identifiable.randomUID("NaiveBayesModel")

  // 拉普拉斯平滑指数
  @BeanProperty var lamada: Double = 1.0

  /**
    * 数据特征转换
    *
    * @param dataFrame
    * @return
    */
  def dataTransForm(dataFrame: DataFrame) = {
    new VectorAssembler()
      .setInputCols(fts)
      .setOutputCol(ftsName)
      .transform(dataFrame)
  }

  var contingentProb: Array[(Double, (Int, Seq[mutable.Map[Double, Double]]))] =
    _

  var priorProb: Map[Double, Double] = _

  def fit = {

    def seqOp(c: (Int, Seq[mutable.Map[Double, Int]]),
              v: (Int, Seq[Double])) = {

      val w: Int = c._1 + v._1

      val tuples = c._2
        .zip(v._2)
        .map(tp => {
          val ftsvalueNum: Int = tp._1.getOrElse(tp._2, 0)

          tp._1 += tp._2 -> (ftsvalueNum + 1)
          tp._1
        })

      (w, tuples)

    }

    def combOp(c1: (Int, Seq[mutable.Map[Double, Int]]),
               c2: (Int, Seq[mutable.Map[Double, Int]])) = {

      val w = c2._1 + c1._1

      val resMap = c1._2
        .zip(c2._2)
        .map(tp => {
          val m1: mutable.Map[Double, Int] = tp._1
          val m2: mutable.Map[Double, Int] = tp._2

          m1.foreach(kv => {
            val i = m2.getOrElse(kv._1, 0)
            m2 += kv._1 -> (kv._2 + i)
          })

          m2
        })

      (w, resMap)
    }

    val nilSeq = new Array[Any](fts.length)
      .map(x => mutable.Map[Double, Int]())
      .toSeq

    val agged = dataTransForm(data).rdd
      .map(row => {

        val lable: Double = row.getAs[Double](labelColName)
        val fts: Seq[Double] = row.getAs[Vector](ftsName).toArray.toSeq

        (lable, (1, fts))
      })
      .aggregateByKey[(Int, Seq[mutable.Map[Double, Int]])]((0, nilSeq))(
        seqOp,
        combOp
      )

    val numLabels: Long = agged.count()
    val numDocuments: Double = agged.map(_._2._1).sum

    // 拉普拉斯变换
    val lpsamples: Double = numDocuments + lamada * numLabels

    //  条件概率
    contingentProb = agged
      .mapValues(tp => {
        val freq = tp._1

        val lprob = tp._2.map(
          m =>
            m.map {
              case (k, v) => {
                val logprob = (v / freq.toDouble).formatted("%.4f")
//                  val logprob = math.log(v / freq.toDouble).formatted("%.4f")
                (k, logprob.toDouble)
              }
          }
        )
        (freq, lprob)
      })
      .collect()

    //  先验概率
    priorProb = agged
      .map(tp => (tp._1, tp._2._1))
      .mapValues(v => ((v + lamada) / lpsamples))
//      .mapValues(v => math.log((v + lamada) / lpsamples))
      .collect()
      .toMap

  }

  def predict(predictData: Seq[Double]): Double = {

    val posteriorProb: Array[(Double, Double)] = contingentProb
      .map(tp3 => {
        val label: Double = tp3._1

        val tp: (Int, Seq[mutable.Map[Double, Double]]) = tp3._2

        val missProb: Double = lamada / (tp._2.length * lamada)

        val sum: Double = tp._2
          .zip(predictData)
          .map {
            case (pmap, ftValue) => {

              val d: Double = pmap.getOrElse(ftValue, missProb)

              math.log(d)

            }
          }
          .sum

        (label, sum)

      })
      .map(tp => (tp._1, priorProb.getOrElse(tp._1, 0.0) + math.log(tp._2)))

    posteriorProb.maxBy(_._2)._1

  }

  def predict(predictData: DataFrame): DataFrame = {

    contingentProb.foreach(println)

    priorProb.foreach(println)

    val predictudf = udf((vec: Vector) => predict(vec.toArray.toSeq))

    dataTransForm(predictData)
      .withColumn(labelColName, predictudf(col(ftsName)))
      .drop(ftsName)

  }

}

2. 算法测试

import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.DoubleType

/**
  * Created by WZZC on 2019/4/27
  **/
object NaiveBayesRunner {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName(s"${this.getClass.getSimpleName}")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._
    // 数据加载
    val data = spark.read
      .option("header", true)
      .option("inferSchema", true)
      .csv("data/naviebayes.csv")

    val model = new StringIndexer()
      .setInputCol("x2")
      .setOutputCol("indexX2")
      .fit(data)

    val dataFrame = model
      .transform(data)
      .withColumn("x1", $"x1".cast(DoubleType))
      .withColumn("y", $"y".cast(DoubleType))

    val bayes = MultinomilNaiveBayesModel(dataFrame, "y")

    bayes.setFts(Array("x1", "indexX2"))

    bayes.fts.foreach(println)
    bayes.fit

    bayes.predict(dataFrame).show()

    spark.stop()
  }

}

参考资料:

《统计学习方法》-- 李航

https://www.cnblogs.com/pinard/p/6069267.html

你可能感兴趣的:(统计学习方法,spark,大数据,机器学习,分类算法)