Spark朴素贝叶斯预测代码及注解


一、简介

贝叶斯算法常用于分类,它通过预测一个给定的元组属于一个特定类的概率,来进行分类。朴素贝叶斯分类法假定一个属性值在给定类的影响独立于其他属性的一类条件独立性。


二、示例

1、数据

PS:以下是一部分,文件名为SMSSpamCollection.txt,下载地址:机器学习文件数据包。

下载后,我对其中的数据进行了处理。将ham和spam的后面都分别加上了//(双斜杠),以便在后续切分数据时方便。如下所示。

......
ham // Dear how is chechi. Did you talk to her
spam // 100 dating service cal;l 09064012103 box334sk38ch
......

2、代码

package com.naiveBayes

import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, IDF, LabeledPoint, Tokenizer}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.{Row, SparkSession}

object FraudDemo {
    def main(args: Array[String]): Unit = {
        val session = SparkSession.builder().master("local").appName(this.getClass.getSimpleName).getOrCreate()
        import session.implicits._
        // 加载文件
        val file = session.read.textFile("src/main/resources/bayes/SMSSpamCollection.txt")

        val parseData = file.map { line =>
            val strs = line.split(" // ")
            val label = if (strs(0) == "ham") 1 else 0
            val features = strs(1)
            (label, features)
        }.toDF("label", "context")

        // tokenizer分解器,把句子划分为单词
        val tokenizer = new Tokenizer().setInputCol("context").setOutputCol("words")
        val trainWords = tokenizer.transform(parseData)
        /*
            利用TF-IDF提取特征
         */
        // setNumFeatures的值表示hash分桶的数量,越大精度越高,开销也越大,该值最好设置为单词数或接近单词数
        val trainData = new HashingTF().setInputCol("words").setOutputCol("raw_features").setNumFeatures(15000).transform(trainWords)
        val trainForm = new IDF().setInputCol("raw_features").setOutputCol("features").fit(trainData).transform(trainData)
        // 打印分词的特征结果
        trainForm.select("label", "context", "features").show(5)

        // 构建特征向量
        val TrainDF = trainForm.select($"label", $"features").map {
            case Row(label: Int, features: Vector) =>
                LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
        }

        // 构建贝叶斯模型
        val bayes = new NaiveBayes().setSmoothing(1.0).setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction")
        val model = bayes.fit(TrainDF)
        val prediction = model.transform(TrainDF)

        //         模型结果
        //        prediction.show(5)

        //         预测不符合的列
        //        prediction.filter($"label" =!= $"prediction").show()

        // 评估模型
        val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")
        val accuracy = evaluator.evaluate(prediction)
        println("预测准确率:" + accuracy + "\r\n")

        /*
            预测数据
        */
        // 加载数据
        val testFile = session.read.textFile("src/main/resources/bayes/test.txt").toDF("context")
        // 分词
        val wordsData = new Tokenizer().setInputCol("context").setOutputCol("words").transform(testFile)
        // 使用TF-IDF提取特征
        // setNumFeatures的数量需要同上面设置的setNumFeatures的值一致
        val preData = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(15000).transform(wordsData)
        val preModel = new IDF().setInputCol("rawFeatures").setOutputCol("features").fit(preData).transform(preData)
        // 打印数据
        model.transform(preModel).select("context", "prediction").show()
    }
}

ps:预测使用的test.txt随便写就行,但只需要写特征列即可,如下所示。

......
I dont know exactly could you ask chechi.
Dunno lei shd b driving lor cos i go sch 1 hr oni.
As in i want custom officer discount oh.
That's necessarily respectful
Hi. Hope you had a good day. Have a better night.
And he's apparently bffs with carly quick now
HARD BUT TRUE: How much you show &  express your love to someone....that much it will hurt when they leave you or you get seperated...!鈥┾??〨ud evening...
Babes I think I got ur brolly I left it in English wil bring it in 2mrw 4 u luv Franxx
Hi babe its me thanks for coming even though it didnt go that well!i just wanted my bed! Hope to see you soon love and kisses xxx
So gd got free ice cream... I oso wan...
Pls give her prometazine syrup. 5mls then  <#> mins later feed.
So how many days since then?
......

你可能感兴趣的:(机器学习,大数据,机器学习,人工智能,朴素贝叶斯算法)