Spark之训练分类模型练习(1)

()本博文为 spark机器学习 第5章学习笔记。
所用数据下载地址为:实验数据集train.tsv

各列的数据意义为:
“url” “urlid” “boilerplate” “alchemy_category” “alchemy_category_score” “avglinksize” “commonlinkratio_1” “commonlinkratio_2” “commonlinkratio_3” “commonlinkratio_4” “compression_ratio” “embed_ratio” “framebased” “frameTagRatio” “hasDomainLink” “html_ratio” “image_ratio” “is_news” “lengthyLinkDomain” “linkwordscore” “news_front_page” “non_markup_alphanum_characters” “numberOfLinks” “numwords_in_url” “parametrizedLinkRatio” “spelling_errors_ratio” “label”

前四列含义为: 链接地址,页面ID,页面内容、页面所属类别
紧接着22列为:各种数值或类别特征
最后一列为:目标值,-1为长久;0为不长久

在linux指令行中使用管道将首行去除:

$ sed 1d train.tsv > train_noheader.tsv

开启spark-shell

val rawData = sc.textFile("file:///home/hadoop/train_noheader.tsv")
val records = rawData.map(line => line.split("\t"))
records.first()

输出结果为:
Array(“http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html“, “4042”, “{“”title”“:”“IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries”“,”“body”“:”“A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees …

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

// 去除多余的 " ,并填补缺失数据(“?”),生成 LabelPoint 训练数据
val data = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d ==
"?") 0.0 else d.toDouble)
LabeledPoint(label, Vectors.dense(features))
}
// 将负特征值转换为 0,方便朴素贝叶斯模型训练
val nbData = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d ==
"?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)
LabeledPoint(label, Vectors.dense(features))
}

// 训练分类模型:
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD //logistic回归
import org.apache.spark.mllib.classification.SVMWithSGD // SVM
import org.apache.spark.mllib.classification.NaiveBayes //朴素贝叶斯
import org.apache.spark.mllib.tree.DecisionTree         // 决策树
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy     //熵不纯度
val numIterations = 10
val maxTreeDepth = 5

//各个模型训练
val lrModel = LogisticRegressionWithSGD.train(data, numIterations)
val svmModel = SVMWithSGD.train(data, numIterations)
val nbModel = NaiveBayes.train(nbData)
val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)

利用训练所得的模型对未知数据进行预测也很简单,以 logistic回归为例:

val dataPoint = data.first
val prediction = lrModel.predict(dataPoint.features) //根据特征值,进行预测
                                                    // dataPoint.label
                                                    // dataPoint.features

输出为:prediction: Double = 1.0

2分类性能评估

2.1预测的正确率错误率

正确率:训练样本中,被正确分类的数目除以总样本数(正样本+负样本)。
错误率:训练样本中,被错误分类的数据除以总样本数(正样本+负样本)。

//各算法平均正确率
//
val lrTotalCorrect = data.map { point =>
if (lrModel.predict(point.features) == point.label) 1 else 0
}.sum
val lrAccuracy = lrTotalCorrect / data.count
//SVM
val svmTotalCorrect = data.map { point =>
if (svmModel.predict(point.features) == point.label) 1 else 0
}.sum
//NB
val nbTotalCorrect = nbData.map { point =>
if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum
//DT 需要给出阈值
val dtTotalCorrect = data.map { point =>
val score = dtModel.predict(point.features)
val predicted = if (score > 0.5) 1 else 0
if (predicted == point.label) 1 else 0
}.sum
//求正确率:
val svmAccuracy = svmTotalCorrect / numData
val nbAccuracy = nbTotalCorrect / numData
val dtAccuracy = dtTotalCorrect / numData

输出结果:
Spark之训练分类模型练习(1)_第1张图片

2.2预测的准确率召回率 PR曲线

定义:

在二分类中:
准确率:定义为 真阳性的数目除以真阳性和假阳性的总和;(真阳性指:被正确预测的类别为1的样本,假阴性是错误预测为类别1的样本。)
意义:结果中,有意义的比例。(评价结果的质量)。
召回率:定义为真阳性的数目除以真阳性和假阴性的和,其中假阴性是类别为1却被预测为0的样本。
意义:100%,表示所有的正样本我都能检测到。(评价算法的完整性)。
PR曲线是,横轴为召回率,纵轴为准确率所形成的曲线。

2.3 ROC曲线和AUC

ROC曲线是真阳性率—–假阳性率的图形化解释。

真阳性率(TPR)——真阳性样本数除以真阳性与假阴性样本之和。
假阳性率(FPR)——假阳性样本数除以假阳性与真阴性样本之和。
理性情况为下ROC下的面积(AUC)为1,越接近1越好。

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

// lr 和 SVM
val metrics = Seq(lrModel, svmModel).map { model =>
val scoreAndLabels = data.map { point =>
(model.predict(point.features), point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
}

// NB
val nbMetrics = Seq(nbModel).map{ model =>
val scoreAndLabels = nbData.map { point =>
val score = model.predict(point.features)
(if (score > 0.5) 1.0 else 0.0, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR,
metrics.areaUnderROC)
}

//DT 决策树
val dtMetrics = Seq(dtModel).map{ model =>
val scoreAndLabels = data.map { point =>
val score = model.predict(point.features)
(if (score > 0.5) 1.0 else 0.0, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
}

// 总的输出结果:
val allMetrics = metrics ++ nbMetrics ++ dtMetrics
allMetrics.foreach{ case(m,pr,roc)=>
  println(f"$m,Area under PR: ${pr * 100.0}%2.4f%%,Area under ROC: ${roc *100}%2.4f%%")}

结果输出:

这里写图片描述

算法所得结果并不理想,下节探讨参数调优方法。

你可能感兴趣的:(机器学习与数据挖掘,MLlib,spark)