《Spark机器学习》笔记——Spark分类模型(线性回归、朴素贝叶斯、决策树、支持向量机)

一、分类模型的种类

1.1、线性模型

1.1.1、逻辑回归

1.2.3、线性支持向量机

1.2、朴素贝叶斯模型

1.3、决策树模型

二、从数据中抽取合适的特征

MLlib中的分类模型通过LabeledPoint(label: Double, features: Vector)对象操作,其中封装了目标变量(标签)和特征向量

从Kaggle/StumbleUpon evergreen分类数据集中抽取特征

该数据集设计网页中推荐的网页是短暂(短暂存在。很快就不流行了)还是长久(长时间流行)

使用sed ld train.tsv > train_noheader.ts可以将第一行的标题栏去掉

下面开始看代码

import org.apache.spark.mllib.classification.{ClassificationModel, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.optimization.{SimpleUpdater, SquaredL2Updater, Updater}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Evergreen {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("Evergreen").setMaster("local")
    //设置在本地模式运行
    val BASEDIR = "hdfs://pc1:9000/"
    //HDFS文件
    //val BASEDIR = "file:///home/chenjie/"
    // 本地文件
    //val sparkConf = new SparkConf().setAppName("Evergreen-cluster").setMaster("spark://pc1:7077").setJars(List("untitled2.jar"))
    //设置在集群模式运行
    val sc = new SparkContext(sparkConf)
    //初始化sc
    val rawData = sc.textFile(BASEDIR + "train_noheader.tsv") //加载数据
    println("rawData.first()=" + rawData.first()) //打印第一条
    //"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"
    // "4042"
    // "{""title"":""IBM hic calies"",
    // ""body"":""A sign the tahe cwlett Packard Co t last."",
    // ""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}"
    // "business"  "0.789131" "2.055555556"  "0.676470588"  "0.205882353"  "0.047058824"  "0.023529412"  "0.443783175"  "0"  "0"  "0.09077381" "0"  "0.245831182"  "0.003883495"  "1"  "1"  "24" "0"  "5424" "170"  "8"  "0.152941176"  "0.079129575"  "0"
  }
以上代码加载了数据集,并观察第一行的数据。注意到该数据包括URL、页面的ID、原始的文本内容和分配给文本的类别。接下来22列包含各种各样的数值或者类属特征。最后一列为目标值,-1为长久,0为短暂。
由于数据格式的问题,需要对数据进行清洗,在处理过程中把额外的引号去掉。并把原始数据中的?号代替的缺失数据用0替换。
val records = rawData.map(line => line.split("\t"))
println(records.first())

val data = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))//去掉引号
  val label = trimmed(r.size - 1).toInt//得到最后一列,即类别信息
  val features = trimmed.slice(4, r.size - 1).map(  d => if(d == "?") 0.0 else d.toDouble)//?0代替
  LabeledPoint(label, Vectors.dense(features))
}
data.cache()
val numData = data.count
println("numData=" + numData)
//numData=7395

val nbData = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val features = trimmed.slice(4, r.size - 1).map(  d => if(d == "?") 0.0 else d.toDouble)
    .map( d => if (d < 0) 0.0 else d)
  LabeledPoint(label, Vectors.dense(features))
}
//在对数据集进一步处理之前,我们发现数值数据中包含负数特征值。我们知道,朴素贝叶斯模型要求特征值非负,否则遇到负的特征值就会抛出异常
//因此需要为朴素贝叶斯模型构建一份输入特征向量的数据,将负特征值设为0
下面的代码将依次加入main函数中

三、训练分类模型

//------------训练分类模型------------------------------------------------------------------
val numItetations = 10
val maxTreeDepth = 5
val lrModel = LogisticRegressionWithSGD.train(data, numItetations)
val svmModel = SVMWithSGD.train(data, numItetations)
val nbModel = NaiveBayes.train(nbData)
val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)
//在决策树中,设置模式或者Algo时使用了Entript不纯度估计

四、使用分类模型

val dataPoint = data.first()
val trueLabel = dataPoint.label
println("真实分类:" + trueLabel)
val prediction1 = lrModel.predict(dataPoint.features)
val prediction2 = svmModel.predict(dataPoint.features)
val prediction3 = nbModel.predict(dataPoint.features)
val prediction4 = dtModel.predict(dataPoint.features)
println("lrModel预测分类:" + prediction1)
println("svmModel预测分类:" + prediction2)
println("nbModel预测分类:" + prediction3)
println("dtModel预测分类:" + prediction4)
/*
* 真实分类:0.0
  lrModel预测分类:1.0
  svmModel预测分类:1.0
  nbModel预测分类:1.0
  dtModel预测分类:0.0
* */


//也可以将RDD[Vector]整体作为输入做预测
/* val preditions = lrModel.predict(data.map(lp => lp.features))
 preditions.take(5).foreach(println)*/

五、评估分类模型的性能

5.1、预测的正确率和错误率

//--------评估分类模型的性能:预测的正确率和错误率--------------------------------
val lrTotalCorrect = data.map{  point =>
  if(lrModel.predict(point.features) == point.label) 1 else 0
}.sum

val svmTotalCorrect = data.map{ point =>
  if(svmModel.predict(point.features) == point.label) 1 else 0
}.sum
val nbTotalCorrect = nbData.map{ point =>
  if(nbModel.predict(point.features) == point.label) 1 else 0
}.sum
val dtTotalCorrect = data.map{  point =>
  val socre = dtModel.predict(point.features)
  val predicted = if(socre > 0.5) 1 else 0
  if(predicted == point.label) 1 else 0
}.sum
val lrAccuracy = lrTotalCorrect / data.count
val svmAccuracy = svmTotalCorrect / numData
val nbAccuracy = nbTotalCorrect / numData
val dtAccuracy = dtTotalCorrect / numData
println("lrModel预测分类正确率:" + lrAccuracy)
println("svmModel预测分类正确率:" + svmAccuracy)
println("nbModel预测分类正确率:" + nbAccuracy)
println("dtModel预测分类正确率:" + dtAccuracy)
/*
* lrModel预测分类正确率:0.5146720757268425
  svmModel预测分类正确率:0.5146720757268425
  nbModel预测分类正确率:0.5803921568627451
  dtModel预测分类正确率:0.6482758620689655
* */


5.2、准确率和召回律

//--------评估分类模型的性能:准确率和召回律--------------------------------
/**准确率用于评价结果的质量,召回律用来评价结果的完整性
  *
  *                              真阳性的数目(被正确预测的类别为1的样本)
  * 在二分类的问题中,准确率= ------------------------- ---------------------
  *                        真阳性的数目 + 假阳性的数目(被错误预测为类别1的样本)
  *
  *                           真阳性的数目(被正确预测的类别为1的样本)
  *                 召回率= ---------------------------------------------
  *                        真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本)
  * 准确率-召回率(PR)曲线下的面积为平均准确率
  */
val metrics = Seq(lrModel, svmModel).map{ model =>
  val scoreAndLabels = data.map{  point =>
    (model.predict(point.features), point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val nbMetrics = Seq(nbModel).map{ model =>
  val scoreAndLabels = nbData.map{  point =>
    val score = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val dtMetrics = Seq(dtModel).map{ model =>
  val scoreAndLabels = nbData.map { point =>
    val score  = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val allMetrics = metrics ++ nbMetrics ++ dtMetrics
allMetrics.foreach{ case (model,pr,roc) =>
  println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
}

//LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
//DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%

5.3、ROC曲线和AUC

//--------评估分类模型的性能:ROC曲线和AUC--------------------------------
/**ROC曲线在概念上与PR曲线类似,它是对分类器的真阳性率-假阳性率的图形化解释。
  *
  *           真阳性的数目(被正确预测的类别为1的样本)
  * 真阳性率= ----------------------------------------------- , 与召回率类似,也称为敏感度。
  *           真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本)
  *
  * ROC曲线表现了分类器性能在不同决策阈值下TPRFPR的折衷。ROC下的面积,称为AUC,表示平均值。
  *
  *
  */


六、改进模型性能以及参数调优

6.1、特征标准化

//------改进模型性能以及参数调优-------------------------------------------
val vectors = data.map(lp => lp.features)
val matrix = new RowMatrix(vectors)
val matrixSummary = matrix.computeColumnSummaryStatistics()
println("每列的均值:")
println(matrixSummary.mean)
//[0.41225805299526774,2.76182319198661,0.46823047328613876,0.21407992638350257,0.0920623607189991,0.04926216043908034,2.255103452212025,-0.10375042752143329,0.0,0.05642274498417848,0.02123056118999324,0.23377817665490225,0.2757090373659231,0.615551048005409,0.6603110209601082,30.077079107505178,0.03975659229208925,5716.598242055454,178.75456389452327,4.960649087221106,0.17286405047031753,0.10122079189276531]
println("每列的最小值:")
//[0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564223,-1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0]
println(matrixSummary.min)
println("每列的最大值:")
println(matrixSummary.max)
//[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,1.0,0.716883117,113.3333333,1.0,1.0,100.0,1.0,207952.0,4997.0,22.0,1.0,1.0]
println("每列的方差:")
println(matrixSummary.variance)
//[0.10974244167559023,74.30082476809655,0.04126316989120245,0.021533436332001124,0.009211817450882448,0.005274933469767929,32.53918714591818,0.09396988697611537,0.0,0.001717741034662896,0.020782634824610638,0.0027548394224293023,3.6837889196744116,0.2366799607085986,0.22433071201674218,415.87855895438463,0.03818116876739597,7.877330081138441E7,32208.11624742624,10.453009045764313,0.03359363403832387,0.0062775328842146995]
println("每列的非0项数目:")
println(matrixSummary.numNonzeros)
//[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,7362.0,157.0,7395.0,7355.0,4552.0,4883.0,7347.0,294.0,7378.0,7395.0,6782.0,6868.0,7235.0]

//观察到第二列的方差和均值比其他都要高,为了使数据更符合模型的假设,可以对每个特征进行标准化,使得每个特征都是0均值和单位标准差。
//具体做法是对每个特征值减去列的均值,然后除以列的标准差进行缩放
//可以使用SparkStandardScaler中的方法方便地完成这些操作。

val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors)
val scaledData = data.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features)))
println("标准化前:" + data.first().features)
println("标准化后:" + scaledData.first().features)
//标准化前:[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]
//标准化后:[1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967]

//下面使用标准化的数据重新训练模型。这里只训练逻辑回归,因为决策树和朴素贝叶斯不受特征标准化的影响。

val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numItetations)
val lrTotalCorrectScaled = scaledData.map{  point =>
  if(lrModelScaled.predict(point.features) == point.label) 1 else 0
}.sum
val lrAccuracyScaled = lrTotalCorrectScaled / numData
val lrPreditionsVsTrue = scaledData.map{  point =>
  (lrModelScaled.predict(point.features), point.label)
}
val lrMetricsScaled = new BinaryClassificationMetrics(lrPreditionsVsTrue)
val lrPr = lrMetricsScaled.areaUnderPR()
val lrRoc = lrMetricsScaled.areaUnderROC()
println(f"${lrModelScaled.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaled * 100}%2.4f%%\n Area under PR : ${lrPr * 100.0}%2.4f%%,Area under ROC: ${lrRoc * 100.0}%2.4f%%")
//LogisticRegressionModel
//Accuracy:62.0419%
// Area under PR : 72.7254%,Area under ROC: 61.9663%

//对比之前的
//lrModel预测分类正确率:0.5146720757268425
// LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//正确率和ROC提高了很多,这就算特征标准化的作用


6.2、使用其他特征

//-------------其他特征--------------------------------------------------
//之前我们只使用了数据的部分特征

val categories = records.map(r => r(3)).distinct().collect().zipWithIndex.toMap
val numCategories = categories.size
println(categories)
println("种类数:" + numCategories)
//Map("weather" -> 0, "sports" -> 1, "unknown" -> 10, "computer_internet" -> 11, "?" -> 8, "culture_politics" -> 9, "religion" -> 4, "recreation" -> 7, "arts_entertainment" -> 5, "health" -> 12, "law_crime" -> 6, "gaming" -> 13, "business" -> 2, "science_technology" -> 3)
//种类数:14
//下面使用一个长为14的向量来表示类别特征,然后根据每个样本所属类别索引,对相应的维度赋值为1,其他为0.我们假定这个新的特征向量和其他的数值特征向量一样

val dataCategories = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val categoryIdx = categories(r(3))
  val categoryFeatures = Array.ofDim[Double](numCategories)
  categoryFeatures(categoryIdx) = 1.0
  val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble)
  val features = categoryFeatures ++ otherFeatures
  LabeledPoint(label, Vectors.dense(features))
}
println("观察第一行:" + dataCategories.first())
//观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])
//发现此前的类别特征已经转为14维向量

val scalerCats = new StandardScaler(withMean = true, withStd = true).fit(dataCategories.map(lp => lp.features))
val scaledDataCasts = dataCategories.map( lp =>
  LabeledPoint(lp.label, scalerCats.transform(lp.features))
)
scaledDataCasts.cache()
println("标准化后:" + scaledDataCasts.first())
//标准化后:(0.0,[-0.02326210589837061,-0.23272797709480803,2.7207366564548514,-0.2016540523193296,-0.09914991930875496,-0.38181322324318134,-0.06487757239262681,-0.4464212047941535,-0.6807527904251456,-0.22052688457880879,-0.028494000387023734,-0.20418221057887365,-0.2709990696925828,-0.10189469097220732,1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967])


val nbDataCategories = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val categoryIdx = categories(r(3))
  val categoryFeatures = Array.ofDim[Double](numCategories)
  categoryFeatures(categoryIdx) = 1.0
  val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble)
    .map( d => if (d < 0) 0.0 else d)
  val features = categoryFeatures ++ otherFeatures
  LabeledPoint(label, Vectors.dense(features))
}
println("观察第一行:" + nbDataCategories.first())
//观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])




nbDataCategories.cache()

val lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的逻辑回归分类模型
val svmModelScaledCats = SVMWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的支持向量机分类模型
val nbModelScaledCats = NaiveBayes.train(nbDataCategories)//增加类型矩阵并标准化后的朴素贝叶斯分类模型
val dtModelScaledCats = DecisionTree.train(dataCategories, Algo.Classification, Entropy, maxTreeDepth) //增加类型矩阵并标准化后的决策树分类模型
//注意,决策树和朴素贝叶斯不受特征标准化的影响。反而标准化后出现负值无法使用贝叶斯


val lrTotalCorrectScaledCats = scaledDataCasts.map{  point =>
  if(lrModelScaledCats.predict(point.features) == point.label) 1 else 0
}.sum

val svmTotalCorrectScaledCats = scaledDataCasts.map{ point =>
  if(svmModelScaledCats.predict(point.features) == point.label) 1 else 0
}.sum
val nbTotalCorrectScaledCats = nbDataCategories.map{ point =>
  if(nbModelScaledCats.predict(point.features) == point.label) 1 else 0
}.sum
val dtTotalCorrectScaledCats = dataCategories.map{  point =>
  val socre = dtModelScaledCats.predict(point.features)
  val predicted = if(socre > 0.5) 1 else 0
  if(predicted == point.label) 1 else 0
}.sum
val  lrAccuracyScaledCats =  lrTotalCorrectScaledCats / numData
val svmAccuracyScaledCats = svmTotalCorrectScaledCats / numData
val  nbAccuracyScaledCats =  nbTotalCorrectScaledCats / numData
val  dtAccuracyScaledCats =  dtTotalCorrectScaledCats / numData
println(" lrModel预测分类正确率:" +  lrAccuracyScaledCats)
println("svmModel预测分类正确率:" + svmAccuracyScaledCats)
println(" nbModel预测分类正确率:" +  nbAccuracyScaledCats)
println(" dtModel预测分类正确率:" +  dtAccuracyScaledCats)
/*
此前的
*  lrModel预测分类正确率:0.5146720757268425
  svmModel预测分类正确率:0.5146720757268425
   nbModel预测分类正确率:0.5803921568627451
   dtModel预测分类正确率:0.6482758620689655
* */
/***
  *lrModel预测分类正确率:0.6657200811359026
   svmModel预测分类正确率:0.6645030425963488
   nbModel预测分类正确率:0.5832319134550372
   dtModel预测分类正确率:0.6655848546315077
  *
  */

val lrPreditionsVsTrueScaledCats = dataCategories.map{  point =>
  (lrModelScaledCats.predict(point.features), point.label)
}
val lrMetricsScaledCats = new BinaryClassificationMetrics(lrPreditionsVsTrueScaledCats)
val lrPrScaledCats = lrMetricsScaledCats.areaUnderPR()
val lrRocScaledCats = lrMetricsScaledCats.areaUnderROC()
println(f"${lrModelScaledCats.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaledCats * 100}%2.4f%%\n Area under PR : ${lrPrScaledCats * 100.0}%2.4f%%,Area under ROC: ${lrRocScaledCats * 100.0}%2.4f%%")
//LogisticRegressionModel
//Accuracy:66.5720%
//Area under PR : 75.6015%,Area under ROC: 52.1977%

val metrics2 = Seq(lrModelScaledCats, svmModelScaledCats).map{ model =>
  val scoreAndLabels = dataCategories.map{  point =>
    (model.predict(point.features), point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val nbMetrics2 = Seq(nbModelScaledCats).map{ model =>
  val scoreAndLabels = nbDataCategories.map{  point =>
    val score = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val dtMetrics2 = Seq(dtModelScaledCats).map{ model =>
  val scoreAndLabels = dataCategories.map { point =>
    val score  = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val allMetrics2 = metrics2 ++ nbMetrics2 ++ dtMetrics2
allMetrics2.foreach{ case (model,pr,roc) =>
  println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
}

//LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
//DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%

//LogisticRegressionModel, Area under PR : 75.6015%,Area under ROC: 52.1977%
//SVMModel, Area under PR : 75.5180%,Area under ROC: 54.1606%
//NaiveBayesModel, Area under PR : 68.3386%,Area under ROC: 58.6397%
//DecisionTreeModel, Area under PR : 75.8784%,Area under ROC: 66.5005%

6.3、使用正确的数据格式

//--------使用正确的数据格式----------------------------------------------
//现在我们仅仅使用类型特征,也就是只使用前14个向量,因为1-of-k编码的类型特征更符合朴素贝叶斯模型
val nbDataOnlyCategories = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val categoryIdx = categories(r(3))
  val categoryFeatures = Array.ofDim[Double](numCategories)
  categoryFeatures(categoryIdx) = 1.0
  LabeledPoint(label, Vectors.dense(categoryFeatures))
}
println("观察第一行:" + nbDataOnlyCategories.first())

val nbModelScaledOnlyCats = NaiveBayes.train(nbDataOnlyCategories)//只有类型矩阵并标准化后的朴素贝叶斯分类模型
val nbMetricsOnlyCats = Seq(nbModelScaledOnlyCats).map{ model =>
  val scoreAndLabels = nbDataOnlyCategories.map{  point =>
    val score = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
nbMetricsOnlyCats.foreach{ case (model,pr,roc) =>
  println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
}
//NaiveBayesModel, Area under PR : 74.0522%,Area under ROC: 60.5138%
//对比此前的:
//NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
//提升了2个百分点

val nbTotalCorrectScaledOnlyCats = nbDataOnlyCategories.map{ point =>
  if(nbModelScaledOnlyCats.predict(point.features) == point.label) 1 else 0
}.sum
val  nbAccuracyScaledOnlyCats =  nbTotalCorrectScaledOnlyCats / numData
println(" nbModel预测分类正确率:" +  nbAccuracyScaledOnlyCats)
// nbModel预测分类正确率:0.6096010818120352
//对比此前的
// nbModel预测分类正确率:0.5832319134550372
//提升了2个百分点


6.4、模型参数调优

6.4.1、线性模型调优

//--------模型参数调优:线性模型-----------------------------------------------------------------------------
scaledDataCasts.cache()

//(1)迭代次数的影响
val iterResults = Seq(1,5,10,50).map{ param =>
  val model = trainLRWithParams(scaledDataCasts, 0.0, param, new SimpleUpdater, 1.0)
  createMetrics(s"$param iterations", scaledDataCasts, model)
}
iterResults.foreach{  case (param, auc) =>
  println(f"$param, AUC=${auc * 100}%2.4f%%")
}
/*1 iterations, AUC=64.9520%
  5 iterations, AUC=66.6161%
  10 iterations, AUC=66.5483%
  50 iterations, AUC=66.8143%*/

//2)步长的影响
val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param =>
  val model = trainLRWithParams(scaledDataCasts, 0.0, numItetations, new SimpleUpdater, param)
  createMetrics(s"$param step size", scaledDataCasts, model)
}
stepResults.foreach{  case (param, auc) =>
  println(f"$param, AUC=${auc * 100}%2.4f%%")
}
/*0.001 step size, AUC=64.9659%
  0.01 step size, AUC=64.9644%
  0.1 step size, AUC=65.5211%
  1.0 step size, AUC=66.5483%
  10.0 step size, AUC=61.9228%*/


//3)正则化的影响
val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{  param =>
  val model = trainLRWithParams(scaledDataCasts, param, numItetations, new SquaredL2Updater, 1.0)
  createMetrics(s"$param L2 regularization parameter", scaledDataCasts, model)
}
regResults.foreach{  case (param, auc) =>
  println(f"$param, AUC=${auc * 100}%2.4f%%")
}
/*0.001 L2 regularization parameter, AUC=66.5475%
  0.01 L2 regularization parameter, AUC=66.5475%
  0.1 L2 regularization parameter, AUC=66.5475%
  1.0 L2 regularization parameter, AUC=66.5475%
  10.0 L2 regularization parameter, AUC=66.5475%*/

6.4.2、决策树调优

//--------模型参数调优:决策树-----------------------------------------------------------------------------

//调整树的深度参数
val dtResultsEntropy = Seq(1, 2, 3, 4, 5, 10, 20).map{  param =>
  val model = trainDTWithParams(scaledDataCasts,param, Entropy)
  val scoreAndLabels = scaledDataCasts.map{ point =>
    val score = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (s"$param tree depth with Entropy", metrics.areaUnderROC())
}
dtResultsEntropy.foreach { case (param, auc) =>
  println(f"$param,AUC=${auc * 100}%2.4f%%")
}
/*1 tree depth,AUC=59.3268%
  2 tree depth,AUC=59.3268%
  3 tree depth,AUC=61.8313%
  4 tree depth,AUC=62.1519%
  5 tree depth,AUC=66.5005%
  10 tree depth,AUC=75.9120%
  20 tree depth,AUC=96.4347%*/

//调整不纯度度量方式:Gini或者Entropy
val dtResultsEntropy2 = Seq(1, 2, 3, 4, 5, 10, 20).map{  param =>
  val model = trainDTWithParams(scaledDataCasts,param, Gini)
  val scoreAndLabels = scaledDataCasts.map{ point =>
    val score = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (s"$param tree depth with Gini", metrics.areaUnderROC())
}
dtResultsEntropy2.foreach { case (param, auc) =>
  println(f"$param,AUC=${auc * 100}%2.4f%%")
}
/* 1 tree depth with Gini,AUC=59.3268%
 2 tree depth with Gini,AUC=61.6106%
   3 tree depth with Gini,AUC=61.8349%
   4 tree depth with Gini,AUC=62.0433%
   5 tree depth with Gini,AUC=66.4518%
   10 tree depth with Gini,AUC=76.8962%
   20 tree depth with Gini,AUC=98.3514%*/

6.4.3、朴素贝叶斯调优

//--------模型参数调优:朴素贝叶斯-----------------------------------------------------------------------------
val nbResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param =>
  val model = trainNBWithParams(nbDataCategories, param)
  val scoreAndLabels = scaledDataCasts.map{ point =>
    (model.predict(point.features), point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (s"$param lambda", metrics.areaUnderROC())
}
nbResults.foreach { case (param, auc) =>
  println(f"$param,AUC=${auc * 100}%2.4f%%")
}

/*0.001 lambda,AUC=61.2364%
  0.01 lambda,AUC=61.3334%
  0.1 lambda,AUC=61.4714%
  1.0 lambda,AUC=61.5605%
  10.0 lambda,AUC=61.8360%*/
6.4.4、交叉验证

将数据集划分为训练集和测试集

//---------交叉验证-------------------------------------------------------------------------------------------
val trainTestSplit = scaledDataCasts.randomSplit(Array(0.6,0.4),123)
val train = trainTestSplit(0)
val test = trainTestSplit(1)
val regResultsTest = Seq(0.0, 0.001, 0.0025, 0.005, 0.01).map{  param =>
  val model = trainLRWithParams(train, param, numItetations, new SquaredL2Updater, 1.0)
  createMetrics(s"$param L2 regularization parameter", test, model)
}
regResultsTest.foreach { case (param, auc) =>
  println(f"$param,AUC=${auc * 100}%2.4f%%")
}

完整代码:

import org.apache.spark.mllib.classification.{ClassificationModel, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.optimization.{SimpleUpdater, SquaredL2Updater, Updater}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Evergreen {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("Evergreen").setMaster("local")
    //设置在本地模式运行
    val BASEDIR = "hdfs://pc1:9000/"
    //HDFS文件
    //val BASEDIR = "file:///home/chenjie/"
    // 本地文件
    //val sparkConf = new SparkConf().setAppName("Evergreen-cluster").setMaster("spark://pc1:7077").setJars(List("untitled2.jar"))
    //设置在集群模式运行
    val sc = new SparkContext(sparkConf)
    //初始化sc
    val rawData = sc.textFile(BASEDIR + "train_noheader.tsv") //加载数据
    println("rawData.first()=" + rawData.first()) //打印第一条
    //"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"
    // "4042"
    // "{""title"":""IBM hic calies"",
    // ""body"":""A sign the tahe cwlett Packard Co t last."",
    // ""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}"
    // "business"  "0.789131" "2.055555556"  "0.676470588"  "0.205882353"  "0.047058824"  "0.023529412"  "0.443783175"  "0"  "0"  "0.09077381" "0"  "0.245831182"  "0.003883495"  "1"  "1"  "24" "0"  "5424" "170"  "8"  "0.152941176"  "0.079129575"  "0"


    val records = rawData.map(line => line.split("\t"))
    println(records.first())

    val data = records.map{ r =>
      val trimmed = r.map(_.replaceAll("\"",""))//去掉引号
      val label = trimmed(r.size - 1).toInt//得到最后一列,即类别信息
      val features = trimmed.slice(4, r.size - 1).map(  d => if(d == "?") 0.0 else d.toDouble)//?0代替
      LabeledPoint(label, Vectors.dense(features))
    }
    data.cache()
    val numData = data.count
    println("numData=" + numData)
    //numData=7395

    val nbData = records.map{ r =>
      val trimmed = r.map(_.replaceAll("\"",""))
      val label = trimmed(r.size - 1).toInt
      val features = trimmed.slice(4, r.size - 1).map(  d => if(d == "?") 0.0 else d.toDouble)
        .map( d => if (d < 0) 0.0 else d)
      LabeledPoint(label, Vectors.dense(features))
    }
    //在对数据集进一步处理之前,我们发现数值数据中包含负数特征值。我们知道,朴素贝叶斯模型要求特征值非负,否则遇到负的特征值就会抛出异常
    //因此需要为朴素贝叶斯模型构建一份输入特征向量的数据,将负特征值设为0

    //------------训练分类模型------------------------------------------------------------------
    val numItetations = 10
    val maxTreeDepth = 5
    val lrModel = LogisticRegressionWithSGD.train(data, numItetations)
    val svmModel = SVMWithSGD.train(data, numItetations)
    val nbModel = NaiveBayes.train(nbData)
    val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)
    //在决策树中,设置模式或者Algo时使用了Entript不纯度估计

    //-----------使用分类模型-----------------------------------------
    val dataPoint = data.first()
    val trueLabel = dataPoint.label
    println("真实分类:" + trueLabel)
    val prediction1 = lrModel.predict(dataPoint.features)
    val prediction2 = svmModel.predict(dataPoint.features)
    val prediction3 = nbModel.predict(dataPoint.features)
    val prediction4 = dtModel.predict(dataPoint.features)
    println("lrModel预测分类:" + prediction1)
    println("svmModel预测分类:" + prediction2)
    println("nbModel预测分类:" + prediction3)
    println("dtModel预测分类:" + prediction4)
    /*
    * 真实分类:0.0
      lrModel预测分类:1.0
      svmModel预测分类:1.0
      nbModel预测分类:1.0
      dtModel预测分类:0.0
    * */


    //也可以将RDD[Vector]整体作为输入做预测
    /* val preditions = lrModel.predict(data.map(lp => lp.features))
     preditions.take(5).foreach(println)*/

    //--------评估分类模型的性能:预测的正确率和错误率--------------------------------
    val lrTotalCorrect = data.map{  point =>
      if(lrModel.predict(point.features) == point.label) 1 else 0
    }.sum

    val svmTotalCorrect = data.map{ point =>
      if(svmModel.predict(point.features) == point.label) 1 else 0
    }.sum
    val nbTotalCorrect = nbData.map{ point =>
      if(nbModel.predict(point.features) == point.label) 1 else 0
    }.sum
    val dtTotalCorrect = data.map{  point =>
      val socre = dtModel.predict(point.features)
      val predicted = if(socre > 0.5) 1 else 0
      if(predicted == point.label) 1 else 0
    }.sum
    val lrAccuracy = lrTotalCorrect / data.count
    val svmAccuracy = svmTotalCorrect / numData
    val nbAccuracy = nbTotalCorrect / numData
    val dtAccuracy = dtTotalCorrect / numData
    println("lrModel预测分类正确率:" + lrAccuracy)
    println("svmModel预测分类正确率:" + svmAccuracy)
    println("nbModel预测分类正确率:" + nbAccuracy)
    println("dtModel预测分类正确率:" + dtAccuracy)
    /*
    * lrModel预测分类正确率:0.5146720757268425
      svmModel预测分类正确率:0.5146720757268425
      nbModel预测分类正确率:0.5803921568627451
      dtModel预测分类正确率:0.6482758620689655
    * */

    //--------评估分类模型的性能:准确率和召回律--------------------------------
    /**准确率用于评价结果的质量,召回律用来评价结果的完整性
      *
      *                              真阳性的数目(被正确预测的类别为1的样本)
      * 在二分类的问题中,准确率= ------------------------- ---------------------
      *                        真阳性的数目 + 假阳性的数目(被错误预测为类别1的样本)
      *
      *                           真阳性的数目(被正确预测的类别为1的样本)
      *                 召回率= ---------------------------------------------
      *                        真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本)
      * 准确率-召回率(PR)曲线下的面积为平均准确率
      */
    val metrics = Seq(lrModel, svmModel).map{ model =>
      val scoreAndLabels = data.map{  point =>
        (model.predict(point.features), point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
    }
    val nbMetrics = Seq(nbModel).map{ model =>
      val scoreAndLabels = nbData.map{  point =>
        val score = model.predict(point.features)
        (if (score > 0.5) 1.0 else 0.0, point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
    }
    val dtMetrics = Seq(dtModel).map{ model =>
      val scoreAndLabels = nbData.map { point =>
        val score  = model.predict(point.features)
        (if (score > 0.5) 1.0 else 0.0, point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
    }
    val allMetrics = metrics ++ nbMetrics ++ dtMetrics
    allMetrics.foreach{ case (model,pr,roc) =>
      println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
    }

    //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
    //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
    //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
    //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%


    //--------评估分类模型的性能:ROC曲线和AUC--------------------------------
    /**ROC曲线在概念上与PR曲线类似,它是对分类器的真阳性率-假阳性率的图形化解释。
      *
      *           真阳性的数目(被正确预测的类别为1的样本)
      * 真阳性率= ----------------------------------------------- , 与召回率类似,也称为敏感度。
      *           真阳性的数目 + 假阴性的数目(被错误预测为类别0的样本)
      *
      * ROC曲线表现了分类器性能在不同决策阈值下TPRFPR的折衷。ROC下的面积,称为AUC,表示平均值。
      *
      *
      */


    //------改进模型性能以及参数调优-------------------------------------------
    val vectors = data.map(lp => lp.features)
    val matrix = new RowMatrix(vectors)
    val matrixSummary = matrix.computeColumnSummaryStatistics()
    println("每列的均值:")
    println(matrixSummary.mean)
    //[0.41225805299526774,2.76182319198661,0.46823047328613876,0.21407992638350257,0.0920623607189991,0.04926216043908034,2.255103452212025,-0.10375042752143329,0.0,0.05642274498417848,0.02123056118999324,0.23377817665490225,0.2757090373659231,0.615551048005409,0.6603110209601082,30.077079107505178,0.03975659229208925,5716.598242055454,178.75456389452327,4.960649087221106,0.17286405047031753,0.10122079189276531]
    println("每列的最小值:")
    //[0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564223,-1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0]
    println(matrixSummary.min)
    println("每列的最大值:")
    println(matrixSummary.max)
    //[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,1.0,0.716883117,113.3333333,1.0,1.0,100.0,1.0,207952.0,4997.0,22.0,1.0,1.0]
    println("每列的方差:")
    println(matrixSummary.variance)
    //[0.10974244167559023,74.30082476809655,0.04126316989120245,0.021533436332001124,0.009211817450882448,0.005274933469767929,32.53918714591818,0.09396988697611537,0.0,0.001717741034662896,0.020782634824610638,0.0027548394224293023,3.6837889196744116,0.2366799607085986,0.22433071201674218,415.87855895438463,0.03818116876739597,7.877330081138441E7,32208.11624742624,10.453009045764313,0.03359363403832387,0.0062775328842146995]
    println("每列的非0项数目:")
    println(matrixSummary.numNonzeros)
    //[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,7362.0,157.0,7395.0,7355.0,4552.0,4883.0,7347.0,294.0,7378.0,7395.0,6782.0,6868.0,7235.0]

    //观察到第二列的方差和均值比其他都要高,为了使数据更符合模型的假设,可以对每个特征进行标准化,使得每个特征都是0均值和单位标准差。
    //具体做法是对每个特征值减去列的均值,然后除以列的标准差进行缩放
    //可以使用SparkStandardScaler中的方法方便地完成这些操作。

    val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors)
    val scaledData = data.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features)))
    println("标准化前:" + data.first().features)
    println("标准化后:" + scaledData.first().features)
    //标准化前:[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]
    //标准化后:[1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967]

    //下面使用标准化的数据重新训练模型。这里只训练逻辑回归,因为决策树和朴素贝叶斯不受特征标准化的影响。

    val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numItetations)
    val lrTotalCorrectScaled = scaledData.map{  point =>
      if(lrModelScaled.predict(point.features) == point.label) 1 else 0
    }.sum
    val lrAccuracyScaled = lrTotalCorrectScaled / numData
    val lrPreditionsVsTrue = scaledData.map{  point =>
      (lrModelScaled.predict(point.features), point.label)
    }
    val lrMetricsScaled = new BinaryClassificationMetrics(lrPreditionsVsTrue)
    val lrPr = lrMetricsScaled.areaUnderPR()
    val lrRoc = lrMetricsScaled.areaUnderROC()
    println(f"${lrModelScaled.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaled * 100}%2.4f%%\n Area under PR : ${lrPr * 100.0}%2.4f%%,Area under ROC: ${lrRoc * 100.0}%2.4f%%")
    //LogisticRegressionModel
    //Accuracy:62.0419%
    // Area under PR : 72.7254%,Area under ROC: 61.9663%

    //对比之前的
    //lrModel预测分类正确率:0.5146720757268425
    // LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
    //正确率和ROC提高了很多,这就算特征标准化的作用

    //-------------其他特征--------------------------------------------------
    //之前我们只使用了数据的部分特征

    val categories = records.map(r => r(3)).distinct().collect().zipWithIndex.toMap
    val numCategories = categories.size
    println(categories)
    println("种类数:" + numCategories)
    //Map("weather" -> 0, "sports" -> 1, "unknown" -> 10, "computer_internet" -> 11, "?" -> 8, "culture_politics" -> 9, "religion" -> 4, "recreation" -> 7, "arts_entertainment" -> 5, "health" -> 12, "law_crime" -> 6, "gaming" -> 13, "business" -> 2, "science_technology" -> 3)
    //种类数:14
    //下面使用一个长为14的向量来表示类别特征,然后根据每个样本所属类别索引,对相应的维度赋值为1,其他为0.我们假定这个新的特征向量和其他的数值特征向量一样

    val dataCategories = records.map{ r =>
      val trimmed = r.map(_.replaceAll("\"",""))
      val label = trimmed(r.size - 1).toInt
      val categoryIdx = categories(r(3))
      val categoryFeatures = Array.ofDim[Double](numCategories)
      categoryFeatures(categoryIdx) = 1.0
      val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble)
      val features = categoryFeatures ++ otherFeatures
      LabeledPoint(label, Vectors.dense(features))
    }
    println("观察第一行:" + dataCategories.first())
    //观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])
    //发现此前的类别特征已经转为14维向量

    val scalerCats = new StandardScaler(withMean = true, withStd = true).fit(dataCategories.map(lp => lp.features))
    val scaledDataCasts = dataCategories.map( lp =>
      LabeledPoint(lp.label, scalerCats.transform(lp.features))
    )
    scaledDataCasts.cache()
    println("标准化后:" + scaledDataCasts.first())
    //标准化后:(0.0,[-0.02326210589837061,-0.23272797709480803,2.7207366564548514,-0.2016540523193296,-0.09914991930875496,-0.38181322324318134,-0.06487757239262681,-0.4464212047941535,-0.6807527904251456,-0.22052688457880879,-0.028494000387023734,-0.20418221057887365,-0.2709990696925828,-0.10189469097220732,1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967])


    val nbDataCategories = records.map{ r =>
      val trimmed = r.map(_.replaceAll("\"",""))
      val label = trimmed(r.size - 1).toInt
      val categoryIdx = categories(r(3))
      val categoryFeatures = Array.ofDim[Double](numCategories)
      categoryFeatures(categoryIdx) = 1.0
      val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble)
        .map( d => if (d < 0) 0.0 else d)
      val features = categoryFeatures ++ otherFeatures
      LabeledPoint(label, Vectors.dense(features))
    }
    println("观察第一行:" + nbDataCategories.first())
    //观察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])




    nbDataCategories.cache()

    val lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的逻辑回归分类模型
    val svmModelScaledCats = SVMWithSGD.train(scaledDataCasts, numItetations)//增加类型矩阵并标准化后的支持向量机分类模型
    val nbModelScaledCats = NaiveBayes.train(nbDataCategories)//增加类型矩阵并标准化后的朴素贝叶斯分类模型
    val dtModelScaledCats = DecisionTree.train(dataCategories, Algo.Classification, Entropy, maxTreeDepth) //增加类型矩阵并标准化后的决策树分类模型
    //注意,决策树和朴素贝叶斯不受特征标准化的影响。反而标准化后出现负值无法使用贝叶斯


    val lrTotalCorrectScaledCats = scaledDataCasts.map{  point =>
      if(lrModelScaledCats.predict(point.features) == point.label) 1 else 0
    }.sum

    val svmTotalCorrectScaledCats = scaledDataCasts.map{ point =>
      if(svmModelScaledCats.predict(point.features) == point.label) 1 else 0
    }.sum
    val nbTotalCorrectScaledCats = nbDataCategories.map{ point =>
      if(nbModelScaledCats.predict(point.features) == point.label) 1 else 0
    }.sum
    val dtTotalCorrectScaledCats = dataCategories.map{  point =>
      val socre = dtModelScaledCats.predict(point.features)
      val predicted = if(socre > 0.5) 1 else 0
      if(predicted == point.label) 1 else 0
    }.sum
    val  lrAccuracyScaledCats =  lrTotalCorrectScaledCats / numData
    val svmAccuracyScaledCats = svmTotalCorrectScaledCats / numData
    val  nbAccuracyScaledCats =  nbTotalCorrectScaledCats / numData
    val  dtAccuracyScaledCats =  dtTotalCorrectScaledCats / numData
    println(" lrModel预测分类正确率:" +  lrAccuracyScaledCats)
    println("svmModel预测分类正确率:" + svmAccuracyScaledCats)
    println(" nbModel预测分类正确率:" +  nbAccuracyScaledCats)
    println(" dtModel预测分类正确率:" +  dtAccuracyScaledCats)
    /*
    此前的
    *  lrModel预测分类正确率:0.5146720757268425
      svmModel预测分类正确率:0.5146720757268425
       nbModel预测分类正确率:0.5803921568627451
       dtModel预测分类正确率:0.6482758620689655
    * */
    /***
      *lrModel预测分类正确率:0.6657200811359026
       svmModel预测分类正确率:0.6645030425963488
       nbModel预测分类正确率:0.5832319134550372
       dtModel预测分类正确率:0.6655848546315077
      *
      */

    val lrPreditionsVsTrueScaledCats = dataCategories.map{  point =>
      (lrModelScaledCats.predict(point.features), point.label)
    }
    val lrMetricsScaledCats = new BinaryClassificationMetrics(lrPreditionsVsTrueScaledCats)
    val lrPrScaledCats = lrMetricsScaledCats.areaUnderPR()
    val lrRocScaledCats = lrMetricsScaledCats.areaUnderROC()
    println(f"${lrModelScaledCats.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaledCats * 100}%2.4f%%\n Area under PR : ${lrPrScaledCats * 100.0}%2.4f%%,Area under ROC: ${lrRocScaledCats * 100.0}%2.4f%%")
    //LogisticRegressionModel
    //Accuracy:66.5720%
    //Area under PR : 75.6015%,Area under ROC: 52.1977%

    val metrics2 = Seq(lrModelScaledCats, svmModelScaledCats).map{ model =>
      val scoreAndLabels = dataCategories.map{  point =>
        (model.predict(point.features), point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
    }
    val nbMetrics2 = Seq(nbModelScaledCats).map{ model =>
      val scoreAndLabels = nbDataCategories.map{  point =>
        val score = model.predict(point.features)
        (if (score > 0.5) 1.0 else 0.0, point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
    }
    val dtMetrics2 = Seq(dtModelScaledCats).map{ model =>
      val scoreAndLabels = dataCategories.map { point =>
        val score  = model.predict(point.features)
        (if (score > 0.5) 1.0 else 0.0, point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
    }
    val allMetrics2 = metrics2 ++ nbMetrics2 ++ dtMetrics2
    allMetrics2.foreach{ case (model,pr,roc) =>
      println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
    }

    //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
    //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
    //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
    //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%

    //LogisticRegressionModel, Area under PR : 75.6015%,Area under ROC: 52.1977%
    //SVMModel, Area under PR : 75.5180%,Area under ROC: 54.1606%
    //NaiveBayesModel, Area under PR : 68.3386%,Area under ROC: 58.6397%
    //DecisionTreeModel, Area under PR : 75.8784%,Area under ROC: 66.5005%


    //--------使用正确的数据格式----------------------------------------------
    //现在我们仅仅使用类型特征,也就是只使用前14个向量,因为1-of-k编码的类型特征更符合朴素贝叶斯模型
    val nbDataOnlyCategories = records.map{ r =>
      val trimmed = r.map(_.replaceAll("\"",""))
      val label = trimmed(r.size - 1).toInt
      val categoryIdx = categories(r(3))
      val categoryFeatures = Array.ofDim[Double](numCategories)
      categoryFeatures(categoryIdx) = 1.0
      LabeledPoint(label, Vectors.dense(categoryFeatures))
    }
    println("观察第一行:" + nbDataOnlyCategories.first())

    val nbModelScaledOnlyCats = NaiveBayes.train(nbDataOnlyCategories)//只有类型矩阵并标准化后的朴素贝叶斯分类模型
    val nbMetricsOnlyCats = Seq(nbModelScaledOnlyCats).map{ model =>
      val scoreAndLabels = nbDataOnlyCategories.map{  point =>
        val score = model.predict(point.features)
        (if (score > 0.5) 1.0 else 0.0, point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
    }
    nbMetricsOnlyCats.foreach{ case (model,pr,roc) =>
      println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
    }
    //NaiveBayesModel, Area under PR : 74.0522%,Area under ROC: 60.5138%
    //对比此前的:
    //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
    //提升了2个百分点

    val nbTotalCorrectScaledOnlyCats = nbDataOnlyCategories.map{ point =>
      if(nbModelScaledOnlyCats.predict(point.features) == point.label) 1 else 0
    }.sum
    val  nbAccuracyScaledOnlyCats =  nbTotalCorrectScaledOnlyCats / numData
    println(" nbModel预测分类正确率:" +  nbAccuracyScaledOnlyCats)
    // nbModel预测分类正确率:0.6096010818120352
    //对比此前的
    // nbModel预测分类正确率:0.5832319134550372
    //提升了2个百分点


    //--------模型参数调优:线性模型-----------------------------------------------------------------------------
    scaledDataCasts.cache()

    //(1)迭代次数的影响
    val iterResults = Seq(1,5,10,50).map{ param =>
      val model = trainLRWithParams(scaledDataCasts, 0.0, param, new SimpleUpdater, 1.0)
      createMetrics(s"$param iterations", scaledDataCasts, model)
    }
    iterResults.foreach{  case (param, auc) =>
      println(f"$param, AUC=${auc * 100}%2.4f%%")
    }
    /*1 iterations, AUC=64.9520%
      5 iterations, AUC=66.6161%
      10 iterations, AUC=66.5483%
      50 iterations, AUC=66.8143%*/

    //2)步长的影响
    val stepResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param =>
      val model = trainLRWithParams(scaledDataCasts, 0.0, numItetations, new SimpleUpdater, param)
      createMetrics(s"$param step size", scaledDataCasts, model)
    }
    stepResults.foreach{  case (param, auc) =>
      println(f"$param, AUC=${auc * 100}%2.4f%%")
    }
    /*0.001 step size, AUC=64.9659%
      0.01 step size, AUC=64.9644%
      0.1 step size, AUC=65.5211%
      1.0 step size, AUC=66.5483%
      10.0 step size, AUC=61.9228%*/


    //3)正则化的影响
    val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{  param =>
      val model = trainLRWithParams(scaledDataCasts, param, numItetations, new SquaredL2Updater, 1.0)
      createMetrics(s"$param L2 regularization parameter", scaledDataCasts, model)
    }
    regResults.foreach{  case (param, auc) =>
      println(f"$param, AUC=${auc * 100}%2.4f%%")
    }
    /*0.001 L2 regularization parameter, AUC=66.5475%
      0.01 L2 regularization parameter, AUC=66.5475%
      0.1 L2 regularization parameter, AUC=66.5475%
      1.0 L2 regularization parameter, AUC=66.5475%
      10.0 L2 regularization parameter, AUC=66.5475%*/

    //--------模型参数调优:决策树-----------------------------------------------------------------------------

    //调整树的深度参数
    val dtResultsEntropy = Seq(1, 2, 3, 4, 5, 10, 20).map{  param =>
      val model = trainDTWithParams(scaledDataCasts,param, Entropy)
      val scoreAndLabels = scaledDataCasts.map{ point =>
        val score = model.predict(point.features)
        (if (score > 0.5) 1.0 else 0.0, point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (s"$param tree depth with Entropy", metrics.areaUnderROC())
    }
    dtResultsEntropy.foreach { case (param, auc) =>
      println(f"$param,AUC=${auc * 100}%2.4f%%")
    }
    /*1 tree depth,AUC=59.3268%
      2 tree depth,AUC=59.3268%
      3 tree depth,AUC=61.8313%
      4 tree depth,AUC=62.1519%
      5 tree depth,AUC=66.5005%
      10 tree depth,AUC=75.9120%
      20 tree depth,AUC=96.4347%*/

    //调整不纯度度量方式:Gini或者Entropy
    val dtResultsEntropy2 = Seq(1, 2, 3, 4, 5, 10, 20).map{  param =>
      val model = trainDTWithParams(scaledDataCasts,param, Gini)
      val scoreAndLabels = scaledDataCasts.map{ point =>
        val score = model.predict(point.features)
        (if (score > 0.5) 1.0 else 0.0, point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (s"$param tree depth with Gini", metrics.areaUnderROC())
    }
    dtResultsEntropy2.foreach { case (param, auc) =>
      println(f"$param,AUC=${auc * 100}%2.4f%%")
    }
    /* 1 tree depth with Gini,AUC=59.3268%
     2 tree depth with Gini,AUC=61.6106%
       3 tree depth with Gini,AUC=61.8349%
       4 tree depth with Gini,AUC=62.0433%
       5 tree depth with Gini,AUC=66.4518%
       10 tree depth with Gini,AUC=76.8962%
       20 tree depth with Gini,AUC=98.3514%*/

    //--------模型参数调优:朴素贝叶斯-----------------------------------------------------------------------------
    val nbResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map{ param =>
      val model = trainNBWithParams(nbDataCategories, param)
      val scoreAndLabels = scaledDataCasts.map{ point =>
        (model.predict(point.features), point.label)
      }
      val metrics = new BinaryClassificationMetrics(scoreAndLabels)
      (s"$param lambda", metrics.areaUnderROC())
    }
    nbResults.foreach { case (param, auc) =>
      println(f"$param,AUC=${auc * 100}%2.4f%%")
    }

    /*0.001 lambda,AUC=61.2364%
      0.01 lambda,AUC=61.3334%
      0.1 lambda,AUC=61.4714%
      1.0 lambda,AUC=61.5605%
      10.0 lambda,AUC=61.8360%*/


    //---------交叉验证-------------------------------------------------------------------------------------------
    val trainTestSplit = scaledDataCasts.randomSplit(Array(0.6,0.4),123)
    val train = trainTestSplit(0)
    val test = trainTestSplit(1)
    val regResultsTest = Seq(0.0, 0.001, 0.0025, 0.005, 0.01).map{  param =>
      val model = trainLRWithParams(train, param, numItetations, new SquaredL2Updater, 1.0)
      createMetrics(s"$param L2 regularization parameter", test, model)
    }
    regResultsTest.foreach { case (param, auc) =>
      println(f"$param,AUC=${auc * 100}%2.4f%%")
    }
  }

  /***
    * 使用参数训练线性分类模型
    * @param input 输入
    * @param regParams
    * @param numIterations 迭代次数
    * @param updater
    * @param stepSize 步长
    * @return
    */
  def trainLRWithParams(input: RDD[LabeledPoint], regParams: Double, numIterations: Int, updater: Updater, stepSize: Double) = {
    val lr = new LogisticRegressionWithSGD()
    lr.optimizer.setNumIterations(numIterations).setUpdater(updater).setStepSize(stepSize)
    lr.run(input)
  }

  def createMetrics(label: String, data: RDD[LabeledPoint], model: ClassificationModel) = {
    val scoreAndLables = data.map{  point =>
      (model.predict(point.features), point.label)
    }
    val metrics = new BinaryClassificationMetrics(scoreAndLables)
    (label, metrics.areaUnderROC())
  }

  def trainDTWithParams(input: RDD[LabeledPoint], maxDepth: Int, impurity: Impurity) = {
    DecisionTree.train(input, Algo.Classification, impurity, maxDepth)
  }

  def trainNBWithParams(input: RDD[LabeledPoint], lambda: Double) = {
    val nb = new NaiveBayes()
    nb.setLambda(lambda)
    nb.run(input)
  }
}




你可能感兴趣的:(大数据,机器学习)