Spark.ML分类模型之决策树(数据集为KDD99)

环境:IDEA+SBT打包,上传至Spark集群运行

不知如何打包运行的,参照这篇博客博客地址

首先,在项目的build.sbt中添加关于spark.mllib的依赖包。注:需要引用的包见下面完整代码。

"org.apache.spark" % "spark-mllib_2.11" % "2.3.2" % "provided"

然后,大致流程:由于spark.ml是基于DataFrame数据格式的框架,不同于spark.mllib框架中的基于RDD数据格式的。在载入训练数据和测试数据时,可以使用sc.textFile()方法,再由RDD转为DataFrame。也可以直接上传为DataFrame格式,具体做法是先创建一个SparkSession对象,再调用read()方法。

val spark = SparkSession.builder().appName("Kdd99").config("example", "some-value").getOrCreate()
val data = spark.read.csv("/user/Tian/data/kddcup.data")

DataFrame数据格式需要操作表头,即列的名称,由于载入的数据中没有表头,故需要加上表头。方法如下

val df = data.toDF("duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
    "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
    "su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
    "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
    "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
    "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate", "label")

 上述代码使用.toDF()方法,传入参数是42列的列名。

由于ML框架下的各种机器学习的算法,传入的数据格式为DataFrame格式,只有两列,一列是features,内容为特征向量;一列是label,内容为标签向量。现在开始把KDD99数据集通过一系列转换变为这种各种。

Spark.ML分类模型之决策树(数据集为KDD99)_第1张图片

观察数据集格式

0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.

可以发现,前41列为特征,最后一列是标签。并且第2、3、4、42列为类别型,不是数值型。由于使用的是决策树算法,故需要把特征中的类别特征转换为数值型,可以用操作DataFrame方法中的StringIndexer()方法。使用示例见官网

转换完之后,数据集中原来的类别型特征的列(即多出的列)可用DataFrame里的drop()方法删除,传入参数为列名。

之后再把前41列合并为一列,取名为features,合并后的每一行自动成为一个稀疏向量。因合并后的列每一行是一个稀疏向量,故需要待合并的列为double类型,合并前转为double类型,代码如下:

val cols = df_final.columns.map(f => col(f).cast(DoubleType))

接下来就可以合并了,合并列的方法为:

val assembler = new VectorAssembler().setInputCols(Array("duration", "src_bytes", "dst_bytes", "land",
    "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
    "su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
    "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
    "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
    "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate", "protocol_typeIndex", "serviceIndex", "flagIndex")).setOutputCol("features")

var data_fea: DataFrame = assembler.transform(df_final.select(cols: _*))

上述代码.setInputCols()传入的参数是一个字符串数组,即将要合并前41个特征列的列名,组成的数组,.setOutputCol()传入的参数是合并成的新列的列名。

合并完成后,需要把原来41列删除掉,由于删除的列比较多,写一个for循环吧,代码如下:

  val colNames = data_fea.columns
  var dataset = data_fea.drop(colNames(0))

  for (colId <- 0 to 40) {
    dataset = dataset.drop(colNames(colId))
  }

现在数据格式已经修改完毕,需要把数据随机分成测试集和训练集。代码如下:

val Array(trainingData, testData) = dataset.randomSplit(Array(0.7, 0.3))

接下来的步骤就是固定的流水线步骤了。建立特征索引、创建决策树模型、配置流水线、配置网格参数、实例化交叉验证、通过交叉验证模型,获取最优参数集,并测试模型。最后预测正确率。

完成代码如下:

package ml

import org.apache.spark.ml.classification.{DecisionTreeClassificationModel, DecisionTreeClassifier}
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

object Kdd99 extends App {

  val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[16]")
  val sc = new SparkContext(conf)
  sc.setLogLevel("ERROR")

  val spark = SparkSession.builder().appName("Kdd99").config("example", "some-value").getOrCreate()


  //读取数据并给数据添加表头
  val data = spark.read.csv("/user/Tian/data/kddcup.data")

  val df = data.toDF("duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
    "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
    "su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
    "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
    "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
    "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate", "label")

  //修改"protocol_type"这一列为数值型
  val indexer_2 = new StringIndexer().setInputCol("protocol_type").setOutputCol("protocol_typeIndex")
  val indexed_2 = indexer_2.fit(df).transform(df)

  //修改"service"这一列为数值型
  val indexer_3 = new StringIndexer().setInputCol("service").setOutputCol("serviceIndex")
  val indexed_3 = indexer_3.fit(indexed_2).transform(indexed_2)

  //修改"flag"这一列为数值型
  val indexer_4 = new StringIndexer().setInputCol("flag").setOutputCol("flagIndex")
  val indexed_4 = indexer_4.fit(indexed_3).transform(indexed_3)

  //修改"label"这一列为数值型
  val indexer_final = new StringIndexer().setInputCol("label").setOutputCol("labelIndex")
  val indexed_df = indexer_final.fit(indexed_4).transform(indexed_4)

  //删除原有的类别列
  val df_final = indexed_df.drop("protocol_type").drop("service")
    .drop("flag").drop("label")

  //合并前41列为features
  val assembler = new VectorAssembler().setInputCols(Array("duration", "src_bytes", "dst_bytes", "land",
    "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
    "su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
    "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
    "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
    "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
    "dst_host_srv_rerror_rate", "protocol_typeIndex", "serviceIndex", "flagIndex")).setOutputCol("features")

  //转换为double类型
  val cols = df_final.columns.map(f => col(f).cast(DoubleType))
  var data_fea: DataFrame = assembler.transform(df_final.select(cols: _*))

  //删除前41列,只留labelIndex和features这两列(data_fea.columns.length = 43)
  val colNames = data_fea.columns
  var dataset = data_fea.drop(colNames(0))

  for (colId <- 0 to 40) {
    dataset = dataset.drop(colNames(colId))
  }

  dataset = dataset.withColumnRenamed("labelIndex", "label")

  /*
  组装
   */
  //把数据随机分成训练集合测试集
  val Array(trainingData, testData) = dataset.randomSplit(Array(0.7, 0.3))

  //建立特征索引
  val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").fit(dataset)

  //创建决策树模型
  val decisionTree = new DecisionTreeClassifier().setLabelCol("label")
    .setFeaturesCol("indexedFeatures").setImpurity("entropy").setMaxBins(100).setMaxDepth(5).setMinInfoGain(0.01)
  println("创建决策树模型...")

  //配置流水线
  val dtPipline = new Pipeline().setStages(Array(featureIndexer, decisionTree))
  println("配置流水线...")

  /*
  模型优化
   */
  //配置网格参数
  val dtParamGrid = new ParamGridBuilder()
    .addGrid(decisionTree.maxDepth, Array(3, 5, 7))
    .build()

  //实例化交叉验证模型
  val evaluator = new BinaryClassificationEvaluator()
  val dtCV = new CrossValidator()
    .setEstimator(dtPipline)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(dtParamGrid)
    .setNumFolds(2)

  //通过交叉验证模型,获取最优参数集,并测试模型
  val dtCVModel = dtCV.fit(trainingData)
  val dtPrediction = dtCVModel.transform(testData)

  //查看决策树匹配模型的参数
  val dtBestModel = dtCVModel.bestModel.asInstanceOf[PipelineModel]
  val dtModel = dtBestModel.stages(1).asInstanceOf[DecisionTreeClassificationModel]
  print("决策树模型深度:")
  println(dtModel.getMaxDepth)

  //统计预测正确率
  //t:决策树预测值的数组
  //label:测试集的标签值数组
  //count:测试集的数量
  val (t, label, count) = (dtPrediction.select("prediction").collect,
    testData.select("label").collect(),
    testData.count().toInt)
  var dt = 0
  for (i <- 0 to count - 1) {
    if (t(i) == label(i)) {
      dt += 1
    }
  }
  //打印正确率
  println("正确率:" + 1.0 * dt / count)

}

最后的运行结果如下图所示:

Spark.ML分类模型之决策树(数据集为KDD99)_第2张图片

原创文章,转载请注明出处: https://blog.csdn.net/weixin_43135846/article/details/83541415

你可能感兴趣的:(机器学习,Spark.ML)