不知如何打包运行的,参照这篇博客博客地址
首先,在项目的build.sbt中添加关于spark.mllib的依赖包。注:需要引用的包见下面完整代码。
"org.apache.spark" % "spark-mllib_2.11" % "2.3.2" % "provided"
然后,大致流程:由于spark.ml是基于DataFrame数据格式的框架,不同于spark.mllib框架中的基于RDD数据格式的。在载入训练数据和测试数据时,可以使用sc.textFile()方法,再由RDD转为DataFrame。也可以直接上传为DataFrame格式,具体做法是先创建一个SparkSession对象,再调用read()方法。
val spark = SparkSession.builder().appName("Kdd99").config("example", "some-value").getOrCreate()
val data = spark.read.csv("/user/Tian/data/kddcup.data")
DataFrame数据格式需要操作表头,即列的名称,由于载入的数据中没有表头,故需要加上表头。方法如下
val df = data.toDF("duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
"wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
"su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
"is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
"srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate", "label")
上述代码使用.toDF()方法,传入参数是42列的列名。
由于ML框架下的各种机器学习的算法,传入的数据格式为DataFrame格式,只有两列,一列是features,内容为特征向量;一列是label,内容为标签向量。现在开始把KDD99数据集通过一系列转换变为这种各种。
观察数据集格式
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
可以发现,前41列为特征,最后一列是标签。并且第2、3、4、42列为类别型,不是数值型。由于使用的是决策树算法,故需要把特征中的类别特征转换为数值型,可以用操作DataFrame方法中的StringIndexer()方法。使用示例见官网
转换完之后,数据集中原来的类别型特征的列(即多出的列)可用DataFrame里的drop()方法删除,传入参数为列名。
之后再把前41列合并为一列,取名为features,合并后的每一行自动成为一个稀疏向量。因合并后的列每一行是一个稀疏向量,故需要待合并的列为double类型,合并前转为double类型,代码如下:
val cols = df_final.columns.map(f => col(f).cast(DoubleType))
接下来就可以合并了,合并列的方法为:
val assembler = new VectorAssembler().setInputCols(Array("duration", "src_bytes", "dst_bytes", "land",
"wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
"su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
"is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
"srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate", "protocol_typeIndex", "serviceIndex", "flagIndex")).setOutputCol("features")
var data_fea: DataFrame = assembler.transform(df_final.select(cols: _*))
上述代码.setInputCols()传入的参数是一个字符串数组,即将要合并前41个特征列的列名,组成的数组,.setOutputCol()传入的参数是合并成的新列的列名。
合并完成后,需要把原来41列删除掉,由于删除的列比较多,写一个for循环吧,代码如下:
val colNames = data_fea.columns
var dataset = data_fea.drop(colNames(0))
for (colId <- 0 to 40) {
dataset = dataset.drop(colNames(colId))
}
现在数据格式已经修改完毕,需要把数据随机分成测试集和训练集。代码如下:
val Array(trainingData, testData) = dataset.randomSplit(Array(0.7, 0.3))
接下来的步骤就是固定的流水线步骤了。建立特征索引、创建决策树模型、配置流水线、配置网格参数、实例化交叉验证、通过交叉验证模型,获取最优参数集,并测试模型。最后预测正确率。
完成代码如下:
package ml
import org.apache.spark.ml.classification.{DecisionTreeClassificationModel, DecisionTreeClassifier}
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
object Kdd99 extends App {
val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[16]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val spark = SparkSession.builder().appName("Kdd99").config("example", "some-value").getOrCreate()
//读取数据并给数据添加表头
val data = spark.read.csv("/user/Tian/data/kddcup.data")
val df = data.toDF("duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
"wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
"su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
"is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
"srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate", "label")
//修改"protocol_type"这一列为数值型
val indexer_2 = new StringIndexer().setInputCol("protocol_type").setOutputCol("protocol_typeIndex")
val indexed_2 = indexer_2.fit(df).transform(df)
//修改"service"这一列为数值型
val indexer_3 = new StringIndexer().setInputCol("service").setOutputCol("serviceIndex")
val indexed_3 = indexer_3.fit(indexed_2).transform(indexed_2)
//修改"flag"这一列为数值型
val indexer_4 = new StringIndexer().setInputCol("flag").setOutputCol("flagIndex")
val indexed_4 = indexer_4.fit(indexed_3).transform(indexed_3)
//修改"label"这一列为数值型
val indexer_final = new StringIndexer().setInputCol("label").setOutputCol("labelIndex")
val indexed_df = indexer_final.fit(indexed_4).transform(indexed_4)
//删除原有的类别列
val df_final = indexed_df.drop("protocol_type").drop("service")
.drop("flag").drop("label")
//合并前41列为features
val assembler = new VectorAssembler().setInputCols(Array("duration", "src_bytes", "dst_bytes", "land",
"wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised", "root_shell",
"su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
"is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
"srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate", "dst_host_count",
"dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate", "protocol_typeIndex", "serviceIndex", "flagIndex")).setOutputCol("features")
//转换为double类型
val cols = df_final.columns.map(f => col(f).cast(DoubleType))
var data_fea: DataFrame = assembler.transform(df_final.select(cols: _*))
//删除前41列,只留labelIndex和features这两列(data_fea.columns.length = 43)
val colNames = data_fea.columns
var dataset = data_fea.drop(colNames(0))
for (colId <- 0 to 40) {
dataset = dataset.drop(colNames(colId))
}
dataset = dataset.withColumnRenamed("labelIndex", "label")
/*
组装
*/
//把数据随机分成训练集合测试集
val Array(trainingData, testData) = dataset.randomSplit(Array(0.7, 0.3))
//建立特征索引
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").fit(dataset)
//创建决策树模型
val decisionTree = new DecisionTreeClassifier().setLabelCol("label")
.setFeaturesCol("indexedFeatures").setImpurity("entropy").setMaxBins(100).setMaxDepth(5).setMinInfoGain(0.01)
println("创建决策树模型...")
//配置流水线
val dtPipline = new Pipeline().setStages(Array(featureIndexer, decisionTree))
println("配置流水线...")
/*
模型优化
*/
//配置网格参数
val dtParamGrid = new ParamGridBuilder()
.addGrid(decisionTree.maxDepth, Array(3, 5, 7))
.build()
//实例化交叉验证模型
val evaluator = new BinaryClassificationEvaluator()
val dtCV = new CrossValidator()
.setEstimator(dtPipline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(dtParamGrid)
.setNumFolds(2)
//通过交叉验证模型,获取最优参数集,并测试模型
val dtCVModel = dtCV.fit(trainingData)
val dtPrediction = dtCVModel.transform(testData)
//查看决策树匹配模型的参数
val dtBestModel = dtCVModel.bestModel.asInstanceOf[PipelineModel]
val dtModel = dtBestModel.stages(1).asInstanceOf[DecisionTreeClassificationModel]
print("决策树模型深度:")
println(dtModel.getMaxDepth)
//统计预测正确率
//t:决策树预测值的数组
//label:测试集的标签值数组
//count:测试集的数量
val (t, label, count) = (dtPrediction.select("prediction").collect,
testData.select("label").collect(),
testData.count().toInt)
var dt = 0
for (i <- 0 to count - 1) {
if (t(i) == label(i)) {
dt += 1
}
}
//打印正确率
println("正确率:" + 1.0 * dt / count)
}
最后的运行结果如下图所示:
原创文章,转载请注明出处: https://blog.csdn.net/weixin_43135846/article/details/83541415