一个简单的Spark ML的例子

文章目录

  • 1-配置
  • 2-流程
  • 3-注意
  • 4-project

1-配置

首先,我在虚拟上,搭建了一个单机spark2.4.1(无hadoop)。然后在本地的IDEA中远程运行spark,操作一个svm的小例子。
sbt文件:

name := "spark_ml_examples"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.1"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.4.1"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.4.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.4.1"
libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "2.4.1"
libraryDependencies += "org.json4s" %% "json4s-jackson" % "{latestVersion}"

其中kafka和stream这里没有用到,因为之前写的,没有移除。关于加载的jar包,我提供几个地址去查询,直接给链接。
https://www.mvnjar.com/org.apache.spark/list.html
https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-assembly
https://search.maven.org/artifact/org.apache.spark/spark-streaming-kafka-0-10_2.11/2.4.1/jar

2-流程

整个流程比较简单,就是一般的机器学习方法,只不过里面涉及到了spark处理。
直接贴代码:

import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.classification.LinearSVC
import org.apache.spark.sql.SparkSession

object spark_svm {
  def main(args: Array[String]): Unit = {
    System.setProperty("hadoop.home.dir", "F:\\hadoop-common-2.2.0-bin")
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

//    创建SparkSession
    val spark = SparkSession
      .builder
      .appName("svm_example")
      .master("local[2]")
      .getOrCreate()

//    加载数据
    val data = spark.read.format("libsvm").load("./data/sample_libsvm_data.txt")
    data.show(5)

//    数据归一化
    val scaler = new StandardScaler()
      .setInputCol("features")
      .setOutputCol("scaledfeatures")
      .setWithMean(true)
      .setWithStd(true)
    val scalerdata = scaler.fit(data)
    val scaleddata = scalerdata.transform(data).select("label","scaledfeatures").toDF("label","features")
    data.show(5)

//    PCA降维
    val pca = new PCA()
      .setInputCol("features")
      .setOutputCol("pcafeatures")
      .setK(5)
      .fit(scaleddata)
    val pcadata = pca.transform(scaleddata).select("label","pcafeatures").toDF("label","features")
    data.show(5)

//    划分数据集
    val Array(trainData, testData) = pcadata.randomSplit(Array(0.5, 0.5), seed = 20)
    trainData.count()

//    创建svm
    val lsvc = new LinearSVC()
      .setMaxIter(10)
      .setRegParam(0.1)

//    创建pipeline
    val pipeline = new Pipeline()
      .setStages(Array(scaler, pca, lsvc))
//    训练svc
//    val lsvcmodel = lsvc.fit(trainData)
    val lsvcmodel = pipeline.fit(trainData)

//    验证精度
    val res = lsvcmodel.transform(testData).select("prediction","label")
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")

    val accuracy = evaluator.evaluate(res)
    println(s"Accuracy = ${accuracy}")

    spark.stop()

  }
}

一个简单的Spark ML的例子_第1张图片在这里插入图片描述

3-注意

在运行中,主要遇到了2个问题,一个是提示没有hadoop,百度得知,通过下载hadoop-bin.zip,然后解压,最用利用 System.setProperty(“hadoop.home.dir”, “F:\hadoop-common-2.2.0-bin”)配置。下载后,解压到哪里,就是哪里目录。
第二个问题就是master的url:xx.master(“local[2]”)

4-project

在我的github中有整个项目,包括jar包,可以自行下载。
github地址
数据集:https://github.com/Great1414/spark_ml_learn/tree/master/data
参考链接:https://spark.apache.org/docs/latest/ml-classification-regression.html

你可能感兴趣的:(大数据)