Spark机器学习之空气质量预测和评价

根据空气测得的数据,对空气质量评价

以下是部分空气数据:

ID,DAYTIME,CITYCODE,SO2 ,CO,NO2 ,O3, PM10,PM2_5,AQI,MEASURE, TIMEPOINT         

0:110000:20141120,20141120,110000,31,3.939,141,8,368,301,351,6,2014-11-20
0:110000:20141208,20141208,110000,32,1.431,65,37,89,60,82,2,2014-12-08
0:110000:20141220,20141220,110000,10,0.478,25,48,18,9,32,1,2014-12-20
0:110000:20150108,20150108,110000,53,3.305,101,12,176,143,190,4,2015-01-08
0:110000:20150120,20150120,110000,45,2.029,76,23,112,85,113,3,2015-01-20

0:110000:20150212,20150212,110000,17,0.832,47,74,49,36,59,2,2015-02-12

更多数据:https://pan.baidu.com/s/1uVSpjx4-yQe1gXVpnzHNeQ

数据是以 “,”分割,其中 MEASURE 是评价,(1:优,2:良,3:轻度污染,4:中度污染,5:重度污染,6:严重污染)

实现 根据数据对空气进行评价

import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.SparkSession

object TestML {

  def main(args: Array[String]): Unit = {
    val dataDir = "file:///d:/docment/air/data/logs.txt";
    val sess = SparkSession.builder().appName("wangjk").master("local[2]").config("spark.testing.memory", "2147480000").getOrCreate();
    val sc = sess.sparkContext;

    //定义样例类
    case class Air(SO2: Double, CO: Double,
                   NO2: Double, O3: Double, PM10: Double,
                   PM2_5: Double, AQI: Double, MEASURE: Double)

    //变换
    val rd1=sc.textFile(dataDir).map(_.split(",")).map(e =>
      Air(e(3).toDouble, e(4).toDouble, e(5).toDouble, e(6).toDouble,
        e(7).toDouble, e(8).toDouble, e(9).toDouble, e(10).toDouble)
      )

    //转换RDDDataFrame
    import sess.implicits._
    val trainDF= rd1.map(w=>(
      w.MEASURE,Vectors.dense(w.SO2,w.CO,w.NO2,w.O3,w.PM10,w.PM2_5,w.AQI))).toDF("label", "features")


    trainDF.show()


    //创建线性回归对象
    var lr=new LinearRegression()
    //迭代次数
    lr.setMaxIter(20)
    //创建模型
    val model=lr.fit(trainDF)
    //测试数据
     val testDF = sess.createDataFrame(Seq((6.0, Vectors.dense(31 , 3.939 , 141 , 8 , 368 , 301 ,351)),
      (2.0,Vectors.dense(32 , 1.431 ,65 , 37 , 89 , 60 , 82)),
      (1.0, Vectors.dense(10,  0.478, 25, 48 , 18, 9, 32)))).toDF("label", "features")

   //保存模型
    model.write.overwrite().save("file:///d:/docment/air/model/")
     val tested= model.transform(testDF).select("features","label","prediction");

     tested.show()

  }
}

结果如下:

+--------------------+-----+------------------+
|            features|
label|        prediction|
+--------------------+-----+------------------+
|[31.0,3.939,141.0...| 
6.0| 6.424981475397239|
|[32.0,1.431,65.0,...| 
2.0|2.1950808612353887|
|[10.0,0.478,25.0,...| 
1.0|1.2091385222200972|
+--------------------+-----+------------------+

其中label是原数据评价,即数据库中的真实评价  prediction 为预测值。


下面是利用保存的模型计算数据:

测试数据与上面的相同,但label得值是随便给的  之前是 6.0 ,2.0 ,1.0  改为 1.0 ,6.0 , 6.0 

主要是测试评价是否会影响结果


import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.sql.SparkSession

object MLTest {

  def main(args: Array[String]): Unit = {

    val sess = SparkSession.builder().appName("wangjk").master("local[2]")
      .config("spark.testing.memory", "2147480000").getOrCreate();
    val sc = sess.sparkContext;
    //加载模型
    val model = LinearRegressionModel.load("file:///d:/docment/air/model/")

    //测试数据
    val testDF = sess.createDataFrame(Seq((1.0, Vectors.dense(31 , 3.939 , 141 , 8 , 368 , 301 ,351)),
      (6.0,Vectors.dense(32 , 1.431 ,65 , 37 , 89 , 60 , 82)),
      (6.0, Vectors.dense(10,  0.478, 25, 48 , 18, 9, 32)))).toDF("label", "features")

    val tested= model.transform(testDF).select("features","label","prediction");
    tested.show()

  }


}

结果如下:

+--------------------+-----+------------------+
|            features|label|        prediction|
+--------------------+-----+------------------+
|[31.0,3.939,141.0...|  1.0| 6.424981475397239|
|[32.0,1.431,65.0,...|  6.0|2.1950808612353887|
|[10.0,0.478,25.0,...|  6.0|1.2091385222200972|
+--------------------+-----+------------------+

prediction 与之前的数据相同



你可能感兴趣的:(Spark)