根据空气测得的数据,对空气质量评价
以下是部分空气数据:
ID,DAYTIME,CITYCODE,SO2 ,CO,NO2 ,O3, PM10,PM2_5,AQI,MEASURE, TIMEPOINT
0:110000:20141120,20141120,110000,31,3.939,141,8,368,301,351,6,2014-11-20
0:110000:20141208,20141208,110000,32,1.431,65,37,89,60,82,2,2014-12-08
0:110000:20141220,20141220,110000,10,0.478,25,48,18,9,32,1,2014-12-20
0:110000:20150108,20150108,110000,53,3.305,101,12,176,143,190,4,2015-01-08
0:110000:20150120,20150120,110000,45,2.029,76,23,112,85,113,3,2015-01-20
0:110000:20150212,20150212,110000,17,0.832,47,74,49,36,59,2,2015-02-12
更多数据:https://pan.baidu.com/s/1uVSpjx4-yQe1gXVpnzHNeQ
数据是以 “,”分割,其中 MEASURE 是评价,(1:优,2:良,3:轻度污染,4:中度污染,5:重度污染,6:严重污染)
实现 根据数据对空气进行评价
import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.sql.SparkSession object TestML { def main(args: Array[String]): Unit = { val dataDir = "file:///d:/docment/air/data/logs.txt"; val sess = SparkSession.builder().appName("wangjk").master("local[2]").config("spark.testing.memory", "2147480000").getOrCreate(); val sc = sess.sparkContext; //定义样例类 case class Air(SO2: Double, CO: Double, NO2: Double, O3: Double, PM10: Double, PM2_5: Double, AQI: Double, MEASURE: Double) //变换 val rd1=sc.textFile(dataDir).map(_.split(",")).map(e => Air(e(3).toDouble, e(4).toDouble, e(5).toDouble, e(6).toDouble, e(7).toDouble, e(8).toDouble, e(9).toDouble, e(10).toDouble) ) //转换RDD成DataFrame import sess.implicits._ val trainDF= rd1.map(w=>( w.MEASURE,Vectors.dense(w.SO2,w.CO,w.NO2,w.O3,w.PM10,w.PM2_5,w.AQI))).toDF("label", "features") trainDF.show() //创建线性回归对象 var lr=new LinearRegression() //迭代次数 lr.setMaxIter(20) //创建模型 val model=lr.fit(trainDF) //测试数据 val testDF = sess.createDataFrame(Seq((6.0, Vectors.dense(31 , 3.939 , 141 , 8 , 368 , 301 ,351)), (2.0,Vectors.dense(32 , 1.431 ,65 , 37 , 89 , 60 , 82)), (1.0, Vectors.dense(10, 0.478, 25, 48 , 18, 9, 32)))).toDF("label", "features") //保存模型 model.write.overwrite().save("file:///d:/docment/air/model/") val tested= model.transform(testDF).select("features","label","prediction"); tested.show() } }
结果如下:
+--------------------+-----+------------------+
| features|label| prediction|
+--------------------+-----+------------------+
|[31.0,3.939,141.0...| 6.0| 6.424981475397239|
|[32.0,1.431,65.0,...| 2.0|2.1950808612353887|
|[10.0,0.478,25.0,...| 1.0|1.2091385222200972|
+--------------------+-----+------------------+
其中label是原数据评价,即数据库中的真实评价 prediction 为预测值。
下面是利用保存的模型计算数据:
测试数据与上面的相同,但label得值是随便给的 之前是 6.0 ,2.0 ,1.0 改为 1.0 ,6.0 , 6.0
主要是测试评价是否会影响结果
import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.regression.LinearRegressionModel import org.apache.spark.sql.SparkSession object MLTest { def main(args: Array[String]): Unit = { val sess = SparkSession.builder().appName("wangjk").master("local[2]") .config("spark.testing.memory", "2147480000").getOrCreate(); val sc = sess.sparkContext; //加载模型 val model = LinearRegressionModel.load("file:///d:/docment/air/model/") //测试数据 val testDF = sess.createDataFrame(Seq((1.0, Vectors.dense(31 , 3.939 , 141 , 8 , 368 , 301 ,351)), (6.0,Vectors.dense(32 , 1.431 ,65 , 37 , 89 , 60 , 82)), (6.0, Vectors.dense(10, 0.478, 25, 48 , 18, 9, 32)))).toDF("label", "features") val tested= model.transform(testDF).select("features","label","prediction"); tested.show() } }
结果如下:
+--------------------+-----+------------------+
| features|label| prediction|
+--------------------+-----+------------------+
|[31.0,3.939,141.0...| 1.0| 6.424981475397239|
|[32.0,1.431,65.0,...| 6.0|2.1950808612353887|
|[10.0,0.478,25.0,...| 6.0|1.2091385222200972|
+--------------------+-----+------------------+
prediction 与之前的数据相同