Spark RDD转DataFrame并更改列column的类型

文本数据:

<BaseValue::CD type=全数>
@ id name value unit
// 序号 基准值名 基准值 量纲
# 1 CD.功率.0 100 MVA
# 2 CD.电压.1000 1000 KV
# 3 CD.电压.800 800 KV
# 4 CD.电压.750 750 KV
# 5 CD.电压.660 660 KV
# 6 CD.电压.600 600 KV
# 7 CD.电压.525 525 KV
# 8 CD.电压.500 500 KV
# 9 CD.电压.400 400 KV
# 10 CD.电压.330 330 KV

方法演示:

//读取数据
val fileRDD: RDD[String] = sparkContext.textFile("C:\\Users\\yanni\\Desktop\\QS文件\\out1\\BaseValueData.txt",3)

//过滤数据
val DataRDD: RDD[String] = fileRDD.filter(_.toString.contains("#"))

//导入隐式转换
import sparkSession.implicits._

//获取DataFrame并打印
val DataDF = DataRDD.map(row => (
  row.toString.substring(2).split(" ")(0),
  row.toString.substring(2).split(" ")(1),
  row.toString.substring(2).split(" ")(2),
  row.toString.substring(2).split(" ")(3)
)).toDF("id","name","value","unit")
DataDF.printSchema()

//修改DataFrame中column的数据类型并打印
val newDF = DataDF.selectExpr("cast(id as int) id","name","cast(value as int) value","unit")
newDF.printSchema()

//注册DataFrame临时表
newDF.createTempView("BVDF")

//展示数据
sparkSession.sql("select * from BVDF").show()

打印结果

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- value: string (nullable = true)
 |-- unit: string (nullable = true)

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- value: integer (nullable = true)
 |-- unit: string (nullable = true)

+---+----------+-----+----+
| id|      name|value|unit|
+---+----------+-----+----+
|  1|   CD.功率.0|  100| MVA|
|  2|CD.电压.1000| 1000|  KV|
|  3| CD.电压.800|  800|  KV|
|  4| CD.电压.750|  750|  KV|
|  5| CD.电压.660|  660|  KV|
|  6| CD.电压.600|  600|  KV|
|  7| CD.电压.525|  525|  KV|
|  8| CD.电压.500|  500|  KV|
|  9| CD.电压.400|  400|  KV|
| 10| CD.电压.330|  330|  KV|

你可能感兴趣的:(Spark)