1.从Seq序列转为dataframe,使用toDF方法,需要隐式转换
val df = Seq((1,"brr"),(2,"hrr"),(3,"xxr")).toDF("id","name")
df.show()
输出:
| id|name|
+---+----+
| 1| brr|
| 2| hrr|
| 3| xxr|
+---+----+
2.从List转为dataframe,跟Seq序列转为dataframe类似
val df = List((1,"brr"),(2,"hrr"),(3,"xxr")).toDF("id","name")
df.show()
输出:
| id|name|
+---+----+
| 1| brr|
| 2| hrr|
| 3| xxr|
+---+----+
3.从RDD转为dataframe,使用toDF方法,需要隐式转换
val rdd = sc.parallelize(List((1,"brr"),(2,"hrr"),(3,"xxr"))) #sc为已有的sparkContext对象
val df = rdd.toDF("id","name")
df.show()
输出:
| id|name|
+---+----+
| 1| brr|
| 2| hrr|
| 3| xxr|
+---+----+
4.从RDD转为dataframe,使用sparkSession对象的createDataFrame方法
需要先定义一个字段结构schema,把rdd套进schema里,从而得到dataframe并且指定字段类型和非空信息
val shema = StructType(List(StructField("id",IntegerType,nullable = false),StructField("name",StringType,nullable = true)))
val rdd = sc.parallelize(Seq(Row(1,"brr"),Row(2,"hrr"),Row(3,"xxr"))) #sc为已有的sparkContext对象
val df = sparkSession.createDataFrame(rdd,schema)
df.show()
输出:
| id|name|
+---+----+
| 1| brr|
| 2| hrr|
| 3| xxr|
+---+----+
5.通过调用sparkSession的read方法读文件得到dataframe
支持文件格式:csv format jdbc json load option options orc parquet schema table text textFile
读json文件
json内容:
{"id":1,"name":"brr"},
{"id":2,"name":"hrr"},
{"id":3,"name":"xxr"}
val df = spark.read.json(json_path)
df.show()
输出为
+---+----+
| id|name|
+---+----+
| 1| brr|
| 2| hrr|
| 3| xxr|
+---+----+
6.通过sparkSession的sql方法读hive表得到dataframe
hive xfy数据库rr表数据:
rr.id rr.name
1 brr
2 hrr
3 xxr
val df = spark.sql("select * from xfy.rr")
df.show()
输出为
+---+----+
| id|name|
+---+----+
| 1| brr|
| 2| hrr|
| 3| xxr|
+---+----+
7.通过sparkSession的read方法读mysql表得到dataframe
mysql xfy数据库rr表数据:
rr.id rr.name
1 brr
2 hrr
3 xxr
val url = "jdbc:mysql://localhost:3306/xfy"
val df = spark.read.format("jdbc").option("url",url).option("dbtable","rr").option("user","root").option("password","123456").load()
df.show()
输出为
+---+----+
| id|name|
+---+----+
| 1| brr|
| 2| hrr|
| 3| xxr|
+---+----+