spark dataframe构建方式

spark dataframe构建方式

1.从Seq序列转为dataframe,使用toDF方法,需要隐式转换

val df = Seq((1,"brr"),(2,"hrr"),(3,"xxr")).toDF("id","name")
df.show()
输出:
| id|name|
+---+----+
|  1| brr|
|  2| hrr|
|  3| xxr|
+---+----+

2.从List转为dataframe,跟Seq序列转为dataframe类似

val df = List((1,"brr"),(2,"hrr"),(3,"xxr")).toDF("id","name")
df.show()
输出:
| id|name|
+---+----+
|  1| brr|
|  2| hrr|
|  3| xxr|
+---+----+

3.从RDD转为dataframe,使用toDF方法,需要隐式转换

val rdd = sc.parallelize(List((1,"brr"),(2,"hrr"),(3,"xxr"))) #sc为已有的sparkContext对象
val df = rdd.toDF("id","name")
df.show()
输出:
| id|name|
+---+----+
|  1| brr|
|  2| hrr|
|  3| xxr|
+---+----+

4.从RDD转为dataframe,使用sparkSession对象的createDataFrame方法
需要先定义一个字段结构schema,把rdd套进schema里,从而得到dataframe并且指定字段类型和非空信息

val shema = StructType(List(StructField("id",IntegerType,nullable = false),StructField("name",StringType,nullable = true)))
val rdd = sc.parallelize(Seq(Row(1,"brr"),Row(2,"hrr"),Row(3,"xxr"))) #sc为已有的sparkContext对象
val df = sparkSession.createDataFrame(rdd,schema)
df.show()
输出:
| id|name|
+---+----+
|  1| brr|
|  2| hrr|
|  3| xxr|
+---+----+

5.通过调用sparkSession的read方法读文件得到dataframe
支持文件格式:csv format jdbc json load option options orc parquet schema table text textFile
读json文件

json内容:
{"id":1,"name":"brr"},
{"id":2,"name":"hrr"},
{"id":3,"name":"xxr"}

val df = spark.read.json(json_path)
df.show()
输出为
+---+----+
| id|name|
+---+----+
|  1| brr|
|  2| hrr|
|  3| xxr|
+---+----+

6.通过sparkSession的sql方法读hive表得到dataframe

hive xfy数据库rr表数据:
rr.id	rr.name
1	brr
2	hrr
3	xxr

val df = spark.sql("select * from xfy.rr")
df.show()
输出为
+---+----+
| id|name|
+---+----+
|  1| brr|
|  2| hrr|
|  3| xxr|
+---+----+

7.通过sparkSession的read方法读mysql表得到dataframe

mysql xfy数据库rr表数据:
rr.id	rr.name
1	brr
2	hrr
3	xxr

val url = "jdbc:mysql://localhost:3306/xfy"
val df = spark.read.format("jdbc").option("url",url).option("dbtable","rr").option("user","root").option("password","123456").load()
df.show()
输出为
+---+----+
| id|name|
+---+----+
|  1| brr|
|  2| hrr|
|  3| xxr|
+---+----+

你可能感兴趣的:(spark)