定义Case Class,在RDD的转换过程中使用Case Class可以隐式转换成SchemaRDD,然后再注册成表,然后就可以利用sqlContext或者SparkSession操作了。
我们给出一个电影测试数据film.txt,定一个Case Class(Film),然后将数据文件读入后隐式转换成SchemeRDD:film,并将film在SparkSession中注册,最后对表进行查询
hdfs dfs -put /opt/data/film.txt /user/hadoop
val session = SparkSession.builder()
.appName("Case Class To Define RDD")
.config("spark.some.config.option", "some-value")
.master("local[*]")
.getOrCreate()
import session.implicits._
val filmRdd = session.sparkContext.textFile("hdfs://hdfs-cluster/user/hadoop/film.txt")
val filmDF = filmRdd
.map(_.split(","))
.map(fields => Film(fields(0),fields(1),fields(2),fields(3),fields(4).trim.toInt,fields(5),fields(6).trim.toFloat))
.toDF()
filmDF.createOrReplaceTempView("film")
val results =session.sql("SELECT name,director,style,score FROM film WHERE score > 5.0")
val filmDS = results.map(film => {
val name = film.getAs[String]("name")
val director = film.getAs[String]("director")
val style = film.getAs[String]("style")
val score = film.getAs[Float]("score")
(name,director,style,score)
})
filmDS.show(10)
通过使用createDataFrame定义RDD,通常有三个步骤
# 创建初始RDD
# 构建Row类型的RDD
# 构建该RDD对应的schema
然后调用createDataFrame方法
val session = SparkSession.builder()
.appName("Create DataFrame API To Define RDD")
.config("spark.some.config.option", "some-value")
.master("local[*]")
.getOrCreate()
import session.implicits._
val filmRdd = session.sparkContext.textFile("hdfs://hdfs-cluster/user/hadoop/film.txt")
val rowRdd = filmRdd
.map(_.split(","))
.map(fields =>
Row(fields(0),fields(1),fields(2),fields(3),fields(4).trim.toInt,fields(5),fields(6).trim.toFloat)
)
// 这里的数据类型必须和数据源所有类型对应
val schema:StructType = StructType(Array(
StructField("filmid",StringType),
StructField("director",StringType),
StructField("name",StringType),
StructField("release_time",StringType),
StructField("box_office",IntegerType),
StructField("style",StringType),
StructField("score",FloatType)
))
val filmDF = session.createDataFrame(rowRdd,schema)
filmDF.createOrReplaceTempView("film")
val results = session.sql("SELECT name,director,style,score FROM film WHERE score > 5.0")
val filmDS = results.map(film => {
val name = film.getAs[String]("name")
val director = film.getAs[String]("director")
val style = film.getAs[String]("style")
val score = film.getAs[Float]("score")
(name,director,style,score)
})
filmDS.show(10)