进一步理解DataFrame, Dataset, RDD

Dataset=RDD+schema

Dataset几乎就是一个RDD,除了它还包括一个schema,这个schema很多时候也是自动推导出来的,最简单的schema是包含一个名为value的列,它的类型可以是String,Int...

如下代码创建一个Dataset:

scala> import spark.implicits._
import spark.implicits._

scala> val ds = Seq(("bluejoe", 100), ("alex", 200)).toDS
ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int]

scala> ds.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(_1,StringType,true), StructField(_2,IntegerType,false))

scala> ds.collect
res1: Array[(String, Int)] = Array((bluejoe,100), (alex,200))

这个Dataset就包含了2行记录,每个记录是一个Tuple2,如:(bluejoe,100)

可以针对这个Dataset做SQL查询:

scala> ds.select("_1").collect
res4: Array[org.apache.spark.sql.Row] = Array([bluejoe], [alex])

scala> ds.show
+-------+---+
|     _1| _2|
+-------+---+
|bluejoe|100|
|   alex|200|
+-------+---+
scala> ds.select("_1").show
+-------+
|     _1|
+-------+
|bluejoe|
|   alex|
+-------+
scala> ds.select(ds("_1")).show
+-------+
|     _1|
+-------+
|bluejoe|
|   alex|
+-------+
scala> ds.select($"_1").show
+-------+
|     _1|
+-------+
|bluejoe|
|   alex|
+-------+

ds.select("_1")与ds.select(ds("_1")),以及ds.select($"_1")等价
$"_1"神奇吗?一点都不神奇,$()其实是一个函数:

  implicit class StringToColumn(val sc: StringContext) {
    def $(args: Any*): ColumnName = {
      new ColumnName(sc.s(args: _*))
    }
  }

SQL列还可以进行运算操作:

scala> ds.select(ds("_2")+10).show
+---------+
|(_2 + 10)|
+---------+
|      110|
|      210|
+---------+
scala> ds.select($"_2"+10).show
+---------+
|(_2 + 10)|
+---------+
|      110|
|      210|
+---------+

+、-等运算符其实也被ColumnName定义了,这里不再赘述。

也可以使用map()对Dataset进行变形:

scala> ds.map(x=>(x._1.toUpperCase, x._2+10)).show
+-------+---+
|     _1| _2|
+-------+---+
|BLUEJOE|110|
|   ALEX|210|
+-------+---+

可以看出,map()函数会生成新的schema:

scala> ds.map(x=>(x._1.toUpperCase, x._2+10, true)).show
+-------+---+----+
|     _1| _2|  _3|
+-------+---+----+
|BLUEJOE|110|true|
|   ALEX|210|true|
+-------+---+----+

除了将一个Tuple转换成另外一个Tuple,还可以转成一个JavaBean:

scala> case class Person(name:String,age:Int){};
defined class Person
scala> val ds2=ds.map(x=>Person(x._1.toUpperCase, x._2+10))
ds2: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]

scala> ds2.show
+-------+---+
|   name|age|
+-------+---+
|BLUEJOE|110|
|   ALEX|210|
+-------+---+

注意这个新的Dataset的每一行变成了一个Person对象:

scala> ds2.collect
res36: Array[Person] = Array(Person(BLUEJOE,110), Person(ALEX,210))

注意,不是任何对象都可以放到Dataset中:

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> ds.map(x=>Row(x._1.toUpperCase, x._2+10)).show
:32: error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
       ds.map(x=>Row(x._1.toUpperCase, x._2+10)).show

DataFrame是Dataset[Row]的别名
A DataFrame is a Dataset organized into named columns.

Dataset可以转成DataFrame:

scala> val df=ds.toDF
df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> df.collect
res33: Array[org.apache.spark.sql.Row] = Array([bluejoe,100], [alex,200])

注意看到DataFrame的每行确实是一个Row,观察源代码:

 def toDF(): DataFrame = new Dataset[Row](sparkSession, queryExecution, RowEncoder(schema))

实际上,toDF()使用一个RowEncoder来实现Tuple到Row的转码。

也可以使用as()函数来转换成DataFrame:

scala> ds.as[Row](RowEncoder(ds.schema)).collect
res55: Array[org.apache.spark.sql.Row] = Array([bluejoe,100], [alex,200])

DataFrame的map()函数具有一些陷阱,因为它实际上还是一个Dataset,所以它的每一行还是可以转换成任意对象(甚至是非Row对象!!):

scala> df.map(x=>(x(0).asInstanceOf[String].toLowerCase, x(1).asInstanceOf[Int]-10)).collect
res43: Array[(String, Int)] = Array((bluejoe,90), (alex,190))

看到没?这个map()之后的对象并不再是DataFrame了!!如果坚持要转变成DataFrame,就必须用到别扭的toDF():

scala> df.map(x=>(x(0).asInstanceOf[String].toLowerCase, x(1).asInstanceOf[Int]-10)).toDF.collect
res44: Array[org.apache.spark.sql.Row] = Array([bluejoe,90], [alex,190])

或者指定Encoder:

scala> df.map{x:Row=>Row(x(0).asInstanceOf[String].toLowerCase, x(1).asInstanceOf[Int]-10)}(RowEncoder(ds.schema)).collect
res52: Array[org.apache.spark.sql.Row] = Array([bluejoe,90], [alex,190])

别扭吗?真的很别扭!

你可能感兴趣的:(进一步理解DataFrame, Dataset, RDD)