Spark DataFrame transformation操作报错问题

    在Spark2.0之后版本中,当对DataFrame对象进行transformation操作的时候,编译阶段不会报错,但是运行阶段就会抛出异常,提示这样的错误信息:

:26: error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
       df.map(_.get(0))
             ^

    大概意思就是说当前的数据集不能找到encoder,这个问题困扰了我好久,在以前的1.X版本中,map/filter等操作都是可以使用的,然后在网上也没找到相关的资料。

    然后去官网翻了一圈,发现官网对dataset的定义有这样一段描述:

Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.

Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.

To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.

There are typically two ways to create a Dataset. The most common way is by pointing Spark to some files on storage systems, using the read function available on a SparkSession.


   val people = spark.read.parquet("...").as[Person]  // Scala
   Dataset people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java

    上面说为了有效地使用特定类型的对象,必须制定编码器,然后编码器将指定类型T映射到spark内部的类型系统。这一点,可能是官方对dataframe结构做了优化,提高性能而做的一点改变。至于映射到spark内部的类型应该就是指RDD了,言外之意就是利用编码器将dataframe里的元素转成rdd,然后再进行transformation操作。例子在官网中也给了,即as[T]操作。

    后来查看了源码,在DataSet类型有一个lazy的函数变量rdd,接口如下:

lazy val rdd : org.apache.spark.rdd.RDD[T] = { /* compiled code */ }
    这个应该就是1.x可以运行的地方,在2.x估计改成lazy的是为了鼓励大家手动指定Encoder,从而提高计算性能。如果实在不想指定编码器,可以直接调用rdd匿名函数,转成RDD[T]对象,然后进行transformation操作。


    

你可能感兴趣的:(spark)