Spark文档 - SQL编程指南

预览

Spark SQL是Spark用于结构化数据处理的模块。不同于基本的RDD API，Spark SQL API提供了更多有关数据和计算的机构化信息。Spark SQL使用这些信息执行优化。使用Spark SQL API（包括SQL和Dataset）的方式有几种，不管使用哪种方式表述计算过程，Spark使用的都是同样的执行引擎。这意味着开发者可以在不同API之间随意切换。

SQL

Spark SQL可以执行查询，也可以从Hive实例中读取数据。

Dataset和DataFrame

Dataset是分布式数据集合。Dataset API只支持Scala和Java语言。

DataFrame是按列组织的Dataset。从概念上，它大致等同于关系数据库中的表或者R/Python中的数据帧。DataFrame可以从各种数据源创建：比如结构化数据文件，Hive表，外部数据库，或者RDD。在Scala和Java中，DataFrame使用类型为row的Dataset表示。在Scala API中，DataFrame实际就是Dataset[Row]的别名。

入门

切入点：SparkSession

SparkSession类是Spark所有功能的切入点。可以使用SparkSession.builder方法创建SparkSession：

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFramesimport spark.implicits._

Spark 2.0之后，SparkSession内置了对Hive特性的支持，包括使用HiveQL，使用Hive UDF，以及从Hive表读取数据。

创建DataFrame

SparkSession可以使用已有的RDD，Hive表，或其他数据源创建DataFrame。例如，下面使用JSON文件创建了一个DataFrame：

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

无类型Dataset操作（即DataFrame操作）

DataFrame为操作结构化数据提供了一个DSL。与操作强类型Dataset的“有类型转换”相比，这些操作称为“无类型转换”。以下是一些操作结构化数据的基本示例：

// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column
df.select("name").show()
// +-------+
// |   name|
// +-------+
// |Michael|
// |   Andy|
// | Justin|
// +-------+

// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()
// +-------+---------+
// |   name|(age + 1)|
// +-------+---------+
// |Michael|     null|
// |   Andy|       31|
// | Justin|       20|
// +-------+---------+

// Select people older than 21
df.filter($"age" > 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+

// Count people by age
df.groupBy("age").count().show()
// +----+-----+
// | age|count|
// +----+-----+
// |  19|    1|
// |null|    1|
// |  30|    1|
// +----+-----+

Dataset支持的操作列表可以在API Documentation中查看。

除了简单的列引用和列表达式。Dataset也支持字符串操作，日期计算，通用数学操作等函数。完整列表在DataFrame Function Reference中。

程序中运行SQL查询

应用程序可以使用sql函数在代码中运行SQL查询，返回的结果为DataFrame：

// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

全局临时视图

Spark SQL中的临时视图是会话级别的，如果会话终结，与之相关的临时视图也会消失。应用程序级别的临时视图可以使用全局临时视图。全局临时视图绑定到系统保留数据库global_temp，使用的时候也必须使用全限定名，例如，SELECT * FROM global_temp.view1。

// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

创建Dataset

Dataset类似于RDD，不同的是，RDD使用Java或Kryo序列化数据，而Dataset使用一个特制的编码器序列化网络间处理的数据。编码器和标准序列化都可以将对象转变成字节，不过编码器是动态生成的，Spark可以直接在编码后的字节序列上执行过滤，排序等操作。

case class Person(name: String, age: Long)

// Encoders are created for case classes
val caseClassDS = Seq(Person("Andy", 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+

// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

与RDD交互

Spark SQL有两种将RDD转换成Dataset的方法。第一种方法使用反射推导RDD中数据的schema。这种基于反射的方式代码简洁，适用于schema已知的情况。

第二种方式是使用接口构造schema并应用到RDD上，这种方式的代码稍显冗长。

使用反射推导schema

Spark SQL支持自动将包含样本类的RDD转换成DataFrame。样本类定义了表结构。样本类的参数名通过反射读取，最终作为列名。样本类可以嵌套，也可以使用复杂类型（比如Seq或Array）。

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
  .textFile("examples/src/main/resources/people.txt")
  .map(_.split(","))
  .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
  .toDF()
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19")

// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager => "Name: " + teenager(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// or by field name
teenagersDF.map(teenager => "Name: " + teenager.getAs[String]("name")).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))

手动指定schema

如果无法事先定义样本类，可以通过以下三步手动创建DataFrame：

从原RDD创建一个类型为Row的RDD
使用StructType表示表结构
通过createDataFrame方法将schema应用到RDD

例如：

import org.apache.spark.sql.types._

// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
  .map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)

// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
  .map(_.split(","))
  .map(attributes => Row(attributes(0), attributes(1).trim))

// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// |        value|
// +-------------+
// |Name: Michael|
// |   Name: Andy|
// | Name: Justin|
// +-------------+

聚合

DataFrame内置了一些通用的聚合函数：count()，countDistinct()，avg()，max()，min()等。这些函数都是为DataFrame设计的，某些也可以用于强类型Dataset。此外，用户也可以自定义聚合函数。

无类型自定义聚合函数

扩展UserDefinedAggregateFunction抽象类可以自定义一个无类型聚合函数。例如：

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._

object MyAverage extends UserDefinedAggregateFunction {
  // Data types of input arguments of this aggregate function
  def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)
  // Data types of values in the aggregation buffer
  def bufferSchema: StructType = {
    StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)
  }
  // The data type of the returned value
  def dataType: DataType = DoubleType
  // Whether this function always returns the same output on the identical input
  def deterministic: Boolean = true
  // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
  // the opportunity to update its values. Note that arrays and maps inside the buffer are still
  // immutable.
  def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0L
  }
  // Updates the given aggregation buffer `buffer` with new input data from `input`
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    if (!input.isNullAt(0)) {
      buffer(0) = buffer.getLong(0) + input.getLong(0)
      buffer(1) = buffer.getLong(1) + 1
    }
  }
  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
  }
  // Calculates the final result
  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
}

// Register the function to access it
spark.udf.register("myAverage", MyAverage)

val df = spark.read.json("examples/src/main/resources/employees.json")
df.createOrReplaceTempView("employees")
df.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

类型安全的自定义聚合函数

扩展自Aggregator抽象类的聚合函数适用于强类型Dataset。例如：

import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator

case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)

object MyAverage extends Aggregator[Employee, Average, Double] {
  // A zero value for this aggregation. Should satisfy the property that any b + zero = b
  def zero: Average = Average(0L, 0L)
  // Combine two values to produce a new value. For performance, the function may modify `buffer`
  // and return it instead of constructing a new object
  def reduce(buffer: Average, employee: Employee): Average = {
    buffer.sum += employee.salary
    buffer.count += 1
    buffer
  }
  // Merge two intermediate values
  def merge(b1: Average, b2: Average): Average = {
    b1.sum += b2.sum
    b1.count += b2.count
    b1
  }
  // Transform the output of the reduction
  def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
  // Specifies the Encoder for the intermediate value type
  def bufferEncoder: Encoder[Average] = Encoders.product
  // Specifies the Encoder for the final output value type
  def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]
ds.show()
// +-------+------+
// |   name|salary|
// +-------+------+
// |Michael|  3000|
// |   Andy|  4500|
// | Justin|  3500|
// |  Berta|  4000|
// +-------+------+

// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name("average_salary")
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// |        3750.0|
// +--------------+

数据源

Spark SQL通过DataFrame接口可以操作各种数据源，DataFrame即可以进行关系型转换操作，也可以作为临时视图使用。一旦DataFrame注册为临时视图，就可以使用SQL查询其中的数据。本节介绍常见的加载和保存数据的方法及其选项。

通用加载/保存函数

最简单的形式会使用默认数据源（默认为parquet，可以通过配置spark.sql.sources.default修改）。

val usersDF = spark.read.load("examples/src/main/resources/users.parquet")
usersDF.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

手动指定选项

可以在代码中直接指定数据源类型及其其他选项。数据源类型使用全限定名（例如org.apache.spark.sql.parquet），但是内置数据源可以使用简短形式（例如json，parquet，jdbc，orc，libsvm，csv，text）。各种格式之间可以互相转换。

加载JSON文件的方法：

val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")
peopleDF.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

加载CSV文件的方法：

val peopleDFCsv = spark.read.format("csv")
  .option("sep", ";")
  .option("inferSchema", "true")
  .option("header", "true")
  .load("examples/src/main/resources/people.csv")

在写操作时可以设置一些选项。例如，ORC数据源可以设置布隆过滤器和字典编码。例如下面的ORC示例：

usersDF.write.format("orc")
  .option("orc.bloom.filter.columns", "favorite_color")
  .option("orc.dictionary.key.threshold", "1.0")
  .save("users_with_options.orc")

在文件上直接执行SQL

除了使用API操作数据，也可以直接使用SQL直接在文件上做查询。

val sqlDF = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

存储模型

可以为存储操作指定一个SaveMode，它用于说明数据已存在是的处理逻辑。但是这些模式都不是原子性的。overwrite模式实际是先删除再保存。

Scala/Java	Any Language	Meaning
`SaveMode.ErrorIfExists`（默认）	`error`或`errorifexists`（默认）	保存数据时，如果数据已存在，抛出异常
`SaveMode.Append`	`append`	保存数据时，直接写入
`SaveMode.Overwrite`	`overwrite`	保存数据时，已存在的数据会被新数据覆盖
`SaveMode.Ignore`	`ignore`	保存数据时，如果数据已存在，忽略新数据

存储到持久化表

DataFrame也可以使用saveAsTable命令持久化到Hive表中。如果没有配置Hive，Spark会使用Derby创建一个默认的本地Hive元数据仓库。不同于createOrReplaceTempView命令，saveAsTable命令会将数据持久保存，表结构存储在Hive元数据仓库中。可以使用SparkSession上的table方法直接从表创建一个DataFrame。

对于文本类型数据源，例如text，parquet，json等。可以使用path选项指定表的存储路径。例如：df.write.option("path", "/some/path").saveAsTable("t")。表删除时，自定义路径下的数据不会被删除。如果不指定路径，数据会保存到默认的数仓目录中，表删除时，数据也会被删除。

从Spark 2.1开始，Hive也会存储持久化表的分区元信息。这么做有几个好处：

查询时可以只返回必要的分区信息。
可以使用ALTER TABLE PARTITION ... SET LOCATION这样的语句。

分桶，排序和分区

文本类型的数据源可以将数据分桶，排序或分区。分桶和排序只在持久化表时可用：

peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")

分区可以用于save和saveAsTable：

usersDF.write.partitionBy("favorite_color").format("parquet").save("namesPartByColor.parquet")

也可以将操作结合使用：

usersDF
  .write
  .partitionBy("favorite_color")
  .bucketBy(42, "name")
  .saveAsTable("users_partitioned_bucketed")

Parquet文件

Parquet是一种列式存储格式，Spark SQL支持读写parquet文件，并自动保存数据的schema。出于兼容性考虑，写入parquet文件时，所有的列都是可为空的。

手动加载数据

例如：

// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._

val peopleDF = spark.read.json("examples/src/main/resources/people.json")

// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet("people.parquet")

// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet("people.parquet")

// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView("parquetFile")
val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19")
namesDF.map(attributes => "Name: " + attributes(0)).show()
// +------------+
// |       value|
// +------------+
// |Name: Justin|
// +------------+

分区发现

表分区是Hive类系统中很常用的一种优化手段。分区表中的数据通常储存在不同目录，每个目录都带有分区列的值。所有内置的文本数据源（包括Text/CSV/ORC/Parquet）都支持自动发现和推断分区信息。例如，我们可以将之前使用的数据按照如下目录结构存储到分区表中，gender和country作为分区列：

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...

将path/to/table传递给SparkSession.read.parquet或者SparkSession.read.load时，Spark SQL会自动从路径中抽取出分区信息。返回的DataFrame的schema就变成：

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)

分区列的数据类型都是自动推断出来的，目前支持数值，日期，时间戳和字符串类型。如果不想启用自动类型推断，可以设置spark.sql.sources.partitonColumnTypeInference.enabled为false。类型推断禁用后，分区列的数据类型为字符串。

从Spark 1.6开始，分区发现默认只会查找给定路径下的分区。例如，如果将path/to/table/gender=male传递给SparkSession.read.parquet或SparkSession.read.load，gender就不会被视为一个分区列。用户可以指定一个分区发现的根路径basePaht。例如，当传递的路径为path/to/table/gender=male，并且basePath为/path/to/table，这时gender就是一个分区列。

Schama合并

同Protocal Buffer，Avro，Thrift一样，Parquet也支持schema演变。用户可以从一个简单的schema开始，不断地在schema上添加列。通过这种方式，用户可能得到多个带有不同schema，但是相互兼容的parquet文件。Parquet数据源现在能够自动探测到这种情况并合并这些文件的schema。

由于schema合并是一个相对耗时的操作，且在多数情况下都不需要，所以默认是禁用的。有两种方法启用这一特性：

读取parquet文件时设置数据源选项mergeSchema为true，或
设置全局SQL选项spark.sql.parquet.mergeSchema为true。

// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._

// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF.write.parquet("data/test_table/key=1")

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
cubesDF.write.parquet("data/test_table/key=2")

// Read the partitioned table
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
mergedDF.printSchema()

// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
// root
//  |-- value: int (nullable = true)
//  |-- square: int (nullable = true)
//  |-- cube: int (nullable = true)
//  |-- key: int (nullable = true)

Hive元数据Parquet表转换

读写Hive仓库中的Parquet表时，出于性能方面的考虑，Spark SQL会使用自己的Parquet支持，而不是Hive的Serde。这一特性由选项spark.sql.hive.convertMetastoreParquet控制，默认开启。

Hive/Parquet schema调解

从表schema层次来看，Hive和Parquet主要有两点区别：

Hive是大小写敏感的，Parquet不是
Hive的所有列都是可空的，Parquet中的空值有特殊含义

出于以上原因，我们必须在将Hive元数据Parquet表转换成Spark SQL Parquet表时，必须调解两种表的schema。调解规则如下：

相同字段必须具有相同的数据类型，被调解的列使用Parquet表的数据类型
下列字段也在调解之列：
- 只出现在Parquet schema中的字段会被忽略
- 只出现在Hive元数据schema中的字段会作为可空字段合并进调解后的schema

元数据刷新

为了更好的性能，Spark SQL会缓存Parquet schema。当启用了Parquet表转换时，Hive表的元数据也会被缓存。如果这些表被Hive或其它外部工具更新过，Spark需要手动更新元数据。

// spark is an existing SparkSession
spark.catalog.refreshTable("my_table")

配置

Parquet的配置可以通过SparkSession的setConf方法或者SET key=value命令设置。

Property Name	Default	Meaning
`spark.sql.parquet.binaryAsString`	false	某些使用Parquet的系统，例如Impala，Hive。在写入Parquet schema时不区分二进制数据和字符串。为了兼容性，这个标志告诉Spark SQL将二进制数据作为字符串解释
`spark.sql.parquet.int96AsTimestamp`	true	为了兼容性，这个标志告诉Spark SQL将INT96数据解释成时间戳
`spark.sql.parquet.compression.codec`	snappy	设置Parquet文件的压缩编解码器
`spark.sql.parquet.filterPushdown`	true	启用Parquet过滤器下推优化
`spark.sql.hive.convertMetastoreParquet`	true	设置为false时，使用Serde作为Hive表的序列化工具
`spark.sql.parquet.mergeSchema`	false	设置是否合并schema
`spark.sql.optimizer.metadataOnly`	true	启用元数据查询优化，可以避免全表扫描
`spark.sql.parquet.writeLegacyFormat`	false	设置为true时，数据会以Spark 1.4版本之前的方式写入

ORC文件

Spark 2.3之后，Spark为ORC文件提供了一个向量化ORC读取器。通知新增了以下配置。向量化读取器用于原生的ORC表（使用子句USING ORC创建的表），需要设置spark.sql.orc.impl为native和spark.sql.orc.enableVectorizedReader为true。对于Hive ORC serde表（使用USING HIVE OPTIONS (fileFormat 'ORC')创建的表），需要设置spark.sql.hive.convertMetastoreOrc为true才可使用向量化读取器。

JSON文件

Spark SQL能够自动推断出JSON数据集的schema，数据会作为DataSet[Row]返回。

// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

Hive表

Spark SQL可以读写Hive表。但是Hive所需的依赖文件并没有包含在Spark发行版之中。要使用Hive，必须将Hive依赖的类库放到Spark集群所有节点的类路径下，Spark会自动加载这些类库。

将hive-site.xml，core-site.xml（安全配置）和hdfs-site.xml（HDFS配置）放到Spark文件夹的/conf目录下即可使用Hive。

要使用Hive，必须显式启用Hive特性，包括连接到Hive元数据，序列化工具，用户自定义函数等。如果没有配置hive-site.xml文件，Spark上下文会自动在当前目录创建一个metastore_db，并根据spark.sql.warehouse.dir的值创建数仓目录。

import java.io.File

import org.apache.spark.sql.{Row, SaveMode, SparkSession}

case class Record(key: Int, value: String)

// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Aggregation queries are also supported.
sql("SELECT COUNT(*) FROM src").show()
// +--------+
// |count(1)|
// +--------+
// |    500 |
// +--------+

// The results of SQL queries are themselves DataFrames and support all normal functions.
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")

// The items in DataFrames are of type Row, which allows you to access each column by ordinal.
val stringsDS = sqlDF.map {
  case Row(key: Int, value: String) => s"Key: $key, Value: $value"
}
stringsDS.show()
// +--------------------+
// |               value|
// +--------------------+
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// |Key: 0, Value: val_0|
// ...

// You can also use DataFrames to create temporary views within a SparkSession.
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.createOrReplaceTempView("records")

// Queries can then join DataFrame data with data stored in Hive.
sql("SELECT * FROM records r JOIN src s ON r.key = s.key").show()
// +---+------+---+------+
// |key| value|key| value|
// +---+------+---+------+
// |  2| val_2|  2| val_2|
// |  4| val_4|  4| val_4|
// |  5| val_5|  5| val_5|
// ...

// Create a Hive managed Parquet table, with HQL syntax instead of the Spark SQL native syntax
// `USING hive`
sql("CREATE TABLE hive_records(key int, value string) STORED AS PARQUET")
// Save DataFrame to the Hive managed table
val df = spark.table("src")
df.write.mode(SaveMode.Overwrite).saveAsTable("hive_records")
// After insertion, the Hive managed table has data now
sql("SELECT * FROM hive_records").show()
// +---+-------+
// |key|  value|
// +---+-------+
// |238|val_238|
// | 86| val_86|
// |311|val_311|
// ...

// Prepare a Parquet data directory
val dataDir = "/tmp/parquet_data"
spark.range(10).write.parquet(dataDir)
// Create a Hive external Parquet table
sql(s"CREATE EXTERNAL TABLE hive_ints(key int) STORED AS PARQUET LOCATION '$dataDir'")
// The Hive external table should already have data
sql("SELECT * FROM hive_ints").show()
// +---+
// |key|
// +---+
// |  0|
// |  1|
// |  2|
// ...

// Turn on flag for Hive Dynamic Partitioning
spark.sqlContext.setConf("hive.exec.dynamic.partition", "true")
spark.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
// Create a Hive partitioned table using DataFrame API
df.write.partitionBy("key").format("hive").saveAsTable("hive_part_tbl")
// Partitioned column `key` will be moved to the end of the schema.
sql("SELECT * FROM hive_part_tbl").show()
// +-------+---+
// |  value|key|
// +-------+---+
// |val_238|238|
// | val_86| 86|
// |val_311|311|
// ...

spark.stop()

指定Hive表存储格式

使用Hive时，用户需要定义Hive表的输入和输出格式，以及编解码的方式。下列选项即可用于指定这些存储选项，例如CREATE TABLE src(id int) USING hive OPTIONS(fileFormat 'parquet')。默认以文本格式读取表中数据。

Property Name	Meaning
`fileFormat`	目前支持6种，`sequencefile`，`rcfile`，`orc`，`parquet`，`textfile`和`avro`
`inputFormat`和`outputFormat`	这两个属性必须成对提供
`serde`	这一选项指定编解码类名
`fieldDelim`，`escapeDelim`，`collectionDelim`，`mapkeyDelim`和`lineDelim`	这一选项只能用于`textfile`文件格式

与不同版本的Hive交互

（略）

使用JDBC连接其他数据库

Spark SQL可以使用JDBC从各种数据库中读取数据。这个功能比JdbcRDD更好，因为它的返回结果是DataFrame，Spark SQL操作DataFrame更方便。

使用JDBC前需要将数据库的JDBC驱动放置到Spark类路径下。例如，使用Spark Shell连接postgres需要使用以下命令：

bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar

数据库中的表可以作为DataFrame或者临时视图加载。用户可以在数据源选项中指定一些JDBC连接属性。主要属性列表如下：

Property Name	Meaning
`url`	JDBC URL，例如`jdbc:postgresql://localhost/test?user=fred&password=secret`
`dbtable`	数据库使用的表
`query`	查询语句。该查询会作为子查询放到`FROM`子句中
`driver`	数据库驱动程序名
`partitionColumn`，`lowerBound`，`upperBound`	这些属性必须一起使用，此外还需要指定`numPartitoins`。这些属性描述了并发工作时表如何分组
`numPartitions`	并发操作是最大可用分组数
`queryTimeout`	查询语句超时时间，0表示无限制
`fetchSize`	每轮查找记录条数限制
`batchSize`	批量插入条数限制
`isolationLevel`	事务隔离级别
`sessionInitStatement`	数据库会话打开后执行的语句，用于会话初始化
`truncate`	`SaveMode.Overwrite`开启后有用
`cascadeTruncate`	写相关选项
`createTableOptions`	写相关选项
`createTableColumnTypes`	建表时，使用指定的列数据类型
`customSchema`	自定义schema
`pushDownPredicate`	默认为true

// Note: JDBC loading and saving can be achieved via either the load/save or jdbc methods
// Loading data from a JDBC source
val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .load()

val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF2 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
// Specifying the custom data types of the read schema
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Saving data to a JDBC source
jdbcDF.write
  .format("jdbc")
  .option("url", "jdbc:postgresql:dbserver")
  .option("dbtable", "schema.tablename")
  .option("user", "username")
  .option("password", "password")
  .save()

jdbcDF2.write
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

// Specifying create table column data types on write
jdbcDF.write
  .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

Avro文件

（略）

故障排除

（略）

性能调优

某些工作负载下，缓存数据，或者某些选项是可以提高Spark性能的。

缓存数据

使用spark.catalog.cacheTable("tablename")或者dataFrame.cache()方法，可以在内存中以列式格式缓存数据库表。Spark SQL可以只扫描必要的列，并且会自动调整压缩以最小化内存使用和GC压力。spark.catalog.uncacheTable("tablename")可以从内存中移除表。

使用sparkSession的setConf方法或者在命令行中使用SET key=value都可以配置缓存：

Property Name	Default	Meaning
`spark.sql.inMemoryColumnarStorage.compressed`	true	是否启用压缩
`spark.sql.inMemoryColumnarStorage.batchSize`	10000	控制缓存记录条数

其他配置选项

下列选项可在执行查询时优化性能。

Property Name	Default	Meaning
`spark.sql.files.maxPartitionBytes`	134217728）（128MB）	读取文件时单个分区最大可容纳字节数
`spark.sql.files.openCostInBytes`	4194304（4MB）	打开文件的估计值
`spark.sql.broadcastTimeout`	300	广播连接的超时时间
`spark.sql.autoBroadcastJoinThreshold`	10485760（10MB）	某表可广播的最大字节数
`spark.sql.shuffle.partitions`	200	洗牌操作的最大分区数

广播数据提示SQL查询

BROADCAST提示可以知道Spark广播某个表与其他表或视图连接。当Spark决定连接方法时，广播散列连接（BHJ）是首选，即使统计信息高于配置spark.sql.autoBroadcastJoinThreshold。当连接表都确定了之后，Spark会广播统计信息较低的表。不是所有情形都可以使用BHJ（例如全外连接）。

import org.apache.spark.sql.functions.broadcast
broadcast(spark.table("src")).join(spark.table("records"), "key").show()

分布式SQL引擎

（略）

PySpark使用指南

（略）