Spark SQL and DataFrame Guide



// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

用于解析语句的SQL指定变量也可以用spark.sql.dialect选项查询。这个参数可以用SQLContext的setConf方法改变或者在SQL中用SET key=value命令改变。对于一个SQLContext,唯一的对话变量是sql,它使用Spark SQL提供的简单SQL解析。在HiveContext,默认的是hiveql, 它更完备;

hadoop fs -put examples/src/main/resources/people.json /user/hadoop/people.json
val df ="people.json")

// Displays the content of the DataFrame to stdout

| age| name| +----+-------+
|  30|   Andy|
| 19| Justin| +----+-------+



val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Create the DataFrame val df ="examples/src/main/resources/people.json") // Show the content of the DataFrame // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Select only the "name" column"name").show() // name // Michael // Andy // Justin // Select everybody, but increment the age by 1"name"), df("age") + 1).show() // name (age + 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df.filter(df("age") > 21).show() // age name // 30 Andy // Count people by age df.groupBy("age").count().show() // age count // null 1 // 19 1 // 30 1 



val sqlContext = ... // An existing SQLContext val df = sqlContext.sql("SELECT * FROM table") 


SQLContext sqlContext = ... // An existing SQLContext DataFrame df = sqlContext.sql("SELECT * FROM table") 


Spark SQL支持两种方法把RDD转化为数据框,第一种方法使用反射推导包含指定对象类型的RDD的模式。


sparkSQL的scala接口支持自动把包含case class的RDD转化为数据框。这些case class定义了表的schema,case class的参数名用反射读取并且转化为列名。case class也可以嵌套或者包含像sequences或者Array的复杂类型。RDD可以转化为数据框并且注册为表。表可以用SQL语句

 hadoop fs -put people.txt /user/hadoop
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
输出:15/08/27 09:07:17 INFO spark.SparkContext: Created broadcast 23 from textFile at <console>:28
people: org.apache.spark.sql.DataFrame = [name: string, age: int]


// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index: => "Name: " + t(0)).collect().foreach(println)

//Name: Justin

// or by field name: => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T][Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)


当事先不能定义case class(例如:以字符串编码的记录或文本数据集,不同用户解析的方式不同)三步创建数据框

  1. 从原始RDD创建行的RDD
  2. 创建一个structType表示的模式与第一步创建的RDD的行结构像匹配
  3. 在行的RDD上通过applyschema方法应用模式
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD ,people.txt要放在hdfs上/user/hadoop/people.txt
scala> val people = sc.textFile("people.txt")
Name: Michael
Name: Andy
Name: Justin




val df ="examples/src/main/resources/users.parquet")"name", "favorite_color")"namesAndFavColors.parquet") 





parquet 文件


// sqlContext from the previous example is used in this example. // This is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ val people: RDD[Person] = ... // An RDD of case class objects, from the previous example. // The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet. people.write.parquet("people.parquet") // Read in the parquet file created above. Parquet files are self-describing so the schema is preserved. // The result of loading a Parquet file is also a DataFrame. val parquetFile ="people.parquet") //Parquet files can also be registered as tables and then used in SQL statements. parquetFile.registerTempTable("parquetFile") val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") => "Name: " + t(0)).collect().foreach(println) 



SparkSQL可以自动推导JSON数据集的schema并把它加载为数据框。这可以用作用与String RDD或JSON文件完成

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
val people =

// The inferred schema can be visualized using the printSchema() method.
// root
// |-- age: integer (nullable = true)
// |-- name: string (nullable = true)

// Register this DataFrame as a table.

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople =

SparkSQL支持读写存贮在hive中的数据,它有很多依赖,不再spark默认的安装包中,因此要支持hive,需要在编译spark时加上-Phive -Phive-thriftserver.这个jar必须放在所有的worker节点上,因为hive序列化和反序列化需要这些库获得hive中的数据


// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

