数据框是分布式数据集,组织为有名字的列。概念上等价于关系数据库或者R/Python的数据框,只是有更丰富的操作。数据框可以有结构化数据文件,hive表,外部数据库或者RDD来创建
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
另外,也可以创建HiveContext,使用它,不需要安装hive,且可以获得SQLContext的所有数据源。HiveContext分开打包而没在spark安装包中。
用于解析语句的SQL指定变量也可以用spark.sql.dialect选项查询。这个参数可以用SQLContext的setConf方法改变或者在SQL中用SET key=value命令改变。对于一个SQLContext,唯一的对话变量是sql,它使用Spark SQL提供的简单SQL解析。在HiveContext,默认的是hiveql, 它更完备;
下面基于JSON文件创建一个数据框;
hadoop fs -put examples/src/main/resources/people.json /user/hadoop/people.json
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("people.json")
// Displays the content of the DataFrame to stdout
df.show()
15/08/27 08:46:00 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/people.json:36+37
15/08/27 08:46:00 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 3). 2117 bytes result sent to driver
15/08/27 08:46:00 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 28 ms on localhost (1/1)
15/08/27 08:46:00 INFO scheduler.DAGScheduler: ResultStage 2 (show at <console>:26) finished in 0.028 s
15/08/27 08:46:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/08/27 08:46:00 INFO scheduler.DAGScheduler: Job 2 finished: show at <console>:26, took 0.040699 s +----+-------+
| age| name| +----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin| +----+-------+
scala>
val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Create the DataFrame val df = sqlContext.read.json("examples/src/main/resources/people.json") // Show the content of the DataFrame df.show() // age name // null Michael // 30 Andy // 19 Justin // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // Select only the "name" column df.select("name").show() // name // Michael // Andy // Justin // Select everybody, but increment the age by 1 df.select(df("name"), df("age") + 1).show() // name (age + 1) // Michael null // Andy 31 // Justin 20 // Select people older than 21 df.filter(df("age") > 21).show() // age name // 30 Andy // Count people by age df.groupBy("age").count().show() // age count // null 1 // 19 1 // 30 1
SQLContext的sql方法返回数据框
val sqlContext = ... // An existing SQLContext val df = sqlContext.sql("SELECT * FROM table")
Java:
SQLContext sqlContext = ... // An existing SQLContext DataFrame df = sqlContext.sql("SELECT * FROM table")
Spark SQL支持两种方法把RDD转化为数据框,第一种方法使用反射推导包含指定对象类型的RDD的模式。
第二种方法是通过编程接口,允许你建立一个(模式)schema并且把它应用到RDD.允许你创建数据框,即使在运行前你不知道它的类型和列。
sparkSQL的scala接口支持自动把包含case class的RDD转化为数据框。这些case class定义了表的schema,case class的参数名用反射读取并且转化为列名。case class也可以嵌套或者包含像sequences或者Array的复杂类型。RDD可以转化为数据框并且注册为表。表可以用SQL语句
hadoop fs -put people.txt /user/hadoop
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._
// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)
// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
输出:15/08/27 09:07:17 INFO spark.SparkContext: Created broadcast 23 from textFile at <console>:28
people: org.apache.spark.sql.DataFrame = [name: string, age: int]
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
//Name: Justin
// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)
当事先不能定义case class(例如:以字符串编码的记录或文本数据集,不同用户解析的方式不同)三步创建数据框
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create an RDD ,people.txt要放在hdfs上/user/hadoop/people.txt
scala> val people = sc.textFile("people.txt")
15/08/28 08:55:50 INFO storage.MemoryStore: ensureFreeSpace(182784) called with curMem=97823, maxMem=278302556
15/08/28 08:55:50 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 178.5 KB, free 265.1 MB)
15/08/28 08:55:50 INFO storage.MemoryStore: ensureFreeSpace(17383) called with curMem=280607, maxMem=278302556
15/08/28 08:55:50 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 17.0 KB, free 265.1 MB)
15/08/28 08:55:50 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37236 (size: 17.0 KB, free: 265.4 MB)
15/08/28 08:55:50 INFO spark.SparkContext: Created broadcast 1 from textFile at <console>:23
people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:23
scala> // The schema is encoded in a string
scala> val schemaString = "name age"
schemaString: String = name age
scala>
scala> // Import Row.
scala> import org.apache.spark.sql.Row;
import org.apache.spark.sql.Row
scala>
scala> // Import Spark SQL data types
scala> import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.types.{StructType, StructField, StringType}
scala>
scala> // Generate the schema based on the string of schema
scala> val schema =
| StructType(
| schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))
scala>
scala> // Convert records of the RDD (people) to Rows.
scala> val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[11] at map at <console>:28
scala>
scala> // Apply the schema to the RDD.
scala> val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
peopleDataFrame: org.apache.spark.sql.DataFrame = [name: string, age: string]
scala>
scala> // Register the DataFrames as a table.
scala> peopleDataFrame.registerTempTable("people")
scala>
scala> // SQL statements can be run by using the sql methods provided by sqlContext.
scala> val results = sqlContext.sql("SELECT name FROM people")
results: org.apache.spark.sql.DataFrame = [name: string]
scala>
scala> // The results of SQL queries are DataFrames and support all the normal RDD operations.
scala> // The columns of a row in the result can be accessed by field index or by field name.
scala> results.map(t => "Name: " + t(0)).collect().foreach(println)
15/08/28 08:56:01 INFO mapred.FileInputFormat: Total input paths to process : 1
15/08/28 08:56:02 INFO spark.SparkContext: Starting job: collect at <console>:31
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:31) with 2 output partitions (allowLocal=false)
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(collect at <console>:31)
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Missing parents: List()
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[15] at map at <console>:31), which has no missing parents
15/08/28 08:56:02 INFO storage.MemoryStore: ensureFreeSpace(5992) called with curMem=297990, maxMem=278302556
15/08/28 08:56:02 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 5.9 KB, free 265.1 MB)
15/08/28 08:56:02 INFO storage.MemoryStore: ensureFreeSpace(3049) called with curMem=303982, maxMem=278302556
15/08/28 08:56:02 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.0 KB, free 265.1 MB)
15/08/28 08:56:02 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:37236 (size: 3.0 KB, free: 265.4 MB)
15/08/28 08:56:02 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:874
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[15] at map at <console>:31)
15/08/28 08:56:02 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, ANY, 1417 bytes)
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, ANY, 1417 bytes)
15/08/28 08:56:02 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/08/28 08:56:02 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/08/28 08:56:02 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/people.txt:0+16
15/08/28 08:56:02 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/user/hadoop/people.txt:16+16
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/08/28 08:56:02 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/08/28 08:56:02 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1821 bytes result sent to driver
15/08/28 08:56:02 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1807 bytes result sent to driver
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 417 ms on localhost (1/2)
15/08/28 08:56:02 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 410 ms on localhost (2/2)
15/08/28 08:56:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/28 08:56:02 INFO scheduler.DAGScheduler: ResultStage 0 (collect at <console>:31) finished in 0.432 s
15/08/28 08:56:02 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:31, took 0.549062 s
Name: Michael
Name: Andy
Name: Justin
sparkSQL通过数据框接口支持很多数据源,数据框可以作为RDD操作,也可以注册为临时表。把数据框注册为表从而可以使用SQL语句。
val df = sqlContext.read.load("examples/src/main/resources/users.parquet") df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
指定怎么处理已经存在的数据
使用HiveContext时,数据框可以用saveAsTable命令保存为永久表。不像registerTempTable,saveAsTable将物化数据框的内容并在HiveMetastore里创建一个指针指向这个数据。只要连接到同一个metastore,重启程序后,持久表仍然存在。持久表的数据框可以通过SQLContext的table方法及表名来创建。
默认saveAsTable将会创建一个“管理表”,即数据的位置有metastore控制。当表drop后,管理表自动删除。
SparkSQL支持读写parquet文件且自动保存原始数据的模式
编程加载数据
// sqlContext from the previous example is used in this example. // This is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ val people: RDD[Person] = ... // An RDD of case class objects, from the previous example. // The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet. people.write.parquet("people.parquet") // Read in the parquet file created above. Parquet files are self-describing so the schema is preserved. // The result of loading a Parquet file is also a DataFrame. val parquetFile = sqlContext.read.parquet("people.parquet") //Parquet files can also be registered as tables and then used in SQL statements. parquetFile.registerTempTable("parquetFile") val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19") teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
未完待续。。。。。。。。。。。。
SparkSQL可以自动推导JSON数据集的schema并把它加载为数据框。这可以用SQLContext.read.json()作用与String RDD或JSON文件完成
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root
// |-- age: integer (nullable = true)
// |-- name: string (nullable = true)
// Register this DataFrame as a table.
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenagers.show
// Alternatively, a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
SparkSQL支持读写存贮在hive中的数据,它有很多依赖,不再spark默认的安装包中,因此要支持hive,需要在编译spark时加上-Phive -Phive-thriftserver
.这个jar必须放在所有的worker节点上,因为hive序列化和反序列化需要这些库获得hive中的数据
配置hive需要把hive-site.xml放在conf/
目录下
必须创建HiveContext
,它继承自SQLContext
且添加支持查找元数据中的表和使用HiveQL
写队列。当步配置hive-site.xml时,context将自动在当前目录中创建metastore_db和warehouse.
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src") // Queries are expressed in HiveQL sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
英文原文