1.数据源
SparkSQL的数据源:结构化的文件(json,parquet),或者是Hive的表,或者是外部的数据库(mysql),也或者是已经存在的RDD。
2.load和save
SparkSQL默认的数据源的文件格式是parquet
Load是用来读取文件的时候加载文件中的数据
Save是用来往外写文件的时候存储写出的数据
val df = sqlContext.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
SparkSQL的默认的数据源的文件的格式是parquet,但是,我们可以手动的指定数据源的文件格式,可以指定为json、jdbc等数据源的格式,使用format函数就可以指定。
val df = sqlContext.read.format("json").load("examples/src/main/resources/people.json")
df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
4.save mode
Df在保存数据的时候,可以设置保存的模式,这样就可以对现存的数据做对应的处理,具体的模式如下:
Scala/Java |
Any Language |
Meaning |
SaveMode.ErrorIfExists(default) |
"error"(default) |
When saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown. |
SaveMode.Append |
"append" |
When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data. |
SaveMode.Overwrite |
"overwrite" |
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame. |
SaveMode.Ignore |
"ignore" |
Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL. |
val df = sqlContext.sql("select *from parquet.`examples\\src\\main\\resources\\users.parquet`")
df.show()
df.write.mode(saveMode=SaveMode.Append).format("json").save("c:\\saveMode")
df.write.format("json").mode(saveMode = SaveMode.Overwrite).save("c:\\saveMode1")
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json(path) orSQLContext.read.format(“json”).load(path) on either an RDD of String, or a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// A JSON dataset is pointed to by path.// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root// |-- age: integer (nullable = true)
// |-- name: string (nullable = true)
// Register this DataFrame as a table.
people.registerTempTable("people")
// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
// Alternatively, a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
//并行化的方式创建json数据
val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
MySQL(postgresql这也是一个关系型数据库,最近也很火(也经常会使用这个做为Hive的元数据库))
package sparksql
import java.util.Properties
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import scala.collection.mutable
object JdbcTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("test").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
TODO 方式一:
val map = new mutable.HashMap[String,String]()
map.put("url","jdbc:mysql://hadoop1:3306/aura")
map.put("user","aura")
map.put("password","aura")
map.put("dbtable","people")
val df = sqlContext.read.format("jdbc").options(
map
).load()
//TODO 方式二
val properties = new Properties()
properties.put("password","aura");
properties.put("user","aura")
/**
* def jdbc(url: String, table: String, properties: Properties): DataFrame = {
jdbc(url, table, JDBCRelation.columnPartition(null), properties)
}
*/
val df = sqlContext.read.jdbc(
"jdbc:mysql://hadoop1:3306/aura","people", properties
)
df.registerTempTable("people")
sqlContext.sql("select * from people").show()
}
}
hive (我们最重要的一种数据源,或者我们的SparkSQL的数据仓库就是借助Hive去实现的)
这里的hive不是hive on spark,只是SparkSQL其中的一种数据源而已。
(1)我们在IDEA中创建项目写SparkSQL语句(数据源来源于Hive)的时候,要加下面的这个依赖:
val conf = new SparkConf().setAppName("sparkSQL").setMaster("local")
val sc = new SparkContext(conf)
val hiveSQLcontext = new HiveContext(sc)
hiveSQLcontext.sql("create table student(id int,name string) row format delimited fields terminated by ','")
hiveSQLcontext.sql("load data local inpath 'root/hadoop/student.txt' into table student")
hiveSQLcontext.sql("select *from student")
输入::paste
可以进行多行的编辑模式
输入:Ctrl +d
退出多行编辑的模式
总结性代码:
package lesson02
import java.util.Properties
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, SaveMode}
import scala.collection.mutable
object DataSource {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("sparkSQL").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
/*
Generic Load/Save Functions
In the simplest form, the default data source
(parquet unless otherwise configured by spark.sql.sources.default) will be used for all operations.
sparkSQL默认的文件的格式是parquet
*/
val df = sqlContext.read.load("examples\\src\\main\\resources\\users.parquet")
df.select("name","favorite_color").write.save("c:\\namesAndFavColors.parquet")
/*
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options that you
would like to pass to the data source. Data sources are specified by their fully qualified name
(i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their
short names (json, parquet, jdbc). DataFrames of any type can be converted into other types using this syntax.
*/
val df = sqlContext.read.format("json").load("examples\\src\\main\\resources\\people.json")
df.select("name","age").write.format("parquet").save("c:\\namesAndages.parquet")
/*
Run SQL on files directly
Instead of using read API to load a file into DataFrame and query it,
you can also query that file directly with SQL.
*/
val df = sqlContext.sql("select *from parquet.`examples\\src\\main\\resources\\users.parquet`")
df.show()
/*
Save Modes
Save operations can optionally take a SaveMode, that specifies how to handle existing data if present.
It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally,
when performing a Overwrite, the data will be deleted before writing out the new data.
*/
df.write.mode(saveMode =SaveMode.Append).format("json").save("c:\\saveMode")
df.write.format("json").mode(saveMode = SaveMode.Overwrite).save("c:\\saveMode1")
//jdbc
//style1
val map = new mutable.HashMap[String,String]()
map.put("url","jdbc:mysql://hadoop03:3306/test")
map.put("user","root")
map.put("password","root")
map.put("dbtable","student")
val df1 = sqlContext.read.format("jdbc").options(
map
).load()
df1.registerTempTable("student")
sqlContext.sql("select *from student").show()
//style2
val properties = new Properties()
properties.put("user","root")
properties.put("password","root")
val df2 = sqlContext.read.jdbc(
"jdbc:mysql://hadoop03:3306/test", "student", properties
)
df2.registerTempTable("student")
sqlContext.sql("select *from student")
//Hive
val hiveSQLcontext = new HiveContext(sc)
hiveSQLcontext.sql("create table student(id int,name string) row format delimited fields terminated by ','")
hiveSQLcontext.sql("load data local inpath 'root/hadoop/student.txt' into table student")
hiveSQLcontext.sql("select *from student")
}
}
以上使用的是Spark1.6.0的版本
目前2.2.0的版本对SparkSQL进行初始化的时候提供了统一的方式SparkSession,只是当数据源是Hive的时候需要多配置参数
非Hive:
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
Hive:
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
// Register the DataFrame as a global temporary view
df.createGlobalTempView("people")
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
http://spark.apache.org/docs/latest/sql-programming-guide.html#getting-started