SparkSQL可以处理多种类型的数据,本文就简单的以Parquet、Json、Relation Database为主线介绍下SparkSQL的处理过程。
基于行的存储如下:
在Parquet中,柱状分布如下:
import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.sql.SQLContext /** * Created by Administrator on 2015/10/17. */ object Parquet { case class Person(firstName: String, lastName: String, age: Int) def main (args: Array[String]): Unit ={ if(args.length != 2){ System.err.println("Usage: <data path> <parquet file>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) //Loaded with sqlContext(which is the instance of HiveContext not SQLContext) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val data = sc.textFile(args(0)) val personRDD = data.map(_.split(",")).map(person => Person(person(0), person(1), person(2).toInt)) //Convert the personRDD into the personDF DataFrame val personDF = personRDD.toDF() //Register the personDF as a table personDF.registerTempTable("person") //Run a SQL query against it val people = sqlContext.sql("SELECT * FROM person") // people.collect().foreach(println) /************************************************Do Parquet*************************************************/ /** * Save this people RDD in the Parquet format * If you are using Spark 1.3.x, you should write like * people.saveAsParquetFile("/spark/sql/people.parquet"). * And the result save in people.parquet directory */ people.write.parquet(args(1)) /** * Load contents of the Parquet files. * If you are using Spark 1.3.x, you should write like * val personParquetDF = sqlContext.load("/spark/sql/people.parquet") */ val personParquetDF = sqlContext.read.parquet(args(1)) //Register the loaded parquet DF as a temp table personParquetDF.registerTempTable("person") //Run a SQL query against it val peopleParquet = sqlContext.sql("SELECT * FROM person WHERE age < 30") peopleParquet.collect().foreach(println) sc.stop() } }
SON的全称是JavaScript Object Notation,是一种轻量级的数据交换格式。 JSON与 XML具有相同的特性,例如易于人编写和阅读,易于机器生成和解析。但是 JSON比XML数据传输的有效性要高出很多。 JSON完全独立与编程语言,使用文本格式保存。
SON数据有两种结构:
import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.sql.SQLContext /** * Created by Administrator on 2015/10/17. */ object Json { def main(args: Array[String]): Unit ={ if(args.length != 1){ System.err.println("Usage: <data path>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) //Loaded with sqlContext(which is the instance of HiveContext not SQLContext) val sqlContext = new SQLContext(sc) /* * Load contents of the Parquet files. * If you are using Spark 1.3.x, you should write like * val personParquetDF = sqlContext.jsonFile("/spark/sql/people.json") */ val personJsonDF = sqlContext.read.json(args(0)) //Register the loaded json DF as a temp table personJsonDF.registerTempTable("person") //Run a SQL query against it val peopleParquet = sqlContext.sql("SELECT * FROM person WHERE age < 30") peopleParquet.collect().foreach(println) /** * Save this people RDD in the Parquet format * If you are using Spark 1.3.x, you should write like * personJsonDF.toJSON.saveAsTextFile("/spark/sql/peopleresult.json"). * And the result save in peopleresult.json directory. */ // personJsonDF.write.json("/spark/sql/peopleresult.json") sc.stop() } }
关于关系型数据库就不多说了,不过,个人不怎么习惯SparkSQL这种关系型数据库操作的方式。
import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.sql.SQLContext /** * Created by Administrator on 2015/10/17. */ object RelationDB { def main(args: Array[String]): Unit ={ val conf = new SparkConf() val sc = new SparkContext(conf) //Loaded with sqlContext(which is the instance of HiveContext not SQLContext) val sqlContext = new SQLContext(sc) //Construct a JDBC URL val url="jdbc:mysql://localhost:3306/test" //Create a connection properties object with username and password val prop = new java.util.Properties prop.setProperty("user","root") prop.setProperty("password","admin") //Load DataFrame with JDBC data source (url, table name, properties) val people = sqlContext.read.jdbc(url,"person",prop) //Show the results in a nice tabular format by executing the following command people.show() /** * This has loaded the whole table. What if I only would like to load males (url, table * name, predicates, properties)? To do this, run the following command */ val males = sqlContext.read.jdbc(url, "person", Array("gender = 'M'"), prop) males.show() //Show only first names by executing the following command val first_names = people.select("first_name") first_names.show() //Show only people below age 60 by executing the following command val below60 = people.filter(people("age") < 60) below60.show() //Group people by gender as follows val grouped = people.groupBy("gender") //Find the number of males and females by executing the following command val gender_count = grouped.count() gender_count.show() //Find the average age of males and females by executing the following command val avg_age = grouped.avg("age") avg_age.show() // Now if you'd like to save this avg_age data to a new table, run the following command gender_count.write.jdbc(url,"gender_count",prop) sc.stop() } }
其实,SparkSQL的DataFrames还可以加载或保存其他格式的源数据。操作和上面的方法差不多。
//Load the data from Parquet; since parquet is the default data source, you do not have to specify it val people = sqlContext.read.load("/spark/sql/people.parquet") //Load the data from Parquet by manually specifying the format val peopleParquet_1 = sqlContext.read.format("org.apache.spark.sql.parquet").load("/spark/sql/people.parquet") /** * For inbuilt datatypes (parquet,json, and jdbc), you do not have to specify the full * format name, only specifying "parquet", "json", or "jdbc" works */ val peopleParquet_2 = sqlContext.read.format("parquet").load("/spark/sql/people.parquet") //Save people as JSON in the append mode val peopleJson = people.write.format("parquet").mode("append").save("/spark/sql/people.json")