<Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet

Spark SQL

  • 一 Spark SQL架构
  • 二 运行原理之Catalyst优化器
    • 1、运行逻辑
    • 2、逻辑计划
    • 3、优化
    • 4、物理计划
  • 三 Spark SQL API
    • 1、SparkSession
    • 2、Dataset
    • 3、使用Case Class创建Dataset
    • 4、RDD->Dataset
    • 5、DataFrame
      • 什么是DataFrame
      • DataFrame API常用操作
      • RDD -> DataFrame
      • Seq/List ->DataFrame
      • DataFrame -> RDD
      • DataFrame -> DataSet
  • 四 Spark SQL操作外部数据源
    • Spark SQL支持的外部数据源
    • Parquet文件
    • Spark对Hive表的数据插入和读取
      • Linux虚拟机spark-shell环境
      • IDEA中开发环境
    • 操作Mysql中的表
      • 从mysql读数据
      • 向Mysql写数据
  • 五 Spark SQL函数
    • 内置函数
    • case class
  • 六 Spark UDF&UDAF&UDTF
    • UDF
    • UDAF
    • UDTF
  • 七 Spark SQL CLI
  • 八 Spark性能优化
    • 序列化
    • 优化点
      • 分区优化
      • join操作

一 Spark SQL架构

  • Spark SQL是Spark的核心组件之一(Spark Core、Spark SQL、Spark Streaming、MLlib、GraphX)
  • 能够直接访问现存的Hive数据
  • 提供JDBC/ODBC接口供第三方工具借助Spark进行数据处理
  • 提供了更高层级的接口方便地处理数据
  • 支持多种操作方式:SQL、API编程
  • 支持多种外部数据源:Parquet、JSON、RDBMS等
    <Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet_第1张图片

二 运行原理之Catalyst优化器

1、运行逻辑

  • Catalyst优化器是Spark SQL的核心
  • 将逻辑计划转为物理计划
    <Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet_第2张图片

2、逻辑计划

<Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet_第3张图片

3、优化

  • 在投影上面查询过滤器
  • 检查过滤器是否可下压
    <Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet_第4张图片

4、物理计划

<Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet_第5张图片

三 Spark SQL API

1、SparkSession

  • SparkContext
  • SQLContext
    • Spark SQL的编程入口
  • HiveContext
    • SQLContext的子集,包含更多功能
  • SparkSession(Spark 2.x推荐)
    • SparkSession:合并了SQLContext与HiveContext
    • 提供与Spark功能交互单一入口点,并允许使用DataFrame和Dataset API对Spark进行编程
//IDEA开发程序时SparkSession创建
//如果是spark-shell下,会自动创建“sc”和“spark”
 val conf: SparkConf = new SparkConf().setAppName("spark").setMaster("local[*]")
val spark = SparkSession.builder
                    .master("master")
                    .appName("appName")
                    .getOrCreate()
//或者
val spark=SparkSession.builder.config(conf).getOrCreate()

2、Dataset

  • 特定域对象中的强类型集合
  • Seq
scala> spark.createDataset(1 to 3).show
/*

+-----+
|value|
+-----+
|    1|
|    2|
|    3|
+-----+

  • Array
scala> spark.createDataset(List(("a",1),("b",2),("c",3))).show
/*

+---+---+
| _1| _2|
+---+---+
|  a|  1|
|  b|  2|
|  c|  3|
+---+---+

  • RDD
scala> spark.createDataset(sc.parallelize(List(("a",1,1),("b",2,2)))).show
/*
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  a|  1|  1|
|  b|  2|  2|
+---+---+---+

  • createDataset()的参数可以是:Seq、Array、RDD
  • 上面三行代码生成的Dataset分别是:Dataset[Int]、Dataset[(String,Int)]、Dataset[(String,Int,Int)]
  • Dataset=RDD+Schema,所以Dataset与RDD有大部共同的函数,如map、filter等

3、使用Case Class创建Dataset

  • Scala中在class关键字前加上case关键字 这个类就成为了样例类,样例类和普通类区别:
    • (1)不需要new可以直接生成对象
    • (2)默认实现序列化接口
    • (3)默认自动覆盖 toString()、equals()、hashCode()
case class Point(label:String,x:Double,y:Double)
case class Category(id:Long,name:String)
val points=Seq(Point("bar",3.0,5.6),Point("foo",-1.0,3.0)).toDS
val categories=Seq(Category(1,"foo"), Category(2,"bar")).toDS
points.join(categories,points("label")===categories("name")).show
/*

+-----+----+---+---+----+
|label|   x|  y| id|name|
+-----+----+---+---+----+
|  bar| 3.0|5.6|  2| bar|
|  foo|-1.0|3.0|  1| foo|
+-----+----+---+---+----+

4、RDD->Dataset

case class Point(label:String,x:Double,y:Double)
case class Category(id:Long,name:String)
val pointsRDD=sc.parallelize(List(("bar",3.0,5.6),("foo",-1.0,3.0)))
val categoriesRDD=sc.parallelize(List((1,"foo"),(2,"bar")))
val points=pointsRDD.map(line=>Point(line._1,line._2,line._3)).toDS
val categories=categoriesRDD.map(line=>Category(line._1,line._2)).toDS
points.join(categories,points("label")===categories("name")).show
/*

+-----+----+---+---+----+
|label|   x|  y| id|name|
+-----+----+---+---+----+
|  bar| 3.0|5.6|  2| bar|
|  foo|-1.0|3.0|  1| foo|
+-----+----+---+---+----+

5、DataFrame

什么是DataFrame

  • DataFrame=Dataset[Row],即DataFrame是Dataset的子类型
  • 类似传统数据的二维表格
  • 在RDD基础上加入了Schema(数据结构信息)
  • DataFrame Schema支持嵌套数据类型
    • struct
    • map
    • array
  • 提供更多类似SQL操作的API

DataFrame API常用操作

  • 创建DataFrame
 /** 将JSON文件转成DataFrame
      * users.json内容如下
      * {"name":"Michael"}
      * {"name":"Andy", "age":30}
      * {"name":"Justin", "age":19}
      */
val df = spark.read.json("file:///software/wordcount/users.json")
// 使用show方法将DataFrame的内容输出
df.show
/*
+----+-------+
| Age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

  • 使用printSchema方法输出DataFrame的Schema信息
df.printSchema()
/*
root
 |-- Age: long (nullable = true)
 |-- name: string (nullable = true)

  • 使用select方法来选择我们所需要的字段
df.select("name").show()
/*
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+
  • 使用select方法选择我们所需要的字段,并未age字段加1
df.select(df("name"), df("age") + 1).show()
//等价于
df.select(df.col("name"), df.col("age") + 1).show()
/*
+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+
  • 使用filter方法完成条件过滤
df.filter(df("age") > 21).show()
/*
+---+----+
|Age|name|
+---+----+
| 30|Andy|
+---+----+
  • 使用groupBy方法进行分组,求分组后的总数
df.groupBy("age").count().show()
/*
+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+
  • sql()方法执行SQL查询操作
df.registerTempTable("people")
spark.sql("SELECT * FROM people").show
/*
+----+-------+
| Age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

RDD -> DataFrame

  • 通过反射RDD内的Schema
  • 凡是涉及到其它类型到DF转换都需要导入隐式包import spark.implicits._
case class People(name:String,age:Int)
object rddToDF {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("sparksql").master("local[*]").getOrCreate()
         //sc的第一种获取方式
    //val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("rddToDF")
    //    val sc: SparkContext = SparkContext.getOrCreate(conf)
        //sc的第二种获取方式
    // val sc: SparkContext = new SparkContext(conf)
        //sc的第三种获取方式,通过SparkSession获取
        val sc: SparkContext = spark.sparkContext

    //将 RDD转换成DataFrame第一种方式-通过反射
    import spark.implicits._    //这里的spark指上面的定义的spark对象,名字跟着它走
    val df: DataFrame = sc.textFile("in/people.txt").map(x=>x.split(",")).map(x=>People(x(0),x(1).toInt)).toDF()
    df.printSchema()
    df.show()
  }
}
  • 通过编程接口指定Schema
//方式二:通过编程接口指定Schema
case class Person(name String,age Int)
val people: RDD[String] =sc.textFile("file:///home/hadoop/data/people.txt")
// 以字符串的方式定义DataFrame的Schema信息
val schemaString = "name age"
//导入所需要的类
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
// 根据自定义的字符串schema信息产生DataFrame的Schema
    val schema: StructType = StructType(schemaString.split(" ").map(fieldName=>{
      if(fieldName.equals("name"))
      StructField(fieldName,StringType,true)
      else
        StructField(fieldName,IntegerType,true)
    }))
//将RDD转换成Row
val rowRDD: RDD[Row]= people.map(_.split(",")).map(p => Row(p(0), p(1).toInt))
// 将Schema作用到RDD上
val peopleDataFrame: DataFrame = spark.createDataFrame(rowRDD, schema)
// 将DataFrame注册成临时表
peopleDataFrame.registerTempTable("people")
val results = spark.sql("SELECT name FROM people")
results.show

Seq/List ->DataFrame

  • 元素数量,类型随意。
 case class Student(id:Int,name:String,sex:String,age:Int)
   val stuDF: DataFrame = Seq(
      Student(1001, "zhangsan", "F", 20),
      Student(1002, "lisi", "M", 16),
      Student(1003, "wangwu", "M", 21),
      Student(1004, "zhaoliu", "F", 21),
      Student(1005, "zhouqi", "M", 22),
      Student(1006, "qianba", "M", 22),
      Student(1007, "liuliu", "F", 23)
    ).toDF()  //Seq/List可以直接toDF

 val df: DataFrame = List((1,20),(3,40)).toDF("id","age")
    df.show()
/*
+---+---+
| id|age|
+---+---+
|  1| 20|
|  3| 40|
+---+---+

DataFrame -> RDD

val rdd: RDD[Row] = peopleDataFrame.rdd

DataFrame -> DataSet

 val frame: DataFrame= sqlContext.sql("select name,count(cn) from tbwordcount group by name")

    //方法一
      val DS1: Dataset[(String, Long)] = frame.map(row => {
        val name: String = row.getAs[String]("name")
        val cn: Long = row.getAs[Long]("cn")
        (name, cn)
      })
      
      //方法二
     val DS2: Dataset[(String, Long)] = frame.as[(String,Long)]

四 Spark SQL操作外部数据源

Spark SQL支持的外部数据源

<Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet_第6张图片

Parquet文件

  • 是一种流行的列式存储格式,以二进制存储,文件中包含数据与元数据
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._

object ParDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("ParquetDemo").getOrCreate()
     import spark.implicits._
    val sc: SparkContext = spark.sparkContext
    val list = List(
      ("zhangsan", "red", Array(3, 4, 5)),
      ("lisi", "blue", Array(7, 8, 9)),
      ("wangwu", "black", Array(12, 15, 19)),
      ("zhaoliu", "orange", Array(7, 9, 6))
    )
    val rdd1: RDD[(String, String, Array[Int])] = sc.parallelize(list)
    val schema = StructType(
      Array(
        StructField("name", StringType),
        StructField("color", StringType),
        StructField("numbers", ArrayType(IntegerType))
      )
    )
   val rowRDD: RDD[Row] = rdd1.map(x=>Row(x._1,x._2,x._3))
   val df: DataFrame = spark.createDataFrame(rowRDD,schema)
    df.show()
   df.write.parquet("out/color")
     val frame: DataFrame = spark.read.parquet("out/color")
        frame.printSchema()
      frame.show()
  }
}

Spark对Hive表的数据插入和读取

  • Spark SQL与Hive集成:

Linux虚拟机spark-shell环境

- 复制hive中的hive-site.xml至spark安装目录下的conf下(ln -s /opt/hive/conf/hive-site.xml  /opt/spark/conf/hive-site.xml)
- 将mysql驱动拷贝至spark的jars目录下(cp /opt/hive/lib/mysql-connector-java-5.1.38.jar  /opt/spark/jars/)
- 启动元数据服务:nohup hive --service metastore &
- spark-shell
- 然后直接在spark.sql("....")里面写sql语句就可以了(scala> spark.sql("select * from stu").show())
- 同样,也不能够在spark-shell中创建hive数据库。

IDEA中开发环境

  • linux虚拟机输入下述命令会开启jps的RunJar就ok了
    • nohup hive --service metastore &
  • IDEA中创建HIve数据库会产生权限问题:-chgrp: 'LAPTOP-F4OELHQ8\86187' does not match expected pattern for groupUsage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...。会导致表能够创建成功,但是数据会存到IDEA目录下,不会上传到HDFS上,并且创建的数据库仅仅是个文件夹没有.db后缀。暂时未找到解决方案。建议:数据库提前用虚拟机创建好,然后使用即可。
  • IDEA添加依赖
 <!-- spark-sql -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.1.1</version>
    </dependency>
    
    <!-- spark-hive -->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>2.1.1</version>
    </dependency>
    
     <!-- mysql-connector-java -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>
  • IDEA开发代码
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}

object SparksqlOnHiveDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("sparkHive")
      .master("local[*]")
       .config("hive.metastore.uris","thrift://192.168.198.201:9083")
      .enableHiveSupport().getOrCreate()

    spark.sql("show databases").collect().foreach(println)

     //spark 默认连接Hive Default库
    //使用其他库的表请 库名.表名
     val df: DataFrame = spark.sql("select * from toronto")
     df.printSchema()
     df.show()

     val df2: Dataset[Row] = df.where(df("ssn").startsWith("158"))
     val df3: Dataset[Row] = df.filter(df("ssn").startsWith("158"))
  }
}

操作Mysql中的表

  • 虚拟机中请将mysql-connector-java-5.1.38.jar复制到spark安装目录的jars
  • IDEA中请添加以下依赖
 <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.38</version>
    </dependency>

从mysql读数据

  • 方式1:通用的load方法读取
 spark.read.format("jdbc")
      .option("url", "jdbc:mysql://192.168.198.201:3306/hive")
      .option("driver", "com.mysql.jdbc.Driver")
      .option("user", "root")
      .option("password", "ok")
      .option("dbtable", "TBLS")
      .load().show
  • 方式2:通用的load方法读取的另一种形式
spark.read.format("jdbc")
      .options(Map("url"->"jdbc:mysql://192.168.198.201:3306/hive?user=root&password=ok",
        "dbtable"->"TBLS","driver"->"com.mysql.jdbc.Driver")).load().show
  • 方式3:使用jdbc方法读取
import org.apache.spark.sql.{DataFrame, SparkSession}

object SparksqlOnMysqlDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]")
      .appName("sparksqlOnmysql")
      .getOrCreate()

    val url="jdbc:mysql://192.168.198.201:3306/hive"
    val user="root"
    val pwd="ok"
    val driver="com.mysql.jdbc.Driver"

    val prop=new java.util.Properties()
    prop.setProperty("user",user)
    prop.setProperty("password",pwd)
    prop.setProperty("driver",driver)

    val df: DataFrame = spark.read.jdbc(url,"TBLS",prop)
    df.show()
    df.where(df("CREATE_TIME").startsWith("159")).show()
    val frame: DataFrame = df.groupBy(df("DB_ID")).count()
    frame.printSchema()
    frame.orderBy(frame("count").desc).show()
  }
}

向Mysql写数据

  • 数据库要提前存在
  • 表不存在,会自动创建表
object SparkSQL03_Datasource {
  def main(args: Array[String]): Unit = {
    //创建上下文环境配置对象
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkSQL01_Demo")
    //创建SparkSession对象
    val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
    import spark.implicits._

 val rdd: RDD[(String, Int)] = spark.sparkContext.parallelize(List(("zs",21),("ls",23),("ww",26)))
val df: DataFrame = rdd.toDF("name","age")
  • 方式1:通用的方式 format指定写出类型
df.write
  .format("jdbc")
  .option("url", "jdbc:mysql://192.168.198.201:3306/test")
  .option("user", "root")
  .option("password", "ok")
  .option("dbtable", "users")
  .mode(SaveMode.Append)
  .save()
/*

mysql> select * from users;
+------+------+
| name | age  |
+------+------+
| ls   |   23 |
| zs   |   21 |
| ww   |   26 |
+------+------+
3 rows in set (0.00 sec)

  • 方式2:通过jdbc方法
  val props: Properties = new Properties()
    props.setProperty("user", "root")
    props.setProperty("password", "ok")
    df.write.mode(SaveMode.Append).jdbc("jdbc:mysql://192.168.198.201:3306/test", "users", props)
/*

mysql> select * from users;
+------+------+
| name | age  |
+------+------+
| ls   |   23 |
| zs   |   21 |
| ww   |   26 |
+------+------+
3 rows in set (0.00 sec)

五 Spark SQL函数

内置函数

<Zhuuu_ZZ>Spark(六)之SparkSQL&DataFrame&DataSet_第7张图片

package sparkSQL
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession, types}

object InnerFunctionDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("innerfunction").master("local[*]").getOrCreate()

    import spark.implicits._
    val sc: SparkContext = spark.sparkContext
    val accessLog = Array(
      "2016-12-27,001",
      "2016-12-27,001",
      "2016-12-27,002",
      "2016-12-28,003",
      "2016-12-28,004",
      "2016-12-28,002",
      "2016-12-28,002",
      "2016-12-28,001"
    )
    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.types._
//    val rdd1: RDD[String] = sc.parallelize(accessLog)
//   val rdd2: RDD[(String, String)] = rdd1.map(x=>x.split(",")).map(x=>(x(0),x(1)))
//     val frame: DataFrame = rdd2.toDF("date","user_id")
//    frame.groupBy("date").agg(count("user_id")).select("date","count(user_id)").show()

    val rdd1: RDD[Row] = sc.parallelize(accessLog).map(x=>x.split(",")).map(x=>Row(x(0),x(1).toInt))
    val structType= StructType(Array(
      StructField("day", StringType, true),
      StructField("user_id", IntegerType, true)
    ))
    val frame: DataFrame = spark.createDataFrame(rdd1,structType)
    frame.groupBy("day").agg(countDistinct("user_id").as("pv")).select("day","pv")
      .collect().foreach(println)

  }
}

case class

import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}



object caseClass {
  case class Student(id:Int,name:String,sex:String,age:Int)
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("case").getOrCreate()
    val sc: SparkContext = spark.sparkContext
   import spark.implicits._
    val stuDF: DataFrame = Seq(
      Student(1001, "zhangsan", "F", 20),
      Student(1002, "lisi", "M", 16),
      Student(1003, "wangwu", "M", 21),
      Student(1004, "zhaoliu", "F", 21),
      Student(1005, "zhouqi", "M", 22),
      Student(1006, "qianba", "M", 22),
      Student(1007, "liuliu", "F", 23)
    ).toDF()  //Seq/List可以直接toDF
//    stuDF.printSchema()
//    stuDF.show()

    import org.apache.spark.sql.functions._
    stuDF.groupBy(stuDF("sex")).agg(count(stuDF("age")).as("num")).show()
    stuDF.groupBy(stuDF("sex")).agg(max(stuDF("age")).as("max")).show()
    stuDF.groupBy(stuDF("sex")).agg(min(stuDF("age")).as("min")).show()

    stuDF.groupBy(stuDF("sex")).agg("age"->"max","age"->"min","age"->"avg","id"->"count").show()

    stuDF.groupBy("sex","age").count().show()

  }
}

六 Spark UDF&UDAF&UDTF

UDF

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}

object SparkUDFDemo {
  case class Hobbies(name:String,hobbies:String)
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("innerfunction").master("local[*]").getOrCreate()

    import spark.implicits._
    val sc: SparkContext = spark.sparkContext
    val rdd: RDD[String] = sc.textFile("in/hobbies.txt")
    val df: DataFrame = rdd.map(x=>x.split(" ")).map(x=>Hobbies(x(0),x(1))).toDF()
     //df.printSchema()
   // df.show()
    df.registerTempTable("hobbies")
    spark.udf.register("hobby_num",
      (v:String)=>v.split(",").size
    )
    val frame: DataFrame = spark.sql("select name,hobbies,hobby_num(hobbies) as hobnum from hobbies")
    frame.show()
  }
}
/*
+-----+--------------------+------+
| name|             hobbies|hobnum|
+-----+--------------------+------+
|alice|jogging,Coding,co...|     3|
| lina|        travel,dance|     2|
+-----+--------------------+------+

UDAF

import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
//自定义UDAF函数
object SparkUDAFDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("udaf").master("local[*]").getOrCreate()
    val sc: SparkContext = spark.sparkContext

    val df: DataFrame = spark.read.json("in/user.json")
//    df.printSchema()
   println("读取的文件详情")   
    df.show()

    //创建并注册自定义udaf函数
    val myUdaf = new MyAgeAvgFunction
    spark.udf.register("myAvgAge", myUdaf)

    df.createTempView("userinfo")

    val resultDF: DataFrame = spark.sql("select  sex, myAvgAge(age) from userinfo group by sex")

  //  resultDF.printSchema()
  println("使用udaf后的效果")
    resultDF.show()
  }
}


class MyAgeAvgFunction extends UserDefinedAggregateFunction{
  //聚合函数的输入数据结构
  override def inputSchema: StructType = {
    new StructType().add("age",LongType)
//    StructType(StructField("age",LongType)::Nil)  //上下等价
  }
  //缓存区数据结构
  override def bufferSchema: StructType = {
    new StructType().add("sum",LongType).add("count",LongType)
//    StructType(StructField("sum",LongType)::StructField("count",LongType)::Nil)
  }
  //聚合函数返回值数据结构
  override def dataType: DataType = DoubleType
  //聚合函数是否是幂等的,即相同输入是否总是能得到相同输出
  override def deterministic: Boolean =true
  //初始化缓冲区
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0)=0L
    buffer(1)=0L
  }
  //给聚合函数传入一条新数据进行处理
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    buffer(0) = buffer.getLong(0)+input.getLong(0)
    buffer(1) = buffer.getLong(1)+1
  }
  //合并聚合函数缓冲区
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    //总年龄数
    buffer1(0)=buffer1.getLong(0)+buffer2.getLong(0)
    //总人数
    buffer1(1)=buffer1.getLong(1)+buffer2.getLong(1)
  }
  //计算最终结果
  override def evaluate(buffer: Row): Any = {
    buffer.getLong(0).toDouble/buffer.getLong(1)
  }
}

/*
读取的文件详情
+---+----+----+-----+
|age|  id|name|  sex|
+---+----+----+-----+
| 20|1001| foo|  man|
| 24|1002| bar|  man|
| 18|1003| baz|  man|
| 17|1004|foo1|woman|
| 19|1005|bar2|woman|
| 20|1006|baz3|woman|
+---+----+----+-----+

使用udaf后的效果
+-----+---------------------+
|  sex|myageavgfunction(age)|
+-----+---------------------+
|  man|   20.666666666666668|
|woman|   18.666666666666668|
+-----+---------------------+

UDTF

import java.util
import org.apache.hadoop.hive.ql.exec.UDFArgumentException
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory
import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspector, ObjectInspectorFactory, PrimitiveObjectInspector, StructObjectInspector}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}


object SparkUDTFDemo {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("udtf").master("local[*]")
      .enableHiveSupport()        //因为spark不支持UDTF,所以这里添加hive支持
      .getOrCreate()
    val sc: SparkContext = spark.sparkContext
    import spark.implicits._
    val lines: RDD[String] = sc.textFile("in/udtf.txt")
    val stuDF: DataFrame = lines.map(_.split("//")).filter(x => x(1).equals("ls"))
      .map(x => (x(0), x(1), x(2))).toDF("id", "name", "class")
//    stuDF.printSchema()
//    stuDF.show()
    stuDF.createOrReplaceTempView("student")

   spark.sql("CREATE TEMPORARY FUNCTION MyUDTF AS 'sparkSQL.myUDTF' ")
    spark.sql("select MyUDTF(class) from student").show()

  }
}

class myUDTF extends GenericUDTF{
  override def initialize(argOIs: Array[ObjectInspector]): StructObjectInspector = {
    //输出参数校验
    if(argOIs.length != 1){
      throw new UDFArgumentException("有且只能有一个参数")
    }
    if(argOIs(0).getCategory!=ObjectInspector.Category.PRIMITIVE){
      throw new UDFArgumentException("参数类型不匹配")
    }

    val fieldNames=new util.ArrayList[String]
    val fieldOIs=new util.ArrayList[ObjectInspector]()
    //定义输出列表字段名称
    fieldNames.add("type")
    //定义的是输出列表的字段类型
    fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector)
    ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs)
  }

  //这是处理数据的方法,入参数组只有1行数据,即每次调用process方法只处理一行数据
  //传入 Hadoop scala kafka hive ...
  //输出 HEAD  type   String
  //          Hadoop
  //          scala
  //          kafka
  //          hive
  //          ...
  override def process(objects: Array[AnyRef]): Unit ={
    //将字符串切分为单个字符的数组
    val strings: Array[String] = objects(0).toString.split(" ")
    println(strings.mkString(","))
    for (elem <- strings) {
      val tmp = new Array[String](1)
      tmp(0)=elem
      forward(tmp)
    }
  }
  override def close(): Unit = {}
}

/*
+------+
|Hadoop|
| scala|
| kafka|
|  hive|
| hbase|
| Oozie|
+------+

七 Spark SQL CLI

  • Spark SQL CLI是在本地模式下使用Hive元存储服务和执行从命令行所输入查询语句的简便工具
  • 注意,Spark SQL CLI无法与thrift JDBC服务器通信
  • Spark SQL CLI等同于Hive CLI(old CLI)、Beeline CLI(new CLI)
  • 将hive-site.xml、hdfs-site.xml、core-site.xml复制到$SPARK_HOME/conf目录下
  • 启动Spark SQL CLI,请在Spark目录中运行以下内容./bin/spark-sql
$spark-sql
spark-sql> show databases;
default
spark-sql> show tables;
default		toronto		false

spark-sql> select * from toronto where ssn like '111%';
John S. 111-222-333 123 Yonge Street

spark-sql> create table montreal(full_name string, ssn string, office_address string);
spark-sql> insert into montreal values('Winnie K. ', '111-222-333 ', '62 John Street');
spark-sql> select t.full_name, m.ssn, t.office_address, m.office_address from toronto t inner join montreal m on t.ssn = m.ssn;
John S. 	111-222-333 	123 Yonge Street 	62 John Street

八 Spark性能优化

序列化

  • Java序列化,Spark默认方式
  • Kryo序列化,比Java序列化快约10倍,但不支持所有可序列化类型
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
//向Kryo注册自定义类型
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]));
  • 如果没有注册需要序列化的class,Kyro依然可以照常工作,但会存储每个对象的全类名(full class name),这样往往比默认的 Java serialization 更浪费空间

优化点

  • 使用对象数组(array数组)、原始类型(基本数据类型)代替Java、Scala集合类(如HashMap)
  • 避免嵌套结构
  • 尽量使用数字作为Key,而非字符串
  • 以较大的RDD使用MEMORY_ONLY_SER
  • 加载CSV、JSON时,仅加载所需字段
  • 仅在需要时持久化中间结果(RDD/DS/DF)
  • 避免不必要的中间结果(RDD/DS/DF)的生成
  • DF的执行速度比DS快约3倍(结构简单,只有Row对象)

分区优化

  • 自定义RDD分区与spark.default.parallelism
    • 该参数用于设置每个stage的默认task数量
  • 将大变量广播出去,而不是直接使用
  • 尝试处理本地数据并最小化跨工作节点的数据传输

join操作

  • 小表放在join左边,会缓存进内存,右边的大表一一与内存中表关联,效率更快
  • 还有一个说法是表中重复键较少的表放在join左边,因为写在关联左侧的表每有1条重复的关联键时底层就会多1次运算处理。两表关联时,即使匹配到一条数据,它还是会继续运行下去,也就是说当一个表关联条件所在字段的某一个值有重复时,会打印多条重复的值

你可能感兴趣的:(Spark,SparkSQL,Spark优化,DataSet,DataFrame)