spark scala dataset、dataframe、RDD 、SQL使用

Dataset[String] to Data[case class] dataset[String] to Dataset[ThrDynamicRowV001]

`val ds: Dataset[ThrDynamicRowV001] = spark.read.textFile(inputThrFile).map(row => {
  val split_str = row.split(",")
  for (i <- 0 to 13) {
    if (split_str(i).isEmpty) {
      split_str(i) = "-1"
    }
  }
  val uniqueId = split_str(0).toLong
  val acqTime = split_str(1).toLong
  val targetType = split_str(2).toShort
  val dataSupplier = split_str(3).toShort
  val dataSource = split_str(4).toShort
  val status = split_str(5).toShort
  val longitude = split_str(6).toLong
  val latitude = split_str(7).toLong
  val areaId = split_str(8).toLong
  val speed: Int = split_str(9).toInt
  val convertion: Double = split_str(10).toDouble
  val cog: Int = split_str(11).toInt
  val trueHead: Int = split_str(12).toInt
  val power: Int = split_str(13).toInt
  val ext: String = split_str(14)
  val extend: String = split_str(15)
  ThrDynamicRowV001(
    uniqueId, acqTime, targetType, dataSupplier, dataSource, status,
    longitude, latitude, areaId, speed, convertion, cog, trueHead,
    power, ext, extend
  )
})`

spark dataframe or dataset GroupBy flatMapGroups

  • GroupByKey
  • Similar to SQL “GROUP BY” clause,Spark groupBy() funciton is used to collect the identical data into groups on Dataframe/Dataset and perform aggregate functions on the grouped data.
  • Syntax: perform groupBy() on Spark Dataframe,it returns RelationGroupedDataset object which contains below aggregate functions.
    • groupBy(col1 : scala.Predef.String, cols : scala.Predef.String*) :
      org.apache.spark.sql.RelationalGroupedDataset
  • flatMapGroups
  • Applies the given function to each group of data.把udf使用到每个分组,并return a new Dataset[case class],最后把分组执行udf后return的所有dataset压平形成一个大的dataset。
    • Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an iterator containing elements of an arbitrary type which will be returned as a new [[Dataset]].
  • groupByKey(key1).flatMapGroups((key,iters) => {function1})
    • demo:
      ` val shipData = readBooksDF.groupByKey(.mmsi)
      .flatMapGroups((
      , group) => {
      var i = 0
      var list1: List[AISship] = List()
      val pointList = group.toList.sortBy(_.acqtime) // group.toList List(AISship(309787000,1440246594,-5760000,129434402),…)
      // print(pointList)
      // println("---------------")
      var lastTime = pointList.head.acqtime
      var index = 0
      for (row <- pointList) {
      val nTime = row.acqtime

      // 两个时间戳相减
      val timeDiff = getDiff(nTime, lastTime) / 3600.0
      if ((timeDiff > 48) & ((index + 3 + 1) <= pointList.length)) {
        // 取出当前点的后面连续三个点,并判断两两连续点的时间间隔是否小于十分钟,若是则认为是新的一个批次
        val tmp = pointList.apply(index + 1).acqtime
        val tmp2 = pointList.apply(index + 2).acqtime
        val tmp3 = pointList.apply(index + 3).acqtime
        val diff1 = getDiff(tmp, nTime) / 60.0
        val diff2 = getDiff(tmp2, tmp) / 60.0
        val diff3 = getDiff(tmp3, tmp2) / 60.0
        if ((diff1 < 30) & (diff2 < 30) & (diff3 < 30)) {
          i += 1
          //              print(i, "---------------i")
        }
      
      }
      
      list1 = list1 :+ AISship(row.mmsi, row.acqtime, row.latitude, row.longitude, i.toString)
      lastTime = nTime
      index += 1
      

      }

      // print(list1, “------------list1”)

      list1
      })`

Spark SQL

Spark write to HBase table

`df.write.options(
  Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable -> "4"))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()`
  • store dataFrame to HBase table using save() function on the dataframe,format takes “org.apache.spark.sql.execution.datasources.hbase” DataSource defined in “hbase-spark” API which enables us to use DataFrame with HBase tables.And df.write.options take the catalog and specifies to use 4 regions in cluster.Finally,save() writes it to HBase table

Spark Write to Cassandra keyspace table dataset[case class]

` foramtChange.write  
  .option("keyspace", keyspace).option("table", table)
  .option("spark.cassandra.output.consistency.level", consistencyLevel)
  .format("org.apache.spark.sql.cassandra")
  .mode(SaveMode.Append)
  .save 
 `

demo: store dataset[CassandraAisDynamicCompressRow] to cassandra keyspace table

spark read HBase table ## Reading from HBase to DataFrame

val hbaseDF = spark.read .options(Map(HBaseTableCatalog.tableCatalog -> catalog)) .format("org.apache.spark.sql.execution.datasources.hbase") .load()

spark read cassandra table

val readBooksDF = spark .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> table, "keyspace" -> keyspace)) .load

spark dataframe where or filter where 和 filter 功能使用方法一样。下面两者是等同的

  • Spark filter() or where() function is used to filter the rows from dataframe or fdataset based on the givern one or multiple conditions or SQL
    expression.
  • spark dataframe filter or where syntaxes()
      1. filter(condition: Column): Dataset[T]
      1. filter(conditionExpr: String): Dataset[T] //using SQL expression
      1. filter(func: T => Boolean): Dataset[T]
      1. filter(func: FilterFunction[T]): Dataset[T]
    • using first signature you can refer Column names using one of the following syntexes $colname , col(‘colname’) , ‘colname’ df(“colname”) with condition expression
    • the second signature will be used to provide SQL expression to filters() rows
    • the third signature is used with SQL funcions where the function applied on each row.
    • the fourth signature is used with FilterFuncion class
    • demo1:DataFrame where() with Column condition
      • demo 找到state 列等于 “OH”的所有行
        df.where(df("state") === "OH").show(false) df.where('state === "OH").show(false) df.where($state === "OH").show(false) df.where(col("state") === "OH").show(false)
    • demo2:where with Multiple Conditions 找到 state 列等于 “OH” 且 gender列等于 “M”的所有行
      • df.where(df(“state”) === “OH” && df(“gender”) === “M”)
    • demo3: 找到 ymxy 列值在 cqlQuery 列表里的所有行
      • df.where( $“ymxy” isInCollection (cqlQuery.toList))

spark select select multiple columns

  • 指定选择这几列值
    • demo: df.select(“mmsi”, “acqtime”, “latitude”, “longitude”, “ext”)

Spark dataframe to Dataset

  • spark dataframe to Dataset[AISship] 前提,dataframe 和dataset列名和数据格式一致
    • demo:val readBookDataset=dataframe.as[AISship]

spark dataset write to csv files

  • demo: IoPolyEvent datatype:dataset[IoPortEventRow]
    val outputIoPolyEventPath = "data6.csv" IoPolyEvent .repartition(1) .write .option("header", "true") .csv(outputIoPolyEventPath)

Spark examples:https://sparkbyexamples.com/
https://www.javatpoint.com/apache-spark-groupbykey-function

DataSet

  • DataSet 优点
    • 整合了 RDD 和 DataFrame 优点,支持结构化和非结构化数据
    • 和 RDD 一样,支持 自定义对象存储
    • 和 DataFrame 一样,支持结构化数据的SQL查询
    • 采用堆外内存存储,gc友好
    • 类型转化安全,代码友好

DataSet

  • A DataSet is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations.Each Dataset also has an untyped view called a DataFrame,which is a Dataset of Row.
  • Opearions avaliable on Datasets are divided into transformations and actions.Transformations are the ones that produce new Datasets,and actions are the ones that trigger computation and return results.Example transformations include map,filter,select,and aggregate(group by).Example actions count,show,or writing data out to file systems.
  • Datasets are “lazy”,i.e. computations are only triggered when an action is invoked.Internally,a Dataset represents a logical plan and generates a physical plan for efficient execution in a parallel and distributed manner.To explore the logical plan as well as optimized physical plan,use the explain function
  • To efficiently support domain-specific objects,an Encoder is required.The encoder maps(映射) the domain specific type T to Spark’s internal type system.For example,given a class Person with two fields,name(String) and age(int),an encoder is used to tell spark to generate code at runtime to serialize the Person object into a binary structure.The binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing(eg. in a columnar format).To understand the internal binary representation for data,use the schema function

RDD
spark scala dataset、dataframe、RDD 、SQL使用_第1张图片

  • 上图直观地体现了DataFrame 和 RDD 的区别。左侧的RDD[Person]虽然以Person 为类型参数,但spark 框架本身不了解 Person类的内部结构。而右侧的DataFrame 却提供了 详细的结构信息,使得Spark SQL 可以清楚地知道 该数据集中包含哪些列,每列的名称和类型各是什么。DataFrame 多了数据的结构信息,即schema。RDD是分布式的java对象的集合。DataFrame 是分布式的Row 对象的集合。DataFrame 除了提供比RDD 更丰富的算子以外,更重要的特点是提升执行效率、减少数据读取以及执行计划的优化,比如filter下推、裁剪等。

RDD和DataSet

  • DataSet以Catalyst逻辑执行计划表示,并且数据以编码的二进制形式被存储,不需要反序列化就可以执行sorting、shuffle等操作
  • Dataset创立需要一个显式的Encoder,把对象序列化为二进制,可以把对象的shema映射为SparkSQL类型,然而RDD依赖于运行时反射机制
  • 通过上面两点,DataSet的性能比RDD要好很多

DataSet和DataFrame

  • DataFrame是DataSet的一个特例;Dataset每个record存储的是一个强类型值(case class),而dataframe每个record存储的是row

spark dataset
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html

sparkRDD、DataFrame、DataSet的区别
https://www.jianshu.com/p/711ded043053

spark rdd api
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#groupByKey

spark dataset GroupByKey().flatMapGroups() spark dataset

你可能感兴趣的:(spark,大数据,spark,java,scala)