【十一】Spark Streaming整合Spark SQL的操作Local模式(使用Scala语言)

DataFrames and SQL操作可以作用在流数据上。

首先创建一个SparkSession使用SparkContext。这个SparkContext也能被StreamingContext使用。

【十一】Spark Streaming整合Spark SQL的操作Local模式(使用Scala语言)_第1张图片

项目目录

【十一】Spark Streaming整合Spark SQL的操作Local模式(使用Scala语言)_第2张图片

pom.xml


  4.0.0
  com.sid.spark
  spark-train
  1.0
  2008
  
    2.11.8
    0.9.0.0
    2.2.0
    2.9.0
    1.4.4
  

  
    
      scala-tools.org
      Scala-Tools Maven2 Repository
      http://scala-tools.org/repo-releases
    
  

  
    
      scala-tools.org
      Scala-Tools Maven2 Repository
      http://scala-tools.org/repo-releases
    
  

  
    
      org.scala-lang
      scala-library
      ${scala.version}
    
    
      org.apache.kafka
      kafka_2.11
      ${kafka.version}
    

    
      org.apache.hadoop
      hadoop-client
      ${hadoop.version}
      
        
          servlet-api
          javax.servlet
        
      
    

    
      
      
      
    

    
      
      
      
    

    
      org.apache.spark
      spark-streaming_2.11
      ${spark.version}
    

    
      org.apache.spark
      spark-sql_2.11
      ${spark.version}
    

    
      net.jpountz.lz4
      lz4
      1.3.0
    

    
      mysql
      mysql-connector-java
      5.1.31
    

  

  
    src/main/scala
    src/test/scala
    
      
        org.scala-tools
        maven-scala-plugin
        
          
            
              compile
              testCompile
            
          
        
        
          ${scala.version}
          
            -target:jvm-1.5
          
        
      
      
        org.apache.maven.plugins
        maven-eclipse-plugin
        
          true
          
            ch.epfl.lamp.sdt.core.scalabuilder
          
          
            ch.epfl.lamp.sdt.core.scalanature
          
          
            org.eclipse.jdt.launching.JRE_CONTAINER
            ch.epfl.lamp.sdt.launching.SCALA_CONTAINER
          
        
      
    
  
  
    
      
        org.scala-tools
        maven-scala-plugin
        
          ${scala.version}
        
      
    
  

代码

package com.sid.spark

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}

/**
  * Use DataFrames and SQL to count words in UTF8 encoded, '\n' delimited text received from the
  * network every second.
  *
  * Usage: SqlNetworkWordCount  
  *  and  describe the TCP server that Spark Streaming would connect to receive data.
  *
  * To run this on your local machine, you need to first run a Netcat server
  *    `$ nc -lk 9999`
  * and then run the example
  *    `$ bin/run-example org.apache.spark.examples.streaming.SqlNetworkWordCount localhost 9999`
  *
  *    Spark Streaming整合Spark SQL完成词频统计
  */

object SqlNetworkWordCount {
  def main(args: Array[String]) {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("SocketWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream("node1", 6789)
    val words = lines.flatMap(_.split(" "))

    // Convert RDDs of the words DStream to DataFrame and run SQL query
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
      // Get the singleton instance of SparkSession
      val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
      import spark.implicits._

      // Convert RDD[String] to RDD[case class] to DataFrame
      val wordsDataFrame = rdd.map(w => Record(w)).toDF()

      // Creates a temporary view using the DataFrame
      wordsDataFrame.createOrReplaceTempView("words")

      // Do word count on table using SQL and print it
      val wordCountsDataFrame =
        spark.sql("select word, count(*) as total from words group by word")
      println(s"========= $time =========")
      wordCountsDataFrame.show()
    }

    ssc.start()
    ssc.awaitTermination()
  }

  /** Case class for converting RDD to DataFrame */
  case class Record(word: String)


  /** Lazily instantiated singleton instance of SparkSession */
  object SparkSessionSingleton {

    @transient  private var instance: SparkSession = _

    def getInstance(sparkConf: SparkConf): SparkSession = {
      if (instance == null) {
        instance = SparkSession
          .builder
          .config(sparkConf)
          .getOrCreate()
      }
      instance
    }
  }
}

启动nc -lk 6789

IDEA启动项目

【十一】Spark Streaming整合Spark SQL的操作Local模式(使用Scala语言)_第3张图片

nc输入a a a a a

IDEA控制台结果

nc再输入a a 

IDEA控制台结果

【十一】Spark Streaming整合Spark SQL的操作Local模式(使用Scala语言)_第4张图片

未完成:

Spark Streaming 结合 Spark SQL做词频统计时需要保留上一个批次状态。该例子期望结果是 a 7。

回头再补充完

你可能感兴趣的:(spark,streaming,Spark,Streaming)