大数据入门之分布式计算框架Spark(3) -- Spark Streaming

1.概述

Spark Streaming将不同的数据源,经过处理之后,结果输出到外部文件系统。

特点:低延时;能从错误中高效地恢复过来;能够运行在成百上千的节点上;能够将批处理、机器学习、图计算等子框架综合使用

工作原理:

    粗粒度:Spark Streaming接收到实时数据流,把数据按照指定的时间段切成一片片小的数据块,然后把小的数据块传给Spark Engine处理。

    细粒度:Spark应用程序运行在Driver端(Spark Context、Streaming Context),在Driver端会要求在Executor端开启Receiver接收器,当收到输入过来的数据(Input Stream)时,拆分成数据块先存进内存,如果是多副本可以拷贝到其他的Executor中。拷贝完成之后,Receiver会把block信息(在哪些机器上面)发送给Streaming Context;每隔几秒钟(时间周期),就会通知Spark Context开启几个jobs,然后把jobs分发到Executor去执行。

2.核心概念

DStreams:源源不断的数据流【一系列的RDD】,这里面的每个RDD,都是包含这一批次的数据

    对DStreams操作算子,比如map/flatmap,其实底层会被翻译成对DStreams中的每个RDD都做相同的操作,因为一个DStreams是由不同批次的RDD构成的。

Input DStreams and Receiver :每一个Input DStreams(除了文件系统上的数据)都需要关联一个Receiver,Receiver是用来从源 头接收数据的,然后把数据存入spark内存中。

Transformations:从Input DStreams过来的数据进行修改(map、flatMap、filter ... )

Output Operation:把DStreams中的数据写进外部系统(数据库或者文件系统)

3.案例

3.1 Spark Streaming处理socket数据

依赖:

    
        2.11.8
        0.8.2.1
        2.2.0
        2.6.0-cdh5.7.0
        1.2.0-cdh5.7.0
    

    
    
        
            cloudera
            https://repository.cloudera.com/artifactory/cloudera-repos
        
    


    
        
            org.scala-lang
            scala-library
            ${scala.version}
        

        
        
            org.apache.kafka
            kafka_2.11
            ${kafka.version}
        

        
        
            org.apache.hadoop
            hadoop-client
            ${hadoop.version}
        

        
        
            org.apache.hbase
            hbase-client
            ${hbase.version}
        

        
            org.apache.hbase
            hbase-server
            ${hbase.version}
        

        
        
            org.apache.spark
            spark-streaming_2.11
            ${spark.version}
        

        
        
            org.apache.spark
            spark-streaming-flume_2.11
            ${spark.version}
        

        
        
            org.apache.spark
            spark-streaming-flume-sink_2.11
            ${spark.version}
        

        
            org.apache.commons
            commons-lang3
            3.5
        

        
        
            org.apache.spark
            spark-streaming-kafka-0-8_2.11
            ${spark.version}
        

        
        
            org.apache.spark
            spark-sql_2.11
            ${spark.version}
        

        
            com.fasterxml.jackson.module
            jackson-module-scala_2.11
            2.6.5
        

        
            net.jpountz.lz4
            lz4
            1.3.0
        

        
            mysql
            mysql-connector-java
            5.1.37
        

        
        
            org.apache.flume.flume-ng-clients
            flume-ng-log4jappender
            1.6.0
        

    
/**
  * Spark Streaming处理Socket数据
  * 测试:nc
  */
object NetworkWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    System.setProperty("hadoop.home.dir", "E:/winutils/")

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    //创建StreamingContext需要两个参数:SparkConf和batch interval
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    val lines = ssc.socketTextStream("localhost", 6789)

    val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

用nc -lk 6789命令,输入单词进行测试。

3.2 Spark Streaming处理文件系统的数据

/**
  * 使用Spark Streaming处理文件系统(local/hdfs)数据
  */
object FileWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    System.setProperty("hadoop.home.dir", "E:/winutils/")

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // 文件系统的数据不需要Receiver
    val lines = ssc.textFileStream("file:///D:/BidDataTestFile/ss/")
    val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

    注意:它只会读取新文件的数据,不会处理已经处理过的文件。

4.Spark Streaming进阶

4.1 统计累积出现单词的次数

/**
  * 使用Spark Streaming 完成已有单词统计
  */
object StatefulWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    System.setProperty("hadoop.home.dir", "E:/winutils/")

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StatefulWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //如果使用了stateful的算子,必须要设置checkpoint
    //在生产环境中,建议大家把checkpoint设置到HDFS的某个文件夹中
    //"." 表示把检查文件放在当前目录
    ssc.checkpoint(".")

    val lines = ssc.socketTextStream("localhost", 6789)

    val result = lines.flatMap(_.split(" ")).map((_, 1))
    //调用updateStateByKey算子,统计单词在全局中出现的次数
    val state = result.updateStateByKey[Int](updateFunction _)

    state.print()

    ssc.start()
    ssc.awaitTermination()

  }

  /**
    * 把当前的数据去更新已有的或者是老的数据
    * @param currentValues 当前的数据
    * @param preValues 老的数据
    * @return
    */
  def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {

    //相当于遍历currentValues,把当前值相加求和
    val current = currentValues.sum
    val pre = preValues.getOrElse(0)  //如果取不到值的话,就返回0

    Some(current + pre)
  }
}

4.2 统计单词个数写入到MySQL数据库

/**
  * 使用Spark Streaming完成词频统计,并将结果写入到MySQL数据库中
  */
object ForeachRDDApp {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    System.setProperty("hadoop.home.dir", "E:/winutils/")

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("ForeachRDDApp")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    val lines = ssc.socketTextStream("localhost", 6789)
    val result = lines.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)

    result.print()

    //将结果写入MySQL数据库
    result.foreachRDD(rdd => {
      rdd.foreachPartition(partitionOfRecords => {
        val connection = createConnection()
        partitionOfRecords.foreach(record => {
          val sql = "insert into wordcount(word, wordcount) values('" + record._1 + "'," + record._2 + ")"
          connection.createStatement().execute(sql)
        })
        connection.close()
      })
    })


    ssc.start()
    ssc.awaitTermination()

  }

  /**
    * 获取MySQL连接
    * @return
    */
  def createConnection() = {
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://localhost:3306/imooc_spark", "root", "MySQLKenan_07")
  }
}

4.3 Window窗体的使用

定时的进行一个时间段内的数据处理

window length:窗口的长度

sliding interval:窗口的间隔

这2个参数和batch size有关系:整数倍

eg.每隔10秒计算前10分钟的wc  ==> 窗口间隔10s,窗口长度10min

// 窗口长度,窗口间隔
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int, b:Int) => (a+b), Seconds(30), Seconds(10))

4.4 黑名单过滤

访问日志:20180808,zs       20180808,ls        20180808,ww【 ==>  (zs:20180808, zs)】

黑名单列表:zs、ls      【 ==> (zs,true)】

left join ==> (zs:[<20180808,zs>,])      (zs:[<20180808,zs>,])   ===>  我们需要的是<20180808,zs>

/**
  * 黑名单过滤
  */
object TransformApp {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("TransformApp")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //构建黑名单
    val blacks = List("zs", "ls")
    val blacksRDD = ssc.sparkContext.parallelize(blacks).map(x => (x, true))

    val lines = ssc.socketTextStream("localhost", 6789)
    //前面是名字后面是完整的信息
    val clickLog = lines.map(x=>(x.split(",")(1), x)).transform(rdd => {
      rdd.leftOuterJoin(blacksRDD)
        .filter(x => x._2._2.getOrElse(false) != true)
        .map(x => x._2._1)
    })

    clickLog.print()

    ssc.start()
    ssc.awaitTermination()

  }
}

4.5 Spark Streaming 整合 Spark SQL

/**
  * Spark Streaming 整合 Spark SQL完成词频统计
  */
object SqlNetworkWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    System.setProperty("hadoop.home.dir", "E:/winutils/")

    val sparkConf = new SparkConf().setAppName("SqlNetworkWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    val lines = ssc.socketTextStream("localhost", 6789)
    val words = lines.flatMap(_.split(" "))

    //从DStream里面遍历每一个批次
    words.foreachRDD((rdd, time) => {
      val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
      import spark.implicits._

      val wordsDataFrame = rdd.map(w => Record(w)).toDF()

      wordsDataFrame.createOrReplaceTempView("words")

      val wordCountsDataFrame = spark.sql("select word, count(*) as total from words group by word")
      println(s"=========== $time ===========")
      wordCountsDataFrame.show()
    })

    ssc.start()
    ssc.awaitTermination()
  }

  //用来将RDD转成DataFrame
  case class Record(word: String)

  object SparkSessionSingleton {

    @transient private var instance: SparkSession = _

    def getInstance(sparkConf: SparkConf): SparkSession = {
      if (instance == null) {
        instance = SparkSession
          .builder
          .config(sparkConf)
          .getOrCreate()
      }
      instance
    }
  }
}

5.Spark Streaming整合Flume

5.1 Push方式

Flume采集数据,直接传给Spark Streaming

vim  flume_push_streaming.conf

simple-agent.sources = netcat-source
simple-agent.sinks = avro-sink
simple-agent.channels = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.avro-sink.type = avro
# 本地测试的时候,这里应该是本地ip
simple-agent.sinks.avro-sink.hostname = hadoop000
simple-agent.sinks.avro-sink.port = 41414 

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.avro-sink.channel = memory-channel

依赖

        
        
            org.apache.spark
            spark-streaming-flume_2.11
            ${spark.version}
        
/**
  * Spark Streaming 整合 Flume的第一种方式 --- Push
  */
object FlumePushWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    //System.setProperty("hadoop.home.dir", "E:/winutils/")

    if (args.length != 2) {
      System.err.println("Usage: FlumePushWordCount  ")
      System.exit(1)
    }

    val Array(hostname, port) = args

    //通过submit方式提交到服务器,需要注释掉下面的内容
    val sparkConf = new SparkConf()
      //.setMaster("local[2]").setAppName("FlumePushWordCount")

    val ssc = new StreamingContext(sparkConf, Seconds(5))

    //如何使用SparkStreaming整合Flume
    //从hostname 的 port 中获取传过来的数据
    val flumeStream = FlumeUtils.createStream(ssc, hostname, port.toInt)
    flumeStream.map(x => new String(x.event.getBody.array()).trim)
        .flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print(10)



    ssc.start()
    ssc.awaitTermination()

  }
}

测试:先启动应用程序,再启动Flume

flume-ng agent \
--name simple-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume_push_streaming.conf \
-Dflume.root.logger=INFO,console
本地测试总结:
1> 启动Spark Streaming作业
2> 启动flume agent
3> 通过telnet输入数据,观察IDEA控制台的输出
打包方式:mvn clean package -DskipTests

提交到spark服务器调试

./spark-submit \
--class com.xq.spark.examples.FlumePushWordCount \
--name FlumePushWordCount \
--master local[2] \
--executor-memory 1G \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0,org.apache.flume:flume-ng-sdk:1.6.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
hadoop000 41414
flume-ng agent \
--name simple-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume_push_streaming.conf \
-Dflume.root.logger=INFO,console
telnet hadoop000 44444

测试成功!!

5.2 Pull方式 (推荐!!)

Flume把数据采集过来之后,丢给sink,然后spark streaming到sink去获取数据。

依赖

        
        
            org.apache.spark
            spark-streaming-flume-sink_2.11
            ${spark.version}
        

        
            org.apache.commons
            commons-lang3
            3.5
        

vim flume_pull_streaming.conf 

simple-agent.sources = netcat-source
simple-agent.sinks = spark-sink
simple-agent.channels = memory-channel

simple-agent.sources.netcat-source.type = netcat
simple-agent.sources.netcat-source.bind = hadoop000
simple-agent.sources.netcat-source.port = 44444

simple-agent.sinks.spark-sink.type = org.apache.spark.streaming.flume.sink.SparkSink
simple-agent.sinks.spark-sink.hostname = hadoop000
simple-agent.sinks.spark-sink.port = 41414 

simple-agent.channels.memory-channel.type = memory

simple-agent.sources.netcat-source.channels = memory-channel
simple-agent.sinks.spark-sink.channel = memory-channel
/**
  * Spark Streaming 整合 Flume的第二种方式 --- Pull
  */
object FlumePullWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    //System.setProperty("hadoop.home.dir", "E:/winutils/")

    if (args.length != 2) {
      System.err.println("Usage: FlumePullWordCount  ")
      System.exit(1)
    }

    val Array(hostname, port) = args

    //通过submit方式提交到服务器,需要注释掉下面的内容
    val sparkConf = new SparkConf()
      //.setMaster("local[2]").setAppName("FlumePullWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    val flumeStream = FlumeUtils.createPollingStream(ssc, hostname, port.toInt)

    flumeStream.map(x => new String(x.event.getBody.array()).trim)
        .flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()

    ssc.start()
    ssc.awaitTermination()
  }
}

  本地调试:

  先启动Flume,后启动Spark应用程序

flume-ng agent \
--name simple-agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/flume_pull_streaming.conf \
-Dflume.root.logger=INFO,console

服务器调试:

项目打包【注意:注释掉local[2],appName参数】

./spark-submit \
--class com.xq.spark.examples.FlumePullWordCount \
--name FlumePullWordCount \
--master local[2] \
--executor-memory 1G \
--packages org.apache.spark:spark-streaming-flume_2.11:2.2.0,org.apache.flume:flume-ng-sdk:1.6.0,org.apache.spark:spark-streaming-flume-sink_2.11:2.2.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
hadoop000 41414

6.Spark Streaming整合Kafka

6.1 Receiver方式整合

通过Receiver接受从Kafka过来的数据,存储到Spark Executor;job启动的时候,处理存储的这些数据。

1)启动ZK      zkServer.sh start

2)启动Kafka    kafka-server-start.sh -daemon /home/Kiku/app/kafka_2.11-0.9.0.0/config/server.properties

3)创建topic     kafka-topics.sh --create --zookeeper hadoop000:2181 --replication-factor 1 --partitions 1 --topic xxxxxx

4)通过控制台测试topic能否正确的发送、接受消息

kafka-console-producer.sh --broker-list hadoop000:9092 --topic kafka_streaming_topic

kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic kafka_streaming_topic

依赖

        
        
            org.apache.spark
            spark-streaming-kafka-0-8_2.11
            ${spark.version}
        
/**
  * Spark Streaming 对接 Kafka的第一种方式 -- Receiver
  */
object KafkaReceiverWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    //System.setProperty("hadoop.home.dir", "E:/winutils/")

    if (args.length != 4) {
      System.err.println("Usage: KafkaReceiverWordCount    ")
    }

    val Array(zkQuorum, group, topics, numThreads) = args

    val sparkConf = new SparkConf()
        //.setMaster("local[2]").setAppName("KafkaReceiverWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))


    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
    //Spark Streaming 如何对接Kafka
    val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)

    //第二位是我们字符串的值
    messages.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()


    ssc.start()
    ssc.awaitTermination()
  }
}

提交到服务器

代码中去掉,local[2]、appName

spark-submit \
--class com.xq.spark.examples.KafkaReceiverWordCount \
--name KafkaReceiverWordCount \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
192.168.6.130:2181 test kafka_streaming_topic 1

6.2 Direct方式整合

无Receiver,端到端的数据保障。周期性的从每一个topic分区中查询Kafka偏移量,周期性的把这个范围的数据通过每个批次进行处理。

优点:简化并行度,不需要创建多个Input Stream,使用direct Stream进行处理

           0数据丢失,Receiver方式中,我们需要WAL(Write Ahead Log)写入日志中,以副本的方式存储,才能保证数据不丢失。

    效率不高。Direct方式,不再需要WAL,性能提升。

缺点:无法更新偏移量到Zookeeper,需要手动更新。

/**
  * Spark Streaming 对接 Kafka的第二种方式 -- Direct
  */
object KafkaDirectWordCount {

  def main(args: Array[String]): Unit = {

    //因为主机没有配置hadoop环境,所以需要加上这句话
    //System.setProperty("hadoop.home.dir", "E:/winutils/")

    if (args.length != 2) {
      System.err.println("Usage: KafkaReceiverWordCount  ")
      System.exit(1)
    }

    val Array(brokers, topics) = args

    val sparkConf = new SparkConf()
      //.setMaster("local[2]").setAppName("KafkaDirectWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(5))


    val topicsSet = topics.split(",").toSet
    val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)

    val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc, kafkaParams, topicsSet
    )

    //第二位是我们字符串的值
    messages.map(_._2).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()

    ssc.start()
    ssc.awaitTermination()

  }
}

提交到服务器

spark-submit \
--class com.xq.spark.examples.KafkaDirectWordCount \
--name KafkaDirectWordCount \
--master local[2] \
--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 \
/home/Kiku/lib/sparktrain-1.0-SNAPSHOT.jar \
192.168.6.130:9092 kafka_streaming_topic

 

你可能感兴趣的:(Spark)