Flink与SparkStreaming的wordcount

Flink是Apache开源的流处理框架,Spark是Apache开源的大规模数据处理快速通用的计算引擎,两者都是分布式框架,支持Java、Scala等多种语言。接下来通过实例来说明两者的应用。

基于Flink的WordCount

1.  对pom.xml文件添加依赖


     org.apache.flink
     flink-streaming-scala_2.11
     1.8.0

2.  在idea上编写WordCount代码

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.scala._

object SocketWordCount {

  def main(args: Array[String]): Unit = {

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val lines: DataStream[String] = env.socketTextStream("localhost", 9999, '\n')

    val wordCount = lines
      .flatMap(_.split(" "))
      .map((_,1))
      .keyBy(_._1)
      .timeWindow(Time.seconds(5), Time.seconds(1))
      .reduce((x,y) => {
        val key = x._1
        val value = x._2 + y._2
        (key, value)
      })

    wordCount.print().setParallelism(1)
    env.execute("SocketWordCount")

  }
}

3. 使用netcat开启本地服务

$ nc -l 9999
hello world hello

 

4. 运行SocketWordCount代码,结果如下:

(hello,2)
(world,1)

 

基于SparkStreaming的WordCount

1.  对pom.xml文件添加依赖


    org.apache.spark
    spark-streaming_2.11
    2.4.2
    provided

2.  在idea上编写WordCount代码

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object SparkStreamingSocketWordCount {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("SparkStreamingSocketWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))
    val lines = ssc.socketTextStream("localhost", 9999)

    val wordCount = lines
      .flatMap(_.split(" "))
      .map((_,1))
      .reduceByKey(_+_)

    wordCount.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

3. 使用netcat开启本地服务

$ nc -l 9999
hello world hello

4. 运行SocketWordCount代码,结果如下:

-------------------------------------------
Time: 1560395435000 ms
-------------------------------------------
(hello,2)
(world,1)

总结:基于以上两种开发会发现两者有很多相似之处,不同的是假如对实时性要求较高建议选择Flink,能够承受低延迟选择SparkStreaming,因为后者对应的API更丰富更利于开发使用。

你可能感兴趣的:(Spark)