初识Flink,flink stream是无边界的数据,咱们用一个例子,对比下Flink与Spark的差异。
Flink是基于的,且Event是独立的,操作、算子都是基于单个的Event的;
Spark是基于RDD的,操作、算子都是基于集合实现的,这是Spark与Flink最本质的差别。
1:Spark WordCount例子
import org.apache.spark.{SparkConf, SparkContext}
object SparkWordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SparkWordCount Demo, compare to Flink !!!").setMaster("local")
val sc = new SparkContext(conf)
val text = sc.makeRDD(
List(
"import org apache flink streaming api scala"
, "import org apache flink streaming api windowing time Time"
))
println("default NumPartitions : " + text.getNumPartitions)
val wordCounts = text.flatMap(line => line.split(" ")).map(word => (word.toLowerCase, 1)).reduceByKey((a, b) => a + b)
println("wordCounts NumPartitions : " + wordCounts.getNumPartitions)
wordCounts.foreach(println(_))
}
}
输出如下:从输出,可以看到Spark将输入一次性全部计算完成。
(scala,1)
(import,2)
(flink,2)
(apache,2)
(org,2)
(windowing,1)
(streaming,2)
(time,2)
(api,2)
2:看下Flink WordCount
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
object FlinkNoTime extends App {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(
"import org apache flink streaming api scala"
, "import org apache flink streaming api window ing time Time"
)
val counts = text
.flatMap {
_.toLowerCase.split("\\W+") filter {
_.nonEmpty
}
}
.map((_, 1))
.keyBy(0) group by the tuple field "0" and sum up tuple field "1"
//.timeWindow(Time.seconds(5)) //加上这行将没数据,因为没有指定时间
.sum(1)
counts.print()
env.execute("Window Stream WordCount")
}
输出如下:我们看到每一个单词,都是由1累加至2。说明Flink把每个单词,都当成独立的Event,Event逐个进入到Flink系统,所以WordCount慢慢累加起来。
那有没办法,让Flink计算输出类似Spark这样了?注意:我们是将flink stream与spark在做比较flink batch此处不考虑。
当然有,给Flink加上Window操作,将所有Event聚集在一个窗口,再实施聚合操作即可。
6> (windowing,1)
3> (org,1)
3> (org,2)
2> (streaming,1)
4> (import,1)
2> (streaming,2)
7> (apache,1)
7> (flink,1)
1> (api,1)
1> (api,2)
1> (scala,1)
5> (time,1)
5> (time,2)
4> (import,2)
7> (apache,2)
7> (flink,2)
3:Flink WordCount Like Spark批计算
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
object FlinkLikeSpark extends App{
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime)
//没有process逻辑,所以此处设置ProcessingTime时,不会有数据数据,暂时只能用IngestionTime
// env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
val text = env.fromElements(
"import org apache flink streaming api scala"
, "import org apache flink streaming api windowing time Time"
)
val counts = text
.flatMap {_.toLowerCase.split("\\W+") filter {_.nonEmpty}}
.map((_, 1))
.keyBy(0)
.timeWindow(Time.seconds(5)) //添加一个窗口操作,即可实现聚合
.sum(1)
.print()
env.execute("Window Stream WordCount")
}
输出如下:观察一下,跟Spark批计算输出相似了,说明我的猜想是正确的。
7> (apache,2)
7> (flink,2)
6> (windowing,1)
4> (import,2)
3> (org,2)
5> (time,2)
1> (api,2)
1> (scala,1)
2> (streaming,2)