Flink作为第三代流计算引擎,同采取了DAG Stage拆分的思想构建了存粹的流计算框架。被人们称为第三代大数据处理方案
。该计算框架和Spark设计理念出发点恰好相反。
第一代:2006年 Hadoop(HDFS、MapReduce),2014年 9月份 Storm 诞生顶级项目。
第二代:2014年2月份 Spark诞生 Spark RDD/DStream
第三代:2014年12月份Flink诞生。
原因是因为早期人们对大数据分析的认知或者业务场景大都停留在批处理
领域。才导致了Flink的发展相比较于Spark较为缓慢,直到2017年人们才慢慢将批处理
开始转向流处理
。
流计算场景:实时计算领域,系统监控、舆情监控、交通预测、国家电网、疾病预测,银行/金融 风控。
Spark架构 vs Flink 架构
**总结:**不难看出Flink在架构的设计优雅程度上其实和Spark是非常相似的。资源管理上Flink同样可以运行在Standalone和yarn、k8s等,在上层上抽象出 流处理和批处理两个维度数据的处理方式分别处理unbound和bounded数据。并且在DataStream和DateSet API之上均有对应的实现例如SQL处理、CEP-Event (Complex event processing)、MachineLearing等,也自然被称为第三代大数据处理方案。
参考:https://ci.apache.org/projects/flink/flink-docs-master/concepts/runtime.html
Flink会使用chaining oprators的方式将一些操作归并到一个subtask中,每个任务就是一个线程。这种chain operator方式类似于Spark DAG拆分。通过该种方式可以优化计算,减少Thread-to-Thread的通信成本。下图描绘了Flink 流计算chain操作
Flink架构角色:JobManager(类似Spark Master)、TaskManager(类似Spark Worker)、Client(类似Spark Driver)
There is always at least one Job Manager. A high-availability setup will have multiple JobManagers, one of which one is always the leader, and the others are standby.
There must always be at least one TaskManager.
每一个TaskManager是一个JVM进程,用于执行1~n个subtasks(个子任务都运行在一个独立线程中),通过Task slots控制TaskManager JVM接受Tasks的数目(Job计算任务数目)。因此一个TaskManager至少有1个Task slots.
每个Task Slot表示一个Task Manager计算资源的一个子集。例如:一个Task Manager有3个slots,意味着每个Slots占用该JVM进程的1/3的内存资源。由于1个Task slot只能分配以一个Job,所以通过slots策略可以到达不同job任务计算间的隔离。就上述案例,如果给一个计算任务分配6 slots,该任务的种任务总数5,分配如下:
一个线程占用一个slots.其中还有一些多余的slot被浪费了,因此在使用Flink程序的时候需要用户精准的知道该job需要多好个Slot,以及任务的并行度。因为Flink可以做到同一个job中Task slots的共享。
默认情况下,Flink任务所需的TaskSlots的数目等于 其中一个Task的最大并行度。
前提条件
上传并解压flink
[root@CentOS ~]# tar -zxf flink-1.8.1-bin-scala_2.11.tgz -C /usr/
[root@CentOS ~]# cd /usr/flink-1.8.1/
[root@CentOS flink-1.8.1]# vi conf/flink-conf.yaml
jobmanager.rpc.address: CentOS
taskmanager.numberOfTaskSlots: 4
[root@CentOS flink-1.8.1]# vi conf/slaves
CentOS
[root@CentOS flink-1.8.1]# ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host CentOS.
Starting taskexecutor daemon on host CentOS.
[root@CentOS flink-1.8.1]# jps
4721 SecondaryNameNode
4420 DataNode
36311 TaskManagerRunner
35850 StandaloneSessionClusterEntrypoint
2730 QuorumPeerMain
3963 Kafka
36350 Jps
4287 NameNode
如果Flink需要将计算数据写入HDFS系统,需要注意Flink安装版本和Hadoop的版本,一般需要下载flink-shaded-hadoop-2-uber-xxxx.jar并且将该jar放置在Flink的lib目录下,这样做的目的是可以通过Flink直接操作HBase、HDFS、YARN都可以。第二种方案在是环境变量中配置HADOOP_CLASSPATH
访问http://centos:8081/#/overview查看flink web UI
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-coreartifactId>
<version>1.8.1version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-clients_2.11artifactId>
<version>1.8.1version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-scala_2.11artifactId>
<version>1.8.1version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-scala_2.11artifactId>
<version>1.8.1version>
dependency>
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
object FlinkStreamWordCount {
def main(args: Array[String]): Unit = {
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
//2.设置Source
val lines:DataStream[String]=env.socketTextStream("CentOS",9999)
//3.对lines数据实现常规转换
lines.flatMap(_.split("\\s+"))
.map(WordPair(_,1))
.keyBy("word")
.sum("count")
.print()
//4.执行任务
env.execute("wordcount")
}
}
case class WordPair(word:String,count:Int)
[root@CentOS flink-1.8.1]# ./bin/flink run --class com.baizhi.demo01.FlinkStreamWordCount -p 3 /root/flink-1.0-SNAPSHOT.jar
[root@CentOS flink-1.8.1]# ./bin/flink list
Waiting for response...
------------------ Running/Restarting Jobs -------------------
26.08.2019 04:21:26 : 8b03648cbd94c37a200349ccf3ff0331 : wordcount (RUNNING)
--------------------------------------------------------------
No scheduled jobs.
[root@CentOS flink-1.8.1]# ./bin/flink cancel 8b03648cbd94c37a200349ccf3ff0331
- 创建执行所需的环境 StreamExecutionEnvironment
2)构建DataStream
3)执行DataStream转换算子(lazy)
4)指定计算结果输出
5)执行计算任务env.execute(“job名字”)
val env=StreamExecutionEnvironment.getExecutionEnvironment
可以根据程序部署环境,自动识别运行上下文。用在本地执行和分布式环境
val env=StreamExecutionEnvironment.createLocalEnvironment(4)
指定本地测试环境。
val jarFiles="D:\\IDEA_WorkSpace\\BigDataProject\\20190813\\FlinkDataStream\\target\\flink-1.0-SNAPSHOT.jar"
val env=StreamExecutionEnvironment.createRemoteEnvironment("CentOS",8081,jarFiles)
val env=StreamExecutionEnvironment.getExecutionEnvironment
//2.设置Source
val lines:DataStream[String]=env.socketTextStream("CentOS",9999)
//3.对lines数据实现常规转换
lines.flatMap(_.split("\\s+"))
.map(WordPair(_,1))
.keyBy("word")
.sum("count")
.print()
println(env.getExecutionPlan)
{"nodes":[{"id":1,"type":"Source: Socket Stream","pact":"Data Source","contents":"Source: Socket Stream","parallelism":1},{"id":2,"type":"Flat Map","pact":"Operator","contents":"Flat Map","parallelism":16,"predecessors":[{"id":1,"ship_strategy":"REBALANCE","side":"second"}]},{"id":3,"type":"Map","pact":"Operator","contents":"Map","parallelism":16,"predecessors":[{"id":2,"ship_strategy":"FORWARD","side":"second"}]},{"id":5,"type":"aggregation","pact":"Operator","contents":"aggregation","parallelism":16,"predecessors":[{"id":3,"ship_strategy":"HASH","side":"second"}]},{"id":6,"type":"Sink: Print to Std. Out","pact":"Data Sink","contents":"Sink: Print to Std. Out","parallelism":16,"predecessors":[{"id":5,"ship_strategy":"FORWARD","side":"second"}]}]}
打开网页:https://flink.apache.org/visualizer/将以上的json黏贴到该网页
Source是流计算应用的输入,用户可以通过``StreamExecutionEnvironment.addSource(sourceFunction)给流计算指定输入,其中sourceFunction可以使
SourceFunction或者是
ParallelSourceFunction|RichParallelSourceFunction `实现自定义的输入Source.当然Flink也提供了一些内建的Source以便于测试使用:
底层使用的TextInputForamt读取,仅仅读取一次。
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
//2.设置Source
val lines:DataStream[String]=env.readTextFile("hdfs://CentOS:9000/demo/words")
//3.对lines数据实现常规转换
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()
//4.执行任务
env.execute("wordcount")
提示如果读取HDFS的文件系统需要额外引入依赖
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-hdfsartifactId>
<version>2.9.2version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-commonartifactId>
<version>2.9.2version>
dependency>
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
//2.设置Source
val p="hdfs://CentOS:9000/demo/words"
val inputFormat=new TextInputFormat(new Path(p))//这里的p路劲可以省略
val lines:DataStream[String]=env.readFile(inputFormat,p)
//3.对lines数据实现常规转换
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()
//4.执行任务
env.execute("wordcount")
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
//2.设置Source
val inputFormat=new TextInputFormat(new Path())
val lines:DataStream[String]=env.readFile(inputFormat,"file:///D:/demo/words",
FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
//3.对lines数据实现常规转换
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()
//4.执行任务
env.execute("wordcount")
如果文件被修改了,该文件的所有内容会被重新加载。导致数据重复计算。因此一般在流计算的时候,并不直接在文件上修改,而是添加新文件。
val env=StreamExecutionEnvironment.getExecutionEnvironment
//2.设置Source
val lines:DataStream[String]=env.fromElements("this is a demo","hello flink")
//3.对lines数据实现常规转换
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()
//4.执行任务
env.execute("wordcount")
通过addSource
方法添加实现,例如用户可以通过Apache kafka中读取数据。
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-kafka_2.11artifactId>
<version>1.8.1version>
dependency>
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props=new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props)
//2.设置Source
val lines:DataStream[String]=env.addSource[String](kafkaConsumer)
//3.对lines数据实现常规转换
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()
//4.执行任务
env.execute("wordcount")
如果使用SimpleStringSchema
仅仅是拿到value,如果用户希望拿到更多信息 比如 key/value/partition/offset 用户可以通过自定义KafkaDeserializationSchema
的子类定制反序列化
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.flink.api.scala._
class CustomKafkaDeserializationSchema extends KafkaDeserializationSchema[(String,String,Int,Long)]{
//这个方法永远返回false
override def isEndOfStream(t: (String, String, Int, Long)): Boolean = false
//解码出用户需要的数据
override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String, Int, Long) = {
var key=""
if(record.key()!=null && record.key().size!=0){
key=new String(record.key())
}
val value=new String(record.value())
(key,value,record.partition(),record.offset())
}
//返回结果类型
override def getProducedType: TypeInformation[(String, String, Int, Long)] = {
createTypeInformation[(String, String, Int, Long)]
}
}
如果Kafka存储的都是json字符串数据,用户可以使用系统自带一些json支持的Schema。推荐使用
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props=new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
//{"name":"zs","age":18}
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new JSONKeyValueDeserializationSchema(true),props)
val lines:DataStream[ObjectNode]=env.addSource(kafkaConsumer)
lines.print()
//4.执行任务
env.execute("wordcount")
}
将DataStream的数据写到 文件系统、socket、打印输出、外围系统(Kafka|Redis)
writeAsText/writeAsCsv数据,这些数据不持之checkpoint机制,也就是说只能保证At-least-Once语义的输出,同时写出的数据并不会立即写出到外围系统,在此期间如果程序故障,有可能导致写丢失。
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props=new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props)
//2.设置Source
val lines:DataStream[String]=env.addSource[String](kafkaConsumer)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.writeAsText("file:///D:/results/words",WriteMode.OVERWRITE)
//4.执行任务
env.execute("wordcount")
如果用户需要reliable, exactly-once语义方式将DataStream写出到外围系统,用户需要使用flink-connector-filesystem
将数据写出到外围系统。
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-filesystem_2.11artifactId>
<version>1.8.1version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-hdfsartifactId>
<version>2.9.2version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-commonartifactId>
<version>2.9.2version>
dependency>
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props=new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props)
//2.设置Source
val lines:DataStream[String]=env.addSource[String](kafkaConsumer)
val bucketingSink = new BucketingSink[(String,Int)]("hdfs://CentOS:9000/BucketSink")
bucketingSink.setBucketer(new DateTimeBucketer[(String, Int)]("yyyy-MM-dd-HH",ZoneId.of("Asia/Shanghai")))
bucketingSink.setBatchSize(1024)//1KB
bucketingSink.setBatchRolloverInterval(20 * 60 * 1000) // this is 20 mins
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(bucketingSink)
//4.执行任务
env.execute("wordcount")
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props=new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props)
//2.设置Source
val lines:DataStream[String]=env.addSource[String](kafkaConsumer)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print("debug")//输出前缀,如果不指定,默认前缀是 taskid
//4.执行任务
env.execute("wordcount")
<dependency>
<groupId>org.apache.bahirgroupId>
<artifactId>flink-connector-redis_2.11artifactId>
<version>1.0version>
dependency>
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props=new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props)
val redisConfig=new FlinkJedisPoolConfig.Builder()
.setHost("CentOS")
.setPort(6379)
.build()
val redisSink= new RedisSink(redisConfig,new WordPairRedisMapper)
//2.设置Source
val lines:DataStream[String]=env.addSource[String](kafkaConsumer)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(redisSink)
//4.执行任务
env.execute("wordcount")
如果连接的是集群 使用
FlinkJedisClusterConfig
,哨兵模式FlinkJedisSentinelConfig
集群
FlinkJedisPoolConfig conf = new FlinkJedisPoolConfig.Builder()
.setNodes(new HashSet(Arrays.asList(new InetSocketAddress(5601)))).build();
哨兵
val conf = new FlinkJedisSentinelConfig.Builder()
.setMasterName("master")
.setSentinels(...)
.build()
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-kafka_2.11artifactId>
<version>1.8.1version>
dependency>
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props1=new Properties()
props1.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props1.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
val props2=new Properties()
props2.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props1)
val kafkaProducer=new FlinkKafkaProducer[(String,Int)]("topic02",new CustomKeyedSerializationSchema,props2)
//2.设置Source
val lines:DataStream[String]=env.addSource[String](kafkaConsumer)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(kafkaProducer)
//4.执行任务
env.execute("wordcount")
[root@CentOS kafka_2.11-0.11.0.0]# ./bin/kafka-console-consumer.sh --bootstrap-server CentOS:9092
--topic topic02
--key-deserializer org.apache.kafka.common.serialization.StringDeserializer
--value-deserializer org.apache.kafka.common.serialization.StringDeserializer
--property print.key=true
class CustomKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
override def serializeKey(t: (String, Int)): Array[Byte] = {
t._1.getBytes()
}
override def serializeValue(t: (String, Int)): Array[Byte] = {
t._2.toString.getBytes()
}
override def getTargetTopic(t: (String, Int)): String = {
null
}
}
用户可以更具需求实现SinkFunction
不带故障恢复,或者使用RichSinkFunction
实现故障恢复(后续章节介绍~)。
class CustomSinkFunction extends SinkFunction[(String,Int)] {
override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit = {
println(value)
}
}
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
val props1=new Properties()
props1.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
props1.put(ConsumerConfig.GROUP_ID_CONFIG,"g1")
val kafkaConsumer=new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props1)
//2.设置Source
val lines:DataStream[String]=env.addSource[String](kafkaConsumer)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new CustomSinkFunction)//
//4.执行任务
env.execute("wordcount")
Takes one element and produces one element. A map function that doubles the values of the input stream:
dataStream.map(x=>x*2)
Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words:
dataStream.flatMap(str => str.split("\\s+"))
Evaluates a boolean function for each element and retains those for which the function returns true.
dataStream.filter(item => !item.contains("error"))
Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream.
val stream1= env.socketTextStream("CentOS",9999)
val stream2= env.socketTextStream("CentOS",8888)
stream1
.union(stream2)
.print()
“Connects” two data streams retaining their types, allowing for shared state between the two streams.
val stream1= env.socketTextStream("CentOS",9999)
val stream2= env.socketTextStream("CentOS",8888)
stream1.connect(stream2)
.flatMap(
(line:String)=>line.split("\\s+"),//stream1
(line:String)=>line.split("\\s+") //stream2
)
.map((_,1))
.keyBy(0)
.sum(1)
.print()
split
:Split the stream into two or more streams according to some criterion.
Select
:Select one or more streams from a split stream.
var splitStream:SplitStream[String]= env.socketTextStream("CentOS",9999)
.split((line:String)=>{
if(line.contains("error")){
List("error")
}else{
List("info")
}
})
splitStream.select("error").print("error:")
splitStream.select("info").print("info:")
以上算子过时了,现在推荐使用side-out-put
val outputTag = new OutputTag[String]("error") {}
var stream=env.socketTextStream("CentOS",9999)
.process(new ProcessFunction[String,String] {
override def processElement(value: String, ctx: ProcessFunction[String, String]#Context, out: Collector[String]): Unit = {
if(value.contains("error")){
ctx.output(outputTag,value)
}else{
out.collect(value)
}
}
})
stream.print("info:")
stream.getSideOutput(outputTag).print("error")
Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning.
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
A “rolling” fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.fold(("",0))((r,v)=>(v._1,v._2+r._2))
.print()
max|maxBy
/min|minBy
|sum
//1 zs 10000 1
//2 ls 15000 1
//3 ww 8000 1
env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>Employee(tokens(0).toInt,tokens(1),tokens(2).toDouble,tokens(3).toInt))
.keyBy("dept")
.minBy("salary")
.print()
11> Employee(1,zs,10000.0,1)
11> Employee(1,zs,10000.0,1)
11> Employee(3,ww,8000.0,1)
//1 zs 10000 1
//2 ls 15000 1
//3 ww 8000 1
env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>Employee(tokens(0).toInt,tokens(1),tokens(2).toDouble,tokens(3).toInt))
.keyBy("dept")
.min("salary")
.print()
11> Employee(1,zs,10000.0,1)
11> Employee(1,zs,10000.0,1)
11> Employee(1,zs,8000.0,1)
Apache Flink是构建在Data Stream之上的Stateful Computation,也就是说状态计算是整个Flink计算的核心。因此状态管理是构建Flink的一个比较重要板块。有状态计算使用场景:1.状态检索 2. 窗口聚合或者统计 3.机器学习领域存储训练模型(公式) 4.查询历史数据。Flink通过checkpoint实现state故障容错以及可以使用savepoint实现state计算恢复。Flink程序计算规模可以随意的扩展,在扩展的时候Flink可以重新分发内部状态,Flink同时还支持在运行期间支持外接查询计算状态。Flink提供了State多种存储方案,例如基于内存MemoryStateBacked、FSStateBackDB、RocksDBStateBackend用于存储系统的状态。
Flink将状态的管理分为keyed State
,专门为KeyedStream
实现的一套状态管理。除了Keyed Stream的状态的管理其他状态都称为Operator State
.
Keyed State
Keyed State is always relative to keys and can only be used in functions and operators on a
KeyedStream
.You can think of Keyed State as Operator State that has been partitioned, or sharded, with exactly one state-partition per key. Each keyed-state is logically bound to a unique composite of
, and since each key “belongs” to exactly one parallel instance of a keyed operator, we can think of this simply as . Keyed State is further organized into so-called Key Groups. Key Groups are the atomic unit by which Flink can redistribute Keyed State; there are exactly as many Key Groups as the defined maximum parallelism. During execution each parallel instance of a keyed operator works with the keys for one or more Key Groups.
这种状态必须和Key进行绑定,必须应用在KeyedStream的操作算子中。每个状态是与
绑定。因为Key属于Task中并行实例一个(shuffle保证相同的key落入一个实例中处理),因此可以讲keyed state理解为和
绑定。
所有keyed state最终会按照Key Groups进行管理。Flink在做状态分发的时候是以Key Groups为单元进行分发。Key Group单元的数数目等于系统定义的最大并行度。因此每个keye dOperator操作实例会连接1~N个Key Groups实现状态更新。
Operator State
With Operator State (or non-keyed state), each operator state is bound to one parallel operator instance. The Kafka Connector is a good motivating example for the use of Operator State in Flink. Each parallel instance of the Kafka consumer maintains a map of topic partitions and offsets as its Operator State.
The Operator State interfaces support redistributing state among parallel operator instances when the parallelism is changed. There can be different schemes for doing this redistribution.
除了Keyed Stream的状态的管理其他状态都称为Operator State
. 和 keyed State不同,每个Operator State的状态只和Operator绑定,该State可以通过
获取.该State在Operator实例当中可以通过不同分发策略实现状态的管理。
无论是Keyed State 还是 Operator State这些状态在Flink中的存在形式只有两种Managed State
和Raw Sate
。由于 Managed State所有操作符都支持。并且Flink也提供了丰富的Managed State的操作 类型和API接口,Flink可以很好的管理Managed State数据,继而可是使用Flink encoder将数据存储到外围系统(checekepoint)。Raw Sata
只应用与一些用户自定义的Operator当中,需要手动指定state序列化,系统在做checkpoint的时候,只是将State数据以字节的形式存储,并不了解数据结果。因此在开发使用中一般只靠考虑使用它Managed State,因为Flink可以针对Managed State实现状态分发和优化存储结构。
一般情况下只考虑使用
Managed State
所有的Keyed State是和key绑定,不同的key操作的state一定不一样的。目前Flink给我们提供一下的state结构。一个key可以绑定若干个State,但是一个State只能属于一个key。
state分类 | 说明 |
---|---|
ValueState | 存储一个值T,用户可以通过T value()获取T,通过update(T)更新状态 |
ListState | 存储一个序列T元素集合,add(T)、addAll(List)、Iterable get()、update(List) |
ReducingState | 存储一个值,会自动实现add计算。add(T),T get(),需要用户提供ReduceFunction |
AggregatingState |
存储一个值,也是实现聚合计算,但是对IN、OUT类型可以不同。add(IN),OUT get()、需要用户提供AggregateFunction |
FoldingState |
等价ReducingState,需要提供FoldFunction(后续版本Flink1.4废弃了) |
MapState |
存储的是个Map,put(UK, UV)、putAll(Map |
以上所有的状态都有一个clear方法清除对应key的状态。
在使用这些State的前提需要:
RuntimeContext对象获取状态
class MyMapFunction implements MapFunction {
public Integer map(String value) { return Integer.parseInt(value); }
};
class MyMapFunction extends RichMapFunction {
public Integer map(String value) { return Integer.parseInt(value); }
};
ValueState
getState(ValueStateDescriptor ) ReducingState
getReducingState(ReducingStateDescriptor ) ListState
getListState(ListStateDescriptor ) AggregatingState
getAggregatingState(AggregatingStateDescriptor ) FoldingState
getFoldingState(FoldingStateDescriptor ) MapState
getMapState(MapStateDescriptor )
创建StateDescriptor:ValueStateDescriptor
, ListStateDescriptor
, ReducingStateDescriptor
, FoldingStateDescriptor
,AggregatingStateDescriptor
或者MapStateDescriptor
class CountMapFunction extends RichMapFunction[(String,Int),(String,Int)]{
var state:ValueState[Int]=_
override def map(value: (String, Int)): (String,Int) = {
var hoistory:Int= state.value()
if(hoistory==null){
hoistory=0
}
state.update(hoistory+value._2)
(value._1,hoistory+value._2)
}
override def open(parameters: Configuration): Unit = {
val vsd = new ValueStateDescriptor[Int]("count",createTypeInformation[Int])
val runtimeContext = getRuntimeContext()
state = runtimeContext.getState(vsd)
}
}
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(line=>line.split("\\s+"))
.map((_,1))
.keyBy(0)
.map(new CountMapFunction)
.print()
//4.执行任务
env.execute("wordcount")
import org.apache.flink.api.common.functions.AggregateFunction
// IN 输入 ACC 累加器 OUT 输出
class AvgAggFunction extends AggregateFunction[(Int,Double),(Double,Int,Int),(Int,Double)]{
override def createAccumulator(): (Double, Int, Int) = {
(0.0,0,-1)
}
override def add(value: (Int, Double), accumulator: (Double, Int, Int)): (Double, Int, Int) = {
var total=accumulator._1 + value._2
var count=accumulator._2+1
(total,count,value._1)
}
override def getResult(accumulator: (Double, Int, Int)): (Int, Double) = {
(accumulator._3,accumulator._1/accumulator._2)
}
override def merge(a: (Double, Int, Int), b: (Double, Int, Int)): (Double, Int, Int) = {
(a._1+b._1,a._2+b._2,a._3)
}
}
import org.apache.flink.api.common.functions.{AggregateFunction, RichMapFunction}
import org.apache.flink.api.common.state.{AggregatingState, AggregatingStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.api.scala._
class AvgMapFunction extends RichMapFunction[(Int,Double),(Int,Double)]{
var avgState:AggregatingState[(Int,Double),(Int,Double)]=_
override def map(value: (Int, Double)): (Int, Double) = {
avgState.add(value)
avgState.get()
}
override def open(parameters: Configuration): Unit = {
var asd=new AggregatingStateDescriptor("avgcost", new AvgAggFunction, createTypeInformation[(Double,Int,Int)])
avgState=getRuntimeContext.getAggregatingState(asd)
}
}
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
//1 zhansan 2 4.5
env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>(tokens(0).toInt,tokens(2).toInt * tokens(3).toDouble))
.keyBy(0)
.map(new AvgMapFunction)
.print()
//4.执行任务
env.execute("wordcount")
可以给Keyed State指定过期时间,一旦配置TTL,针对对一些过期的state,系统会尽最大努力删除过期的数据,以减少state对内存占用量。
class CountMapFunction extends RichMapFunction[(String,Int),(String,Int)]{
var state:ValueState[Int]=_
override def map(value: (String, Int)): (String,Int) = {
var hoistory:Int= state.value()
if(hoistory==null){
hoistory=0
}
state.update(hoistory+value._2)
(value._1,hoistory+value._2)
}
//创建state
override def open(parameters: Configuration): Unit = {
val vsd = new ValueStateDescriptor[Int]("count",createTypeInformation[Int])
//1.创建TTLConfig
val ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(5)) //这是state存活时间10s
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndW9696ite)//设置过期时间更新方式
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)//永远不要返回过期的状态
.build
//2.开启TTL
vsd.enableTimeToLive(ttlConfig)
val runtimeContext = getRuntimeContext()
state = runtimeContext.getState(vsd)
}
}
分析一下参数含义:
val ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(5)) //①
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)//②
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)//③
.build
①个参数是必须指定的,表示state过期时间
②参数表示更新state策略 默认OnCreateAndWrite,控制什么时间刷新TTL时间戳
OnCreateAndWrite
- 创建、修改都可以改变值OnReadAndWrite
- 读、写都会改变③当state过期以后,是否将过期的状态返回 默认 NeverReturnExpired
NeverReturnExpired
- 永远不返回过期值ReturnExpiredIfNotCleanedUp
- 可能返回过期的数据(还没有被清理)。Notes:
- 开启过期state特性,系统会存储state的时间戳,因此会提升系统状态存储成本-耗内存
- TTL时间属性来自于计算节点
- 如果在你恢复以前的状态中并没开启TTL特性,然后修改了TTL配置,系统无法正常恢复状态原因参考①,系统会抛出StateMigrationException异常
- TTL开启并不影响系统checkpoint和SavePoint,只是告诉Flink该如何对待state
过期数据会在用户读取的时候被删除。如果一些keyedState已经过期,但是系统没有尝试使用过期的State,这些state系统依然会保留。
val ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(5)) //这是state存活时间10s
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)//设置过期时间更新方式
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)//永远不要返回过期的状态
.cleanupFullSnapshot()
.build
val ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(5)) //这是state存活时间10s
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)//设置过期时间更新方式
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)//永远不要返回过期的状态
.cleanupInBackground()
.build
基于内存
val ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(5)) //这是state存活时间10s
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)//设置过期时间更新方式
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)//永远不要返回过期的状态
.cleanupIncrementally(5,true)
.build
true表示每key进行,系统会尝试迭代100 key状态检查是否过期,如果过期就删除过期数据,false表示只有在更新state的时候才会向后检查100Key,这种检查时被动的Lazy的形式。
RocksDB state管理
RockDB(k-v存储)底层异步压缩状态,会将key相同的数据进行Compact(压缩),以减少state文件大小。但是并不对过期state进行清理,因此可以通过配置CompactFilter让RockDB在compact的时候对过期的state进行排除。这种特性过滤的特性默认是关闭的,如果开启可以再flink-coinf.yaml中配置state.backend.rocksdb.ttl.compaction.filter.enabled: true
或者通过API设置
RocksDBStateBackend::enableTtlCompactionFilter
val ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(5)) //这是state存活时间10s
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)//设置过期时间更新方式
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)//永远不要返回过期的状态
.cleanupInRocksdbCompactFilter(1000)//处理完1000个状态查询时候,会启用一次CompactFilter
.build
这里的1000表示当系统进行1000查询,系统后台会执行一次Compact
如果用户希望使用Operator State用户在定义RichFunction的时候需要实现CheckpointedFunction
或者ListCheckpointed
//对当前状态做快照,存储状态
void snapshotState(FunctionSnapshotContext context) throws Exception;
//状态的初始化。或者是恢复
void initializeState(FunctionInitializationContext context) throws Exception;
以上两个方法分别是在系统做checkpoint/savepoint的时候 会调用snapshotState方法。当系统第一次初始化Operator的时候或者是故障恢复的时候系统调用initializeState,在高方法在中一般包含两个逻辑:1.初始化逻辑 2、恢复逻辑
注意Operator State目前仅仅支持 list-style managed Operator State,只能存储 List 数据元素,List元素相互独立,因此系统在做状态分发的时候可以讲List元素分发给不同的Operator实例。
目前Flink 支持Operator State分发策略有两种:
缓冲Sink
class BufferSink(threshold: Int = 0) extends SinkFunction[(String,Int)] with CheckpointedFunction {
@transient
private var checkpointedState: ListState[(String, Int)] = _
private val bufferedElements = ListBuffer[(String, Int)]()
//将结果写出数据
override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit = {
bufferedElements += value
if (bufferedElements.size == threshold) {
for (element <- bufferedElements) {
println(element)
}
bufferedElements.clear()
}
}
//快照逻辑
override def snapshotState(context: FunctionSnapshotContext): Unit = {
checkpointedState.clear()//清除上一次状态
for (element <- bufferedElements) {
checkpointedState.add(element)
}
}
//初始化状态
override def initializeState(context: FunctionInitializationContext): Unit = {
val descriptor = new ListStateDescriptor[(String, Int)]("buffered-elements", createTypeInformation[(String, Int)])
checkpointedState = context.getOperatorStateStore.getListState(descriptor)
if(context.isRestored) {//从状态中恢复
for(element <- checkpointedState.get().asScala) {
bufferedElements += element
}
}
}
}
如果使用context.getOperatorStateStore.getListState系统会均分数据,如果希望 每个实例都拿到副本,可以使用context.getOperatorStateStore.getUnionListState
测试步骤
#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================
# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# .
#
state.backend: rocksdb
# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#
state.checkpoints.dir: hdfs:///flink-checkpoints
# Default target directory for savepoints, optional.
#
state.savepoints.dir: hdfs:///flink-savepoints
# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend).
#
state.backend.incremental: true
state.backend.rocksdb.ttl.compaction.filter.enabled: true
#==============================================================================
# HistoryServer
#==============================================================================
# The HistoryServer is started and stopped via bin/historyserver.sh (start|stop)
# Directory to upload completed jobs to. Add this directory to the list of
# monitored directories of the HistoryServer as well (see below).
jobmanager.archive.fs.dir: hdfs:///completed-jobs/
# The address under which the web-based HistoryServer listens.
historyserver.web.address: CentOS
# The port under which the web-based HistoryServer listens.
historyserver.web.port: 8082
# Comma separated list of directories to monitor for completed jobs.
historyserver.archive.fs.dir: hdfs:///completed-jobs/
# Interval in milliseconds for refreshing the monitored directories.
historyserver.archive.fs.refresh-interval: 10000
[root@CentOS flink-1.8.1]# ./bin/flink list -m CentOS:8081
------------------ Running/Restarting Jobs -------------------
27.08.2019 21:07:34 : a623ae600438c52010e73b6f808af8a6 : wordcount (RUNNING)
--------------------------------------------------------------
[root@CentOS flink-1.8.1]# ./bin/flink cancel -s a623ae600438c52010e73b6f808af8a6
Cancelling job a623ae600438c52010e73b6f808af8a6 with savepoint to default savepoint directory.
Cancelled job a623ae600438c52010e73b6f808af8a6. Savepoint stored in hdfs://CentOS:9000/flink-savepoints/savepoint-a623ae-0f339a1004f0.
该接口是CheckpointedFunction变体,比CheckpointedFunction有更多的限制,在实现状态恢复的时候支持Event-Split 状态。
//需要系统存储状态
List snapshotState(long checkpointId, long timestamp) throws Exception;
//传入的是状态
void restoreState(List state) throws Exception;
在系统调用checkpoint/savepoint的时候系统会调用snapshotState方法,然后将List
持久化.在状态恢复的时候系统会调用restoreState方法。
除了Keyed State和 Operator State 之外Flink第三种状态是Broadcast State。引Broadcast State是为了支持这样的场
景:Broadcast State是Flink支持的第三种Operator State。使Broadcast State,可以在Flink程序的一个
Stream中输出数据记录,然后将这些数据记录广播(Broadcast)到下游的每个Task中,使得这些数据记
录能够为所有的Task所共享,比如这些用于配置的数据记录(数据量一般比较小)。这样,每个Task在处理其所对应的Stream
中记录的时候,读取这些配置,来满足实际数据处理需要。
首先需要创建一个Datastream可以keyed也可以no-keyed,然后再创建一个Broadcast
Stream,使用Datastream去连接Broadcast Stream(connect方法),这就可以使得Datastream下游的任务都可以拿到Broadcast Stream中的状态。
class UserOrderKeyedBroadcastProcessFunction(msd:MapStateDescriptor[String,String])
extends KeyedBroadcastProcessFunction[String,(String,String,Double),(String,String),(String,String,Double)]{
//处理keyd-stream那一方数据
override def processElement(value: (String, String, Double), ctx: KeyedBroadcastProcessFunction[String, (String, String, Double), (String, String), (String, String, Double)]#ReadOnlyContext, out: Collector[(String, String, Double)]): Unit = {
val braodcastState = ctx.getBroadcastState(msd)
println("=================")
for(i <- braodcastState.immutableEntries().asScala){
println(i.getKey+"\t"+i.getValue)
}
var name=braodcastState.get(value._1)//根据ID查询用户名
// 用户名 商品 价格
out.collect((name,value._2,value._3))
}
//处理广播流哪一方的数据
override def processBroadcastElement(value: (String, String), ctx: KeyedBroadcastProcessFunction[String, (String, String, Double), (String, String), (String, String, Double)]#Context, out: Collector[(String, String, Double)]): Unit = {
val state: BroadcastState[String, String] = ctx.getBroadcastState(msd)
// 用户id 用户名
state.put(value._1,value._2) //将用户信息放置到Map
}
}
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
//1 apple 10
val keyedStream=env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>(tokens(0),tokens(1),tokens(2).toDouble))
.keyBy(0)
//存储 广播 流的状态
val msd=new MapStateDescriptor[String,String]("user-state",createTypeInformation[String],createTypeInformation[String])
//1 zhansan
val broadcast = env.socketTextStream("CentOS", 8888)
.map(line => line.split("\\s+"))
.map(toknes=>(toknes(0), toknes(1)))
.broadcast(msd)
keyedStream.connect(broadcast)
.process(new UserOrderKeyedBroadcastProcessFunction(msd))
.print()
//4.执行任务
env.execute("counter")
class UserLevelBroadcastProcessFunction(msd:MapStateDescriptor[String,Int]) extends BroadcastProcessFunction[(String,String),(String,Double),(String,String,Int)] {
override def processElement(value: (String, String),
ctx: BroadcastProcessFunction[(String, String), (String, Double), (String, String, Int)]#ReadOnlyContext,
out: Collector[(String, String, Int)]): Unit = {
out.collect(value._1,value._2,ctx.getBroadcastState(msd).get(value._1))
}
//level:0 1 2 3
override def processBroadcastElement(value: (String, Double),
ctx: BroadcastProcessFunction[(String, String), (String, Double), (String, String, Int)]#Context,
out: Collector[(String, String, Int)]): Unit = {
val state = ctx.getBroadcastState(msd)
if(value._2<100){
state.put(value._1,0)
}else if(value._2 < 1000){
state.put(value._1,1)
}else if(value._2 < 5000){
state.put(value._1,2)
}else{
state.put(value._1,3)
}
}
}
object FlinkStreamBroadcaststate {
def main(args: Array[String]): Unit = {
//1.创建StreamExecutionEnvironment
val env=StreamExecutionEnvironment.getExecutionEnvironment
val msd=new MapStateDescriptor[String,Int]("user-level",createTypeInformation[String],createTypeInformation[Int])
//1 apple 10
val broadcaststream=env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>(tokens(0),tokens(2).toDouble))
.keyBy(0)
.sum(1)
.broadcast(msd)
//1 zhansan
val userstream = env.socketTextStream("CentOS", 8888)
.map(line => line.split("\\s+"))
.map(toknes=>(toknes(0), toknes(1)))
userstream.connect(broadcaststream).process(new UserLevelBroadcastProcessFunction(msd))
.print()
//4.执行任务
env.execute("counter")
}
}
val env=StreamExecutionEnvironment.getExecutionEnvironment
//开启checkpoint
env.enableCheckpointing(7000,CheckpointingMode.EXACTLY_ONCE)
//checkpoint必须在2s内完成,如果完成不了终止
env.getCheckpointConfig.setCheckpointTimeout(4000)
//距离上一次的checkpoint完成之后需要等5s 之后再开启下一次的checkpoint
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(5000)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)//只开启一个checkpoint线程
//在退出应用时候,不删除checkpoint数据
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_RETAIN)
//必须保证任务可以从checkpoint恢复,恢复不成功任务失败
env.getCheckpointConfig.setFailOnCheckpointingErrors(true)
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print()
env.execute("wordcount")
Checkpoint是系统自动生成保存点,由于计算过程中的计算恢复。除此之外Flink有提供了另外一种机制,需要人工触发Flink的状态备份,以便可以系统未来回滚到指定的状态。
[root@CentOS flink-1.7.2]# ./bin/flink savepoint 7c46aa11163ecd995c81f12ff92c14cc hdfs://CentOS:9000/2019-08-29
[root@CentOS flink-1.7.2]# ./bin/flink run -s savepoint恢复目录 -c 全类名 jar包路径
Flink提供了不同的Sate backend,⽤于指定状态的存储⽅式和位置。 根据您的State Backend,State可以
位于Java的堆上或堆外。Flink管理应⽤程序的Sate,这意味着Flink处理内存管理(如果需要可能会溢出
到磁盘)以允许应⽤程序保持⾮常⼤的状态。默认情况下,配置⽂件flink-conf.yaml确定所有Flink作业的
状态后端。 但是,可以基于每个作业覆盖默认状态后端,如下所示。
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.setStateBackend(... )
窗口计算是流计算的核心,是将unbounded stream 拆分有限大小,一般这种拆分依据时间或数目去限定一个窗口的大小。然后用户可以基于这些有限大小的窗口实现常规计算。首先我们先来研究一下Flink窗口计算的基本代码架构:
keyed streams
stream
.keyBy(...) <- 将non-keyed stream 转换为keyed stream
.window(...) <- 必须指定: "assigner" 窗口类型
[.trigger(...)] <- 可选: "trigger" (每种类型Window一般都默认 trigger)
[.evictor(...)] <- 可选: "evictor" (默认所有Window没有 evictor 策略) 剔除元素
[.allowedLateness(...)] <- 可选: "lateness" (默认不处理迟到数据)
[.sideOutputLateData(...)] <- 可选: "output tag" (将迟到数据单独使用side output输出出去)
.reduce/aggregate/fold/apply() <- 必须指定: "function" 窗口
[.getSideOutput(...)] <- 可选: "output tag" 通过该方法拿到迟到数据
non-keyed
stream
.windowAll(...) <- 必须指定: "assigner" 窗口类型
[.trigger(...)] <- 可选: "trigger" (每种类型Window一般都默认 trigger)
[.evictor(...)] <- 可选: "evictor" (默认所有Window没有 evictor 策略) 剔除元素
[.allowedLateness(...)] <- 可选: "lateness" (默认不处理迟到数据)
[.sideOutputLateData(...)] <- 可选: "output tag" (将迟到数据单独使用side output输出出去)
.reduce/aggregate/fold/apply() <- 必须指定: "function" 窗口
[.getSideOutput(...)] <- 可选: "output tag" 通过该方法拿到迟到数据
当有一个元素落入了窗口的时间范围该窗口将创建了,当watermarker没过了当前窗口的end time的时候该窗口会被自动删除。
Flink保证窗口删除只包含一下几种:sliding、tumbling、session 窗口,不包含 global windows,因为global windows是基于元素的个数对窗口划分,并不是基于时间。
每个窗口都有一个Trigger和聚合函数,触发器主要负责触发窗口,聚合函数主要负责做计算。这里面除了 global windows没有触发器以外,其他的所有窗口都有默认触发器。除了以上以外窗口还可以指定Evictor用于在窗口触发以前或者触发以后剔除窗口中的元素。
定义了元素是如何落入到窗口当中去的(窗口的类型)。Flink中已经定义好了一些常见的窗口分配器比如:tumbling windows, sliding windows, session windows 以及global windows,除了global windows之外的所有窗口都是基于时间窗口,这些窗口可以基于EventTime或者ProcessTime ,这些Time Window 有start time 和 end time标示窗口已的范围,该窗口是前闭合后开的。同时该窗口有一个maxTimestamp方法可以计算出该窗口允许的最大时间戳的元素。
滚地窗口长度固定,滑动间隔等于窗口长度,窗口元素之间没有交叠。
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("wordcount")
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(10),Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("wordcount")
通过计算元素时间间隔,如果间隔小于session gap则会合并到一个窗口中。如果大于时间间隔,当前窗口关闭,后续的元素属于新的窗口。与滚动和滑动不同的时候回话窗口没有固定的窗口大小,底层本质上做的是窗口合并。
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("wordcount")
val env=StreamExecutionEnvironment.getExecutionEnvironment
//001 5000 100
//002 10000 10
env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(ts=>(ts(0),ts(1).toLong,ts(2).toDouble))
.keyBy(0)
.window(ProcessingTimeSessionWindows.withDynamicGap[(String,Long,Double)](new SessionWindowTimeGapExtractor[(String,Long,Double)]{
override def extract(element: (String,Long,Double)): Long = {
println("element:"+element)
element._2 //毫秒值
}
}))
.reduce((v1,v2)=>(v1._1,v1._2,v1._3+v2._3))
.print()
env.execute("wordcount")
会将所有相同key的元素放到一个全局的窗口中,默认该窗口永远都不会闭合(永远都不会触发),因为该窗口没有默认的窗口触发器Trigger,因此需要用户自定义Trigger。
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(line=>line.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(GlobalWindows.create())
.trigger(CountTrigger.of(3)) //只用相同的可以累计达到3个触发window
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("wordcount")
当用户在设置完window assigner需要给这些窗口中的元素指定聚合|计算。WindowFunction的作用就是对Winbdow的元素做计算。
window function存在形式可以是:ReduceFunction、AggregateFunction、FoldFunction(不可以用在session windows中)、ProcessWindowFunction。其中使用ReduceFunction、AggregateFunction效率比较高,但是使用ProcessWindowFunction可以拿到窗口所有元素,这种计算是全量计算,效率比前两个效率低下。但是通过该方法可以拿到Window的元数据信息。由于ProcessWindowFunction在window触发以前,系统需要缓存所有元素,因此对内存消耗大,但是可以配合ReduceFunction、AggregateFunction、FoldFunction减轻对内存占用。
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.reduce(new ReduceFunction[(String,Int)]{
override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
(v1._1,v1._2+v2._2)
}
})
.print()
env.execute("wordcount")
An AggregateFunction
is a generalized version of a ReduceFunction
that has three types: an input type (IN
), accumulator type (ACC
), and an output type (OUT
). The input type is the type of elements in the input stream and the AggregateFunction
has a method for adding one input element to an accumulator. The interface also has methods for creating an initial accumulator, for merging two accumulators into one accumulator and for extracting an output (of type OUT
) from an accumulator.
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.aggregate(new AggregateFunction[(String,Int),(String,Int),(String,Int)]{
override def createAccumulator(): (String, Int) = {
("",0)
}
override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) = {
(value._1,value._2+accumulator._2)
}
override def getResult(accumulator: (String, Int)): (String, Int) = {
accumulator
}
override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
(a._1,a._2+b._2)
}
})
.print()
env.execute("wordcount")
A ProcessWindowFunction gets an Iterable containing all the elements of the window, and a Context object with access to time and state information, which enables it to provide more flexibility than other window functions. This comes at the cost of performance and resource consumption, because elements cannot be incrementally aggregated but instead need to be buffered internally until the window is considered ready for processing.
该方法拿到的包含窗口所有元素的集合和上下文对象,所以该方法比其他的窗口函数要灵活很多。当然这是以性能和资源消耗为代价的,因为元素不能增量地聚合,而是需要内部进行缓冲,直到窗口被认为可以被处理的时候才会进行处理。
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction[(String,Int),(String,Int,Int),String,TimeWindow]{
override def process(key: String,
context: Context,
elements: Iterable[(String, Int)],
out: Collector[(String, Int,Int)]): Unit = {
var total=0
for(i<- elements){
total += i._2
}
//局部更新 该状态和窗口的生命周期绑定
val windowState= context.windowState.getState[Int](new ValueStateDescriptor[Int](key+"windowCount",createTypeInformation[Int]))
var currentCount=windowState.value()+total
windowState.update(currentCount)
//获取全局状态 和窗口无关
val globalState= context.globalState.getState[Int](new ValueStateDescriptor[Int](key+"globalcount",createTypeInformation[Int]))
val globalCount=globalState.value()+total
globalState.update(globalCount)
out.collect((key,currentCount,globalCount))
}
})
.print()
env.execute("wordcount")
A ProcessWindowFunction
can be combined with either a ReduceFunction
, an AggregateFunction
, or a FoldFunction
to incrementally aggregate elements as they arrive in the window. When the window is closed, the ProcessWindowFunction
will be provided with the aggregated result. This allows it to incrementally compute windows while having access to the additional window meta information of the ProcessWindowFunction
.
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.fold(
("",0),
(acc:(String,Int),v:(String,Int))=>(v._1,acc._2+v._2),
new ProcessWindowFunction[(String,Int),(String,Int,Int),String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[(String, Int)], out: Collector[(String, Int,Int)]): Unit = {
var total=0
for(i<- elements){
total += i._2
}
//局部更新 该状态和窗口的生命周期绑定
val windowState= context.windowState.getState[Int](new ValueStateDescriptor[Int](key+"windowCount",createTypeInformation[Int]))
var currentCount=windowState.value()+total
windowState.update(currentCount)
//获取全局状态 和窗口无关
val globalState= context.globalState.getState[Int](new ValueStateDescriptor[Int](key+"globalcount",createTypeInformation[Int]))
val globalCount=globalState.value()+total
globalState.update(globalCount)
out.collect((key,currentCount,globalCount))
}
}
)
.print()
env.execute("wordcount")
In some places where a ProcessWindowFunction
can be used you can also use a WindowFunction
. This is an older version of ProcessWindowFunction
that provides less contextual information and does not have some advances features, such as per-window keyed state.
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS",9999)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.apply(new WindowFunction[(String,Int),(String,Int),String,TimeWindow]{
override def apply(key: String, window: TimeWindow,
input: Iterable[(String, Int)],
out: Collector[(String, Int)]): Unit = {
out.collect((key,input.map(_._2).sum))
}
})
.print()
env.execute("wordcount")
触发器决定了窗口什么时候是Ready的以便Window Function处理。每一个WindowAssigner都有一个默认的Trigger。只有当默认的Trigger不满足你的需求的时候,你可以定制自己Trigger.
Trigger定义五大类回调方法,用于响应响应事件:
onElement()
元素落入到Window的时候,回调onEventTime()
当用户注册了 event-time 定时器触发的时候 ,回调.onProcessingTime()
当用户注册了 processing-time 定时器触发的时候 ,回调.onMerge()
该方法是在使用 session window的时候,当窗口合并到时候,该窗口触发器的状态也会合并clear()
当窗口被移除时候,相应的Trigger的clear会被回调。
注意:前三个方法的返回值都是TriggerResult该返回值决定了当前窗口是否能够被触发。
CONTINUE
: 继续保持窗口,不触发 √FIRE
:窗口Ready,可以调用Window Function √PURGE
: 清除窗口的元素,并且丢该该窗口FIRE_AND_PURGE
: 触发窗口的计算,随后将窗口的内容清空。-很少使用
默认的WindowAssigners的触发器
WindowAssigners类型 | 触发器 |
---|---|
event-time window | EventTimeTrigger |
processing-time window | ProcessingTimeTrigger |
GlobalWindow | NeverTrigger |
public static class NeverTrigger extends Trigger
val env=StreamExecutionEnvironment.getExecutionEnvironment
//001 70
env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>(tokens(0),tokens(1).toDouble))
.keyBy(_._1)
.window(GlobalWindows.create())
.trigger(DeltaTrigger.of[(String,Double),GlobalWindow](20.0,new DeltaFunction[(String,Double)] {//如果差值大于20就触发窗口
override def getDelta(oldDataPoint: (String, Double), newDataPoint: (String, Double)): Double = {
println(oldDataPoint+"\t"+newDataPoint)
newDataPoint._2-oldDataPoint._2
}
},createTypeInformation[(String,Double)].createSerializer(env.getConfig)))
.process(new ProcessWindowFunction[(String,Double),(String,Int,Int),String,GlobalWindow]{
override def process(key: String,
context: Context,
elements: Iterable[(String, Double)],
out: Collector[(String, Int,Int)]): Unit = {
elements.foreach(println)
}
})
.print()
env.execute("wordcount")
Flink’s windowing model allows specifying an optional Evictor
in addition to the WindowAssigner
and the Trigger
. This can be done using the evictor(...)
method (shown in the beginning of this document). The evictor has the ability to remove elements from a window after the trigger fires and before and/or after the window function is applied.
void evictBefore(Iterable> elements, int size, W window, EvictorContext evictorContext);
void evictAfter(Iterable> elements, int size, W window, EvictorContext evictorContext);
val env=StreamExecutionEnvironment.getExecutionEnvironment
//001 70
env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>(tokens(0),tokens(1).toDouble))
.keyBy(_._1)
.window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)))
.evictor(CountEvictor.of(3))
.process(new ProcessWindowFunction[(String,Double),(String,Int,Int),String,TimeWindow]{
override def process(key: String,
context: Context,
elements: Iterable[(String, Double)],
out: Collector[(String, Int,Int)]): Unit = {
elements.foreach(println)
}
})
.print()
env.execute("wordcount")
Flink支持多种时间计量方式:
Processing Time: 运行算子执行节点系统时钟(默认时间策略)
Event time:事件时间,通常这些数据是内嵌在Event当中。
Ingestion time:数据进入到计算集群时间(Flink Source)
相比较这三种处理方式,
Ingestion time
和Processing Time
都无法处理迟到或者过期的数据。因此如果用户使用EventTime的时候相比较前两种可能复杂一些,需要用户指定一个Watermarker生成策略,用于计算窗口的触发时间。
val env=StreamExecutionEnvironment.getExecutionEnvironment
//设置时间策略
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime|ProcessingTime|EventTime)
以上两种策略均不需要系统维护watermarker,因此以上两种策略 在使用的时候基本一样。如果用户使用了EventTime策略,在流处理当中必须手动指定Watermarker的生成策略。
指定的watermarker生成策略:
//计算当前水位线,系统定期 调用
Watermark getCurrentWatermark();
//从元素中抽取 EvetTime
long extractTimestamp(T element, long previousElementTimestamp);
val maxOrderness:Long=2000 //允许最大乱序 2s
var maxCurrentTimestamp:Long=0
val sdf=new SimpleDateFormat("HH:mm:ss")
//定期计算一次最新水位线
override def getCurrentWatermark: Watermark = {
val w=maxCurrentTimestamp-maxOrderness
new Watermark(w)
}
//抽取当前Event的时间
override def extractTimestamp(element: (String, String, Double, Long), previousElementTimestamp: Long): Long = {
maxCurrentTimestamp=Math.max(maxCurrentTimestamp,element._4)
println("currentwatermarker:"+ sdf.format(maxCurrentTimestamp-maxOrderness)+",crentEventTime:"+sdf.format(element._4))
element._4
}
通过设置env.getConfig.setAutoWatermarkInterval(1000),设置系统计算水位线的频率(推荐)
//Event产生的时候,系统计算一次水位线
Watermark checkAndGetNextWatermark(T lastElement, long extractedTimestamp);
//从元素中抽取 EvetTime
long extractTimestamp(T element, long previousElementTimestamp);
val maxOrderness:Long=2000 //允许最大延迟 2s
var maxCurrentTimestamp:Long=0
val sdf=new SimpleDateFormat("HH:mm:ss")
//定期计算一次最新水位线
override def getCurrentWatermark: Watermark = {
val w=maxCurrentTimestamp-maxOrderness
new Watermark(w)
}
//抽取当前Event的时间
override def extractTimestamp(element: (String, String, Double, Long), previousElementTimestamp: Long): Long = {
maxCurrentTimestamp=Math.max(maxCurrentTimestamp,element._4)
println("Thread:"+Thread.currentThread().getId+"\tW:"+ sdf.format(maxCurrentTimestamp-maxOrderness)+",crentEventTime:"+sdf.format(element._4))
element._4
}
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//设置时间策略
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(1000)
//001 zs 4.5 时间戳
env.socketTextStream("CentOS",9999)
.map(line=>line.split("\\s+"))
.map(tokens=>(tokens(0),tokens(1),tokens(2).toDouble,tokens(3).toLong))
.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(String, String, Double, Long)] {
val maxOrderness:Long=2000 //允许最大乱序 2s
var maxCurrentTimestamp:Long=0
val sdf=new SimpleDateFormat("HH:mm:ss")
//定期计算一次最新水位线
override def getCurrentWatermark: Watermark = {
val w=maxCurrentTimestamp-maxOrderness
new Watermark(w)
}
//抽取当前Event的时间
override def extractTimestamp(element: (String, String, Double, Long), previousElementTimestamp: Long): Long = {
maxCurrentTimestamp=Math.max(maxCurrentTimestamp,element._4)
println("currentwatermarker:"+ sdf.format(maxCurrentTimestamp-maxOrderness)+",crentEventTime:"+sdf.format(element._4))
element._4
}
})
.keyBy(_._1)
.timeWindow(Time.seconds(5))
.process(new ProcessWindowFunction[(String,String,Double,Long),(String,String,Double,Long),String,TimeWindow]{
override def process(key: String,
context: Context,
elements: Iterable[(String,String,Double,Long)],
out: Collector[(String,String,Double,Long)]): Unit = {
val sdf=new SimpleDateFormat("HH:mm:ss")
val start=sdf.format(context.window.getStart)
val end=sdf.format(context.window.getEnd)
val waterMarker=sdf.format(context.currentWatermark)
println(s"=========${start} \tw:${waterMarker}=========")
elements.foreach(println)
println(s"=========${end} \tw:${waterMarker}=========")
println()
println()
}
})
.print()
env.execute("wordcount")
注意为了测试效果方便,这需要设置并行度为1
Watermarks in Parallel Streams
如果在 多个流中都含有watermarker,在做时间计算的时候以小的时间为准
。
默认Flink迟到的数据会被丢弃 w(T) >w1( window end 时间 T‘),后续再有数据落入w1,这些数据默认会丢弃。
w=max(EventTime)-ordernesstime
Flink支持迟到数据处理,w(T) - w1(T’)< late时间,该元素还可以加入到当前窗口计算。
.timeWindow(Time.seconds(5))
.allowedLateness(Time.seconds(2))//允许最大迟到时间
.timeWindow(Time.seconds(5))
.allowedLateness(Time.seconds(2))
.sideOutputLateData(lateTag)
.reduce/flod/aggreate/apply
.getSideOutput
val env=StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//设置时间策略
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(1000)
val lateTag=new OutputTag[(String,String,Double,Long)]("late")
//001 zs 4.5 时间戳
val windowStream = env.socketTextStream("CentOS", 9999)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1), tokens(2).toDouble, tokens(3).toLong))
.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(String, String, Double, Long)] {
val maxOrderness: Long = 2000 //允许最大延迟 2s
var maxCurrentTimestamp: Long = 0
val sdf = new SimpleDateFormat("HH:mm:ss")
//定期计算一次最新水位线
override def getCurrentWatermark: Watermark = {
val w = maxCurrentTimestamp - maxOrderness
new Watermark(w)
}
//抽取当前Event的时间
override def extractTimestamp(element: (String, String, Double, Long), previousElementTimestamp: Long): Long = {
maxCurrentTimestamp = Math.max(maxCurrentTimestamp, element._4)
println("Thread:" + Thread.currentThread().getId + "\tW:" + sdf.format(maxCurrentTimestamp - maxOrderness) + ",crentEventTime:" + sdf.format(element._4))
element._4
}
})
.keyBy(_._1)
.timeWindow(Time.seconds(5))
.allowedLateness(Time.seconds(2))
.sideOutputLateData(lateTag)
.process(new ProcessWindowFunction[(String, String, Double, Long), (String, String, Double, Long), String, TimeWindow] {
override def process(key: String,
context: Context,
elements: Iterable[(String, String, Double, Long)],
out: Collector[(String, String, Double, Long)]): Unit = {
val sdf = new SimpleDateFormat("HH:mm:ss")
val start = sdf.format(context.window.getStart)
val end = sdf.format(context.window.getEnd)
val waterMarker = sdf.format(context.currentWatermark)
println(s"=========${start} \tw:${waterMarker}=========")
elements.foreach(println)
println(s"=========${end} \tw:${waterMarker}=========")
println()
println()
}
})
windowStream.print()
windowStream.getSideOutput(lateTag).print("late:")
env.execute("wordcount")
窗口的Join只join连接两个流①有共同key(连接条件)②数据落入同一个时间窗口。然后被join的数据会传递``JoinFunction和
FlatJoinFunction`常见窗口join的代码结构:
stream.join(otherStream)
.where()
.equalTo()
.window()
.apply()
需要留意点:
When performing a tumbling window join, all elements with a common key and a common tumbling window are joined as pairwise combinations and passed on to a JoinFunction
or FlatJoinFunction
. Because this behaves like an inner join, elements of one stream that do not have elements from another stream in their tumbling window are not emitted!
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特征
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置watermarker计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)
//1 zhansan 1567392721000
val userDataStream = fsEnv.socketTextStream("CentOS", 9999)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1),tokens(2).toLong))
.assignTimestampsAndWatermarks(new UserWaterMarker)
//1 apple 4.5 1567392721000
val orderDataStream = fsEnv.socketTextStream("CentOS", 8888)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1), tokens(2).toDouble,tokens(3).toLong))
.assignTimestampsAndWatermarks(new OrderWaterMarker)
userDataStream.join(orderDataStream)
.where(user=>user._1)
.equalTo(order=>order._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply((user,order)=>{
(user._1,user._2,order._2,order._3)
})
.print()
fsEnv.execute("UserOrderJoin")
class OrderWaterMarker extends AssignerWithPeriodicWatermarks[(String,String,Double,Long)]{
val maxOrderness:Long=2000
var currentMaxTimestamp:Long=0
var sdf=new SimpleDateFormat("HH:mm:ss")
override def getCurrentWatermark: Watermark = {
return new Watermark(currentMaxTimestamp-maxOrderness)
}
override def extractTimestamp(element: (String,String,Double,Long), previousElementTimestamp: Long): Long = {
currentMaxTimestamp=Math.max(currentMaxTimestamp,element._4)
println(s"Watermark:${sdf.format(currentMaxTimestamp-maxOrderness)}\tEventTime:${sdf.format(element._4)}")
element._4
}
}
class UserWaterMarker extends AssignerWithPeriodicWatermarks[(String,String,Long)]{
val maxOrderness:Long=2000
var currentMaxTimestamp:Long=0
var sdf=new SimpleDateFormat("HH:mm:ss")
override def getCurrentWatermark: Watermark = {
return new Watermark(currentMaxTimestamp-maxOrderness)
}
override def extractTimestamp(element: (String, String, Long), previousElementTimestamp: Long): Long = {
currentMaxTimestamp=Math.max(currentMaxTimestamp,element._3)
println(s"Watermark:${sdf.format(currentMaxTimestamp-maxOrderness)}\tEventTime:${sdf.format(element._3)}")
element._3
}
}
When performing a sliding window join, all elements with a common key and common sliding window are joined as pairwise combinations and passed on to the JoinFunction
or FlatJoinFunction
. Elements of one stream that do not have elements from the other stream in the current sliding window are not emitted! Note that some elements might be joined in one sliding window but not in another!
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特征
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置watermarker计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)
//1 zhansan 1567392721000
val userDataStream = fsEnv.socketTextStream("CentOS", 9999)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1),tokens(2).toLong))
.assignTimestampsAndWatermarks(new UserWaterMarker)
//1 apple 4.5 1567392721000
val orderDataStream = fsEnv.socketTextStream("CentOS", 8888)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1), tokens(2).toDouble,tokens(3).toLong))
.assignTimestampsAndWatermarks(new OrderWaterMarker)
userDataStream.join(orderDataStream)
.where(user=>user._1)
.equalTo(order=>order._1)
.window(SlidingEventTimeWindows.of(Time.seconds(4),Time.seconds(2)))
.apply((user,order)=>{
(user._1,user._2,order._2,order._3)
})
.print()
fsEnv.execute("UserOrderJoin")
When performing a session window join, all elements with the same key that when “combined” fulfill the session criteria are joined in pairwise combinations and passed on to the JoinFunction
or FlatJoinFunction
. Again this performs an inner join, so if there is a session window that only contains elements from one stream, no output will be emitted!
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特征
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置watermarker计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)
//1 zhansan 1567392721000
val userDataStream = fsEnv.socketTextStream("CentOS", 9999)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1),tokens(2).toLong))
.assignTimestampsAndWatermarks(new UserWaterMarker)
//1 apple 4.5 1567392721000
val orderDataStream = fsEnv.socketTextStream("CentOS", 8888)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1), tokens(2).toDouble,tokens(3).toLong))
.assignTimestampsAndWatermarks(new OrderWaterMarker)
userDataStream.join(orderDataStream)
.where(user=>user._1)
.equalTo(order=>order._1)
.window(EventTimeSessionWindows.withGap(Time.seconds(2)))
.apply((user,order)=>{
(user._1,user._2,order._2,order._3)
})
.print()
fsEnv.execute("UserOrderJoin")
The interval join joins elements of two streams (we’ll call them A & B for now) with a common key and where elements of stream B have timestamps that lie in a relative time interval to timestamps of elements in stream A.
This can also be expressed more formally as b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound]
ora.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound
如果当前水位线淹没可orange流中元素的时间区间,说明该元素不可能和后续任意元素实现jion,因此后续再有数据落入该区间范围系统默认会丢该该数据。
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
fsEnv.setParallelism(1)
//设置时间特征
fsEnv.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//设置watermarker计算频率 1s
fsEnv.getConfig.setAutoWatermarkInterval(1000)
//1 zhansan 1567392721000
val userDataStream = fsEnv.socketTextStream("CentOS", 9999)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1),tokens(2).toLong))
.assignTimestampsAndWatermarks(new UserWaterMarker)
.keyBy(_._1)
//1 apple 4.5 1567392721000
val orderDataStream = fsEnv.socketTextStream("CentOS", 8888)
.map(line => line.split("\\s+"))
.map(tokens => (tokens(0), tokens(1), tokens(2).toDouble,tokens(3).toLong))
.assignTimestampsAndWatermarks(new OrderWaterMarker)
.keyBy(_._1)
userDataStream.intervalJoin(orderDataStream)
.between(Time.seconds(-2),Time.seconds(2))
.process(new ProcessJoinFunction[(String,String,Long),(String,String,Double,Long),(String,String,String,Double)] {
override def processElement(left: (String, String, Long),
right: (String, String, Double, Long),
ctx: ProcessJoinFunction[(String, String, Long), (String, String, Double, Long), (String, String, String, Double)]#Context,
out: Collector[(String, String, String, Double)]): Unit = {
val timestamp = ctx.getTimestamp
val userts = ctx.getLeftTimestamp
val orderts = ctx.getRightTimestamp
println(s"${timestamp},${userts},${orderts},${left.toString()},${right.toString()}")
println()
}
}).print()
fsEnv.execute("IntervalJoin")