Flink状态计算实例与状态数据恢复(checkpoint)

状态计算,简单的理解是本次计算依赖于之前的计算结果,比如,根据key计算求和值,如:

key  value

1001,3000-------------(1001,3000)

1002,500--------------(1002,500)

1001,400--------------(1001,3400)

其中计算第三条数据1001,400时需要累加到之前已经计算好的1001,3000上去,进而得出正确的结果(1001,3400)。在以往的流处理结构中如storm和spark streaming是不具备这样的功能,如果你要实现这样的功能,就需要把计算结果存在hbase或其他外部存储中,再次计算的时候把数据取出来进行累加,但是这么做增加了应用程序的复杂性和出错的风险。而flink解决了这一难题。

总体来说,flink状态计算基本分为两大类:

1.Keyed State

2.Operator State

每一种又细分为Managed和raw,Managed是flink推荐的方式,flink内置了一些常用存储状态的数据结构,可以直接拿来使用,raw允许用户可以自定义状态的数据结构,但是不推荐这么做。还有一种 Broadcast State,用的不是很多,一般是一个流的数据量小,比如规则数据,需要把它广播到其他流上去做计算,此处不做阐述。

参考地址(https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/state/state.html),

下面的例子均是以Managed方式展开的。

一.Keyed State

使用managed keyed state需要对流按key分组,完整的例子如下:

package com.xxx.flink.demo


import java.util.Properties

import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{StateTtlConfig, ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.time.Time
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.fs.Path
import org.apache.flink.runtime.state.StateBackend
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.util.Collector
import org.apache.flink.api.scala._
import org.slf4j.LoggerFactory


object ManagedKeyedStateStreaming {

  private val LOG = LoggerFactory.getLogger(ManagedKeyedStateStreaming.getClass)
  private val KAFKA_CONSUMER_TOPIC="test"
  private val KAFKA_BROKERS="hadoop01:9092,hadoop02:9092,hadoop03:9092"
  private val KAFKA_ZOOKEEPER_CONNECTION="hadoop01:2181,hadoop02:2181,hadoop03:2181"
  private val KAFKA_GROUP_ID="flink-demo-group"
  private val KAFKA_PROP: Properties = new Properties() {
    setProperty("bootstrap.servers", KAFKA_BROKERS)
    setProperty("zookeeper.connect", KAFKA_ZOOKEEPER_CONNECTION)
    setProperty("group.id", KAFKA_GROUP_ID)
    setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  }

  def main(args: Array[String]): Unit = {
    LOG.info("===Stateful Computation Demo===")
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(5000)//5秒一个checkpoint
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)//指定处理的时间特性
    env.setRestartStrategy(RestartStrategies.noRestart())//重启策略
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)//确保一次语义

    val checkPointPath = new Path("hdfs:///flink/checkpoints")//fs状态后端配置,如为file:///,则在taskmanager的本地
    val fsStateBackend: StateBackend= new FsStateBackend(checkPointPath)
    env.setStateBackend(fsStateBackend)
    env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)//退出不删除checkpoint

    val dataStream = env.addSource(new FlinkKafkaConsumer010[String](KAFKA_CONSUMER_TOPIC,new SimpleStringSchema(),KAFKA_PROP))
    dataStream.filter(_.split("\\|").length==3)
      .map(line=>{
      val arr = line.split("\\|")
      (arr(0),arr(2).toInt)
    }).keyBy(_._1)
      .flatMap(new SalesAmountCalculation())
      .print()

    //flink封装了简便的接口供使用
//    dataStream.filter(_.split("\\|").length==3)
//      .map(line=>{
//        val arr = line.split("\\|")
//        (arr(0),arr(2).toInt)
//      }).keyBy(_._1)
//       .mapWithState((in:(String,Int),count:Option[(String,Int)])=>{
//         count match {
//           case Some(c) => ((in._1,c),Some((in._1,in._2+c._2)))
//           case None=>((in._1,0),Some((in._1,in._2)))
//         }
//    })
//        .print()

    env.execute("Stateful Computation Demo")

  }
}

//计算汇总值
class SalesAmountCalculation extends RichFlatMapFunction[(String, Int), (String, Int)] {
  private var sum: ValueState[(String, Int)] = _

  override def flatMap(input: (String, Int), out: Collector[(String, Int)]): Unit = {
    //显式调用已经过期的状态值会被删除,可以配置在读取快照时清除过期状态值,如:
//    val ttlConfig = StateTtlConfig
//      .newBuilder(Time.seconds(1))
//      .cleanupFullSnapshot
//      .build
    val tmpCurrentSum = sum.value
    val currentSum = if (tmpCurrentSum != null) {
      tmpCurrentSum
    } else {
      (input._1, 0)
    }
    val newSum = (currentSum._1, currentSum._2 + input._2)
    sum.update(newSum)
    out.collect(newSum)
  }

  override def open(parameters: Configuration): Unit = {
    //设置状态值的过期时间
//    val ttlConfig = StateTtlConfig
//      .newBuilder(Time.seconds(1))//过期时间1秒
//      .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)//在创建和写入时更新状态值
//      .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)//过期访问不返回状态值
//      .build
    val valueStateDescriptor = new ValueStateDescriptor[(String, Int)]("sum", createTypeInformation[(String, Int)])
//    valueStateDescriptor.enableTimeToLive(ttlConfig)//启用状态值过期配置
    sum = getRuntimeContext.getState(valueStateDescriptor)
  }
}

代码解释:上述的例子是根据key计算求和值, 

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.enableCheckpointing(5000)//5秒一个checkpoint
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)//指定处理的时间特性
env.setRestartStrategy(RestartStrategies.noRestart())//重启策略
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)//确保一次语义

val checkPointPath = new Path("hdfs:///flink/checkpoints")//fs状态后端配置,如为file:///,则在taskmanager的本地
val fsStateBackend: StateBackend= new FsStateBackend(checkPointPath)
env.setStateBackend(fsStateBackend)
env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)//退出不删除checkpoint

以上部分是设置执行环境的一些参数,checkpoing间隔时间,数据流的时间特性,重启策略,状态后端使用的fsstatebackend等等。checkpoint默认是不开启的,为了容错,应该要开启并设置一些参数,需要注意的是,开启checkpoint后不设置fsstatebackend或rocksdbbackend,则会把状态存在taskmanager节点内存中,checkpoint会放在jobmanager的内存中,实际上只有rocksdbbackend是把运行时的状态数据存在内存之外,其他两个均是把运行时的状态数据存在内存中。

接下来读取kafka处理:

val dataStream = env.addSource(new FlinkKafkaConsumer010[String](KAFKA_CONSUMER_TOPIC,new SimpleStringSchema(),KAFKA_PROP)) //读kafka指定topic数据
dataStream.filter(_.split("\\|").length==3)
  .map(line=>{
  val arr = line.split("\\|")//按|分隔
  (arr(0),arr(2).toInt)//取第一个和第三个数据构成元组
}).keyBy(_._1)//按第一个元素分组
  .flatMap(new SalesAmountCalculation())
  .print()

最核心的是SalesAmountCalculation类,里面包含了状态计算的逻辑,最主要是重写open方法和flatMap方法,open方法是得到一个状态实例用于后续计算,flatMap里面则是实现具体的计算逻辑,到此整个程序则完成,可以在IDEA本地调试,下面重点说说状态恢复。

在程序中设置了checkpoint路径保存状态数据以进行失败恢复,完整的测试过程如下(flink是和yarn集成的方式):

程序打包:mvn clean package

jar包上传至hadoop01节点/home/下,然后启动程序:
/opt/flink-1.6.1/bin/flink run -d -p 1 -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 -ynm stateful-computation -c com.xxx.flink.demo.ManagedKeyedStateStreaming /home/flink-demo-1.0.jar

在kafka任意节点输入命令(命令路径根据具体情况):
/opt/cloudera/parcels/KAFKA-3.1.1-1.3.1.1.p0.2/lib/kafka/bin/kafka-console-producer.sh --broker-list hadoop01:9092,hadoop02:9092,hadoop03:9092 --topic test
向test topic中放入以下消息:
1001|张辉|2000
1002|李峰|5500
1003|骁海|8000
1001|张辉|3600
1001|张辉|7600
1002|李峰|8800
在任务执行的taskmanger上的/yarn/container-logs/application_id/container_xxxx/目录下:查看taskmanager.out日志输出是否和预期相符合,输出结果:
(1001,13200)
(1002,14300)
(1003,8000)
停止程序:
yarn application -kill application_id

在程序配置的checkpoint目录下,查找最新的checkpoint _metadata
checkponint恢复,恢复之间已经计算好的数据:
/opt/flink-1.6.1/bin/flink run -d -p 1 -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 -ynm stateful-computation -s hdfs:///flink/checkpoints/2f5b04350f1c05661a05a93d00ec4bd8/chk-38/_metadata -c com.xxx.flink.demo.ManagedKeyedStateStreaming /home/flink-demo-1.0.jar

重新验证逻辑是否正确:
在test topic中输入:
1001|张辉|2000
查看输出结果是否是:(1001,15200)

至此整个验证过程结束,和预期结果相符合,状态数据恢复成功。

二.Operator State

该类型的state没有限制,可以应用在各种operator上,实现方式是要么实现CheckPointedFunction接口,要么实现ListChecpointed接口,以实现CheckPointedFunction接口为列:

package com.xxx.flink.demo

import java.util.Properties

import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state._
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.Path
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext, StateBackend}
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.slf4j.LoggerFactory

import scala.collection.mutable.ListBuffer


object ManagedOperatorStateStreaming {

  private val LOG = LoggerFactory.getLogger(ManagedOperatorStateStreaming.getClass)
  private val KAFKA_CONSUMER_TOPIC = "test"
  private val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
  private val KAFKA_ZOOKEEPER_CONNECTION = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
  private val KAFKA_GROUP_ID = "flink-demo-group"
  private val KAFKA_PROP: Properties = new Properties() {
    setProperty("bootstrap.servers", KAFKA_BROKERS)
    setProperty("zookeeper.connect", KAFKA_ZOOKEEPER_CONNECTION)
    setProperty("group.id", KAFKA_GROUP_ID)
    setProperty("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    setProperty("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
  }

  def main(args: Array[String]): Unit = {
    LOG.info("===Stateful Computation Demo===")
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.enableCheckpointing(5000) //5秒一个checkpoint
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime) //指定处理的时间特性
    env.setRestartStrategy(RestartStrategies.noRestart()) //重启策略
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) //确保一次语义

//    val checkPointPath = new Path("file:///flink/checkpoints")//for local test
    val checkPointPath = new Path("hdfs:///flink/checkpoints")
    //fs状态后端配置,如为file:///,则在taskmanager的本地
    val fsStateBackend: StateBackend = new FsStateBackend(checkPointPath)
    env.setStateBackend(fsStateBackend)
    env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION) //退出不删除checkpoint

    val dataStream = env.addSource(new FlinkKafkaConsumer010[String](KAFKA_CONSUMER_TOPIC, new SimpleStringSchema(), KAFKA_PROP))
    dataStream.filter(_.split("\\|").length == 3)
      .map(line => {
        val arr = line.split("\\|")
        (arr(0), arr(2).toInt)
      }).addSink(new BufferingSink(3))

    env.execute("Stateful Computation Demo")

  }
}

class BufferingSink(threshold: Int = 0) extends SinkFunction[(String, Int)] with CheckpointedFunction {

  private val LOG = LoggerFactory.getLogger("BufferingSink")
  @transient
  private var checkPointedState: ListState[(String, Int)] = _
  private val bufferedElements = ListBuffer[(String, Int)]()

  override def invoke(value: (String, Int)):Unit = {
    bufferedElements += value
    if (bufferedElements.size == threshold) {
      for (element <- bufferedElements) {
        LOG.error(s"==================BufferingSink invoke,elements approach the threshold,Print==================$element")
      }
      bufferedElements.clear()
    }
  }

  //该方法在每次执行快照的时候执行
  override def snapshotState(context: FunctionSnapshotContext) = {
    checkPointedState.clear()//清除之前状态以装载新的状态数据
    for (element <- bufferedElements) {
      checkPointedState.add(element)
    }
  }

  //该方法在首次初始化时调用,或者在恢复前一个快照的时候执行
  override def initializeState(context: FunctionInitializationContext) = {
    val descriptor = new ListStateDescriptor[(String, Int)]("buffered-elements", TypeInformation.of(new TypeHint[(String, Int)]() {}))
    checkPointedState = context.getOperatorStateStore.getListState(descriptor)
    LOG.info(s"********************context state********************"+context.isRestored)
    //如果context.isRestored返回TRUE则restore成功
    if (context.isRestored) {
      LOG.info("********************has element state?********************"+checkPointedState.get().iterator().hasNext)
      val it = checkPointedState.get().iterator()
      while(it.hasNext){
        val element = it.next()
        LOG.error(s"********************checkpoint restore,the prefix state value********************($element)")
        bufferedElements+=element
      }
//      for (element <- checkPointedState.get()) {
//        bufferedElements += element
//      }
    }
  }
}

最主要是重写以下两个方法:

void snapshotState(FunctionSnapshotContext context) throws Exception;

void initializeState(FunctionInitializationContext context) throws Exception;

snapshotState方法每次在执行快照的时候会调用, 即checkpoint的时候,initializeState在初始化的时候调用或者在恢复快照的时候调用。基于此也进行了状态数据恢复测试:

程序打包:mvn clean package

jar包上传至hadoop01节点/home/下,然后启动程序:
/opt/flink-1.6.1/bin/flink run -d -p 1 -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 -ynm stateful-computation -c com.xxx.flink.demo.ManagedOperatorStateStreaming /home/flink-demo-1.0.jar

在kafka任意节点输入命令:
/opt/cloudera/parcels/KAFKA-3.1.1-1.3.1.1.p0.2/lib/kafka/bin/kafka-console-producer.sh --broker-list hadoop01:9092,hadoop02:9092,hadoop03:9092 --topic test
向test topic中放入以下消息:
1001|张辉|2000
1002|李峰|5500
在任务执行的taskmanger上的/yarn/container-logs/application_id/container_xxxx/目录下:查看taskmanager.log日志
停止程序:
yarn application -kill application_id

在程序配置的checkpoint目录下,查找最新的checkpoint _metadata
checkponint恢复,恢复之间已经计算好的数据:
/opt/flink-1.6.1/bin/flink run -s hdfs:///flink/checkpoints/d32f0b85e75e4b5c12500b1041938955/chk-533 -d -p 1 -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 -ynm stateful-computation -c com.xxx.flink.demo.ManagedOperatorStateStreaming /home/flink-demo-1.0.jar
注意jar包不可中途更换,重新上传。

验证:
在任务节点的taskmanager.log日志中查看是否有这样的日志:
2019-01-07 14:39:13,003 INFO  BufferingSink                                                 - ********************context state********************true
2019-01-07 14:39:13,003 INFO  BufferingSink                                                 - ********************has element state?********************true
2019-01-07 14:39:13,004 ERROR BufferingSink                                                 - ********************checkpoint restore,the prefix state value********************((1001,2000))
2019-01-07 14:39:13,004 ERROR BufferingSink                                                 - ********************checkpoint restore,the prefix state value********************((1002,5500))
如果有则恢复成功,否则失败。

经过测试发现,状态数据恢复成功;还有一种实现ListCheckpointed的方式本篇不做详述。

你可能感兴趣的:(Flink技术研究与应用)