在Flink实时计算的实际项目中,广播中的状态,可能并不是需要一直存在,只需要当天存在,之后不再会用到。
这种情况下,如果状态数据一直不清理,量会越来越庞大,占用内存,时间长,甚至会导致内存溢出。所以需要对过期的广播状态进行清理。
但是状态过期清理的机制,目前仅是对keyed state来说的有效,对广播状态不起作用。因此,需要自己手动去处理。
按照flink keyed state过期处理的思想,手动实现对过期的广播状态的清理;主要逻辑如下:
(1)广播状态设置一个状态,专门存储一个时间戳毫秒值,通过这个时间和当前时间对比,判断是否需要进行状态清理;比如当前时间+24小时;一天之后,对状态进行处理;
(2)每个状态在创建的时候,都生成一个时间戳毫秒值,类似于上一步;在清理状态时,判断该状态是否需要清理,需要清理,则从状态中清理掉该状态;
实现的代码逻辑如下:
package cn.china.test.main;
import com.alibaba.fastjson.JSON;
import com.typesafe.config.Config;
import com.typesafe.config.ConfigFactory;
import cn.china.test.data.EmployeeInfo;
import cn.china.test.data.RankSalaryInfo;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.state.*;
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.BroadcastConnectedStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.Map;
import java.util.Properties;
public class Test {
public static void main(String[] args) throws Exception {
Config config = ConfigFactory.load(Test.class.getClassLoader());
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 禁用全局任务链,避免多个操作在一个task中执行(flink内部优化,会把可以优化的算子优化成一个算子链,放在一个task中执行,UI页面只显示一个算子窗口)
env.disableOperatorChaining();
String brokers = config.getString("consumer.kafka.brokers");
String employeeSalary = config.getString("kafka.topic.employee.salary");
String employeeInfo = config.getString("kafka.topic.employee.info");
String pushTopic = config.getString("kafka.topic.employee.push");
String groupId = config.getString("kafka.groupId");
String checkPointPath = config.getString("check.point.path.prefix");
StateBackend backend = new EmbeddedRocksDBStateBackend(true);
env.setStateBackend(backend);
CheckpointConfig conf = env.getCheckpointConfig();
conf.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);//启用持久化 checkpoint 到外部系统,取消时保留检查点数据
conf.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);//恰好一次(Exactly Once)或者至少一次(At least Once)
conf.setCheckpointInterval(30 * 1000);//milliseconds,checkpoint 触发间隔。在此间隔内触发 checkpoint
conf.setCheckpointTimeout(30 * 60 * 1000);//milliseconds,checkpoint 超时间隔。超时之后,JobManager 取消 checkpoint 并触发新的 checkpoint
conf.setCheckpointStorage(checkPointPath);//设置检查点保存路径
conf.setMinPauseBetweenCheckpoints(10 * 1000);// Checkpoint 之间所需的最小暂停时间
conf.setMaxConcurrentCheckpoints(30);// 可以同时进行的最大 checkpoint 个数
Properties props = new Properties();
props.setProperty("bootstrap.servers", brokers);
props.setProperty("group.id", groupId);
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("auto.offset.reset", "earliest");
props.put("max.poll.records", 1000);
props.put("session.timeout.ms", 90000);
props.put("request.timeout.ms", 120000);
props.put("enable.auto.commit", true);
props.put("auto.commit.interval.ms", 100);
// 获取职级薪资信息 并 广播
FlinkKafkaConsumer rankSalaryConsumer = new FlinkKafkaConsumer(employeeSalary, new SimpleStringSchema(), props);
rankSalaryConsumer.setCommitOffsetsOnCheckpoints(true);
DataStream rankSalaryKafkaData = env.addSource(rankSalaryConsumer).name("RankSalarySource");
MapStateDescriptor rankSalaeyBroadcastDesc = new MapStateDescriptor("RankSalaeyBroadcast", String.class, RankSalaryInfo.class);
BroadcastStream rankSalaryBroadcast = rankSalaryKafkaData.broadcast(rankSalaeyBroadcastDesc);
// 获取员工信息
FlinkKafkaConsumer employeeInfoConsumer = new FlinkKafkaConsumer(employeeInfo, new SimpleStringSchema(), props);
employeeInfoConsumer.setCommitOffsetsOnCheckpoints(true);
DataStream employeeInfoKafkaData = env.addSource(employeeInfoConsumer).name("EmployeeInfoSource");
// 员工信息 关联 职级薪资信息,获取薪资
BroadcastConnectedStream employeeInfoConnectRankSalary = employeeInfoKafkaData.connect(rankSalaryBroadcast);
DataStream employeeInfoDataStream = employeeInfoConnectRankSalary.process(new BroadcastProcessFunction() {
MapStateDescriptor rankSalaeyBroadcastDesc = new MapStateDescriptor("RankSalaeyBroadcast", String.class, RankSalaryInfo.class);
@Override
public void processElement(String value, ReadOnlyContext ctx, Collector out) throws Exception {
try {
ReadOnlyBroadcastState broadcastState = ctx.getBroadcastState(rankSalaeyBroadcastDesc);
EmployeeInfo employeeInfo = JSON.parseObject(value, EmployeeInfo.class);
String rank = employeeInfo.rank;
if (broadcastState.contains(rank)) {
RankSalaryInfo rankSalaryInfo = broadcastState.get(rank);
String salary = rankSalaryInfo.salary;
employeeInfo.setSalary(salary);
}
String employeeInfoString = JSON.toJSONString(employeeInfo);
out.collect(employeeInfoString);
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void processBroadcastElement(String value, Context ctx, Collector out) throws Exception {
try {
BroadcastState broadcastState = ctx.getBroadcastState(rankSalaeyBroadcastDesc);
// 设置一个状态值,保存计时器时间,计时器时间为当前时间加上24小时的毫秒值,到了计时器时间之后,清理数据;也就是一天清理一次
if (!broadcastState.contains("timer_state")) {
RankSalaryInfo timerState = new RankSalaryInfo("timer_state", "timer_state", System.currentTimeMillis() + 1000 * 60 * 60 * 24);
broadcastState.put("timer_state", timerState);
}
// 定时器时间大于等于当前时间,说明上次清理是在24小时之前,清理过期状态,并重新设置定时器
RankSalaryInfo timerState = broadcastState.get("timer_state");
Long timer = timerState.ttl;
if (System.currentTimeMillis() >= timer) {
// 清理状态
Iterator> iterator = broadcastState.iterator();
// 创建一个list,存放需要清理的状态的key
ArrayList waitToDeleteKey = new ArrayList<>();
while (iterator.hasNext()) {
Map.Entry next = iterator.next();
String key = next.getKey();
RankSalaryInfo rankSalaryInfo = next.getValue();
Long stateTtl = rankSalaryInfo.ttl;
if (!"timer_state".equals(key) && System.currentTimeMillis() >= stateTtl) {
waitToDeleteKey.add(key);
}
}
// 遍历waitToDeleteKey,删除过期状态
for (int i = 0; i < waitToDeleteKey.size(); i++) {
broadcastState.remove(waitToDeleteKey.get(i));
}
// 重新设置计时器状态,到当前时间的24小时之后
timerState.setTtl(System.currentTimeMillis() + 1000 * 60 * 60 * 24);
broadcastState.put("timer_state", timerState);
}
// 设置最数据的状态
RankSalaryInfo rankSalaryInfo = JSON.parseObject(value, RankSalaryInfo.class);
// 状态的ttl设置为 当前时间+24小时 的毫秒值
rankSalaryInfo.setTtl(System.currentTimeMillis() + 1000 * 60 * 60 * 24);
String rank = rankSalaryInfo.rank;
broadcastState.put(rank, rankSalaryInfo);
} catch (Exception e) {
e.printStackTrace();
}
}
});
employeeInfoDataStream.addSink(new FlinkKafkaProducer(
brokers,
pushTopic,
new SimpleStringSchema())).name("PushInfo");
env.execute("BroadcastStateTtlTest");
}
}
涉及到的两个实体类,简单罗列如下:
1)EmployeeInfo
public class EmployeeInfo {
public String name;
public String age;
public String rank;
public String salary;
public EmployeeInfo(String name, String age, String rank, String salary) {
this.name = name;
this.age = age;
this.rank = rank;
this.salary = salary;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAge() {
return age;
}
public void setAge(String age) {
this.age = age;
}
public String getRank() {
return rank;
}
public void setRank(String rank) {
this.rank = rank;
}
public String getSalary() {
return salary;
}
public void setSalary(String salary) {
this.salary = salary;
}
}
2)RankSalaryInfo
public class RankSalaryInfo {
public String rank;
public String salary;
public Long ttl;
public RankSalaryInfo() {
}
public RankSalaryInfo(String rank, String salary, Long ttl) {
this.rank = rank;
this.salary = salary;
this.ttl = ttl;
}
public String getRank() {
return rank;
}
public void setRank(String rank) {
this.rank = rank;
}
public String getSalary() {
return salary;
}
public void setSalary(String salary) {
this.salary = salary;
}
public Long getTtl() {
return ttl;
}
public void setTtl(Long ttl) {
this.ttl = ttl;
}
}
上面代码的逻辑如下:
1)从kafka中获取员工等级薪资信息,即RankSalaryInfo,包括rank和salary信息,ttl是在生成状态时创建;将对应的RankSalaryInfo信息写到广播状态中;
2)从kafka中获取员工信息,即EmployeeInfo,包括name、age、rank,通过rank到广播状态中获取salary,没有则默认为null(可以根据自己的情况,设置默认值);
3)将经过上面步骤处理的员工信息转成json串,发送到kafka中;
在上面的逻辑代码中,使用POJO类作为状态,该POJO类需要满足一定的条件;具体参考:
Flink POJO类状态使用注意事项_Johnson8702的博客-CSDN博客