Flink入门(十四)大job的StateBackend压力测试

最近有这样需求,两个topic消息interval-join。
其中一个topic是,展示列表的详细信息(曝光),大约20分钟,有100G大小(主要从服务端发送);
另外一个topic是,用户操作列表(点击、下单),由前端发送,数据很少,大约10分钟几十M,action表示操作,
itemId表示方案的唯一标识符,time表示操作时间,unionId表示用户唯一标识符。

{"action":"click","itemId":"ce1e6cc4f6a09b058df716ff9b21f82b@TT@2","time":"1560272397","unionId":"ohmdTtxmXB0q-2_CuTNQ1v3YVKuo"}

两个topic的kafka

//详细信息,大消息流
DataStream detailStream
//用户点击操作信息流
DataStream clickStream

目的还原当时用户点击的方案详细信息,后续操作先不考虑。
假设服务端给一个用户发送50个展示方案,大约0-8分钟之内,用户会点击操作(超过8分钟的不考虑),根据itemId还原用户当时点击的方案的详细信息。

        DataStream< String> joinStream = detailStream.keyBy(PlanClickBO::getItemId)
                .intervalJoin(clickStream.keyBy(PlanActionBO::getItemId))
                .between(seconds(-10), seconds(60*8))
                .process(new ProcessJoinFunction() {
                    @Override
                    public void processElement(PlanClickBO left, PlanActionBO right, Context ctx, Collector out) throws Exception {
                        out.collect(left+":"+right);
                    }
                });

因为采用窗口,需要配置State Backends

MemoryStateBackend

默认是MemoryStateBackend(内存的),但程序会偶尔task会丢失,造成重启,也会发生JM挂掉,Job平台自动重启,数据会丢失。
物理存储设置checkpoint,有FsStateBackend和RocksDBStateBackend这两方式。

FsStateBackend

FsStateBackend是全量的checkpoint,默认同步的,也支持异步。
设置方法很简单

        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        FsStateBackend backend= new FsStateBackend(path, true);
        env.setStateBackend((StateBackend)backend);
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);
        env.enableCheckpointing(1000*60);

但问题来,大消息detailStream数据太大了,而且缓存8分钟数据。每次checkpoint时间越来越长,开始大约10-20S,大约10G文件大小,到后面基本上超过10分钟(我这里集群配置checkpoint超时10分钟),
到后面checkpoint基本失效。

RocksDBStateBackend

异步chekcpoint,而且是增量更新,有compaction自动合并增量文件。
添加maven依赖

        
            org.apache.flink
            flink-statebackend-rocksdb_2.11
            ${flink.version}
        

调用代码

        String path=args.length>1?args[0]:CHECK_POINT_PATH;
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        RocksDBStateBackend backend = new RocksDBStateBackend(path, true);
        backend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM);
        env.setStateBackend((StateBackend)backend);
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);
        env.getCheckpointConfig().setFailOnCheckpointingErrors(false);
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000*30);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
        env.enableCheckpointing(1000*60);

CheckpointingMode默认是EXACTLY_ONCE,我是为了减少性能消耗。
先说几个重要设置,env.getCheckpointConfig().setFailOnCheckpointingErrors(false) 允许checkpoint失败,checkpoint也会让程序挂掉。一定要设置setMinPauseBetweenCheckpoints(两个checkpoint最小间隔),如果不设置,连续checkpoint,也没时间compaction合并文件了。
Flink入门(十四)大job的StateBackend压力测试_第1张图片
图说明,默认生成5个增量文件,从1开始,图中为545-549,增量更新checkpoint,会自动删除前面的,shared是compaction文件,自动合并,且里面小文件有生命周期。
RocksDBStateBackend明显比FsStateBackend快多了,大约2-3s秒,很快。但到后面checkpoint也超时,
shared文件越来越大,重启后,直接OOM,无法读hdfs(文件太大)。也就比FsStateBackend程序跑得时间长些(FsStateBackend死在checkpoint时间太长,RocksDBStateBackend最大问题重启后OOM)。


结论:虽然问题没有解决,几轮测试,感觉还是RocksDBStateBackend效果好些,只是缓存大对象,效果不行。当然和我申请的容器才7-8个有关系,跑了两个小时会自动挂,申请30个容器跑一天没问题,但太耗资源了对于这个需求。

附测试源码

package com.tc.flink.demo.stream;

import com.alibaba.fastjson.JSON;
import com.tc.flink.analysis.label.input.PlanActionBO;
import com.tc.flink.demo.bean.PlanClickBO;
import com.tc.flink.conf.KafkaConfig;
import com.tc.flink.conf.KafkaTopicName;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.contrib.streaming.state.OptionsFactory;
import org.apache.flink.contrib.streaming.state.PredefinedOptions;
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;
import org.apache.flink.runtime.state.StateBackend;
import org.apache.flink.runtime.state.filesystem.FsStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.util.Collector;
import org.rocksdb.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;
import java.util.Properties;

import static org.apache.flink.streaming.api.windowing.time.Time.*;

public class UserClickedDetailBak {

    protected static Logger logger = LoggerFactory.getLogger(UserClickedDetailBak.class);

    public static String CHECK_POINT_PATH="hdfs://hadoopcluster/traffichuixing/checkpoint/user_click";

    public static void main(String[] args) throws Exception {
        String path=args.length>1?args[0]:CHECK_POINT_PATH;
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//        //采用fsstatebackend
//        FsStateBackend backend= new FsStateBackend(path, true);
        //采用RocksDBS
        RocksDBStateBackend backend = new RocksDBStateBackend(path, true);
        backend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM);
//        //自定义options,取消setPredefinedOptions
//        backend.setOptions(new OptionsFactory(){
//            @Override
//            public DBOptions createDBOptions(DBOptions currentOptions) {
//                currentOptions.setWalTtlSeconds(60*20);
//                return currentOptions;
//            }
//
//            @Override
//            public ColumnFamilyOptions createColumnOptions(ColumnFamilyOptions currentOptions) {
//                final long blockCacheSize = 256 * 1024 * 1024;
//                final long blockSize = 128 * 1024;
//                final long targetFileSize = 256 * 1024 * 1024;
//                final long writeBufferSize = 64 * 1024 * 1024;
//                return new ColumnFamilyOptions()
//                        .setCompactionStyle(CompactionStyle.FIFO)
//                        .setLevelCompactionDynamicLevelBytes(true)
//                        .setTargetFileSizeBase(targetFileSize)
//                        .setMaxBytesForLevelBase(4 * targetFileSize)
//                        .setWriteBufferSize(writeBufferSize)
//                        .setMinWriteBufferNumberToMerge(3)
//                        .setMaxWriteBufferNumber(4)
//                        .setTableFormatConfig(
//                                new BlockBasedTableConfig()
//                                        .setBlockCacheSize(blockCacheSize)
//                                        .setBlockSize(blockSize)
//                                        .setFilter(new BloomFilter())
//                        );
//            }
//        });


        env.setStateBackend((StateBackend)backend);
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);
        env.getCheckpointConfig().setFailOnCheckpointingErrors(false);
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000*30);
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
        env.enableCheckpointing(1000*60);


        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

        Properties propsConsumer = new Properties();
        propsConsumer.setProperty("bootstrap.servers", KafkaConfig.KAFKA_BROKER_LIST);
        propsConsumer.setProperty("group.id", "trafficwisdom-streaming");
        propsConsumer.put("enable.auto.commit", false);//处理完再提交offset,防止kafka消息读完,任务没计算就挂,丢失数据
        propsConsumer.put("max.poll.records", 1000);

        FlinkKafkaConsumer011 detailLog = new FlinkKafkaConsumer011("plan_detail_log", new SimpleStringSchema(), propsConsumer);
        detailLog.setStartFromLatest();
        DataStream detailsStream = env.addSource(detailLog).name("detail_source").disableChaining().setParallelism(12);

        DataStream detailStream = detailsStream.flatMap(new FlatMapFunction() {
            @Override
            public void flatMap(String value, Collector out) throws Exception {
                try {
                    List array = JSON.parseArray(value, PlanClickBO.class);
                    for (PlanClickBO planClickBO : array) {
                        out.collect(planClickBO);
                    }
                } catch (Exception e) {
                    logger.error(e.getMessage(),e);
                }
            }
        }).filter(s->s!=null && s.getTransferPlanId()!=null);

//        detail.print();

        FlinkKafkaConsumer011 actionLog = new FlinkKafkaConsumer011("transfer_action_log", new SimpleStringSchema(), propsConsumer);
        actionLog.setStartFromLatest();
        DataStream actionStream = env.addSource(actionLog).name("action_source").disableChaining().setParallelism(12);

        DataStream clickStream = actionStream.map(new MapFunction() {
            @Override
            public PlanActionBO map(String value) throws Exception {
                try {
                    PlanActionBO planActionBO = JSON.parseObject(value, PlanActionBO.class);
                    if(planActionBO.getAction().equals("click")) {
                        return planActionBO;
                    }
                } catch (Exception e) {
                }
                return null;
            }
        }).filter(s -> s!=null);


        DataStream< PlanClickBO> joinStream = detailStream.keyBy(PlanClickBO::getTransferPlanId)
                .intervalJoin(clickStream.keyBy(PlanActionBO::getItemId))
                .between(seconds(-10), seconds(60*8))
                .process(new ProcessJoinFunction() {
                    @Override
                    public void processElement(PlanClickBO left, PlanActionBO right, Context ctx, Collector out) throws Exception {
                        Long clickTime=Long.parseLong(right.getTime())*1000;
                        left.setClickTime(clickTime);
                        out.collect(left);
                    }
                });
        Properties propsProducer = new Properties();
        propsProducer.setProperty("bootstrap.servers", KafkaConfig.KAFKA_BROKER_LIST);
        FlinkKafkaProducer011 flinkKafkaProducer=new FlinkKafkaProducer011(KafkaTopicName.TRANSFER_LINE_STATS,new SimpleStringSchema(),propsProducer);
        joinStream.map(record->{
            return JSON.toJSONString(record);
        }).returns(Types.STRING).addSink(flinkKafkaProducer).name("user_click_detail_sink").setParallelism(3);
        env.execute("UserLatestClick");

    }
}

你可能感兴趣的:(Flink)