flink 多种实时去重方案

目录

基于状态后端

1.MemoryStateBackend

2.FsStateBackend 

3.RocksDBStateBackend

基于HyperLogLog

基于布隆过滤器(BloomFilter)

基于BitMap

 


基于状态后端

默认flink状态分为3种。

1.MemoryStateBackend

(默认,内存)

2.FsStateBackend 

(正在进行的数据保存在TaskManager内存中,checkpoint时会将状态快照写入checkpoint文件中。)适合用于生产环境

3.RocksDBStateBackend

(唯一支持增量checkpoint,正在进行的数据保存在rockdb数据库中,默认数据存储在taskManager运行节点的数据目录下)适合用于生产环境

示例代码

public class MapStateDistinctFunction extends KeyedProcessFunction, Tuple2> {

    private transient ValueState counts;

    @Override
    public void open(Configuration parameters) throws Exception {
        //设置ValueState的TTL生命周期为24小时,到期自动清除状态
        final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.minutes(24 * 60))
                .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
                .build();
        //设置ValueState的默认值
        final ValueStateDescriptor descriptor = new ValueStateDescriptor<>("skuNum", Integer.class);
        descriptor.enableTimeToLive(ttlConfig);
        counts = getRuntimeContext().getState(descriptor);
        super.open(parameters);
    }

    @Override
    public void processElement(Tuple2 value, Context context, Collector> out) throws Exception {
        final String f0 = value.f0;
        //如果不存在则新增
        if(counts.value() == null){
            counts.update(1);
        }else{
            //如果存在则加1
            counts.update(counts.value()+1);
        }
        out.collect(Tuple2.of(f0,counts.value()));
    }
}

 

基于HyperLogLog

是一种估计统计算法,被用来统计一个集合中不同数据的个数,在不需要100%精确的业务场景下,可以使用这种方法来进行统计。


  net.agkn
  hll
  1.6.0

 示例代码

public class HyperLogLogDistinct implements AggregateFunction, HLL, Long> {
    @Override
    public HLL createAccumulator() {
        return new HLL(14, 5);
    }

    @Override
    public HLL add(Tuple2 value, HLL accumulator) {
        //value为访问记录<商品,id>
        //向HyperLogLog中插入元素
        accumulator.addRaw(value.f1);
        return accumulator;
    }

    @Override
    public Long getResult(HLL accumulator) {
        //计算HyperLogLog中元素的基数
        final long cardinality = accumulator.cardinality();
        return cardinality;
    }

    @Override
    public HLL merge(HLL a, HLL b) {
        a.union(b);
        return a;
    }
}

 

 

基于布隆过滤器(BloomFilter)

类似于HashSet,用于快速判断某个元素是否在于集合中,和HyperLogLog一样,不能保证100%精确,但是插入和查询效率都很高.

示例代码

public class BloomFilterDistinct extends KeyedProcessFunction {

    //import org.apache.flink.shaded.guava18.com.google.common.hash.BloomFilter;
    private transient ValueState bloomState;
    private transient ValueState countState;

    @Override
    public void processElement(String value, Context context, Collector out) throws Exception {

        final BloomFilter bloomFilter = bloomState.value();
        Long skuCount = countState.value();

        if (skuCount == null) {
            skuCount = 0L;
        }
        if (!bloomFilter.mightContain(value)) {
            bloomFilter.put(value);
            skuCount = skuCount + 1;
        }
        bloomState.update(bloomFilter);
        countState.update(skuCount);
        out.collect(countState.value());

    }
}

 

基于BitMap

不仅可以减少存储,而且还可以做到完全准确


   org.roaringbitmap
   RoaringBitmap
   0.8.0

 示例代码

public class BitMapDistinct implements AggregateFunction {

    @Override
    public Roaring64NavigableMap createAccumulator() {
        return new Roaring64NavigableMap();
    }
    @Override
    public Roaring64NavigableMap add(Long value, Roaring64NavigableMap accumulator) {
        accumulator.add(value);
        return accumulator;
    }
    @Override
    public Long getResult(Roaring64NavigableMap accumulator) {
        return accumulator.getLongCardinality();
    }
    @Override
    public Roaring64NavigableMap merge(Roaring64NavigableMap a, Roaring64NavigableMap b) {
        return null;
    }
}

 

你可能感兴趣的:(flink)