Flink典型ETL场景(好文点赞收藏!!)

目录

  • 1-关联维表
    • 1.1-预加载维表
    • 1.2-热存储维表
    • 1.3-广播维表
  • 2-双流join
    • 2.1-window join
      • 2.1.1-Tumbling Window Join
      • 2.1.2-Sliding Window Join
      • 2.1.3-Session Window Join
    • 2.2-Interval join

1-关联维表

1.1-预加载维表

实现RichMapFunction,在open方法中读取数据库中的维度数据全量加载到内存中

优点:简单
缺点:适用于数据量小的维表

1.2-热存储维表

将维度数据存储待hbase或者redis中,通过异步IO查询热存储,利用cache机制将维度数据缓存在内存

优点:支持较多的维度数据
缺点:维度更新有延迟

1.3-广播维表

利用broadcast state将维度数据流广播出去

优点:维度变更可及时更新结果
缺点:数据保存在内存中,支持的数据量比较小

2-双流join

2.1-window join

2.1.1-Tumbling Window Join

Flink典型ETL场景(好文点赞收藏!!)_第1张图片

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
    .apply (new JoinFunction<Integer, Integer, String> (){
     
        @Override
        public String join(Integer first, Integer second) {
     
            return first + "," + second;
        }
    });

2.1.2-Sliding Window Join

Flink典型ETL场景(好文点赞收藏!!)_第2张图片

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
    .apply (new JoinFunction<Integer, Integer, String> (){
     
        @Override
        public String join(Integer first, Integer second) {
     
            return first + "," + second;
        }
    });

2.1.3-Session Window Join

Flink典型ETL场景(好文点赞收藏!!)_第3张图片

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
 
...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream.join(greenStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
    .apply (new JoinFunction<Integer, Integer, String> (){
     
        @Override
        public String join(Integer first, Integer second) {
     
            return first + "," + second;
        }
    });

2.2-Interval join

Flink典型ETL场景(好文点赞收藏!!)_第4张图片

import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

...

DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...

orangeStream
    .keyBy(<KeySelector>)
    .intervalJoin(greenStream.keyBy(<KeySelector>))
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process (new ProcessJoinFunction<Integer, Integer, String(){
     

        @Override
        public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
     
            out.collect(first + "," + second);
        }
    });

你可能感兴趣的:(实时计算(数仓),flink,redis,java,数据仓库,大数据)