目录
- 系列总目录
- 层级
- pom文件
- DataStream
- 简单版本Flink处理流数据
- 也可以类似java8的flatmap
- 从Kafka读数据
- 写入kafka
- 时间和窗口
- 处理函数
- 多流转换
- 状态编程
- 容错机制
- Table和Sql
- CEP
系列总目录
层级
- SQL -> TABLE -> DataStream -> 有状态的接口更底层
pom文件
1.13.0
1.8
2.12
1.7.30
org.apache.flink
flink-java
${flink.version}
org.apache.flink
flink-streaming-java_${scala.binary.version}
${flink.version}
org.apache.flink
flink-clients_${scala.binary.version}
${flink.version}
org.apache.flink
flink-connector-kafka_${scala.binary.version}
${flink.version}
org.apache.bahir
flink-connector-redis_2.11
1.0
org.apache.flink
flink-connector-elasticsearch6_${scala.binary.version}
${flink.version}
org.apache.flink
flink-connector-jdbc_${scala.binary.version}
${flink.version}
mysql
mysql-connector-java
5.1.47
org.apache.flink
flink-statebackend-rocksdb_${scala.binary.version}
1.13.0
org.apache.flink
flink-table-api-java-bridge_${scala.binary.version}
${flink.version}
org.apache.flink
flink-table-planner-blink_${scala.binary.version}
${flink.version}
org.apache.flink
flink-streaming-scala_${scala.binary.version}
${flink.version}
org.apache.flink
flink-csv
${flink.version}
org.apache.flink
flink-cep_${scala.binary.version}
${flink.version}
org.slf4j
slf4j-api
${slf4j.version}
org.slf4j
slf4j-log4j12
${slf4j.version}
org.apache.logging.log4j
log4j-to-slf4j
2.14.0
org.apache.hadoop
hadoop-client
2.7.5
provided
hello world
hello flink
hello java
public class ClickSource implements SourceFunction {
// 声明一个布尔变量,作为控制数据生成的标识位
private Boolean running = true;
@Override
public void run(SourceContext ctx) throws Exception {
Random random = new Random(); // 在指定的数据集中随机选取数据
String[] users = {"Mary", "Alice", "Bob", "Cary"};
String[] urls = {"./home", "./cart", "./fav", "./prod?id=1", "./prod?id=2"};
while (running) {
ctx.collect(new Event(
users[random.nextInt(users.length)],
urls[random.nextInt(urls.length)],
Calendar.getInstance().getTimeInMillis()
));
// 隔1秒生成一个点击事件,方便观测
Thread.sleep(1000);
}
}
@Override
public void cancel() {
running = false;
}
}
DataStream
简单版本Flink处理流数据
- 简单版本Flink处理流数据,这里会统计单词对应出现的个数,可以直接运行看看。这里2.读取文件可以从kafka等Mq读取,也可以从服务读取,6打印可以sink到redis或者es或者ClickHouse等
public class BoundedStreamWordCount {
public static void main(String[] args) throws Exception {
// 1. 创建流式执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 2. 读取文件
DataStreamSource lineDSS = env.readTextFile("input/words.txt");
// 3. 转换数据格式
SingleOutputStreamOperator> wordAndOne = lineDSS
.flatMap((String line, Collector words) -> {
Arrays.stream(line.split(" ")).forEach(words::collect);
})
.returns(Types.STRING)
.map(word -> Tuple2.of(word, 1L))
.returns(Types.TUPLE(Types.STRING, Types.LONG));
// 4. 分组
KeyedStream, String> wordAndOneKS = wordAndOne
.keyBy(t -> t.f0);
// 5. 求和
SingleOutputStreamOperator> result = wordAndOneKS
.sum(1);
// 6. 打印
result.print();
// 7. 执行
env.execute();
}
}
也可以类似java8的flatmap
public class TransFlatmapTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource stream = env.fromElements(
new Event("Mary", "./home", 1000L),
new Event("Bob", "./cart", 2000L)
);
stream.flatMap(new MyFlatMap()).print();
env.execute();
}
public static class MyFlatMap implements FlatMapFunction {
@Override
public void flatMap(Event value, Collector out) throws Exception {
if (value.user.equals("Mary")) {
out.collect(value.user);
} else if (value.user.equals("Bob")) {
out.collect(value.user);
out.collect(value.url);
}
}
}
}
从Kafka读取数据
- 这里读取数据对应BoundedStreamWordCount中2的从文件读取数据
public class SourceKafkaTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "hadoop102:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
DataStreamSource stream = env.addSource(new FlinkKafkaConsumer(
"clicks",
new SimpleStringSchema(),
properties
));
stream.print("Kafka");
env.execute();
}
}
写入Kafka
- addSink代码对应BoundedStreamWordCount里面的print
public class SinkToKafkaTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties properties = new Properties();
properties.put("bootstrap.servers", "hadoop102:9092");
DataStreamSource stream = env.readTextFile("input/clicks.csv");
stream
.addSink(new FlinkKafkaProducer(
"clicks",
new SimpleStringSchema(),
properties
));
env.execute();
}
}
时间和窗口
public class WindowAggregateTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 时间戳
SingleOutputStreamOperator stream = env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timestamp;
}
}));
// 所有数据设置相同的key,发送到同一个分区统计PV和UV,再相除
stream.keyBy(data -> true)
// 滑动窗口
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(2)))
.aggregate(new AvgPv())
.print();
env.execute();
}
public static class AvgPv implements AggregateFunction, Long>, Double> {
@Override
public Tuple2, Long> createAccumulator() {
// 创建累加器
return Tuple2.of(new HashSet(), 0L);
}
@Override
public Tuple2, Long> add(Event value, Tuple2, Long> accumulator) {
// 属于本窗口的数据来一条累加一次,并返回累加器
accumulator.f0.add(value.user);
return Tuple2.of(accumulator.f0, accumulator.f1 + 1L);
}
@Override
public Double getResult(Tuple2, Long> accumulator) {
// 窗口闭合时,增量聚合结束,将计算结果发送到下游
return (double) accumulator.f1 / accumulator.f0.size();
}
@Override
public Tuple2, Long> merge(Tuple2, Long> a, Tuple2, Long> b) {
return null;
}
}
}
处理函数
- 看TopN类,open可以做一些比如数据库连接,processElement是处理,onTimer定时处理,这里对应最底层处理Api, 比DataStream还底层
public class KeyedProcessTopN {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 从自定义数据源读取数据
SingleOutputStreamOperator eventStream = env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timestamp;
}
}));
// 需要按照url分组,求出每个url的访问量
SingleOutputStreamOperator urlCountStream =
eventStream.keyBy(data -> data.url)
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.aggregate(new UrlViewCountAgg(),
new UrlViewCountResult());
// 对结果中同一个窗口的统计数据,进行排序处理
SingleOutputStreamOperator result = urlCountStream.keyBy(data -> data.windowEnd)
.process(new TopN(2));
result.print("result");
env.execute();
}
// 自定义增量聚合
public static class UrlViewCountAgg implements AggregateFunction {
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(Event value, Long accumulator) {
return accumulator + 1;
}
@Override
public Long getResult(Long accumulator) {
return accumulator;
}
@Override
public Long merge(Long a, Long b) {
return null;
}
}
// 自定义全窗口函数,只需要包装窗口信息
public static class UrlViewCountResult extends ProcessWindowFunction {
@Override
public void process(String url, Context context, Iterable elements, Collector out) throws Exception {
// 结合窗口信息,包装输出内容
Long start = context.window().getStart();
Long end = context.window().getEnd();
out.collect(new UrlViewCount(url, elements.iterator().next(), start, end));
}
}
// 自定义处理函数,排序取top n
public static class TopN extends KeyedProcessFunction{
// 将n作为属性
private Integer n;
// 定义一个列表状态
private ListState urlViewCountListState;
public TopN(Integer n) {
this.n = n;
}
@Override
public void open(Configuration parameters) throws Exception {
// 从环境中获取列表状态句柄
urlViewCountListState = getRuntimeContext().getListState(
new ListStateDescriptor("url-view-count-list",
Types.POJO(UrlViewCount.class)));
}
@Override
public void processElement(UrlViewCount value, Context ctx, Collector out) throws Exception {
// 将count数据添加到列表状态中,保存起来
urlViewCountListState.add(value);
// 注册 window end + 1ms后的定时器,等待所有数据到齐开始排序
ctx.timerService().registerEventTimeTimer(ctx.getCurrentKey() + 1);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector out) throws Exception {
// 将数据从列表状态变量中取出,放入ArrayList,方便排序
ArrayList urlViewCountArrayList = new ArrayList<>();
for (UrlViewCount urlViewCount : urlViewCountListState.get()) {
urlViewCountArrayList.add(urlViewCount);
}
// 清空状态,释放资源
urlViewCountListState.clear();
// 排序
urlViewCountArrayList.sort(new Comparator() {
@Override
public int compare(UrlViewCount o1, UrlViewCount o2) {
return o2.count.intValue() - o1.count.intValue();
}
});
// 取前两名,构建输出结果
StringBuilder result = new StringBuilder();
result.append("========================================\n");
result.append("窗口结束时间:" + new Timestamp(timestamp - 1) + "\n");
for (int i = 0; i < this.n; i++) {
UrlViewCount UrlViewCount = urlViewCountArrayList.get(i);
String info = "No." + (i + 1) + " "
+ "url:" + UrlViewCount.url + " "
+ "浏览量:" + UrlViewCount.count + "\n";
result.append(info);
}
result.append("========================================\n");
out.collect(result.toString());
}
}
}
多流转换
- 双流join, 还有union, union的话n个流union就是把n个流合成一个流,但是这n个流的数据结构不变,类似list.add操作
// union操作
public class UnionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator stream1 = env.socketTextStream("hadoop102", 7777)
.map(data -> {
String[] field = data.split(",");
return new Event(field[0].trim(), field[1].trim(), Long.valueOf(field[2].trim()));
})
.assignTimestampsAndWatermarks(WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(2))
.withTimestampAssigner(new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timestamp;
}
})
);
stream1.print("stream1");
SingleOutputStreamOperator stream2 = env.socketTextStream("hadoop103", 7777)
.map(data -> {
String[] field = data.split(",");
return new Event(field[0].trim(), field[1].trim(), Long.valueOf(field[2].trim()));
})
.assignTimestampsAndWatermarks(WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner(new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timestamp;
}
})
);
stream2.print("stream2");
// 合并两条流
stream1.union(stream2)
.process(new ProcessFunction() {
@Override
public void processElement(Event value, Context ctx, Collector out) throws Exception {
out.collect("水位线:" + ctx.timerService().currentWatermark());
}
})
.print();
env.execute();
}
}
状态编程
- MyFlatMap中open方法是对各种状态初始化,可以运行下下面代码,会累加
public class StateTest {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
SingleOutputStreamOperator stream = env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timestamp;
}
})
);
stream.keyBy(data -> data.user)
.flatMap(new MyFlatMap())
.print();
env.execute();
}
// 实现自定义的FlatMapFunction,用于Keyed State测试
public static class MyFlatMap extends RichFlatMapFunction{
// 定义状态
ValueState myValueState;
ListState myListState;
MapState myMapState;
ReducingState myReducingState;
AggregatingState myAggregatingState;
// 增加一个本地变量进行对比
Long count = 0L;
// 初始化状态
@Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor valueStateDescriptor = new ValueStateDescriptor<>("my-state", Event.class);
myValueState = getRuntimeContext().getState(valueStateDescriptor);
myListState = getRuntimeContext().getListState(new ListStateDescriptor("my-list", Event.class));
myMapState = getRuntimeContext().getMapState(new MapStateDescriptor("my-map", String.class, Long.class));
myReducingState = getRuntimeContext().getReducingState(new ReducingStateDescriptor("my-reduce",
new ReduceFunction() {
@Override
public Event reduce(Event value1, Event value2) throws Exception {
return new Event(value1.user, value1.url, value2.timestamp);
}
}
, Event.class));
myAggregatingState = getRuntimeContext().getAggregatingState(new AggregatingStateDescriptor("my-agg",
new AggregateFunction() {
@Override
public Long createAccumulator() {
return 0L;
}
@Override
public Long add(Event value, Long accmulator) {
return accmulator + 1;
}
@Override
public String getResult(Long accumulator) {
return "count: " + accumulator;
}
@Override
public Long merge(Long a, Long b) {
return a + b;
}
}
, Long.class));
// 配置状态的TTL
StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(1))
.setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp)
.build();
valueStateDescriptor.enableTimeToLive(ttlConfig);
}
@Override
public void flatMap(Event value, Collector out) throws Exception {
// 访问和更新状态
System.out.println("my value" + myValueState.value());
myValueState.update(value);
System.out.println( "my value1: " + myValueState.value() );
myListState.add(value);
myMapState.put(value.user, myMapState.get(value.user) == null? 1: myMapState.get(value.user) + 1);
System.out.println( "my map value: " + myMapState.get(value.user) );
myReducingState.add(value);
System.out.println( "my reducing value: " + myReducingState.get() );
myAggregatingState.add(value);
System.out.println( "my agg value: " + myAggregatingState.get() );
count ++;
System.out.println("count: " + count);
}
}
}
- 广播流相关的 Flink广播流——BroadcastStream
容错机制
- 检查点,状态一致性,原理-【Flink】状态一致性、端到端的精确一次(ecactly-once)保证
- Flink 提供了容错机制, 可以恢复数据流应用到一致状态。该机制确保在发生故障时, 程序的状态最终将只反映数据流中的每个记录一次(exactly once)
- 代码模拟了把当前局部变量中的所有元素写入到检查点中,从故障中恢复,就将ListState中的所有元素添加到局部变量中
public class BufferingSinkExample {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.enableCheckpointing(10000L);
// env.setStateBackend(new EmbeddedRocksDBStateBackend());
// env.getCheckpointConfig().setCheckpointStorage(new FileSystemCheckpointStorage(""));
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
checkpointConfig.setMinPauseBetweenCheckpoints(500);
checkpointConfig.setCheckpointTimeout(60000);
checkpointConfig.setMaxConcurrentCheckpoints(1);
checkpointConfig.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
checkpointConfig.enableUnalignedCheckpoints();
SingleOutputStreamOperator stream = env.addSource(new ClickSource())
.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(Event element, long recordTimestamp) {
return element.timestamp;
}
})
);
stream.print("input");
// 批量缓存输出
stream.addSink(new BufferingSink(10));
env.execute();
}
public static class BufferingSink implements SinkFunction, CheckpointedFunction {
private final int threshold;
private transient ListState checkpointedState;
private List bufferedElements;
public BufferingSink(int threshold) {
this.threshold = threshold;
this.bufferedElements = new ArrayList<>();
}
@Override
public void invoke(Event value, Context context) throws Exception {
bufferedElements.add(value);
if (bufferedElements.size() == threshold) {
for (Event element: bufferedElements) {
// 输出到外部系统,这里用控制台打印模拟
System.out.println("输出到外部系统: " + element);
}
System.out.println("==========输出完毕=========");
bufferedElements.clear();
}
}
@Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
checkpointedState.clear();
// 把当前局部变量中的所有元素写入到检查点中
for (Event element : bufferedElements) {
checkpointedState.add(element);
}
}
@Override
public void initializeState(FunctionInitializationContext context) throws Exception {
ListStateDescriptor descriptor = new ListStateDescriptor<>(
"buffered-elements",
Types.POJO(Event.class));
checkpointedState = context.getOperatorStateStore().getListState(descriptor);
// 如果是从故障中恢复,就将ListState中的所有元素添加到局部变量中
if (context.isRestored()) {
for (Event element : checkpointedState.get()) {
bufferedElements.add(element);
}
}
}
}
}
Table和Sql
CEP
public class LoginFailDetectExample {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 1. 获取登录事件流,并提取时间戳、生成水位线
KeyedStream stream = env
.fromElements(
new LoginEvent("user_1", "192.168.0.1", "fail", 2000L),
new LoginEvent("user_1", "192.168.0.2", "fail", 3000L),
new LoginEvent("user_2", "192.168.1.29", "fail", 4000L),
new LoginEvent("user_1", "171.56.23.10", "fail", 5000L),
new LoginEvent("user_2", "192.168.1.29", "fail", 7000L),
new LoginEvent("user_2", "192.168.1.29", "fail", 8000L),
new LoginEvent("user_2", "192.168.1.29", "success", 6000L)
)
.assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner(
new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(LoginEvent loginEvent, long l) {
return loginEvent.timestamp;
}
}
)
)
.keyBy(r -> r.userId);
// 2. 定义Pattern,连续的三个登录失败事件
Pattern pattern = Pattern.begin("first") // 以第一个登录失败事件开始
.where(new SimpleCondition() {
@Override
public boolean filter(LoginEvent loginEvent) throws Exception {
return loginEvent.eventType.equals("fail");
}
})
.next("second") // 接着是第二个登录失败事件
.where(new SimpleCondition() {
@Override
public boolean filter(LoginEvent loginEvent) throws Exception {
return loginEvent.eventType.equals("fail");
}
})
.next("third") // 接着是第三个登录失败事件
.where(new SimpleCondition() {
@Override
public boolean filter(LoginEvent loginEvent) throws Exception {
return loginEvent.eventType.equals("fail");
}
});
// 3. 将Pattern应用到流上,检测匹配的复杂事件,得到一个PatternStream
PatternStream patternStream = CEP.pattern(stream, pattern);
// 4. 将匹配到的复杂事件选择出来,然后包装成字符串报警信息输出
patternStream
.select(new PatternSelectFunction() {
@Override
public String select(Map> map) throws Exception {
LoginEvent first = map.get("first").get(0);
LoginEvent second = map.get("second").get(0);
LoginEvent third = map.get("third").get(0);
return first.userId + " 连续三次登录失败!登录时间:" + first.timestamp + ", " + second.timestamp + ", " + third.timestamp;
}
})
.print("warning");
env.execute();
}
}
参考文章
- 尚硅谷大数据Flink 2.0
- Flink广播流——BroadcastStream
- Flink 容错机制
- 【Flink】状态一致性、端到端的精确一次(ecactly-once)保证