Flink的KafkaConnector使用简介

KafkaConnector使用方法

引言

Flink通过Kafka Connector提供从Kafka读取数据和向Kafka写入数据的功能,并通过Checkpoint机制实现了Exactly-Once的操作语义,在保证数据读取和写入准确性的同时能够查询对应的offset信息。

KafkaConsumner

基本使用篇

Flink通过KafkaConsumer从Kafka的一个(或多个)Topic中读取数据,形成数据流。
为使用Flink KafkaConnector相关的功能,需要在项目的pom.xml文件中引入:

    
        org.apache.flink
        flink-connector-kafka-0.8_2.11
        1.4.2
    

Note: 如使用Kafka0.9相关依赖,可将artifactId改为:

    flink-connector-kafka-0.9_2.11

接下来通过如下代码获取从Kafka读取数据的数据流:

Properties properties = new Properties();
// Kafka broker地址,以逗号分隔
properties.setProperty("bootstrap.servers", "localhost:9092");
// Zookepper服务器地址,以逗号分隔
properties.setProperty("zookeeper.connect", "localhost:2181");
// 读取数据的Group ID
properties.setProperty("group.id", "test");
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
DataStream stream = env.addSource(kafkaConsumer);

其中,FlinkKafkaConsumer08的构造函数签名为:\

public FlinkKafkaConsumer08(String topic, DeserializationSchema valueDeserializer, Properties props)

三个重要参数分别为:
1、读取数据的一个或多个Topic名称,多个Topic之间以逗号分隔
2、定义从kafka读取数据的反序列化方式,详见高阶使用·数据反序列化
3、包含其他相关配置参数的的Properties

具体的使用实例如下:

import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.timestamps.AscendingTimestampExtractor;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.watermark.Watermark;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer010;

import javax.annotation.Nullable;

/**
 * A simple example that shows how to read from and write to Kafka. This will read String messages
 * from the input topic, parse them into a POJO type {@link KafkaEvent}, group by some key, and finally
 * perform a rolling addition on each key for which the results are written back to another topic.
 *
 * 

This example also demonstrates using a watermark assigner to generate per-partition * watermarks directly in the Flink Kafka consumer. For demonstration purposes, it is assumed that * the String messages are of formatted as a (word,frequency,timestamp) tuple. * *

Example usage: * --input-topic test-input --output-topic test-output --bootstrap.servers localhost:9092 --zookeeper.connect localhost:2181 --group.id myconsumer */ public class Kafka010Example { public static void main(String[] args) throws Exception { // parse input arguments final ParameterTool parameterTool = ParameterTool.fromArgs(args); if (parameterTool.getNumberOfParameters() < 5) { System.out.println("Missing parameters!\n" + "Usage: Kafka --input-topic --output-topic " + "--bootstrap.servers " + "--zookeeper.connect --group.id "); return; } StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.getConfig().disableSysoutLogging(); env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000)); env.enableCheckpointing(5000); // create a checkpoint every 5 seconds env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); DataStream input = env .addSource( new FlinkKafkaConsumer010<>( parameterTool.getRequired("input-topic"), new KafkaEventSchema(), parameterTool.getProperties()) .assignTimestampsAndWatermarks(new CustomWatermarkExtractor())) .keyBy("word") .map(new RollingAdditionMapper()); input.addSink( new FlinkKafkaProducer010<>( parameterTool.getRequired("output-topic"), new KafkaEventSchema(), parameterTool.getProperties())); env.execute("Kafka 0.10 Example"); } /** * A {@link RichMapFunction} that continuously outputs the current total frequency count of a key. * The current total count is keyed state managed by Flink. */ private static class RollingAdditionMapper extends RichMapFunction { private static final long serialVersionUID = 1180234853172462378L; private transient ValueState currentTotalCount; @Override public KafkaEvent map(KafkaEvent event) throws Exception { Integer totalCount = currentTotalCount.value(); if (totalCount == null) { totalCount = 0; } totalCount += event.getFrequency(); currentTotalCount.update(totalCount); return new KafkaEvent(event.getWord(), totalCount, event.getTimestamp()); } @Override public void open(Configuration parameters) throws Exception { currentTotalCount = getRuntimeContext().getState(new ValueStateDescriptor<>("currentTotalCount", Integer.class)); } } /** * A custom {@link AssignerWithPeriodicWatermarks}, that simply assumes that the input stream * records are strictly ascending. * *

Flink also ships some built-in convenience assigners, such as the * {@link BoundedOutOfOrdernessTimestampExtractor} and {@link AscendingTimestampExtractor} */ private static class CustomWatermarkExtractor implements AssignerWithPeriodicWatermarks { private static final long serialVersionUID = -742759155861320823L; private long currentTimestamp = Long.MIN_VALUE; @Override public long extractTimestamp(KafkaEvent event, long previousElementTimestamp) { // the inputs are assumed to be of format (message,timestamp) this.currentTimestamp = event.getTimestamp(); return event.getTimestamp(); } @Nullable @Override public Watermark getCurrentWatermark() { return new Watermark(currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1); } } }

高阶使用篇

一、支持多Topic

FlinkKafkaConsumer可支持同时从多个Topic中消费数据,具体方法可在构造函数中设置以逗号分隔的多个Topic名称:

new FlinkKafkaConsumer08<>("topic1,topic2,topic3", new SimpleStringSchema(), properties)

同时,可以通过正则表达式设置匹配多个Topic名称,这样可在作业启动后自动匹配发现新的Topic并从中消费数据。例如:

FlinkKafkaConsumer011 myConsumer = new FlinkKafkaConsumer011<>(
    java.util.regex.Pattern.compile("test-topic-[0-9]"),
    new SimpleStringSchema(),
    properties);

通过以下参数,可以设置作业在检测Topic的时间间隔,集群可发现新的Topic并从中消费数据

flink.partition-discovery.interval-millis

二、起始Offset配置

FlinkKafkaConsumer可支持作业从Topic的指定offset处开始消费数据,共有以下几种消费方式:
1、指定作业从每个partition最早的起始位置开始消费数据

FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromEarliest();

2、指定作业从每个partition最近(最晚)的位置开始消费数据

FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromLatest();

3.指定作业从group中每个partition的当前位置开始消费数据,要求当前consumer已经指定了groupID

FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);
kafkaConsumer.setStartFromGroupOffsets();

4.分别指定作业从不同partition的不同offset处开始消费数据

FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);

Map specificStartOffsets = new HashMap<>();
specificStartOffsets.put(new KafkaTopicPartition("topic", 0), 23L);
specificStartOffsets.put(new KafkaTopicPartition("topic", 1), 31L);
specificStartOffsets.put(new KafkaTopicPartition("topic", 2), 43L);

kafkaConsumer.setStartFromSpecificOffsets(specificStartOffsets);

三、Checkpoint与offset commit模式

为保证处理数据的的准确性(Exactly-Once),可通过以下方式开启作业的checkpoint机制:

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // checkpoint every 5000 msecs

1、当作业开启checkpoint机制时,KafkaConsumer会在每次checkpoint执行完成后,将当前消费数据的offset信息commit到zookepper(Kafka08)或broker(Kafka09)。
2、当作业开启checkpoint机制时,用户可通过以下方式关闭offset commit操作,则作业不会checkpoint完成后执行offset commit。

kafkaConsumer.setCommitOffsetsOnCheckpoints(false);

3、当作业未开启checkpoint机制时,KafkaConsumer会周期性的commit当前消费数据的offset信息,用户可通过以下方式设置commit的时间间隔(默认为60s)。

Properties properties = new Properties();
properties.setProperty("auto.commit.interval.ms", 60);

FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);

4、用户可通过以下参数关闭周期性commit offset模式,则当作业未开启checkpoint机制时,KafkaConsumer不会执行commit offset操作。

Properties properties = new Properties();
properties.setProperty("auto.commit.enable", false);

FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);

四、自定义反序列化方式

为了能够从Kafka中消费不同格式的数据,需要为kafkaConsumer设置特定的DeserializationSchema,使其能够将消费到的byte[]数组反序列化为对应的Java/Scala对象。在以上的例子中均使用SimpleStringSchema,将数据反序列化为String类型,用户还可以通过以下两种方式设置自定义的反序列化方式:
1、通过基于TypeInformation类构造Flink本身提供的TypeInformationSerializationSchema,实现对Flink原生支持数据类型的反序列化,例如:

TypeInformationSerializationSchema serializationSchema =
                new TypeInformationSerializationSchema>(
                        TypeInformation.of(new TypeHint>() {
                        }),
                        env.getConfig()
                );
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08("topic",serializationSchema,properties);
DataStream> stream = env.addSource(kafkaConsumer);

2、通过使用JsonDeserializationSchema实现对JSON格式数据的反序列化,具体的使用方法如下:

JSONDeserializationSchema deserializationSchema = new JSONDeserializationSchema();
FlinkKafkaConsumer08 kafkaConsumer = new FlinkKafkaConsumer08("topic",deserializationSchema,properties);
DataStream stream = env.addSource(kafkaConsumer);

2、继承AbstractDeserializationSchema抽象类实现用户自定义的反序列工具类,并实现以下两个方法:

public abstract T deserialize(byte[] message) throws IOException;
public TypeInformation getProducedType();

其中,deserialize()方法执行具体的反序列化操作,而getProducedType()方法用于获取反序列化对象的类型。

Note: 当反序列化数据出现异常时,有以下两种可供选择的操作
1.由反序列化方法抛出Exception,会导致整个作业失败重启;
2.跳过当前数据,并继续处理下一条数据。保证作业正常运行,但可能会影响数据处理的准确性。

你可能感兴趣的:(Flink,Apache,Flink)