预定义的Source和Sink
一个SocketWindowWordCount的例子
public class SocketWindowWordCount {
public static void main(String[] args) throws Exception {
// 用final修饰符定义端口号,表示不可变
final int port;
try {
final ParameterTool params = ParameterTool.fromArgs(args);
port = params.getInt("port");
} catch (Exception e) {
System.err.println("No port specified. Please run 'SocketWindowWordCount --port '");
return;
}
// (1)获取执行环境
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// (2)获取数据流,例子中是从指定端口的socket中获取用户输入的文本
DataStream text = env.socketTextStream("localhost", port, "\n");
// (3)transformation操作,对数据流实现算法
DataStream windowCounts = text
//将用户输入的文本流以非空白符的方式拆开来,得到单个的单词,存入命名为out的Collector中
.flatMap(new FlatMapFunction() {
public void flatMap(String value, Collector out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
//将输入的文本分为不相交的分区,每个分区包含的都是具有相同key的元素。也就是说,相同的单词被分在了同一个区域,下一步的reduce就是统计分区中的个数
.keyBy("word")
//滑动窗口机制,每1秒计算一次最近5秒
.timeWindow(Time.seconds(5), Time.seconds(1))
//一个在KeyedDataStream上“滚动”进行的reduce方法。将上一个reduce过的值和当前element结合,产生新的值并发送出。
//此处是说,对输入的两个对象进行合并,统计该单词的数量和
.reduce(new ReduceFunction() {
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// 单线程执行该程序
windowCounts.print().setParallelism(1);
env.execute("Socket Window WordCount");
}
// 统计单词的数据结构,包含两个变量和三个方法
public static class WordWithCount {
//两个变量存储输入的单词及其数量
public String word;
public long count;
public WordWithCount() {}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
以上流connector是Flink项目的一部分,但是不包括在二进制发布包中
Flink Kafka Consumer
Flink Kafka Producer
生产者例子
public class WriteIntoKafka {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Map properties= new HashMap();
properties.put("bootstrap.servers", "/*服务地址*/");
properties.put("topic", "/*topic*/");
// parse user parameters
ParameterTool parameterTool = ParameterTool.fromMap(properties);
// add a simple source which is writing some strings
DataStream messageStream = env.addSource(new SimpleStringGenerator());
// write stream to Kafka
messageStream.addSink(new FlinkKafkaProducer010<>(parameterTool.getRequired("bootstrap.servers"),
parameterTool.getRequired("topic"),
new SimpleStringSchema()));
messageStream.rebalance().map(new MapFunction() {
//序列化设置
private static final long serialVersionUID = 1L;
@Override
public String map(String value) throws Exception {
return value;
}
});
messageStream.print();
env.execute();
}
public static class SimpleStringGenerator implements SourceFunction {
//序列化设置
private static final long serialVersionUID = 1L;
boolean running = true;
@Override
public void run(SourceContext ctx) throws Exception {
while(running) {
ctx.collect(prouderJson());
}
}
@Override
public void cancel() {
running = false;
}
}
}
消费者例子
public class ReadFromKafka {
public static void main(String[] args) throws Exception {
// create execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Map properties= new HashMap();
properties.put("bootstrap.servers", "/*服务地址*/");
properties.put("Okusi Infotech", "test");
properties.put("enable.auto.commit", "true");
properties.put("auto.commit.interval.ms", "1000");
properties.put("auto.offset.reset", "earliest");
properties.put("session.timeout.ms", "30000");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("topic", "/*topic*/");
// parse user parameters
ParameterTool parameterTool = ParameterTool.fromMap(properties);
FlinkKafkaConsumer010 consumer010 = new FlinkKafkaConsumer010(
parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties());
DataStream messageStream = env
.addSource(consumer010);
// print() will write the contents of the stream to the TaskManager's standard out stream
// the rebelance call is causing a repartitioning of the data so that all machines
// see the messages (for example in cases when "num kafka partitions" < "num flink operators"
messageStream.rebalance().map(new MapFunction() {
private static final long serialVersionUID = 1L;
@Override
public String map(String value) throws Exception {
return value;
}
});
messageStream.print();
env.execute();
}
}
Kafka Consumer反序列化数据
常用
Kafka Consumer消费起始位置
作业故障从checkpoint自动恢复,以及手动做savepoint时,消费的位置从保存状态中恢复,与该配置无关
Kafka Consumer-topic partition自动发现
原理:内部单独的线程获取kafka meta信息进行更新
flink.partition-discovery-interval-millis:发现时间间隔。默认false,设置非负值开启
分区发现
Topic发现
Pattern topicPattern= java.util.regex.Pattern.compile("topic[0-9]")
Kafka Consumer-commit offset方式
Checkpoint关闭
Checkpoint开启
Kafka Consumer 时戳提取、水位生成
per Kafka Partition watermark
Kafka Producer分区
Kafka Producer容错
Kafka 0.11:FlinkKafkaProducer011,两阶段提交Sink结合kafka事务,可以保证端到端精准一次
An Overview of End-to-End Exactly-Once Processing in Apache Flink® (with Apache Kafka, too!)www.ververica.com