在上一篇实战博客,我们给WordWithCount计算任务自定义了Sink函数
现在我们开始研究自定义Source:
我们以Kafka Source举例,讲2种注册Tabel Source的方法和一些技巧:
想要将DataStream转换为表,我们需要一个DataStream
以Kafka为外部数据源,需要在pom文件中添加依赖
org.apache.flink
flink-connector-kafka-0.11_2.11
${flink.version}
org.apache.flink
flink-connector-kafka_2.11
${flink.version}
添加Kafka DataStream:
DataStream> kafkaStream=blinkStreamEnv.addSource(new FlinkKafkaConsumer011<>("test",new AbstractDeserializationSchema>(){
@Override
public Tuple1 deserialize(byte[] bytes) throws IOException {
return new Tuple1<>(new String(bytes,"utf-8"));
}
},kafkaProperties));
注册表:
//如果多列应为:fromDataStream(kafkaStream,"f0,f1,f2");
Table source=blinkStreamTabelEnv.fromDataStream(kafkaStream,"word");
blinkStreamTabelEnv.registerTable("kafkaDataStream",source);
虽然没有指定是Tabel Source,但是可以在后续流程使用注册好的 kafkaDataStream 表
Flink的DataStream和DataSet API支持非常多种类型。元组,POJO,Scala案例类和Flink的Row类型等复合类型允许嵌套的数据结构具有多个字段,这些字段可在表表达式中访问。
上述符合数据类型可以通过自定义Schema来使用
我喜欢将自定义函数封装成类,简洁可复用
import org.apache.flink.api.common.serialization.AbstractDeserializationSchema;
import org.apache.flink.api.java.tuple.Tuple1;
import org.apache.flink.types.Row;
import java.io.IOException;
public final class Tuple1Schema extends AbstractDeserializationSchema> {
@Override
public Tuple1 deserialize(byte[] bytes) throws IOException {
return new Tuple1<>(new String(bytes,"utf-8"));
}
}
在main中使用:
DataStream> kafkaStream=blinkStreamEnv.addSource(new FlinkKafkaConsumer011<>("test",new Tuple1Schema(),kafkaProperties));
到这里已经注册好可用的Datastream Source Tabel了
但是还可以进一步自定义:
StreamTableSource接口继承自TableSource接口,可以在getDataStream方法中编写DataStream
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.table.api.DataTypes;
import org.apache.flink.table.api.TableSchema;
import org.apache.flink.table.sources.StreamTableSource;
import org.apache.flink.table.types.DataType;
import java.util.Properties;
public class KafkaTabelSource implements StreamTableSource {
@Override
public DataType getProducedDataType() {
return DataTypes.STRING();
}
@Override
public TableSchema getTableSchema() {
return TableSchema.builder().fields(new String[]{"word"},new DataType[]{DataTypes.STRING()}).build();
}
@Override
public DataStream getDataStream(StreamExecutionEnvironment env) {
Properties kafkaProperties=new Properties();
kafkaProperties.setProperty("bootstrap.servers", "0.0.0.0:9092");
kafkaProperties.setProperty("group.id", "test");
DataStream kafkaStream=env.addSource(new FlinkKafkaConsumer011<>("test",new SimpleStringSchema(),kafkaProperties));
return kafkaStream;
}
}
使用:
import kmops.Custom.KafkaTabelSource;
import kmops.Custom.MyRetractStreamTableSink;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.DataTypes;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.sinks.RetractStreamTableSink;
import org.apache.flink.table.types.DataType;
import org.apache.flink.types.Row;
public class KafkaSource2 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment blinkStreamEnv=StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings blinkStreamSettings= EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment blinkStreamTabelEnv= StreamTableEnvironment.create(blinkStreamEnv,blinkStreamSettings);
blinkStreamTabelEnv.registerTableSource("kafkaDataStream",new KafkaTabelSource());//使用自定义TableSource
RetractStreamTableSink retractStreamTableSink=new MyRetractStreamTableSink(new String[]{"_count","word"},new DataType[]{DataTypes.BIGINT(), DataTypes.STRING()});
blinkStreamTabelEnv.registerTableSink("sinkTable",retractStreamTableSink);
Table wordWithCount = blinkStreamTabelEnv.sqlQuery("SELECT count(word) AS _count,word FROM kafkaDataStream GROUP BY word ");
wordWithCount.insertInto("sinkTable");
blinkStreamTabelEnv.execute("BLINK STREAMING QUERY");
}
}
相当简洁就完成了自定义的Source、Sink
有必要分享一下开发环境下kafka的使用:
入门请移步官网
http://kafka.apache.org/quickstart
安装Python环境,pip kafka-python依赖,可以编写如下程序发送大量消息给Kafka:
# https://pypi.org/project/kafka-python/
import pickle
import time
import json
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['127.0.0.1:9092'],
key_serializer=lambda k: pickle.dumps(k),
value_serializer=lambda v: pickle.dumps(v))
start_time = time.time()
for i in range(0, 10000):
print('------{}---------'.format(i))
producer = KafkaProducer()
producer.send('test', b'Xiaomi')
producer.send('test', b'Xiaomi')
producer.send('test', b'Xiaomi')
producer.send('test', b'Apple')
producer.send('test', b'Apple')
producer.send('test', b'Huawei')
# future = producer.send('test', key='num', value=i, partition=0)
# 将缓冲区的全部消息push到broker当中
producer.flush()
producer.close()
end_time = time.time()
time_counts = end_time -
输出结果:
26,Xiaomi
18,Apple
27,Xiaomi
28,Xiaomi
19,Apple
10,Huawei
29,Xiaomi
20,Apple
30,Xiaomi
21,Apple
11,Huawei
31,Xiaomi
22,Apple
32,Xiaomi
33,Xiaomi
12,Huawei
23,Apple
34,Xiaomi
35,Xiaomi
24,Apple
36,Xiaomi
源码、python kafka\socket moke小程序已上传至GitHub
https://github.com/StarPlatinumStudio/Flink-SQL-Practice