flink 消费 kafka数据,将结果写入到elasticsearch中
开发工具:IDEA、JDK1.8
本地开发组件:Kafka、Flink、ElasticSearch
各组件在mac上过程如下:
1. 本地安装下载 kafka_2.12-2.2.0
下载kafka:https://kafka.apache.org/quickstart#quickstart_download
本地模式使用kafka的zookeeper服务,不用另外安装。
启动zk + kafka:
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
创建topic
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test-cash
命令行生产数据
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-cash
查看topic的数据
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-cash --from-beginning
删除topic
bin/kafka-topics.sh --delete --bootstrap-server localhost:9092 --topic test-cash
2. 本地安装flink & Elasticsearch
下载flink-1.7.2
https://flink.apache.org/downloads.html#apache-flink-172
进入目录,运行 & 关闭flink
./bin/start-cluster.sh
./bin/stop-cluster.sh
下载 Elasticsearch5.6
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.16.tar.gz
本地运行 ./bin/elasticsearch
curl -XPUT "http://localhost:9200/myindex-idx" 创建索引,名称为 myindex
创建mapping,名称为 name_count
curl -XPUT "http://localhost:9200/myindex-idx/_mapping/name_count" -d'{"name_count":{"dynamic":"strict","properties":{"name":{"type":"keyword"},"count":{"type":"long"},"time":{"type":"long"}}}'
curl -XGET "http://localhost:9200" 查看版本
查询数据 curl -XPOST 'http://localhost:9200/myindex-idx/_search?pretty' -d '{"query": { "match_all": {} }}'
删除数据 curl -XPOST 'http://localhost:9200/myindex-idx/_delete_by_query?pretty' -d '{"query": { "match_all": {} }}'
查询索引结构 curl -XGET "http://localhost:9200/myindex-idx/_mapping/name_count"
3. java开发,Maven安装
下载 apache-maven-3.6.1 http://maven.apache.org/download.cgi
配置 maven的环境变量,
vi ~/.bash_profile
然后输入
M3_HOME=/usr/local/maven/maven3.2.1 (maven所在目录)
PATH=$M3_HOME/bin:$PATH
export M3_HOME
export PATH
保存后source一下
mvn -version
可以看到maven的版本信息了
4. maven中引入 flink-kafka-connector等依赖
需要注意,依赖的版本与组件的版本有关系,详细去官网手册查询maven包与组件版本对应关系
org.apache.flink
flink-streaming-java_2.11
1.8.0
org.apache.flink
flink-shaded-jackson
org.apache.flink
flink-test-utils_2.11
1.4.0
test
org.apache.flink
flink-shaded-jackson
2.7.9-2.0
org.apache.flink
flink-connector-kafka_2.11
1.7.0
org.apache.flink
flink-connector-elasticsearch5_2.11
1.3.1
5. 运行java程序,向kafka写入测试数据
向kafka写入测试数据,代码如下:
public class KafkaProducerDemo{
public static void main(String[] args){
String[] data = {...}; // 可以写成某个格式的字符串或json等,后续解析,"name":"abc","time":"1534445000","count":"80"
Properties props = new Properties();
props.put("bootstrap.servers", "127.0.0.1:9092");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Random random = new Random();
Producer producer = new KafkaProducer(props);
int totalMessageCount = 200;
for (int i = 0; i < totalMessageCount; i++) {
String value = data[random.nextInt(data.length)];
producer.send(new ProducerRecord("test", value), new Callback() {
public void onCompletion(RecordMetadata metadata, Exception exception) {
if (exception != null) {
System.out.println("Failed to send message with exception " + exception);
}
}
});
Thread.sleep(10L);
}
producer.close();
}
}
写入程序运行后可以运行 kafka-consumer.sh 查看test topic的数据;
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
6. flink处理kafka数据
主要代码如下:实现每10条记录就累加一次,使用countWindow 和 reduceFunction
public class KafkaDemo{
public static void main(String[] args){
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "127.0.0.1:2181");
properties.setProperty("bootstrap.servers","127.0.0.1:9092");
properties.setProperty("group.id","test");
DataStreamSource source = env.addSource(new FlinkKafkaConsumer<>("test", new SimpleStringSchema(), properties).setStartFromEarliest());
DataStream> stream = source.map(new MessageSplitter())
.keyBy(new KeySelector,Tuple2>(){
public Tuple2 getKey(Tuple3 tuple){
return Tuple2.of(tuple.f0,tuple.f1);
}
}) // 按照name 和 time分组
.countWindow(10)
.reduce(new SummingReducer()); // 对count累加
try {
stream.writeAsText("data.txt/res"); // 写到文件
env.execute("Kafka sum test");
} catch (Exception e){
e.printStackTrace();
}
}
class SummingReducer implements ReduceFunction> { // flink 内的聚合函数
@Override
public Tuple3 reduce(Tuple3 value1, Tuple3 value2) {
return new Tuple3<>(value1.f0,value1.f1,value1.f2 + value2.f2);
}
}
class MessageSplitter implements MapFunction> { // 将kafka字符串解析到tuple
private final String[] NAMES = {"name","time", "count"};
public Tuple3 map(String value) throws Exception {
if (value != null && value.contains(",")) {
...
return new Tuple3(map.get(NAMES[0]),
Long.parseLong(map.get(NAMES[1])),
Long.parseLong(map.get(NAMES[2])));
}
return new Tuple3<>();
}
}
7. 写入数据至elasticsearch
// 注释掉写入本地文件这行代码 stream.writeAsText("data.txt/res") 替换成如下代码, 将 ES 作为sink
Map config = new HashMap<>();
config.put("cluster.name", "elasticsearch"); // es 默认的cluster_name : elasticsearch
config.put("bulk.flush.max.actions", "1"); // sink中每一条数据就向es写入,配置其他数字表示做缓存
List transportAddresses = new ArrayList<>();
transportAddresses.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300));
// 注意这里的端口号为9300
stream.addSink(new ElasticsearchSink<>(config, transportAddresses, new ElasticsearchSinkFunction>() {
final String[] OUT_NAMES = {"name","time", "count"};
private IndexRequest createIndexRequest(Tuple3 element) {
Map json = new HashMap<>();
for (int i=0;i<3;i++){
json.put(OUT_NAMES[i],element.getField(i));
}
return Requests.indexRequest()
.index("myindex-idx")
.type("my_type")
.source(json);
}
public void process(Tuple3 element, RuntimeContext ctx, RequestIndexer indexer) {
indexer.add(createIndexRequest(element));
}
}));