flink 消费 kafka数据,将结果写入到elasticsearch中

flink 消费 kafka数据,将结果写入到elasticsearch中 

开发工具:IDEA、JDK1.8

本地开发组件:Kafka、Flink、ElasticSearch

各组件在mac上过程如下:

1. 本地安装下载 kafka_2.12-2.2.0

下载kafka:https://kafka.apache.org/quickstart#quickstart_download

本地模式使用kafka的zookeeper服务,不用另外安装。

启动zk + kafka:

bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties

创建topic
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test-cash
命令行生产数据
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-cash

查看topic的数据 
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-cash --from-beginning

删除topic
bin/kafka-topics.sh --delete --bootstrap-server localhost:9092 --topic test-cash

2. 本地安装flink & Elasticsearch

下载flink-1.7.2

https://flink.apache.org/downloads.html#apache-flink-172

进入目录,运行 & 关闭flink

./bin/start-cluster.sh
./bin/stop-cluster.sh

下载 Elasticsearch5.6

https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.16.tar.gz

本地运行  ./bin/elasticsearch

curl -XPUT "http://localhost:9200/myindex-idx"  创建索引,名称为 myindex

创建mapping,名称为 name_count
curl -XPUT "http://localhost:9200/myindex-idx/_mapping/name_count" -d'{"name_count":{"dynamic":"strict","properties":{"name":{"type":"keyword"},"count":{"type":"long"},"time":{"type":"long"}}}'

curl -XGET "http://localhost:9200"  查看版本
 
查询数据  curl -XPOST 'http://localhost:9200/myindex-idx/_search?pretty' -d '{"query": { "match_all": {} }}'

删除数据   curl -XPOST 'http://localhost:9200/myindex-idx/_delete_by_query?pretty' -d '{"query": { "match_all": {} }}'

查询索引结构  curl -XGET "http://localhost:9200/myindex-idx/_mapping/name_count"

3. java开发,Maven安装

下载 apache-maven-3.6.1 http://maven.apache.org/download.cgi

配置 maven的环境变量,

vi ~/.bash_profile
然后输入
M3_HOME=/usr/local/maven/maven3.2.1 (maven所在目录)

PATH=$M3_HOME/bin:$PATH

export M3_HOME

export PATH

保存后source一下

mvn -version
可以看到maven的版本信息了

4. maven中引入 flink-kafka-connector等依赖

需要注意,依赖的版本与组件的版本有关系,详细去官网手册查询maven包与组件版本对应关系


    
        org.apache.flink
        flink-streaming-java_2.11
        1.8.0
    

    
        org.apache.flink
        flink-shaded-jackson
    

    
        org.apache.flink
        flink-test-utils_2.11
        1.4.0
        test
    

    
        org.apache.flink
        flink-shaded-jackson
        2.7.9-2.0
    

    
        org.apache.flink
        flink-connector-kafka_2.11
        1.7.0
    

   
    org.apache.flink
    flink-connector-elasticsearch5_2.11
    1.3.1


5. 运行java程序,向kafka写入测试数据

向kafka写入测试数据,代码如下:

public class KafkaProducerDemo{
     public static void main(String[] args){
          String[] data = {...}; // 可以写成某个格式的字符串或json等,后续解析,"name":"abc","time":"1534445000","count":"80"
          Properties props = new Properties();
          props.put("bootstrap.servers", "127.0.0.1:9092");
          props.put("acks", "all");
          props.put("retries", 0);
          props.put("batch.size", 16384);
          props.put("linger.ms", 1);
          props.put("buffer.memory", 33554432);
          props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
          props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
          Random random = new Random();

          Producer producer = new KafkaProducer(props);
          int totalMessageCount = 200;
          for (int i = 0; i < totalMessageCount; i++) {
              String value = data[random.nextInt(data.length)];
              producer.send(new ProducerRecord("test", value), new Callback() {
                  public void onCompletion(RecordMetadata metadata, Exception exception) {
                      if (exception != null) {
                          System.out.println("Failed to send message with exception " + exception);
                      }
                  }
              });
              Thread.sleep(10L);
          }
          producer.close();     
     }
}

写入程序运行后可以运行 kafka-consumer.sh 查看test topic的数据;

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
 

6. flink处理kafka数据

主要代码如下:实现每10条记录就累加一次,使用countWindow 和 reduceFunction

public class KafkaDemo{
public static void main(String[] args){
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.enableCheckpointing(5000, CheckpointingMode.EXACTLY_ONCE);

    Properties properties = new Properties();

    properties.setProperty("zookeeper.connect", "127.0.0.1:2181");
    properties.setProperty("bootstrap.servers","127.0.0.1:9092");
    properties.setProperty("group.id","test");

    DataStreamSource source = env.addSource(new FlinkKafkaConsumer<>("test", new SimpleStringSchema(), properties).setStartFromEarliest());
    
    DataStream> stream = source.map(new MessageSplitter())
            .keyBy(new KeySelector,Tuple2>(){
                public Tuple2 getKey(Tuple3 tuple){
                    return Tuple2.of(tuple.f0,tuple.f1);
                }
            })                      // 按照name 和 time分组
            .countWindow(10)
            .reduce(new SummingReducer());    // 对count累加

    try {
        stream.writeAsText("data.txt/res"); // 写到文件
        env.execute("Kafka sum test");

    } catch (Exception e){
        e.printStackTrace();
    }
}

class SummingReducer implements ReduceFunction> { // flink 内的聚合函数
    @Override
    public Tuple3 reduce(Tuple3 value1, Tuple3 value2) {
        return new Tuple3<>(value1.f0,value1.f1,value1.f2 + value2.f2);
    }
}

class MessageSplitter implements MapFunction> { // 将kafka字符串解析到tuple
    private final String[] NAMES = {"name","time", "count"};
    public Tuple3 map(String value) throws Exception {
        if (value != null && value.contains(",")) {
            ...
            return new Tuple3(map.get(NAMES[0]), 
                    Long.parseLong(map.get(NAMES[1])),
                    Long.parseLong(map.get(NAMES[2])));
        }
        return new Tuple3<>();
    }
}

7. 写入数据至elasticsearch

// 注释掉写入本地文件这行代码 stream.writeAsText("data.txt/res") 替换成如下代码, 将 ES 作为sink 

Map config = new HashMap<>();
config.put("cluster.name", "elasticsearch"); // es 默认的cluster_name : elasticsearch
config.put("bulk.flush.max.actions", "1"); // sink中每一条数据就向es写入,配置其他数字表示做缓存
List transportAddresses = new ArrayList<>();

transportAddresses.add(new InetSocketAddress(InetAddress.getByName("127.0.0.1"), 9300));
// 注意这里的端口号为9300 

stream.addSink(new ElasticsearchSink<>(config, transportAddresses, new ElasticsearchSinkFunction>() {
    final String[] OUT_NAMES = {"name","time", "count"};
    private IndexRequest createIndexRequest(Tuple3 element) {
        Map json = new HashMap<>();

        for (int i=0;i<3;i++){
            json.put(OUT_NAMES[i],element.getField(i));
        }

        return Requests.indexRequest()
                .index("myindex-idx")
                .type("my_type")
                .source(json);
    }
    public void process(Tuple3 element, RuntimeContext ctx, RequestIndexer indexer) {
        indexer.add(createIndexRequest(element));
    }
}));

你可能感兴趣的:(2019)