Flink读取kafka数据并以parquet格式写入HDFS

Flink读取kafka数据并以parquet格式写入HDFS_第1张图片

大数据业务场景中,经常有一种场景:外部数据发送到kafka中,flink作为中间件消费kafka数据并进行业务处理;处理完成之后的数据可能还需要写入到数据库或者文件系统中,比如写入hdfs中;
目前基于spark进行计算比较主流,需要读取hdfs上的数据,可以通过读取parquet:spark.read.parquet(path)

数据实体:

public class Prti {
 
    private String passingTime;
 
    private String plateNo;
 
    public Prti() {
    }
 
   //gettter and setter 方法....
}

public class FlinkReadKafkaToHdfs {
 
    private final static StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
 
    private final static Properties properties = new Properties();
 
    /**
     * kafka 中发送数据JSON格式:
     * {"passingTime":"1546676393000","plateNo":"1"}
     */
    public static void main(String[] args) throws Exception {
        init();
        readKafkaToHdfsByReflect(environment, properties);
    }
}
 
private static void init() {
        environment.enableCheckpointing(5000);
        environment.setParallelism(1);
        environment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        //kafka的节点的IP或者hostName,多个使用逗号分隔
        properties.setProperty("bootstrap.servers", "192.168.0.10:9092");
        //only required for Kafka 0.8;
//        properties.setProperty("zookeeper.connect", "192.168.0.10:2181");
        //flink consumer flink的消费者的group.id
        properties.setProperty("group.id", "test-consumer-group");
        //第一种方式:路径写自己代码上的路径
//        properties.setProperty("fs.hdfs.hadoopconf", "...\\src\\main\\resources");
        //第二种方式:填写一个schema参数即可
        properties.setProperty("fs.default-scheme", "hdfs://hostname:8020");
 
        properties.setProperty("kafka.topic", "test");
        properties.setProperty("hfds.path", "hdfs://hostname/test");
        properties.setProperty("hdfs.path.date.format", "yyyy-MM-dd");
        properties.setProperty("hdfs.path.date.zone", "Asia/Shanghai");
        properties.setProperty("window.time.second", "60");
    }

public static void readKafkaToHdfsByReflect(StreamExecutionEnvironment environment, Properties properties) throws Exception {
        String topic = properties.getProperty("kafka.topic");
        String path = properties.getProperty("hfds.path");
        String pathFormat = properties.getProperty("hdfs.path.date.format");
        String zone = properties.getProperty("hdfs.path.date.zone");
        Long windowTime = Long.valueOf(properties.getProperty("window.time.second"));
        FlinkKafkaConsumer010 flinkKafkaConsumer010 = new FlinkKafkaConsumer010<>(topic, new SimpleStringSchema(), properties);
        KeyedStream KeyedStream = environment.addSource(flinkKafkaConsumer010)
            .map(FlinkReadKafkaToHdfs::transformData)
            .assignTimestampsAndWatermarks(new CustomWatermarks())
            .keyBy(Prti::getPlateNo);
 
        DataStream output = KeyedStream.window(TumblingEventTimeWindows.of(Time.seconds(windowTime)))
            .apply(new WindowFunction() {
                @Override
                public void apply(String key, TimeWindow timeWindow, Iterable iterable, Collector collector) throws Exception {
                    System.out.println("keyBy: " + key + ", window: " + timeWindow.toString());
                    iterable.forEach(collector::collect);
                }
            });
        //写入HDFS,parquet格式
        DateTimeBucketAssigner bucketAssigner = new DateTimeBucketAssigner<>(pathFormat, ZoneId.of(zone));
        StreamingFileSink streamingFileSink = StreamingFileSink.
            forBulkFormat(new Path(path), ParquetAvroWriters.forReflectRecord(Prti.class))
            .withBucketAssigner(bucketAssigner)
            .build();
        output.addSink(streamingFileSink).name("Hdfs Sink");
        environment.execute("PrtiData");
    }
 
   private static Prti transformData(String data) {
        if (data != null && !data.isEmpty()) {
            JSONObject value = JSON.parseObject(data);
            Prti prti = new Prti();
            prti.setPlateNo(value.getString("plate_no"));
            prti.setPassingTime(value.getString("passing_time"));
            return prti;
        } else {
            return new Prti();
        }
    }
 
    private static class CustomWatermarks implements AssignerWithPunctuatedWatermarks {
 
        private Long cuurentTime = 0L;
 
        @Nullable
        @Override
        public Watermark checkAndGetNextWatermark(Prti prti, long l) {
            return new Watermark(cuurentTime);
        }
 
        @Override
        public long extractTimestamp(Prti prti, long l) {
            Long passingTime = Long.valueOf(prti.getPassingTime());
            cuurentTime = Math.max(passingTime, cuurentTime);
            return passingTime;
        }
    }

往kafka中发送数据,省略…

进入spark shell中,执行:spark.read.parquet("/test/日期路径"),即可读取;

注意点:

StreamingFileSink streamingFileSink = StreamingFileSink.
            forBulkFormat(new Path(path), ParquetAvroWriters.forReflectRecord(Prti.class))
            .withBucketAssigner(bucketAssigner)
            .build();

第一种方式,最简单的方式:
ParquetAvroWriters.forReflectRecord(Prti.class)
第二种方式:这种方式对实体类有很高的要求,需要借助avro的插件编译生成数据实体类即可;
ParquetAvroWriters.forSpecificRecord(Prti.class)

编写好一个prti.avsc的文件,内容如下:

{"namespace": "com.xxx.streaming.entity",
  "type": "record",
  "name": "Prti",
  "fields": [
      {"name": "passingTime", "type": "string"},
      {"name": "plateNo",  "type": "string"}
  ]
}

其中:com.xxx.streaming.entity是生成的实体放置的包路径;

在pom中引入插件:

 
                org.apache.avro
                avro-maven-plugin
                1.8.2
                
                    
                        generate-sources
                        
                            schema
                        
                        
                            ${project.basedir}/src/main/resources/
                            ${project.basedir}/src/main/java/
                        
                    
                
            

第三种方式:
ParquetAvroWriters.forGenericRecord(“schema”)
传入一个avsc的文件进去即可。

补充一下:

增加POM依赖,有人问核心代码的依赖找不到,在maven的仓库里地区找不到,但是直接在pom引入可以下载到;在中英库上有的;

http://repo1.maven.org/maven2/org/apache/flink/flink-parquet/

核心的一个依赖:

org.apache.flink
flink-parquet
1.7.0

其他依赖,没有整理,有几个是不需要的,懒得整理了:


        UTF-8
        1.7.0
        1.7.7
        1.2.17
    
 
    
        
            org.slf4j
            slf4j-log4j12
            ${slf4j.version}
            provided
        
        
            log4j
            log4j
            ${log4j.version}
            provided
        
        
            org.apache.flink
            flink-java
            ${flink.version}
        
        
            org.apache.flink
            flink-runtime-web_2.11
            ${flink.version}
        
        
            org.apache.flink
            flink-streaming-java_2.11
            ${flink.version}
        
        
            org.apache.flink
            flink-clients_2.11
            ${flink.version}
        
        
            org.apache.flink
            flink-connector-wikiedits_2.11
            ${flink.version}
        
        
            org.apache.flink
            flink-connector-kafka-0.10_2.11
            ${flink.version}
            
                
                    log4j
                    log4j
                
                
                    org.slf4j
                    slf4j-log4j12
                
            
        
        
            org.apache.flink
            flink-connector-filesystem_2.11
            ${flink.version}
            
                
                    log4j
                    log4j
                
                
                    org.slf4j
                    slf4j-log4j12
                
            
        
 
        
            org.apache.flink
            flink-avro
            ${flink.version}
        
 
        
            org.apache.parquet
            parquet-avro
            1.10.0
        
        
            org.apache.parquet
            parquet-hadoop
            1.10.0
        
 
        
            org.apache.hadoop
            hadoop-common
            2.7.0
            
                
                    log4j
                    log4j
                
                
                    org.slf4j
                    slf4j-log4j12
                
            
        
 
        
            org.apache.hadoop
            hadoop-hdfs
            2.7.0
        
        
            org.apache.flink
            flink-parquet
            1.7.0
        
        
            org.apache.flink
            flink-hadoop-compatibility_2.11
            1.7.0
        
    

原文:https://blog.csdn.net/u012798083/article/details/85852830

你可能感兴趣的:(Flink从入门到实践,大数据成神之路)