构建大数据ETL通道--Json数据的流式转换--Json转Parquet(三)

如果生成的日志数据是Avro格式,可直接采用上一篇的方式( https://blog.csdn.net/qq_29829081/article/details/80518671),将Avro数据转储为Parquet。但是我们一般都是日志数据不是Avro,大部分是Json数据。因此,本篇主要讲如何将Json通过Morphline流式转储为Parquet数据。文章中只是简单的例子,在实际生产环境中,我们的Json数据非常复杂,但是也可以采用Morphline进行转储,可以采用通用的方式进行处理,下一节再详述。

本节主要讲述如何借助于flume的Morphline Interceptor,将json数据线临时转成avro,再通过kite dataset sink最终将数据转成parquet格式进行存储。其实最关键的就是Flume的配置,Morphline命令行的组合。

1 Flume配置:

(1) nginx端flume配置

# Name the components on this agent
a1.sources = r
a1.sinks = k_kafka
a1.channels = c_mem

# Channelsinfo
a1.channels.c_mem.type = memory
a1.channels.c_mem.capacity = 2000
a1.channels.c_mem.transactionCapacity = 300
a1.channels.c_mem.keep-alive = 60

# Sources info
a1.sources.r.type = exec
a1.sources.r.shell = /bin/bash -c
a1.sources.r.command = tail -F /home/litao/litao.json
a1.sources.r.channels = c_mem

# Sinks info
a1.sinks.k_kafka.channel  = c_mem
a1.sinks.k_kafka.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k_kafka.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sinks.k_kafka.kafka.topic = test_2018-03-14
a1.sinks.k_kafka.kafka.flumeBatchSize = 5
a1.sinks.k_kafka.kafka.producer.acks =1

(2) kafka端flume配置
# Name the components on this agent
a1.channels = c1
a1.sources = r1
a1.sinks  = k1

# Channel config
a1.channels.c1.type = memory
a1.channels.c1.capacity = 500000
a1.channels.c1.transactionCapacity =100000
a1.channels.c1.keep-alive = 50

# Sources info
a1.sources.r1.type = com.bigo.flume.source.kafka.KafkaSource
a1.sources.r1.channels = c1
a1.sources.r1.kafka.bootstrap.servers = kafka1:9093,kafka2:9093,kafka3:9093,kafka4:9093,kafka5:9093,kafka6:9093
a1.sources.r1.kafka.topics = test_2018-03-14
a1.sources.r1.kafka.consumer.group.id = test_2018-03-14.conf_flume_group
a1.sources.r1.kafka.consumer.timeout.ms = 100
a1.sources.r1.batchSize = 2000

# Config Interceptors
a1.sources.r1.interceptors=i1 morphline

# Inject the Schema into the header so the AvroEventSerializer can pick it up
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = flume.avro.schema.url
#a1.sources.r1.interceptors.i1.value = file:/home/litao/litao.avsc
a1.sources.r1.interceptors.i1.value=hdfs://bigocluster/user/litao/litao.avsc

# Morphline interceptor config
a1.sources.r1.interceptors.morphline.type = org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
a1.sources.r1.interceptors.morphline.morphlineFile = /etc/flume/conf/a1/morphline.conf
a1.sources.r1.interceptors.morphline.morphlineId = convertJsonToAvro

# Sink config
a1.sinks.k1.type  = org.apache.flume.sink.kite.DatasetSink
a1.sinks.k1.channel  = c1
a1.sinks.k1.kite.dataset.uri  = dataset:hdfs://bigocluster/flume/hellotalk/parquet
a1.sinks.k1.kite.batchSize = 100
a1.sinks.k1.kite.rollInterval = 30

2 morphlines配置:
                    morphlines: [
  {
    id: convertJsonToAvro
    importCommands: [ "org.kitesdk.**" ]
    commands: [
      # read the JSON blob
      { readJson: {} }
      # extract JSON objects into fields
      { extractJsonPaths {
        flatten: true
        paths: {
          name: /name
          age: /age
        }
      } }
      # add a creation timestamp to the record
      #{ addCurrentTime {
      #  field: timestamp
      #  preserveExisting: true
      #} }
      # convert the extracted fields to an avro object
      # described by the schema in this field
      { toAvro {
        schemaFile: /home/litao/litao.avsc
      } }
      # serialize the object as avro
      { writeAvroToByteArray: {
        format: containerlessBinary
      } }
    ]
  }
]
3 依赖的jar包:
config-1.3.1.jar
metrics-healthchecks-3.0.2.jar
kite-morphlines-core-1.1.0.jar
kite-morphlines-json-1.1.0.jar
kite-morphlines-avro-1.1.0.jar

你可能感兴趣的:(大数据)