flume使用HDFS Sink将数据导入到Hive中

整体流程:avro Source获取数据,然后通过SPILLABLEMEMORY channel,再然后使用hdfs sink将数据落地到hdfs中,最后通过调度系统执行脚本导入到hive中。

最初是打算使用hive sink的:

logger.sources = r1
logger.sinks = k1
logger.channels = c1

# Describe/configure the source
logger.sources.r1.type = Avro
logger.sources.r1.bind = 0.0.0.0
logger.sources.r1.port = 6666

#Spillable Memory Channel
logger.channels.c1.type=SPILLABLEMEMORY
logger.channels.c1.checkpointDir = /data/flume/checkpoint
logger.channels.c1.dataDirs = /data/flume

# Describe the sink
logger.sinks.k1.type = hive
logger.sinks.k1.hive.metastore = thrift://hadoop01.com:9083
logger.sinks.k1.hive.database = tmp
logger.sinks.k1.hive.table = app_log
logger.sinks.k1.hive.partition = %y-%m-%d-%H-%M
logger.sinks.k1.batchSize = 10000
logger.sinks.k1.useLocalTimeStamp = true
logger.sinks.k1.round = true
logger.sinks.k1.roundValue = 10
logger.sinks.k1.roundUnit = minute
logger.sinks.k1.serializer = DELIMITED
logger.sinks.k1.serializer.delimiter = "\n"
logger.sinks.k1.serializer.serdeSeparator = '\t'
logger.sinks.k1.serializer.fieldnames =log

# Bind the source and sink to the channel
logger.sources.r1.channels = c1
logger.sinks.k1.channel=c1

但是使用开发过程中遇到各种坑,各种莫名其妙的错误,最终放弃。

1、flume.conf

logger.sources = r1
logger.sinks = k1
logger.channels = c1

# Describe/configure the source
logger.sources.r1.type = Avro
logger.sources.r1.bind = 0.0.0.0
logger.sources.r1.port = 6666

#Spillable Memory Channel
logger.channels.c1.type=SPILLABLEMEMORY
logger.channels.c1.checkpointDir = /data/flume/checkpoint
logger.channels.c1.dataDirs = /data/flume

# Describe the sink
logger.sinks.k1.type = hdfs
logger.sinks.k1.hdfs.path = hdfs://zsCluster/collection-logs/buried-logs/dt=%Y-%m-%d/
logger.sinks.k1.hdfs.filePrefix = collection-%Y-%m-%d_%H
logger.sinks.k1.hdfs.fileSuffix = .log
logger.sinks.k1.hdfs.useLocalTimeStamp = true
logger.sinks.k1.hdfs.round = false
logger.sinks.k1.hdfs.roundValue = 10
logger.sinks.k1.hdfs.roundUnit = minute
logger.sinks.k1.hdfs.batchSize = 1000
logger.sinks.k1.hdfs.minBlockReplicas=1
#fileType:默认值:SequenceFile,
#当使用DataStream时候,文件不会被压缩,不需要设置hdfs.codeC;当使用CompressedStream时候,必须设置一个正确的hdfs.codeC值
logger.sinks.k1.hdfs.fileType=DataStream
logger.sinks.k1.hdfs.writeFormat=Text
logger.sinks.k1.hdfs.rollSize=0
logger.sinks.k1.hdfs.rollInterval=600
logger.sinks.k1.hdfs.rollCount=0
logger.sinks.k1.hdfs.callTimeout = 60000

# Bind the source and sink to the channel
logger.sources.r1.channels = c1
logger.sinks.k1.channel=c1

这里要注意一下:

1)logger.sinks.k1.hdfs.path这一项,路径后面是dt=%Y-%m-%d,dt为后面hive表的分区字段;

2)logger.sinks.k1.hdfs.writeFormat这一项设置为Text(默认为Writable);

至于其他每一项的含义可查看官网:Flume 1.9.0 User Guide — Apache Flumehttps://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#hdfs-sink

中文版:Flume 1.9用户手册中文版 — 可能是目前翻译最完整的版本了https://flume.liyifeng.org/#hdfs-sink

2、创建Hive临时表

create external table tmp.app_log(
log String
)
comment '日志信息'
partitioned by (`dt` string)
row format delimited
fields terminated by '\001'
lines terminated by '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/collection-logs/buried-logs/';

由于我的数据为json格式,所以这里暂时只有一个字段,到后面业务层再去拆分;这里的分区字段名称要和上一步的hdfs.path路径里的一致,都为dt;STORED AS这里因为上一步writeFormat设置了Text,所以这里要设置为TextInputFormat;LOCATION这里和上面hdfs.path的路径保持一致;

3、导入分区

ALTER TABLE tmp.app_log ADD PARTITION(dt='${last_date}')

4、将临时表数据导入ods层

insert overwrite table ods.app_log partition(dt='${last_date}')
select log from tmp.app_log WHERE dt='${last_date}' and log <> '';

你可能感兴趣的:(大数据,hive,flume,大数据)