Flume采集HDFS audit log日志至HDFS

1、背景

HDFS的audit log产生数据量很大,速度也很快,在机器系统盘上必须立即持久化到HDFS,否则数据会被覆盖或者磁盘会打满。
用于数据治理-HDFS废弃文件、Hive废弃表检测与清理。

2、实现

① Apache Flume官网下载最新版本的Flume。
② 配置audit_log_hdfs.conf

# 一个channel一个source 配置3个sink
a1.sources = r1
a1.sinks = k1 k2 k3
a1.channels = c1

# 数据来源,给c1配置shell命令tail -F 获取hdfs-audit.log
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/logs/hdfs/hdfs-audit.log
a1.sources.r1.channels = c1

# 配置sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
## 按照日期分区
a1.sinks.k1.hdfs.path = /user/hive/warehouse/ods.db/ods_hdfs_audit_log_d/stat_day=20%y%m%d
## 配置文件名 前缀和后缀,后缀加上'.'为了解决在Hive/Spark读取当前写入分区时,访问正在写入的临时文件后,临时文件被rename成了正式文件,报错FileNotFoundException
a1.sinks.k1.hdfs.filePrefix = audit-log-sink1
a1.sinks.k1.hdfs.inUsePrefix = .
## 滚动间隔和数量设置为0,则文件块仅按照滚动大小(128M)来写,配置和HDFS block size相同
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
## 执行HDFS命令超时时长,由于文件较大,建议稍微放宽一些
a1.sinks.k1.hdfs.callTimeout=60000

a1.sinks.k2.type = hdfs
a1.sinks.k2.channel = c1
a1.sinks.k2.hdfs.path = /user/hive/warehouse/ods.db/ods_hdfs_audit_log_d/stat_day=20%y%m%d
a1.sinks.k2.hdfs.filePrefix = audit-log-sink2
a1.sinks.k2.hdfs.inUsePrefix = .
a1.sinks.k2.hdfs.rollInterval = 0
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.callTimeout=60000

a1.sinks.k3.type = hdfs
a1.sinks.k3.channel = c1
a1.sinks.k3.hdfs.path = /user/hive/warehouse/ods.db/ods_hdfs_audit_log_d/stat_day=20%y%m%d
a1.sinks.k3.hdfs.filePrefix = audit-log-sink3
a1.sinks.k3.hdfs.inUsePrefix = .
a1.sinks.k3.hdfs.rollInterval = 0
a1.sinks.k3.hdfs.rollSize = 134217728
a1.sinks.k3.hdfs.rollCount = 0
a1.sinks.k3.hdfs.useLocalTimeStamp = true
a1.sinks.k3.hdfs.fileType = DataStream
a1.sinks.k3.hdfs.callTimeout=60000

# 配置channel,设置其容量(内存大小,缓冲,事务数据量等)
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacity = 100000000
a1.channels.c1.byteCapacityBufferPercentage = 20

# 给channel关联上source和sink
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
a1.sinks.k3.channel = c1

③ JVM配置
需要调大JVM,防止OOM、GC频繁等问题。
由于NameNode机器配置较高,所以JVM可以开大一些。
修改./conf/flume-env.sh,增加:

export JAVA_OPTS="-Xms5g -Xmx20g -XX:+UseG1GC -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=45 -XX:G1ReservePercent=10 -XX:+HeapDumpOnOutOfMemoryError -Xloggc:/opt/app/apache-flume-1.9.0-bin/bin/audit_log_gc_$(date +%Y%m%d-%H%M%S).log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:HeapDumpPath=/opt/app/apache-flume-1.9.0-bin/bin"

④ 启动

nohup ./flume-ng agent -c ../conf -f audit_log_hdfs.conf  -n a1 -Dflume.root.logger=INFO,console > audit_log_hdfs.log &

你可能感兴趣的:(hadoop,hdfs)