flume实践(二):TAILDIR多文件采集到对应HDFS文件

需求: 

  • 不同服务产生不同的日志文件,例如: server/test_a_20181217.log  server/test_b_20181217.log;日志是不断写入的
  • flume采集日志到对应HDFS文件夹里,即 :

 server/test_a_20181217.log ——>  /user/hive/logs/ymd=20181217/testa/xxxx.txt

 server/test_b_20181217.log ——>  /user/hive/logs/ymd=20181217/testb/xxxx.txt

解决方案:

  • flume会将日志转成header,body
  • 在source把文件名加入header,写入hdfs获取变量产生目录
  • 实现: Taildir Source / filegroups
  • 架构

flume实践(二):TAILDIR多文件采集到对应HDFS文件_第1张图片

日志机器flume配置:

#gent的名称为"a1"  

a1.sources = r
a1.sinks = k-1 k-2 k-3
a1.channels = c

#***注册日志收集***
#source配置信息  
a1.sources.r.type = TAILDIR
# 指定元数据存储
a1.sources.r.positionFile = /home/apache-flume-1.8.0-bin/logs/test.json
#监控分组,兼容多个目录(is ok)
a1.sources.r.filegroups = f1 f2
#test_a
a1.sources.r.filegroups.f1 = /server/test_a_.*log
a1.sources.r.headers.f1.headerKey1 = test_a               # header 变量
#test_b
a1.sources.r.filegroups.f2 = /server/test_b_.*log        # header 变量
a1.sources.r.headers.f2.headerKey1 = test_b
# 开启文件路径存入header
a1.sources.r.fileHeader = true
a1.sources.r.fileHeaderKey = file
#sink组
a1.sinkgroups=g 
a1.sinkgroups.g.sinks=k-1 k-2 k-3
a1.sinkgroups.g.processor.type=failover
a1.sinkgroups.g.processor.priority.k-1=10
a1.sinkgroups.g.processor.priority.k-2=5
a1.sinkgroups.g.processor.priority.k-3=1
a1.sinkgroups.g.processor.maxpenalty=10000
#sink配置信息(故障转移,优先级依次从高到低)
a1.sinks.k-1.type = avro
a1.sinks.k-1.hostname = 192.168.0.1
a1.sinks.k-1.port = 41401
a1.sinks.k-2.type = avro
a1.sinks.k-2.hostname = 192.168.0.1
a1.sinks.k-2.port = 41401
a1.sinks.k-3.type = avro
a1.sinks.k-3.hostname = 192.168.0.1
a1.sinks.k-3.port = 41401
#channel配置信息  
a1.channels.c.type = memory
a1.channels.c.capacity = 1000
a1.channels.c.transactionCapacity = 100
#将source和sink绑定至该channel上  
a1.sources.r.channels = c
a1.sinks.k-1.channel = c
a1.sinks.k-2.channel = c
a1.sinks.k-3.channel = c

CDH上flume配置

#source配置信息  
a1.sources.r.type = avro
a1.sources.r.bind = 0.0.0.0
a1.sources.r.port = 41401
#sink配置信息  
a1.sinks.k.type = hdfs
a1.sinks.k.hdfs.fileType=DataStream  
a1.sinks.k.hdfs.useLocalTimeStamp=true
a1.sinks.k.hdfs.path = /user/hive/ymd=%Y%m%d/%{headerKey1}      # 获取变量
a1.sinks.k.hdfs.filePrefix = log
a1.sinks.k.hdfs.inUseSuffix=.txt
a1.sinks.k.hdfs.writeFormat = Text
a1.sinks.k.hdfs.idleTimeout = 3600
a1.sinks.k.hdfs.batchSize=10
a1.sinks.k.hdfs.rollSize = 0
a1.sinks.k.hdfs.rollInterval = 0
a1.sinks.k.hdfs.rollCount = 0
a1.sinks.k.hdfs.minBlockReplicas = 1
a1.sinks.k.hdfs.round = true
a1.sinks.k.hdfs.roundValue = 1
#channel配置信息
a1.channels.c.type = memory
a1.channels.c.capacity = 1000
a1.channels.c.transactionCapacity = 100
#将source和sink绑定至该channel上
a1.sources.r.channels = c
a1.sinks.k.channel = c

附录

  • 过滤器实战,关注过滤器推荐  http://www.cnblogs.com/zlslch/p/7244211.html
  • Spooling Directory Source采集,但是不支持文件同时读写: https://blog.csdn.net/xiao_jun_0820/article/details/41576999

你可能感兴趣的:(7,大数据)