flume采集文件到HDFS(跟踪文件内容)

采集需求:比如业务系统使用log4j生成的日志,日志内容不断增加,需要把追加到日志文件中的数据实时采集到hdfs

如果要做离线分析就放到hdfs中,如果做实时分析就放kafka中

根据需求,首先定义以下3大要素

  1. 采集源,即source——监控文件内容更新 :  exec  ‘tail -F file’
  2. 下沉目标,即sink——HDFS文件系统  :  hdfs sink
  3. Source和sink之间的传递通道——channel,可用file channel 也可以用 内存channel

 

配置文件编写:

agent1.sources = source1

agent1.sinks = sink1

agent1.channels = channel1



# Describe/configure tail -F source1

agent1.sources.source1.type = exec

agent1.sources.source1.command = tail -F /home/hadoop/logs/access_log

agent1.sources.source1.channels = channel1



#configure host for source

agent1.sources.source1.interceptors = i1

agent1.sources.source1.interceptors.i1.type = host

agent1.sources.source1.interceptors.i1.hostHeader = hostname



# Describe sink1

agent1.sinks.sink1.type = hdfs

#a1.sinks.k1.channel = c1

agent1.sinks.sink1.hdfs.path =hdfs://hdp-node-01:9000/weblog/flume-collection/%y-%m-%d/%H-%M

agent1.sinks.sink1.hdfs.filePrefix = access_log

agent1.sinks.sink1.hdfs.maxOpenFiles = 5000

agent1.sinks.sink1.hdfs.batchSize= 100

agent1.sinks.sink1.hdfs.fileType = DataStream

agent1.sinks.sink1.hdfs.writeFormat =Text

agent1.sinks.sink1.hdfs.rollSize = 102400

agent1.sinks.sink1.hdfs.rollCount = 1000000

agent1.sinks.sink1.hdfs.rollInterval = 60

agent1.sinks.sink1.hdfs.round = true

agent1.sinks.sink1.hdfs.roundValue = 10

agent1.sinks.sink1.hdfs.roundUnit = minute

agent1.sinks.sink1.hdfs.useLocalTimeStamp = true



# Use a channel which buffers events in memory

agent1.channels.channel1.type = memory

agent1.channels.channel1.keep-alive = 120

agent1.channels.channel1.capacity = 500000

agent1.channels.channel1.transactionCapacity = 600



# Bind the source and sink to the channel

agent1.sources.source1.channels = channel1

agent1.sinks.sink1.channel = channel1

 重点是这两句:

agent1.sources.source1.type = exec

agent1.sources.source1.command = tail -F /home/hadoop/logs/access_log

type为exec可在官网查看说明

flume采集文件到HDFS(跟踪文件内容)_第1张图片

 flume采集文件到HDFS(跟踪文件内容)_第2张图片

 在flume路径中:

vi tail-hdfs.conf

把配置文件粘贴进来

cd log

生成shell脚本文件makelog.sh

vi makelog.sh    

while true
do
echo 'date' >> access.log
sleep 0.1
done

增加可执行权限

chmod +x makelog.sh

执行makelog.sh

sh makelog.sh    模拟生成日志信息,可以用命令tail -f access.log 跟踪执行结果

启动flume采集程序:

在flume的bin目录下

./flume-ng agent -C ../conf/ -f ../tail-hdfs.conf -n ag1 -Dflume.root.logger=INFO,console

注意此处引用的配置文件为tail-hdfs.conf

你可能感兴趣的:(hadoop)