

  • 从磁盘采集文件发送到HDFS
  • 数据采集来源:系统日志文件、Python爬虫数据、端口数据
  • 数据发送目标:HDFS、Kafka

flume 组成

  • agent 是一个独立的Flume进程,包含组件Source、Channel、Sink。(Agent使用JVM 运行Flume。每台机器运行一个agent,但是可以在一个agent中包含多个sources和sinks。)
  • source Source捕获事件后会进行特定的格式化,然后Source会把事件推入(单个或多个)Channel中
  • channel Channel是一个缓冲区,它将保存事件直到Sink处理完该事件
  • sink Sink负责持久化日志或者把事件推向另一个Source。
  • Client:Client生产数据,运行在一个独立的线程。
  • Event: 一个数据单元,消息头和消息体组成。(Events可以是日志记录、 avro 对象等。)
  • Flow: Event从源点到达目的点的迁移的抽象。

flume 读取源

  • Taildir 本地目录数据
  • Avro 微型rpc框架,用来flume和flume对接。
  • Kafka
  • NetCat linux上的一个通讯工具,nc命令
  • Exec 命令行,例如 tail -f 数据

flume 发送源

  • Logger 控制台,用来调试较多
  • Avro 微型rpc框架,用来flume和flume对接。
  • Kafka

flume channel

  • Memory
  • File
  • kafka
  • JDBC

flume 安装

  • flume1.9.0下载链接
  • tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /usr/local/flume
  • cp flume-env.sh.template flume-env.sh
  • 添加环境变量: vim /etc/profile 添加内容后 source ~/.bashrc
export FLUME_HOME=/usr/local/flume
  • 查看flume 版本:flume-ng version


  • 创建agent配置文件 touch /usr/local/job/netcat-flume-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • 启动agent
flume-ng agent -n a1 -c /usr/local/flume/conf/nf -f /usr/local/flume/job/netcat-flume-logger.conf  -Dflume.root.logger=INFO,console
  • 配置flume source
yum -y install netcat
nc localhost 44444


  • 创建agent配置文件 touch /usr/local/job/file-flume-logger.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command= tail -f /var/log/hive.log

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • 启动agent:flume-ng agent -n a1 -c /usr/local/flume/conf -f /usr/local/flume/job/file-flume-logger.conf -Dflume.root.logger=INFO,console
  • 往日志文件写入数据进行测试


  • 添加如下jar包到flume的lib目录

  • 创建agent配置文件 touch /usr/local/job/file-flume-hdfs.conf
# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -f /var/log/hive.log

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop01:9000/flume/%Y%m%d/%H
a2.sinks.k2.hdfs.filePrefix = logs-
a2.sinks.k2.hdfs.round = true
a2.sinks.k2.hdfs.roundValue = 1
a2.sinks.k2.hdfs.roundUnit = hour
a2.sinks.k2.hdfs.useLocalTimeStamp = true
a2.sinks.k2.hdfs.batchSize = 1000
a2.sinks.k2.hdfs.fileType = DataStream
a2.sinks.k2.hdfs.rollInterval = 30
a2.sinks.k2.hdfs.rollSize = 134217700
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
  • 启动agent:flume-ng agent -n a2 -c /usr/local/flume/conf -f /usr/local/flume/job/file-flume-hdfs.conf


  • 创建agent配置文件 touch /usr/local/flume/job/dir-flume-hdfs.conf

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = spooldir
a2.sources.r2.spoolDir = /home/hadoop/upload
a2.sources.r2.ignorePattern = ([^ ]*\.tmp)

# Describe the sink
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop01:9000/flume/%Y%m%d/%H
a2.sinks.k2.hdfs.filePrefix = logs-
a2.sinks.k2.hdfs.round = true
a2.sinks.k2.hdfs.roundValue = 1
a2.sinks.k2.hdfs.roundUnit = hour
a2.sinks.k2.hdfs.useLocalTimeStamp = true
a2.sinks.k2.hdfs.batchSize = 1000
a2.sinks.k2.hdfs.fileType = DataStream
a2.sinks.k2.hdfs.rollInterval = 30
a2.sinks.k2.hdfs.rollSize = 134217700
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
  • 启动agent:flume-ng agent -n a2 -c /usr/local/flume/conf -f /usr/local/flume/job/dir-flume-hdfs.conf



a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind =
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:flume-ng agent -n a1 -c /usr/local/src/flume/conf/ -f /usr/local/src/flume/conf/avro-master.conf -Dflume.root.logger=INFO,console


a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command= tail -f /var/log/hive.log                                                                    

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.port= 44444
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动:flume-ng agent -n a1 -c /usr/local/src/flume/conf/ -f /usr/local/src/flume/conf/avro-slave.conf -Dflume.root.logger=INFO,console



a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command= tail -f /var/log/hive.log                                                                    

#Describe the sink
#设置kafkaSink 注意大小写
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = flume_topic
#设置kafka 的 broker地址以及端口号,多个地址用逗号隔开
a1.sinks.k1.kafka.bootstrap.servers =
a1.sinks.k1.serializer.class = kafka.serializer.StringEncoder
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

2、启动flume:flume-ng agent -n a1 -c /usr/local/src/flume/conf/ -f /usr/local/src/flume/conf/flume-kafka.conf -Dflume.root.logger=INFO,console
3、启动消费者:kafka-console-consumer.sh --topic flume_topic --from-beginning --bootstrap-server localhost:9092
4、测试:echo 666 >> /var/log/hive.log
