Flume学习-小项目实例

Flume学习-小项目实例

系列文章目录

  1. Flume架构与实践
  2. Flume学习-小项目实例

0x01 摘要

Flume是一个分布式的、可靠的、高效的日志数据搜集服务。作者为了进一步学习Flume,做了一个简易的分布式日志搜集系统,仅供新手参考。

注意设定$FLUME_HOME,flume-ng执行时会去FLUME_HOME/lib下找jar

0x02 项目背景

使用分层日志收集架构设计一个高可用的分布式日志收集系统,最终将收集到的日志数据分别发送到kafka的ad_log主题和在磁盘中滚动生成日志文件。

0x03 架构设计

Flume学习-小项目实例_第1张图片
以上是本项目的架构图。

在目标业务机器上部署N个日志搜集Flume-Agent,将日志发送到日志汇集层主备Flume-Agent,最后通过ReplicatingChannelSelector ,复制events到两个channels,相应的sink以自己配置的方式分别写入Kafka和滚动生成的日志文件。

0x04 详细设计

4.1 日志搜集层

可以根据需要在ad_log业务机上部署多个flume-agent,下面分别介绍下flume三大组件的设计理念。

4.1.1 flume-source

source采用TAILDIR type。这个type支持用正则匹配监听多个文件或文件夹,且支持新增文件监听、断点续传和数据读取at most once。

4.1.2 flume-channel

channel采用filechannel,并开启备份检查点,保证在挂掉后重启agent的时候能快速回放检查点文件恢复数据。

4.1.3 flume-sink

sink使用failover type,配置两个sink,其中一个权重更高,保证在高权重挂掉的情况下能切换到另一个sink,业务不受影响。

4.2 日志汇集层

4.2.1 flume-source

source采用avro type。这个type会通过监听指定端口接收上游RPC过来的events。
在这个项目中我们还加入了replicating type的selector。这种selelctor的特点是将events以复制的方式同时写入多个channel。这样的好处是不同用途的sink从不同的channel中获取相同的event。

4.2.2 flume-channel

跟日志搜集层的channel相同,采用filechannel,并开启备份检查点,保证在挂掉后重启agent的时候能快速回放检查点文件恢复数据。

4.1.3 flume-sink

  • sink1 采用kafkaSink,将数据写入topic为ad_log的kafka。acks配置为1,丢数据和性能间平衡。
  • sink2 采用file-rolling-sink,数据写入磁盘中滚动生成的文件。

4.3 持久化层

  • Kafka:一份数据写入kafka,作为消息队列可供多个消费者组使用。
  • Roll-file:一份数据写入磁盘中滚动生成的文件。

0x05 Flume-config

5.1 ad_log_agent.conf

agent1.sources = r1
agent1.channels = c1
agent1.sinks = k1 k2

agent1.sources.r1.type = TAILDIR
agent1.sources.r1.positionFile = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/taildir/position/taildir_position.json
agent1.sources.r1.filegroups = f1
# 监听该目录下.log结尾文件
agent1.sources.r1.filegroups.f1 = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/input/.*\\.log
agent1.sources.r1.channels = c1

agent1.channels.c1.type = file
agent1.channels.c1.dataDirs = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/agent1/data
agent1.channels.c1.checkpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/agent1/checkpoint
agent1.channels.c1.useDualCheckpoints = true
agent1.channels.c1.backupCheckpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/agent1/backup-checkpoint

agent1.sinkgroups = g1
agent1.sinkgroups.g1.sinks = k1 k2
agent1.sinkgroups.g1.processor.type = failover
agent1.sinkgroups.g1.processor.priority.k1 = 10
agent1.sinkgroups.g1.processor.priority.k2 = 5

agent1.sinks.k1.type = avro
agent1.sinks.k1.channel = c1
agent1.sinks.k1.hostname = 127.0.0.1
agent1.sinks.k1.port = 8888

agent1.sinks.k2.type = avro
agent1.sinks.k2.channel = c1
agent1.sinks.k2.hostname = 127.0.0.1
agent1.sinks.k2.port = 8889

5.2 ad_log_collect1.conf

collector1.sources = r1
collector1.channels = c1 c2
collector1.sinks = k1 k2

#定义source为8888端口avro
collector1.sources.r1.type = avro
collector1.sources.r1.bind = 127.0.0.1
collector1.sources.r1.port = 8888
collector1.sources.r1.threads= 3
collector1.sources.r1.channels = c1
#设置复制选择器
collector1.sources.r1.selector.type = replicating
#设置required channel
collector1.sources.r1.channels = c1 c2

#设置channel c1
collector1.channels.c1.type = file
collector1.channels.c1.dataDirs = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector1/c1/data
collector1.channels.c1.checkpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector1/c1/checkpoint
collector1.channels.c1.useDualCheckpoints = true
collector1.channels.c1.backupCheckpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector1/c1/backup-checkpoint

#设置channel c2
collector1.channels.c2.type = file
collector1.channels.c2.dataDirs = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector1/c2/data
collector1.channels.c2.checkpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector1/c2/checkpoint
collector1.channels.c2.useDualCheckpoints = true
collector1.channels.c2.backupCheckpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector1/c2/backup-checkpoint

#设置sink1为kafka-sink,topic为ad_log
collector1.sinks.k1.channel = c1
collector1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
collector1.sinks.k1.kafka.topic = ad_log
collector1.sinks.k1.kafka.bootstrap.servers = 127.0.0.1:9092
collector1.sinks.k1.kafka.flumeBatchSize = 10
collector1.sinks.k1.kafka.producer.acks = 1

#设置sink2为file-rolling-sink
collector1.sinks.k2.channel = c2
collector1.sinks.k2.type = file_roll
collector1.sinks.k2.sink.directory = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/output
collector1.sinks.k2.sink.rollInterval = 60

5.3 ad_log_collect2.conf

collector2.sources = r1
collector2.channels = c1 c2
collector2.sinks = k1 k2

#定义source为8889端口avro
collector2.sources.r1.type = avro
collector2.sources.r1.bind = 127.0.0.1
collector2.sources.r1.port = 8889
collector2.sources.r1.threads= 3
collector2.sources.r1.channels = c1
#设置复制选择器
collector2.sources.r1.selector.type = replicating
#设置required channel
collector2.sources.r1.channels = c1 c2

#设置channel c1
collector2.channels.c1.type = file
collector2.channels.c1.dataDirs = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector2/c1/data
collector2.channels.c1.checkpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector2/c1/checkpoint
collector2.channels.c1.useDualCheckpoints = true
collector2.channels.c1.backupCheckpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector2/c1/backup-checkpoint

#设置channel c2
collector2.channels.c2.type = file
collector2.channels.c2.dataDirs = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector2/c2/data
collector2.channels.c2.checkpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector2/c2/checkpoint
collector2.channels.c2.useDualCheckpoints = true
collector2.channels.c2.backupCheckpointDir = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/filechannel/collector2/c2/backup-checkpoint

#设置sink1为kafka-sink,topic为ad_log
collector2.sinks.k1.channel = c1
collector2.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
collector2.sinks.k1.kafka.topic = ad_log
collector2.sinks.k1.kafka.bootstrap.servers = 127.0.0.1:9092
collector2.sinks.k1.kafka.flumeBatchSize = 10
collector2.sinks.k1.kafka.producer.acks = 1

#设置sink2为file-rolling-sink
collector2.sinks.k2.channel = c2
collector2.sinks.k2.type = file_roll
collector2.sinks.k2.sink.directory = /Users/chengc/cc/apps/apache-flume-1.8.0-bin/test/ad_log/output
collector2.sinks.k2.sink.rollInterval = 60

0x06 项目启动步骤

  1. 创建kafka topic:ad_log
    bin/kafka-topics.sh --create --zookeeper 127.0.0.1:2181 --replication-factor 1 --partitions 3 --topic ad_log
  2. 启动ad_log数据汇集agent1
    bin/flume-ng agent --conf conf --conf-file conf/ad_log/ad_log_collect1.conf --name collector1 -Dflume.root.logger=INFO,console
  3. 启动ad_log数据汇集agent2
    bin/flume-ng agent --conf conf --conf-file conf/ad_log/ad_log_collect2.conf --name collector2 -Dflume.root.logger=INFO,console
  4. 启动ad_log采集agent
    bin/flume-ng agent --conf conf --conf-file conf/ad_log/ad_log_agent.conf --name agent1 -Dflume.root.logger=INFO,console
  5. 查看kafka ad_log topic 数据
    bin/kafka-console-consumer.sh --zookeeper 127.0.0.1:2181 --from-beginning --topic ad_log

0xFE 总结

本文只是作者学习Flume过程中的一个小例子。更多的Flume使用细节,请查看官方文档。

0xFF 参考文档

Apache Flume

你可能感兴趣的:(flume)