框架学习最值得信赖的资料Flume官网和Flume源码

如果觉得纯英文效率比较低的话，也可以搜索flume中文检索；
本文用于记录曾经了解flume的过程以及flume的核心技术点，便于快速掌握flume

快速了解

[Overview]Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

Flume是一个从多源高效采集、汇总、移动海量日志到日志存储中的具备分布式、可靠、可用特性的系统。

经典应用场景:

Flume采集日志服务器上的日志实时传输到Kafka
Flume采集Kafka集群数据落地到HDFS

Flume用户指南页面介绍，左侧有上级菜单以及本页菜单，可以根据需求快速定位(请见附录1)，常见需求包括：

快速熟悉某组件的特定类型，比如source如何选择等
定义组件时参数规范不熟悉，如指定source的参数配置
快速开发自定义组件，参考[开发者指南]

核心概念

Flume流程图

source读取数据源，并将其封装成一个个的Event(process方法)
然后经过拦截器链，对even的集合进行链式处理(修改或丢弃)
进入channel选择器，根据配置策略选择channel
将event发送到选择的channel缓存(此处是source push)
进入sink选择器，根据配置策略选择sink
将event发送到选择的sink(此处是sink轮询pull)

[Data flow model]A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).
[Flume Interceptors]Flume has the capability to modify/drop events in-flight. This is done with the help of interceptors. Interceptors are classes that implement org.apache.flume.interceptor.Interceptor interface. An interceptor can modify or even drop events based on any criteria chosen by the developer of the interceptor. Flume supports chaining of interceptors. This is made possible through by specifying the list of interceptor builder class names in the configuration. Interceptors are specified as a whitespace separated list in the source configuration. The order in which the interceptors are specified is the order in which they are invoked. The list of events returned by one interceptor is passed to the next interceptor in the chain. Interceptors can modify or drop events. If an interceptor needs to drop events, it just does not return that event in the list that it returns. If it is to drop all events, then it simply returns an empty list. Interceptors are named components, here is an example of how they are created through configuration:

核心概念(必须理解)

name	description
Event	flume数据流中包含多属性(headers body)的传输单元(本身是byte数组)
Agent	指定了events从哪儿来到哪儿去的JVM进程,可以理解为job或者部署实例
Source	消费外部系统的数据，并将数据发送给channel
Channel	数据中转站(可以理解为队列或者go中的channel)，实现source和sink的异步
Sink	消费channel中的数据，并写到下游(可以是agent也可以是外部系统)
Channel Selector	source的配置，顾名思义，是多个channel时的选择策略；默认是replicating(复制，即一个event会发送到多个channel)，可选multiplexing(可根据event属性制定规则分发)
Sink Processors	sinkgroups的配置，顾名思义，是多个sink时的选择策略；failover(故障转移)和load_balance(负载均衡，均衡策略可选择round_robin和random)
Serializers	可选择event需要序列化的部分(如只需要body)以及压缩格式
Interceptors	拦截器来实现event的修改甚至丢弃(比如格式校验、添加时间戳、添加唯一ID、sink=kafak场景，可根据内容添加指定key等)

备注：

1.source,sink都是运行在独立的线程
2.Event接口包含headers(map)和body(byte数组)属性，主要实现类包括(FlumeEvent,JSONEvent,MockEvent,PersistableEvent,SimpleEvent)

网络拓扑

[Setting multi-agent flow]In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.

多个Agent串联，那么前者的sink和后者source必须是avro

[Consolidation]A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.

一个常见的场景：日志采集的机器有很多，连接存储系统的却很少，那么就需要前者采集日志，后者汇总日志(实际生产中后者往往不止一台机器，即需要配置sink groups)

[Multiplexing the flow]Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.

flume支持多路复用(这种方式可以通过配置channel选择器将数据发送到指定位置，可以是一个event发送多次，也可以根据event发送到指定的destination)

如何使用

所有的使用基本都可以基于用户指南，例子较为详尽
这里举一个生产常用的例子：Kafka->HDFS

根据需求查看Kafka Source和HDFS Sink的说明(ignore while you knew them)

Kafka Source

[Kafka Source]Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics.
[Security and Kafka Source]Secure authentication as well as data encryption is supported on the communication channel between Flume and Kafka.
SASL_PLAINTEXT - Kerberos or plaintext authentication with no data encryption

主要参数说明：

Property Name	Default	Description
kafka.bootstrap.servers	-	Kafka集群实例列表，逗号分隔
kafka.consumer.group.id	flume	消费组唯一标识符,为了增加并发，可对多个agent设置相同的消费者组
kafka.topics	-	订阅topic列表，逗号分隔
kafka.topics.regex	-	正则匹配订阅的topic,会覆盖kafka.topics
kafka.consumer.security.protocol	PLAINTEXT	设置使用哪种安全协议写入Kafka

例如：

tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
# 批量写入channel，即2000个event,写一次batch
tier1.sources.source1.batchSize = 5000 
# 定时写入channel，即超过2000ms，写一次batch
tier1.sources.source1.batchDurationMillis = 2000 
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
# 此参数会覆盖上面订阅的topic
tier1.sources.source1.kafka.topics.regex = ^topic[0-9]$
# 消费者组一般以flume.topicName.index命名
tier1.sources.source1.kafka.consumer.group.id = custom.g.id

# Flume和Kafka之间通信支持安全认证和数据加密
# 企业中SASL_PLAINTEXT(无数据加密的kerberos认证用的比较多)
tier1.sources.source1.kafka.consumer.security.protocol = SASL_PLAINTEXT
tier1.sources.source1.kafka.consumer.sasl.mechanism = GSSAPI
tier1.sources.source1.kafka.consumer.sasl.kerberos.service.name = kafka

KafkaSource(也是定义的KafkaConsumer)此处仅显示核心逻辑代码(一个批次执行一次doProcess)，需要看详细逻辑的可查看源码

public class KafkaSource extends AbstractPollableSource implements Configurable, BatchSizeSupported {

    protected Status doProcess() throws EventDeliveryException {
        // 为本批次push生成id，用于保证事务
        final String batchUUID = UUID.randomUUID().toString();
        // 批次写入channel全部放在try中，包装成事务，全部成功才提交offset,否则BACK OFF回滚
        try{
            // 如果batch条数不够或者时间区间未到，继续消费数据并解析ConsumerRecord构造Event添加到eventList中
            while (eventList.size() < batchUpperLimit && System.currentTimeMillis() < maxBatchEndTime) {
                ConsumerRecords records = consumer.poll(duration);
                ConsumerRecord message = it.next();
                event = EventBuilder.withBody(eventBody, headers);
                eventList.add(event);
                // 将offset信息更新到Hashmap中
                tpAndOffsetMetadata.put(new TopicPartition(message.topic(), message.partition()),new OffsetAndMetadata(message.offset() + 1, batchUUID));
            }
            // 获取channel processor来执行批量写入channel,实际上调用的是channel.put
            getChannelProcessor().processEventBatch(eventList);
            // 完成写入后提交offset
            consumer.commitSync(tpAndOffsetMetadata);
            return Status.READY;        
        }catch (Exception e) {
            return Status.BACKOFF;
        }
    }
}

HDFS Sink

[HDFS Sink]This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events.

主要参数说明：

Property Name	Default	Description
hdfs.path	-	HDFS目录路径(例如：hdfs://namenode/flume/webdata/)
hdfs.filePrefix	FlumeData	Flume在HDFS文件夹下创建新文件的固定前缀(通常会添加实例ip或者host等信息方便定位问题)
hdfs.rollInterval	30	定时创建新文件，单位秒；即每30s创建一个新文件，0表示不参考写入时间
hdfs.rollSize	1024	固定大小创建新文件，单位字节；0表示不参考文件大小
hdfs.rollCount	10	固定event数量创建新文件,单位条，0表示不参考写入数量
hdfs.batchSize	100	批量写入hdfs每批次event数量
hdfs.codeC	-	压缩算法，比如snappy等

例如：

a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
# 自定义保存路径，使用转义符，详见官网其他转义符
a1.sinks.k1.hdfs.path = /user/log/business/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = %[IP]-log-%t
a1.sinks.k1.hdfs.fileSuffix = .snappy # 点不会自动生成
a1.sinks.k1.hdfs.codeC = snappy
a1.sinks.k1.hdfs.rollSize = 134217728 # 设置大一点避免小文件
a1.sinks.k1.hdfs.batchSize = 1000 

# 以下是数据划分文件目录的依据，以下是10分钟划分一个huafneyige目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute

HDFSEventSink，核心逻辑

public class HDFSEventSink extends AbstractSink implements Configurable, BatchSizeSupported {
    public Status process() throws EventDeliveryException {
        Channel channel = getChannel();
        // 这里直接用事务接口，用法可以参考接口说明(begin->commit->rollback->close)
        Transaction transaction = channel.getTransaction();
        transaction.begin();
        try {
            // 很明显这里就是去channel中take
          for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
            Event event = channel.take();
               // 移除引用，方便GC
            WriterCallback closeCallback = new WriterCallback() {
              @Override
              public void run(String bucketPath) {
                synchronized (sfWritersLock) {
                  sfWriters.remove(bucketPath);
                }
              }
            };
            // 调用HDFSWriter append
            // HDFSWriter主要是格式的区别，也就是配置文件中fileType指定
            bucketWriter.append(event);
          }
          // flush all pending buckets before committing the transaction
          for (BucketWriter bucketWriter : writers) {
              // 调用HDFSWriter sync
              // 将流中的数据全部刷写到文件中
            bucketWriter.flush();
          }
          // 提交事务
          transaction.commit();
        } catch (IOException eIO) {
          transaction.rollback();
          return Status.BACKOFF;
        } catch (Throwable th) {
          transaction.rollback();
        } finally {
          transaction.close();
        }
      }

结语

这里旨在提供一个较为完善的实现思路，尽量在任何场景下都能快速完成开发任务；以及flume自动化传输平台的设计。

附录

[附录1:Flume用户指南页面目录]

This Page
Flume 1.9.0 User Guide
Introduction
    Overview
    System Requirements
    Architecture
Setup
    Setting up an agent
    Data ingestion
    Setting multi-agent flow
    Consolidation
    Multiplexing the flow
Configuration
    Defining the flow
    Configuring individual components
    Adding multiple flows in an agent
    Configuring a multi agent flow
    Fan out flow
    SSL/TLS support
    Source and sink batch sizes and channel transaction capacities
    Flume Sources
        Avro Source
        Thrift Source
        Exec Source
        JMS Source
        Spooling Directory Source
        Taildir Source
        Twitter 1% firehose Source (experimental)
        Kafka Source
        NetCat TCP Source
        NetCat UDP Source
        Sequence Generator Source
        Syslog Sources
        HTTP Source
        Stress Source
        Legacy Sources
        Custom Source
        Scribe Source
    Flume Sinks
        HDFS Sink
        Hive Sink
        Logger Sink
        Avro Sink
        Thrift Sink
        IRC Sink
        File Roll Sink
        Null Sink
        HBaseSinks
        MorphlineSolrSink
        ElasticSearchSink
        Kite Dataset Sink
        Kafka Sink
        HTTP Sink
        Custom Sink
    Flume Channels
        Memory Channel
        JDBC Channel
        Kafka Channel
        File Channel
        Spillable Memory Channel
        Pseudo Transaction Channel
        Custom Channel
    Flume Channel Selectors
        Replicating Channel Selector (default)
        Multiplexing Channel Selector
        Custom Channel Selector
    Flume Sink Processors
        Default Sink Processor
        Failover Sink Processor
        Load balancing Sink Processor
        Custom Sink Processor
    Event Serializers
        Body Text Serializer
        “Flume Event” Avro Event Serializer
        Avro Event Serializer
    Flume Interceptors
        Timestamp Interceptor
        Host Interceptor
        Static Interceptor
        Remove Header Interceptor
        UUID Interceptor
        Morphline Interceptor
        Search and Replace Interceptor
        Regex Filtering Interceptor
        Regex Extractor Interceptor
        Example 1:
        Example 2:
    Flume Properties
        Property: flume.called.from.service
Configuration Filters
    Common usage of config filters
    Environment Variable Config Filter
        Example
    External Process Config Filter
        Example
        Example 2
    Hadoop Credential Store Config Filter
        Example
Log4J Appender
Load Balancing Log4J Appender
Security
Monitoring
    Available Component Metrics
        Sources 1
        Sources 2
        Sinks 1
        Sinks 2
        Channels
    JMX Reporting
    Ganglia Reporting
    JSON Reporting
    Custom Reporting
    Reporting metrics from custom components
Tools
    File Channel Integrity Tool
    Event Validator Tool
Topology Design Considerations
    Is Flume a good fit for your problem?
    Flow reliability in Flume
    Flume topology design
    Sizing a Flume deployment
Troubleshooting
    Handling agent failures
    Compatibility
        HDFS
        AVRO
        Additional version requirements
    Tracing
    More Sample Configs
Component Summary
Alias Conventions

Flume的那点事