如果觉得纯英文效率比较低的话,也可以搜索flume中文检索;
本文用于记录曾经了解flume的过程以及flume的核心技术点,便于快速掌握flume
快速了解
[Overview]Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
Flume是一个从多源高效采集、汇总、移动海量日志到日志存储中的具备分布式、可靠、可用特性的系统。
经典应用场景:
- Flume采集日志服务器上的日志实时传输到Kafka
- Flume采集Kafka集群数据落地到HDFS
Flume用户指南页面介绍,左侧有上级菜单
以及本页菜单
,可以根据需求快速定位(请见附录1),常见需求包括:
- 快速熟悉某组件的特定类型,比如source如何选择等
- 定义组件时参数规范不熟悉,如指定source的参数配置
- 快速开发自定义组件,参考[开发者指南]
核心概念
- Flume流程图
- source读取数据源,并将其封装成一个个的Event(process方法)
- 然后经过拦截器链,对even的集合进行链式处理(修改或丢弃)
- 进入channel选择器,根据配置策略选择channel
- 将event发送到选择的channel缓存(此处是source push)
- 进入sink选择器,根据配置策略选择sink
- 将event发送到选择的sink(此处是sink轮询pull)
[Data flow model]A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).
[Flume Interceptors]Flume has the capability to modify/drop events in-flight. This is done with the help of interceptors. Interceptors are classes that implement org.apache.flume.interceptor.Interceptor interface. An interceptor can modify or even drop events based on any criteria chosen by the developer of the interceptor.Flume supports chaining of interceptors.
This is made possible through by specifying the list of interceptor builder class names in the configuration. Interceptors are specified as a whitespace separated list in the source configuration. The order in which the interceptors are specified is the order in which they are invoked. The list of events returned by one interceptor is passed to the next interceptor in the chain. Interceptors can modify or drop events. If an interceptor needs to drop events, it just does not return that event in the list that it returns. If it is to drop all events, then it simply returns an empty list. Interceptors are named components, here is an example of how they are created through configuration:
- 核心概念(必须理解)
name | description |
---|---|
Event | flume数据流中包含多属性(headers body)的传输单元(本身是byte数组) |
Agent | 指定了events从哪儿来到哪儿去的JVM进程,可以理解为job或者部署实例 |
Source | 消费外部系统的数据,并将数据发送给channel |
Channel | 数据中转站(可以理解为队列或者go中的channel),实现source和sink的异步 |
Sink | 消费channel中的数据,并写到下游(可以是agent也可以是外部系统) |
Channel Selector | source的配置,顾名思义,是多个channel时的选择策略;默认是replicating(复制,即一个event会发送到多个channel),可选multiplexing(可根据event属性制定规则分发) |
Sink Processors | sinkgroups的配置,顾名思义,是多个sink时的选择策略;failover(故障转移)和load_balance(负载均衡,均衡策略可选择round_robin和random) |
Serializers | 可选择event需要序列化的部分(如只需要body)以及压缩格式 |
Interceptors | 拦截器来实现event的修改甚至丢弃(比如格式校验、添加时间戳、添加唯一ID、sink=kafak场景,可根据内容添加指定key等) |
- 备注:
1.source,sink都是运行在独立的线程
2.Event接口包含headers(map)和body(byte数组)属性,主要实现类包括(FlumeEvent,JSONEvent,MockEvent,PersistableEvent,SimpleEvent)
网络拓扑
[Setting multi-agent flow]In order to flow the data across multiple agents or hops, the sink of the previous agent and source of the current hop need to be avro type with the sink pointing to the hostname (or IP address) and port of the source.
- 多个Agent串联,那么前者的sink和后者source必须是avro
[Consolidation]A very common scenario in log collection is a large number of log producing clients sending data to a few consumer agents that are attached to the storage subsystem. For example, logs collected from hundreds of web servers sent to a dozen of agents that write to HDFS cluster.
- 一个常见的场景:日志采集的机器有很多,连接存储系统的却很少,那么就需要前者采集日志,后者汇总日志(
实际生产中后者往往不止一台机器,即需要配置sink groups
)
[Multiplexing the flow]Flume supports multiplexing the event flow to one or more destinations. This is achieved by defining a flow multiplexer that can replicate or selectively route an event to one or more channels.
- flume支持多路复用(这种方式可以通过配置channel选择器将数据发送到指定位置,可以是一个event发送多次,也可以根据event发送到指定的destination)
如何使用
所有的使用基本都可以基于用户指南,例子较为详尽
这里举一个生产常用的例子:Kafka->HDFS
根据需求查看Kafka Source和HDFS Sink的说明(ignore while you knew them)
Kafka Source
[Kafka Source]Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics.
[Security and Kafka Source]Secure authentication as well as data encryption is supported on the communication channel between Flume and Kafka.
SASL_PLAINTEXT - Kerberos or plaintext authentication with no data encryption
- 主要参数说明:
Property Name | Default | Description |
---|---|---|
kafka.bootstrap.servers | - | Kafka集群实例列表,逗号分隔 |
kafka.consumer.group.id | flume | 消费组唯一标识符,为了增加并发,可对多个agent设置相同的消费者组 |
kafka.topics | - | 订阅topic列表,逗号分隔 |
kafka.topics.regex | - | 正则匹配订阅的topic,会覆盖kafka.topics |
kafka.consumer.security.protocol | PLAINTEXT | 设置使用哪种安全协议写入Kafka |
- 例如:
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.channels = channel1
# 批量写入channel,即2000个event,写一次batch
tier1.sources.source1.batchSize = 5000
# 定时写入channel,即超过2000ms,写一次batch
tier1.sources.source1.batchDurationMillis = 2000
tier1.sources.source1.kafka.bootstrap.servers = localhost:9092
tier1.sources.source1.kafka.topics = test1, test2
# 此参数会覆盖上面订阅的topic
tier1.sources.source1.kafka.topics.regex = ^topic[0-9]$
# 消费者组一般以flume.topicName.index命名
tier1.sources.source1.kafka.consumer.group.id = custom.g.id
# Flume和Kafka之间通信支持安全认证和数据加密
# 企业中SASL_PLAINTEXT(无数据加密的kerberos认证用的比较多)
tier1.sources.source1.kafka.consumer.security.protocol = SASL_PLAINTEXT
tier1.sources.source1.kafka.consumer.sasl.mechanism = GSSAPI
tier1.sources.source1.kafka.consumer.sasl.kerberos.service.name = kafka
- KafkaSource(也是定义的KafkaConsumer)此处仅显示核心逻辑代码(一个批次执行一次doProcess),需要看详细逻辑的可查看源码
public class KafkaSource extends AbstractPollableSource implements Configurable, BatchSizeSupported {
protected Status doProcess() throws EventDeliveryException {
// 为本批次push生成id,用于保证事务
final String batchUUID = UUID.randomUUID().toString();
// 批次写入channel全部放在try中,包装成事务,全部成功才提交offset,否则BACK OFF回滚
try{
// 如果batch条数不够或者时间区间未到,继续消费数据并解析ConsumerRecord构造Event添加到eventList中
while (eventList.size() < batchUpperLimit && System.currentTimeMillis() < maxBatchEndTime) {
ConsumerRecords records = consumer.poll(duration);
ConsumerRecord message = it.next();
event = EventBuilder.withBody(eventBody, headers);
eventList.add(event);
// 将offset信息更新到Hashmap中
tpAndOffsetMetadata.put(new TopicPartition(message.topic(), message.partition()),new OffsetAndMetadata(message.offset() + 1, batchUUID));
}
// 获取channel processor来执行批量写入channel,实际上调用的是channel.put
getChannelProcessor().processEventBatch(eventList);
// 完成写入后提交offset
consumer.commitSync(tpAndOffsetMetadata);
return Status.READY;
}catch (Exception e) {
return Status.BACKOFF;
}
}
}
HDFS Sink
[HDFS Sink]This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types. The files can be rolled (close current file and create a new one) periodically based on the elapsed time or size of data or number of events. It also buckets/partitions data by attributes like timestamp or machine where the event originated. The HDFS directory path may contain formatting escape sequences that will replaced by the HDFS sink to generate a directory/file name to store the events.
- 主要参数说明:
Property Name | Default | Description |
---|---|---|
hdfs.path | - | HDFS目录路径(例如:hdfs://namenode/flume/webdata/) |
hdfs.filePrefix | FlumeData | Flume在HDFS文件夹下创建新文件的固定前缀(通常会添加实例ip或者host等信息方便定位问题) |
hdfs.rollInterval | 30 | 定时创建新文件,单位秒;即每30s创建一个新文件,0表示不参考写入时间 |
hdfs.rollSize | 1024 | 固定大小创建新文件,单位字节;0表示不参考文件大小 |
hdfs.rollCount | 10 | 固定event数量创建新文件,单位条,0表示不参考写入数量 |
hdfs.batchSize | 100 | 批量写入hdfs每批次event数量 |
hdfs.codeC | - | 压缩算法,比如snappy等 |
- 例如:
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
# 自定义保存路径,使用转义符,详见官网其他转义符
a1.sinks.k1.hdfs.path = /user/log/business/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = %[IP]-log-%t
a1.sinks.k1.hdfs.fileSuffix = .snappy # 点不会自动生成
a1.sinks.k1.hdfs.codeC = snappy
a1.sinks.k1.hdfs.rollSize = 134217728 # 设置大一点避免小文件
a1.sinks.k1.hdfs.batchSize = 1000
# 以下是数据划分文件目录的依据,以下是10分钟划分一个huafneyige目录
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
- HDFSEventSink,核心逻辑
public class HDFSEventSink extends AbstractSink implements Configurable, BatchSizeSupported {
public Status process() throws EventDeliveryException {
Channel channel = getChannel();
// 这里直接用事务接口,用法可以参考接口说明(begin->commit->rollback->close)
Transaction transaction = channel.getTransaction();
transaction.begin();
try {
// 很明显这里就是去channel中take
for (txnEventCount = 0; txnEventCount < batchSize; txnEventCount++) {
Event event = channel.take();
// 移除引用,方便GC
WriterCallback closeCallback = new WriterCallback() {
@Override
public void run(String bucketPath) {
synchronized (sfWritersLock) {
sfWriters.remove(bucketPath);
}
}
};
// 调用HDFSWriter append
// HDFSWriter主要是格式的区别,也就是配置文件中fileType指定
bucketWriter.append(event);
}
// flush all pending buckets before committing the transaction
for (BucketWriter bucketWriter : writers) {
// 调用HDFSWriter sync
// 将流中的数据全部刷写到文件中
bucketWriter.flush();
}
// 提交事务
transaction.commit();
} catch (IOException eIO) {
transaction.rollback();
return Status.BACKOFF;
} catch (Throwable th) {
transaction.rollback();
} finally {
transaction.close();
}
}
结语
这里旨在提供一个较为完善的实现思路,尽量在任何场景下都能快速完成开发任务;以及flume自动化传输平台的设计。
附录
[附录1:Flume用户指南页面目录]
This Page
Flume 1.9.0 User Guide
Introduction
Overview
System Requirements
Architecture
Setup
Setting up an agent
Data ingestion
Setting multi-agent flow
Consolidation
Multiplexing the flow
Configuration
Defining the flow
Configuring individual components
Adding multiple flows in an agent
Configuring a multi agent flow
Fan out flow
SSL/TLS support
Source and sink batch sizes and channel transaction capacities
Flume Sources
Avro Source
Thrift Source
Exec Source
JMS Source
Spooling Directory Source
Taildir Source
Twitter 1% firehose Source (experimental)
Kafka Source
NetCat TCP Source
NetCat UDP Source
Sequence Generator Source
Syslog Sources
HTTP Source
Stress Source
Legacy Sources
Custom Source
Scribe Source
Flume Sinks
HDFS Sink
Hive Sink
Logger Sink
Avro Sink
Thrift Sink
IRC Sink
File Roll Sink
Null Sink
HBaseSinks
MorphlineSolrSink
ElasticSearchSink
Kite Dataset Sink
Kafka Sink
HTTP Sink
Custom Sink
Flume Channels
Memory Channel
JDBC Channel
Kafka Channel
File Channel
Spillable Memory Channel
Pseudo Transaction Channel
Custom Channel
Flume Channel Selectors
Replicating Channel Selector (default)
Multiplexing Channel Selector
Custom Channel Selector
Flume Sink Processors
Default Sink Processor
Failover Sink Processor
Load balancing Sink Processor
Custom Sink Processor
Event Serializers
Body Text Serializer
“Flume Event” Avro Event Serializer
Avro Event Serializer
Flume Interceptors
Timestamp Interceptor
Host Interceptor
Static Interceptor
Remove Header Interceptor
UUID Interceptor
Morphline Interceptor
Search and Replace Interceptor
Regex Filtering Interceptor
Regex Extractor Interceptor
Example 1:
Example 2:
Flume Properties
Property: flume.called.from.service
Configuration Filters
Common usage of config filters
Environment Variable Config Filter
Example
External Process Config Filter
Example
Example 2
Hadoop Credential Store Config Filter
Example
Log4J Appender
Load Balancing Log4J Appender
Security
Monitoring
Available Component Metrics
Sources 1
Sources 2
Sinks 1
Sinks 2
Channels
JMX Reporting
Ganglia Reporting
JSON Reporting
Custom Reporting
Reporting metrics from custom components
Tools
File Channel Integrity Tool
Event Validator Tool
Topology Design Considerations
Is Flume a good fit for your problem?
Flow reliability in Flume
Flume topology design
Sizing a Flume deployment
Troubleshooting
Handling agent failures
Compatibility
HDFS
AVRO
Additional version requirements
Tracing
More Sample Configs
Component Summary
Alias Conventions