本文是Flume官方开发者文档的翻译。
Flume 1.8.0开发者指南
简介
概览
Apache Flume是一个分布式的,可靠的,高可用系统,用于将大量来自不同源的日志数据高效地收集,统计,搬迁到一个集中存储仓库。
Apache Flume是Apache Software Foundation的高级别项目。现在有两个可用的代码分支。0.9.x版本和1.x版本。本文档适用于1.x版本。0.9.x版本的文档请参阅Flume 0.9.x 开发者文档。
结构
数据流模型
Event
是流经Flume 代理(agent)的一个数据单元。Event
从Source
经Channel
流向Sink
,这个过程通过对Event
接口的实现来代表。一个Event
传输了一份载荷(字节数字)和一个与之相伴的可选头部(字符串属性)。一个Flume agent就是一个进程(JVM),管理着允许Event从一个外部源流向外部目标的各组件。
Source
消费Event
s有着固定的格式,Event
s从外部源传送到Source
的时候就像一个web服务器。例如,AvroSource
从客户端或者流中的其他Flume agents接收 Avro Event
s。Source
接收到一个Event
,就向一个或多个Channel
进行存储。
Channel
是一个被动的仓库,它持有Event
直到Event
被Sink
消费。FileChannel
就是Flume中的一类Channel
,使用本地文件系统作为自己的仓库。
Sink
从Channel
中移除Event
并推送到外部仓库(repository),例如HDFS(如果是HDFSEventSink
),或者流里下一个节点的Source
。agent中的Source
和Sink
同Channel
中排列的Event
s异步运行。
可靠性
Event
在Flume agent的Channel中暂存。Sink
负责把Event
传给流中的下一个agent或者终端仓库(例如HDFS)。只有当Event
被存进下一个agent或者终端仓库后,Sink
才会把它从Channel中移除。这是Flume中单跳的消息传递语义提供的流的端到端的可靠性。Flume使用了一种传统途径来保证Event传递的可靠性。Source
和Sink
包含由Channel提供的Transaction
中Event
的存储和检索。这确保了Event在流中的点之间可靠的传递。如果是多跳的流,前一跳的Sink
和下一跳的Source
打开各自的Transaction来保证Event数据被安全存放到下一跳的Channel中。
构建Flume
获取源
用Git 检查代码,见Git仓库。
Flume 1.x在“trunk”分支下开发,因此可以使用一下命令:
git clone https://git-wip-us.apache.org/repos/asf/flume.git
编译/测试 Flume
Flume的编译已经maven化了,可以使用标准Maven命令来编译Flume:
- 仅编译:
mvn clean compile
- 编译并运行单元测试:
mvn clean test
- 运行单体测试:
mvn clean test -Dtest=
, ,... -DfailIfNotTests=false - 创建源码(tarball)包:
mvn clean install
- 创建源码包(跳过单元测试):
mvn clean install -DskipTests
请注意Flume的构建需要路径中存在Google Protocol Buffers 编译器。它的下载和安装在这个教程中。
更新Protocol Buffer版本
文件管道依赖于Protocol Buffer。在用Flume更新Protocol Buffer的版本时,需要用protocol 编译器来重新生成数据访问类,它是Protocol Buffer的一部分。步骤如下:
- 在本地机器上安装指定版本的Protocol Buffer。
- 更新pom.xml中Protocol Buffer的版本
- 在Flume中生成新的Protocol Buffer数据接收类:
cd flume-ng-channels/flume-file-channel; mvn -P compile-proto clean package -DskipTests
- 在所有生成的文件中添加Apache许可证头部。
- 重新构建并测试Flume:
cd ../..; mvn clean install
下载用户组件
客户端
客户端在events的源头进行操作并将它们传输到Flume agent。通常,从哪里的应用消费数据,客户端就在哪个应用的进程空间中操作。Flume现在支持Arvo,log4j,syslog,和Http POST(JSON body)作为外部数据源。另外,有一个ExecSource
能消费本地进程的输出,作为Flume的输入。
很有可能现有的操作在实际使用中不够用。这种情况下,可以自定义一个机制来向Flume发送数据,有两种方法实现:一种方法是创建自定义的客户端和一个Flume现有的Source
例如AvroSource
或者SyslogTcpSource
相连,这时客户端应该将其数据转换成能被Flume Source
理解的消息;另一种方法是写一个自定义Flume Source
直接用IPC或者RPC协议和现有的客户端应用通信,然后将客户端数据转换成Flume Event
向下游发送。注意Flume agent中所有在Channel
中存放的event必须作为Flume Event
存在。
客户端SDK
尽管Flume包含了一些构建机制(例如Source
)来获取数据,我们还是希望能够有直接将自定义的应用和Flume连接起来。Flume Client SDK就是这样一个库,能够使应用和Flume相连,并通过RPC向Flume发送数据。
RPC客户端接口
Flume的RPCClient接口的实现包含了Flume支持的RPC机制。可以直接调用Flume Client SDK的append(Event)
或者appendBatch(List
方法来发送数据且不用担心背后的消息交换的细节。要提供需要的Event
可以通过以下任何一种方式:
- 直接实现
Event
接口; - 使用一个便捷的实现,例如SimpleEvent类;
- 调用EventBuilder重载的
withBody()
静态帮助方法。
RPC客户端-Avro和Thrift
在Flume 1.4.0中,Avro是默认的RPC协议。NettyAvroRpcClient
和ThriftRpcClient
是RpcClient
接口的实现。客户端需要用目标Flume agent的host和port来创建对象,然后就能用RpcClent
来向agent发送数据了。下面是一个用户的数据生成应用中Flume Client SDK API的使用:
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import java.nio.charset.Charset;
public class MyApp {
public static void main(String[] args) {
MyRpcClientFacade client = new MyRpcClientFacade();
// Initialize client with the remote Flume agent's host and port
client.init("host.example.org", 41414);
// Send 10 events to the remote Flume agent. That agent should be
// configured to listen with an AvroSource.
String sampleData = "Hello Flume!";
for (int i = 0; i < 10; i++) {
client.sendDataToFlume(sampleData);
}
client.cleanUp();
}
}
class MyRpcClientFacade {
private RpcClient client;
private String hostname;
private int port;
public void init(String hostname, int port) {
// Setup the RPC connection
this.hostname = hostname;
this.port = port;
this.client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
}
public void sendDataToFlume(String data) {
// Create a Flume Event object that encapsulates the sample data
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = RpcClientFactory.getDefaultInstance(hostname, port);
// Use the following method to create a thrift client (instead of the above line):
// this.client = RpcClientFactory.getThriftInstance(hostname, port);
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
}
远程的Flume agent需要有一个AvroSource(如果用的是Tfrift客户端需要就是ThriftSource)在一些端口上监听。
等待MyApp连接的Flume agent配置:
a1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.channels.c1.type = memory
a1.sources.r1.channels = c1
a1.sources.r1.type = avro
# For using a thrift source set the following instead of the above line.
# a1.source.r1.type = thrift
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger
为了更好的伸缩性,默认的Flume客户端实现(NettyAvroRpcClient 和 ThriftRpcClient)能用这些属性配置:
client.type = default (for avro) or thrift (for thrift)
hosts = h1 # default client accepts only 1 host
# (additional hosts will be ignored)
hosts.h1 = host1.example.org:41414 # host and port must both be specified
# (neither has a default)
batch-size = 100 # Must be >=1 (default: 100)
connect-timeout = 20000 # Must be >=1000 (default: 20000)
request-timeout = 20000 # Must be >=1000 (default: 20000)
安全的RPC客户端-Thrift
Flume 1.6.0中,Thrift source和sink支持基于验证的kerberos。客户端需要用SecureRpcClientFactory
类的的getThriftInstance
方法来持有一个SecureThriftRpcClient
类。SecureThriftRpcClient
类继承了ThriftRpcClient
类,后者实现了RpcClient
接口。在使用SecureRpcClientFactory
的过程中,kerberos 认证模块属于flume-ng-auth的一部分,需要再classpath中配置。客户端的principal和keytab都需要通过属性值作为参数传进去,他们反映了客户端通过kerberos KDC认证的资格。另外,客户端连接的目标Thrift source,也要提供服务端principle。
在用户数据生成数据的应用中使用SecureRpcClientFactory
的例子:
import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.api.SecureRpcClientFactory;
import org.apache.flume.api.RpcClientConfigurationConstants;
import org.apache.flume.api.RpcClient;
import java.nio.charset.Charset;
import java.util.Properties;
public class MyApp {
public static void main(String[] args) {
MySecureRpcClientFacade client = new MySecureRpcClientFacade();
// Initialize client with the remote Flume agent's host, port
Properties props = new Properties();
props.setProperty(RpcClientConfigurationConstants.CONFIG_CLIENT_TYPE, "thrift");
props.setProperty("hosts", "h1");
props.setProperty("hosts.h1", "client.example.org"+":"+ String.valueOf(41414));
// Initialize client with the kerberos authentication related properties
props.setProperty("kerberos", "true");
props.setProperty("client-principal", "flumeclient/[email protected]");
props.setProperty("client-keytab", "/tmp/flumeclient.keytab");
props.setProperty("server-principal", "flume/[email protected]");
client.init(props);
// Send 10 events to the remote Flume agent. That agent should be
// configured to listen with an AvroSource.
String sampleData = "Hello Flume!";
for (int i = 0; i < 10; i++) {
client.sendDataToFlume(sampleData);
}
client.cleanUp();
}
}
class MySecureRpcClientFacade {
private RpcClient client;
private Properties properties;
public void init(Properties properties) {
// Setup the RPC connection
this.properties = properties;
// Create the ThriftSecureRpcClient instance by using SecureRpcClientFactory
this.client = SecureRpcClientFactory.getThriftInstance(properties);
}
public void sendDataToFlume(String data) {
// Create a Flume Event object that encapsulates the sample data
Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));
// Send the event
try {
client.append(event);
} catch (EventDeliveryException e) {
// clean up and recreate the client
client.close();
client = null;
client = SecureRpcClientFactory.getThriftInstance(properties);
}
}
public void cleanUp() {
// Close the RPC connection
client.close();
}
}
远程ThriftSource
需要再kerberos模块中启动。等待MyApp连接的Flume agent配置的例子:
a1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.channels.c1.type = memory
a1.sources.r1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
a1.sources.r1.kerberos = true
a1.sources.r1.agent-principal = flume/[email protected]
a1.sources.r1.agent-keytab = /tmp/flume.keytab
a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger
Failover Client
这个类包装了Avro RPC客户端来向客户端提供failover处理能力。需要一个空格分隔的
// Setup properties for the failover
Properties props = new Properties();
props.put("client.type", "default_failover");
// List of hosts (space-separated list of user-chosen host aliases)
props.put("hosts", "h1 h2 h3");
// host/port pair for each host alias
String host1 = "host1.example.org:41414";
String host2 = "host2.example.org:41414";
String host3 = "host3.example.org:41414";
props.put("hosts.h1", host1);
props.put("hosts.h2", host2);
props.put("hosts.h3", host3);
// create the client with failover properties
RpcClient client = RpcClientFactory.getInstance(props);
为了更好的伸缩性,failover Flume客户端实现(FailoverRpcClient
)可以配置如下属性:
client.type = default_failover
hosts = h1 h2 h3 # at least one is required, but 2 or
# more makes better sense
hosts.h1 = host1.example.org:41414
hosts.h2 = host2.example.org:41414
hosts.h3 = host3.example.org:41414
max-attempts = 3 # Must be >=0 (default: number of hosts
# specified, 3 in this case). A '0'
# value doesn't make much sense because
# it will just cause an append call to
# immmediately fail. A '1' value means
# that the failover client will try only
# once to send the Event, and if it
# fails then there will be no failover
# to a second client, so this value
# causes the failover client to
# degenerate into just a default client.
# It makes sense to set this value to at
# least the number of hosts that you
# specified.
batch-size = 100 # Must be >=1 (default: 100)
connect-timeout = 20000 # Must be >=1000 (default: 20000)
request-timeout = 20000 # Must be >=1000 (default: 20000)
负载均衡RPC客户端
Flume Client SDK也支持在多主机之间均衡负载的RPCClient。这种客户端需要一个由空格分隔的
这个客户端可以用以下负载策略进行配置:
- 从配置的主机中随机选择一个
- 用轮训方式选择主机
可以写一个类,实现LoadBalancingRpcClient$HostSelector
接口,这样就能自定义一个主机的选择顺序。这是,自定义类的FQCN需要配置作为host-selector属性的值。LoadBalancing RPC Client现在不支持thrift。
如果backoff
配置为可用,客户端就会把失败的主机列到黑名单中,这样,在指定的超时时间之前这些主机会作为failover 主机拒绝被选择。在超时时间过去后,如果主机仍然未响应,就进入失效队列,超时时间指数增长,以避免等待未响应主机造成的阻塞。
最大backoff时间通过maxBackoff
属性配置,以毫秒为单位,默认30秒(在OrderSelector
类中指定,它的超类加载了均衡策略)。达到最大backoff时间之后,backoff超时时间会指数增长。最大可以达到65536秒(大约18.2小时)。
例如:
// Setup properties for the load balancing
Properties props = new Properties();
props.put("client.type", "default_loadbalance");
// List of hosts (space-separated list of user-chosen host aliases)
props.put("hosts", "h1 h2 h3");
// host/port pair for each host alias
String host1 = "host1.example.org:41414";
String host2 = "host2.example.org:41414";
String host3 = "host3.example.org:41414";
props.put("hosts.h1", host1);
props.put("hosts.h2", host2);
props.put("hosts.h3", host3);
props.put("host-selector", "random"); // For random host selection
// props.put("host-selector", "round_robin"); // For round-robin host
// // selection
props.put("backoff", "true"); // Disabled by default.
props.put("maxBackoff", "10000"); // Defaults 0, which effectively
// becomes 30000 ms
// Create the client with load balancing properties
RpcClient client = RpcClientFactory.getInstance(props);
为了更好的伸缩性,负载均衡的Flume client实现(LoadBalancingRpcClient)可以按如下配置:
client.type = default_loadbalance
hosts = h1 h2 h3 # At least 2 hosts are required
hosts.h1 = host1.example.org:41414
hosts.h2 = host2.example.org:41414
hosts.h3 = host3.example.org:41414
backoff = false # Specifies whether the client should
# back-off from (i.e. temporarily
# blacklist) a failed host
# (default: false).
maxBackoff = 0 # Max timeout in millis that a will
# remain inactive due to a previous
# failure with that host (default: 0,
# which effectively becomes 30000)
host-selector = round_robin # The host selection strategy used
# when load-balancing among hosts
# (default: round_robin).
# Other values are include "random"
# or the FQCN of a custom class
# that implements
# LoadBalancingRpcClient$HostSelector
batch-size = 100 # Must be >=1 (default: 100)
connect-timeout = 20000 # Must be >=1000 (default: 20000)
request-timeout = 20000 # Must be >=1000 (default: 20000)
Embedded agent
Flume有一个嵌入式agent api允许用户在自己的应用中嵌入agent。这意味着这个agent很轻量,但不是所有sources,sinks,channels都可用。特别是使用的source是一个嵌入式source并且event需要通过EmbeddedAgent
对象的put,putAll方法发送到这个source的时候。只有File Channel和Memory Channel是可用的channel,Avro Sink是可用的sink,嵌入式agent也支持Interceptors。
注意:嵌入式agent依赖hadoop-core.jar。
Embedded agent的配置和全版本Agent配置很相似。
以下是全面的配置列表,必选的属性加粗了:
属性名 | 默认值 | 描述 |
---|---|---|
source.type | embedded | 嵌入的source中唯一可用的source |
channel.type | - | 可选memory 和file 分别相当于MemoryChannel和FileChannel |
channel.* | - | 配置指定channel的选项,查阅MemoryChannel或者FileChannel用户指南获得详细清单 |
sinks | - | sink名列表 |
sink.type | - | sink的配置选项,AvroSink 用户指南中有详细清单,但是注意AvroSink至少需要主机名和接口 |
processor.type | - | 可选failover 和load_balance 分别代表FailoverSinksProcessor 和LoadBalancingSinkProcessor |
processor.* | - | 选择的sink处理器的配置项,FailoverSinksProcessor和LoadBalanceingSinkProcessor用户指南中有详细清单 |
source.interceptor | - | 空格隔开的拦截器列表 |
source.interceptor.* | - | source.interceptors属性中中各拦截器的配置项 |
下面使用改agent的例子:
Map properties = new HashMap();
properties.put("channel.type", "memory");
properties.put("channel.capacity", "200");
properties.put("sinks", "sink1 sink2");
properties.put("sink1.type", "avro");
properties.put("sink2.type", "avro");
properties.put("sink1.hostname", "collector1.apache.org");
properties.put("sink1.port", "5564");
properties.put("sink2.hostname", "collector2.apache.org");
properties.put("sink2.port", "5565");
properties.put("processor.type", "load_balance");
properties.put("source.interceptors", "i1");
properties.put("source.interceptors.i1.type", "static");
properties.put("source.interceptors.i1.key", "key1");
properties.put("source.interceptors.i1.value", "value1");
EmbeddedAgent agent = new EmbeddedAgent("myagent");
agent.configure(properties);
agent.start();
List events = Lists.newArrayList();
events.add(event);
events.add(event);
events.add(event);
events.add(event);
agent.putAll(events);
...
agent.stop();
Transaction接口
Transaction
接口Flume可靠性的基础。所有主要的组件(例如Source
,Sink
和Channel
)必须使用Flume Transaction
。
Transaction
是在Channel
实现中实现的。每个连接到Channel
的Source
和Sink
都必须包含一个Transaction
对象。Source
用ChannelProcessor
来管理Transaction
,Sink
则直接通过配置的Channel
来管理它们。存放(把event放到Channel
中)或者取出(从Channel
中取出)Event
的操作在活动的Transaction
中完成。
例如:
Channel ch = new MemoryChannel();
Transaction txn = ch.getTransaction();
txn.begin();
try {
// This try clause includes whatever Channel operations you want to do
Event eventToStage = EventBuilder.withBody("Hello Flume!",
Charset.forName("UTF-8"));
ch.put(eventToStage);
// Event takenEvent = ch.take();
// ...
txn.commit();
} catch (Throwable t) {
txn.rollback();
// Log exception, handle individual exceptions as needed
// re-throw all Errors
if (t instanceof Error) {
throw (Error)t;
}
} finally {
txn.close();
}
这里从一个Channel
中持有Transaction
。在begin()
方法返回之后,Transaction
就是active/open的,然后Event
就被put进Channel
了。如果put成功,Transaction
就提交然后关闭。
Sink
Sink
的目标是从Channel
中取出Event
然后传输向流中的下一个Flume Agent或者存储到外部仓库。一个Sink
和一个Channel
关联,在Flume属性文件中配置。有一个SinkRunner
实例将每一个配置的Sink
联系起来,在Flume框架调用SinkRunner.start()
方法的时候,创建一个新的线程来驱动Sink
(用SinkRunner.PollingRunner
作为线程的Runnable
)。这个线程管理着Sink的生命周期。Sink需要实现start()
和stop()
方法,它们都是LifecycleAware
接口的一部分。Sink.start()
需要初始化Sink
并且让Sink
达到能把Event
发送到目的地的状态。Sink.process()
方法要完成从Channel
取出Event
并发送的核心工作。Sink.stop()
方法要处理必要的清理工作(例如资源回收)。Sink实现也需要实现Configurable接口来处理它自己的配置:
public class MySink extends AbstractSink implements Configurable {
private String myProp;
@Override
public void configure(Context context) {
String myProp = context.getString("myProp", "defaultValue");
// Process the myProp value (e.g. validation)
// Store myProp for later retrieval by process() method
this.myProp = myProp;
}
@Override
public void start() {
// Initialize the connection to the external repository (e.g. HDFS) that
// this Sink will forward Events to ..
}
@Override
public void stop () {
// Disconnect from the external respository and do any
// additional cleanup (e.g. releasing resources or nulling-out
// field values) ..
}
@Override
public Status process() throws EventDeliveryException {
Status status = null;
// Start transaction
Channel ch = getChannel();
Transaction txn = ch.getTransaction();
txn.begin();
try {
// This try clause includes whatever Channel operations you want to do
Event event = ch.take();
// Send the Event to the external repository.
// storeSomeData(e);
txn.commit();
status = Status.READY;
} catch (Throwable t) {
txn.rollback();
// Log exception, handle individual exceptions as needed
status = Status.BACKOFF;
// re-throw all Errors
if (t instanceof Error) {
throw (Error)t;
}
}
return status;
}
}
Source
Source 的目标是从外部客户端接受数据并将其存放到配置好的Channel中。一个Source能从它自己的ChannelProcessor中获得实例来连续处理Event,并在Channel本地transaction中提交。如果出现异常,相应的Channel将会传递这个异常,所有这些Channel将会回滚它们的transaction,但是之前其他Channel处理的event仍然会提交。
和SinkRunner.PollingRunner
, Runnable
类似,Flume框架调用PollableSourceRunner.start()
方法的时候会创建一个线程,在这个线程上面会执行PollingRunner
,Runnable
。每个配置的PollableSource
和它自己的作为PollingRunner
运行的线程关联起来。这个线程管理着PollableSource的生命周期,例如开始和结束。PollableSource实现必须实现定义在LifecycleAware接口中的start()和stop()方法。PollableSource的runner调用Source的process()方法。process()方法需要检车新数据并将之作为Flume Event存放到Channel中去。
注意,实际上有两类Source。除了已经提到的PollableSource,一定要有自己的回调机制来捕获新数据,存放到Channel,还有一种EventDrivenSource,和PollableSource不同,不是由自己的线程驱动。
下面是一个自定义的PollableSource:
public class MySource extends AbstractSource implements Configurable, PollableSource {
private String myProp;
@Override
public void configure(Context context) {
String myProp = context.getString("myProp", "defaultValue");
// Process the myProp value (e.g. validation, convert to another type, ...)
// Store myProp for later retrieval by process() method
this.myProp = myProp;
}
@Override
public void start() {
// Initialize the connection to the external client
}
@Override
public void stop () {
// Disconnect from external client and do any additional cleanup
// (e.g. releasing resources or nulling-out field values) ..
}
@Override
public Status process() throws EventDeliveryException {
Status status = null;
try {
// This try clause includes whatever Channel/Event operations you want to do
// Receive new data
Event e = getSomeData();
// Store the Event into this Source's associated Channel(s)
getChannelProcessor().processEvent(e);
status = Status.READY;
} catch (Throwable t) {
// Log exception, handle individual exceptions as needed
status = Status.BACKOFF;
// re-throw all Errors
if (t instanceof Error) {
throw (Error)t;
}
} finally {
txn.close();
}
return status;
}
}