Flume简介

本文是Flume官方开发者文档的翻译。

Flume 1.8.0开发者指南

简介

概览

Apache Flume是一个分布式的,可靠的,高可用系统,用于将大量来自不同源的日志数据高效地收集,统计,搬迁到一个集中存储仓库。

Apache Flume是Apache Software Foundation的高级别项目。现在有两个可用的代码分支。0.9.x版本和1.x版本。本文档适用于1.x版本。0.9.x版本的文档请参阅Flume 0.9.x 开发者文档。

结构

数据流模型

Event是流经Flume 代理(agent)的一个数据单元。EventSourceChannel流向Sink,这个过程通过对Event接口的实现来代表。一个Event传输了一份载荷(字节数字)和一个与之相伴的可选头部(字符串属性)。一个Flume agent就是一个进程(JVM),管理着允许Event从一个外部源流向外部目标的各组件。

Flume数据流模型

Source 消费Events有着固定的格式,Events从外部源传送到Source的时候就像一个web服务器。例如,AvroSource从客户端或者流中的其他Flume agents接收 Avro Events。Source接收到一个Event,就向一个或多个Channel进行存储。
Channel是一个被动的仓库,它持有Event直到EventSink消费。FileChannel就是Flume中的一类Channel,使用本地文件系统作为自己的仓库。
SinkChannel中移除Event并推送到外部仓库(repository),例如HDFS(如果是HDFSEventSink),或者流里下一个节点的Source 。agent中的SourceSinkChannel中排列的Events异步运行。

可靠性

Event在Flume agent的Channel中暂存。Sink负责把Event传给流中的下一个agent或者终端仓库(例如HDFS)。只有当Event被存进下一个agent或者终端仓库后,Sink才会把它从Channel中移除。这是Flume中单跳的消息传递语义提供的流的端到端的可靠性。Flume使用了一种传统途径来保证Event传递的可靠性。SourceSink包含由Channel提供的TransactionEvent的存储和检索。这确保了Event在流中的点之间可靠的传递。如果是多跳的流,前一跳的Sink和下一跳的Source打开各自的Transaction来保证Event数据被安全存放到下一跳的Channel中。

构建Flume

获取源

用Git 检查代码,见Git仓库。
Flume 1.x在“trunk”分支下开发,因此可以使用一下命令:

git clone https://git-wip-us.apache.org/repos/asf/flume.git

编译/测试 Flume

Flume的编译已经maven化了,可以使用标准Maven命令来编译Flume:

  1. 仅编译:mvn clean compile
  2. 编译并运行单元测试:mvn clean test
  3. 运行单体测试:mvn clean test -Dtest=,,... -DfailIfNotTests=false
  4. 创建源码(tarball)包:mvn clean install
  5. 创建源码包(跳过单元测试):mvn clean install -DskipTests

请注意Flume的构建需要路径中存在Google Protocol Buffers 编译器。它的下载和安装在这个教程中。

更新Protocol Buffer版本

文件管道依赖于Protocol Buffer。在用Flume更新Protocol Buffer的版本时,需要用protocol 编译器来重新生成数据访问类,它是Protocol Buffer的一部分。步骤如下:

  1. 在本地机器上安装指定版本的Protocol Buffer。
  2. 更新pom.xml中Protocol Buffer的版本
  3. 在Flume中生成新的Protocol Buffer数据接收类:cd flume-ng-channels/flume-file-channel; mvn -P compile-proto clean package -DskipTests
  4. 在所有生成的文件中添加Apache许可证头部。
  5. 重新构建并测试Flume:cd ../..; mvn clean install

下载用户组件

客户端

客户端在events的源头进行操作并将它们传输到Flume agent。通常,从哪里的应用消费数据,客户端就在哪个应用的进程空间中操作。Flume现在支持Arvo,log4j,syslog,和Http POST(JSON body)作为外部数据源。另外,有一个ExecSource能消费本地进程的输出,作为Flume的输入。
很有可能现有的操作在实际使用中不够用。这种情况下,可以自定义一个机制来向Flume发送数据,有两种方法实现:一种方法是创建自定义的客户端和一个Flume现有的Source例如AvroSource或者SyslogTcpSource相连,这时客户端应该将其数据转换成能被Flume Source理解的消息;另一种方法是写一个自定义Flume Source直接用IPC或者RPC协议和现有的客户端应用通信,然后将客户端数据转换成Flume Event向下游发送。注意Flume agent中所有在Channel中存放的event必须作为Flume Event存在。

客户端SDK

尽管Flume包含了一些构建机制(例如Source)来获取数据,我们还是希望能够有直接将自定义的应用和Flume连接起来。Flume Client SDK就是这样一个库,能够使应用和Flume相连,并通过RPC向Flume发送数据。

RPC客户端接口

Flume的RPCClient接口的实现包含了Flume支持的RPC机制。可以直接调用Flume Client SDK的append(Event)或者appendBatch(List)方法来发送数据且不用担心背后的消息交换的细节。要提供需要的Event可以通过以下任何一种方式:

  • 直接实现Event接口;
  • 使用一个便捷的实现,例如SimpleEvent类;
  • 调用EventBuilder重载的withBody()静态帮助方法。

RPC客户端-Avro和Thrift

在Flume 1.4.0中,Avro是默认的RPC协议。NettyAvroRpcClientThriftRpcClientRpcClient接口的实现。客户端需要用目标Flume agent的host和port来创建对象,然后就能用RpcClent来向agent发送数据了。下面是一个用户的数据生成应用中Flume Client SDK API的使用:

import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.api.RpcClient;
import org.apache.flume.api.RpcClientFactory;
import org.apache.flume.event.EventBuilder;
import java.nio.charset.Charset;

public class MyApp {
  public static void main(String[] args) {
    MyRpcClientFacade client = new MyRpcClientFacade();
    // Initialize client with the remote Flume agent's host and port
    client.init("host.example.org", 41414);

    // Send 10 events to the remote Flume agent. That agent should be
    // configured to listen with an AvroSource.
    String sampleData = "Hello Flume!";
    for (int i = 0; i < 10; i++) {
      client.sendDataToFlume(sampleData);
    }

    client.cleanUp();
  }
}

class MyRpcClientFacade {
  private RpcClient client;
  private String hostname;
  private int port;

  public void init(String hostname, int port) {
    // Setup the RPC connection
    this.hostname = hostname;
    this.port = port;
    this.client = RpcClientFactory.getDefaultInstance(hostname, port);
    // Use the following method to create a thrift client (instead of the above line):
    // this.client = RpcClientFactory.getThriftInstance(hostname, port);
  }

  public void sendDataToFlume(String data) {
    // Create a Flume Event object that encapsulates the sample data
    Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));

    // Send the event
    try {
      client.append(event);
    } catch (EventDeliveryException e) {
      // clean up and recreate the client
      client.close();
      client = null;
      client = RpcClientFactory.getDefaultInstance(hostname, port);
      // Use the following method to create a thrift client (instead of the above line):
      // this.client = RpcClientFactory.getThriftInstance(hostname, port);
    }
  }

  public void cleanUp() {
    // Close the RPC connection
    client.close();
  }

}

远程的Flume agent需要有一个AvroSource(如果用的是Tfrift客户端需要就是ThriftSource)在一些端口上监听。
等待MyApp连接的Flume agent配置:

a1.channels = c1
a1.sources = r1
a1.sinks = k1

a1.channels.c1.type = memory

a1.sources.r1.channels = c1
a1.sources.r1.type = avro
# For using a thrift source set the following instead of the above line.
# a1.source.r1.type = thrift
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger

为了更好的伸缩性,默认的Flume客户端实现(NettyAvroRpcClient 和 ThriftRpcClient)能用这些属性配置:

client.type = default (for avro) or thrift (for thrift)

hosts = h1                           # default client accepts only 1 host
                                     # (additional hosts will be ignored)

hosts.h1 = host1.example.org:41414   # host and port must both be specified
                                     # (neither has a default)

batch-size = 100                     # Must be >=1 (default: 100)
connect-timeout = 20000              # Must be >=1000 (default: 20000)
request-timeout = 20000              # Must be >=1000 (default: 20000)

安全的RPC客户端-Thrift

Flume 1.6.0中,Thrift source和sink支持基于验证的kerberos。客户端需要用SecureRpcClientFactory类的的getThriftInstance方法来持有一个SecureThriftRpcClient类。SecureThriftRpcClient类继承了ThriftRpcClient类,后者实现了RpcClient接口。在使用SecureRpcClientFactory的过程中,kerberos 认证模块属于flume-ng-auth的一部分,需要再classpath中配置。客户端的principal和keytab都需要通过属性值作为参数传进去,他们反映了客户端通过kerberos KDC认证的资格。另外,客户端连接的目标Thrift source,也要提供服务端principle。
在用户数据生成数据的应用中使用SecureRpcClientFactory的例子:

import org.apache.flume.Event;
import org.apache.flume.EventDeliveryException;
import org.apache.flume.event.EventBuilder;
import org.apache.flume.api.SecureRpcClientFactory;
import org.apache.flume.api.RpcClientConfigurationConstants;
import org.apache.flume.api.RpcClient;
import java.nio.charset.Charset;
import java.util.Properties;

public class MyApp {
  public static void main(String[] args) {
    MySecureRpcClientFacade client = new MySecureRpcClientFacade();
    // Initialize client with the remote Flume agent's host, port
    Properties props = new Properties();
    props.setProperty(RpcClientConfigurationConstants.CONFIG_CLIENT_TYPE, "thrift");
    props.setProperty("hosts", "h1");
    props.setProperty("hosts.h1", "client.example.org"+":"+ String.valueOf(41414));

    // Initialize client with the kerberos authentication related properties
    props.setProperty("kerberos", "true");
    props.setProperty("client-principal", "flumeclient/[email protected]");
    props.setProperty("client-keytab", "/tmp/flumeclient.keytab");
    props.setProperty("server-principal", "flume/[email protected]");
    client.init(props);

    // Send 10 events to the remote Flume agent. That agent should be
    // configured to listen with an AvroSource.
    String sampleData = "Hello Flume!";
    for (int i = 0; i < 10; i++) {
      client.sendDataToFlume(sampleData);
    }

    client.cleanUp();
  }
}

class MySecureRpcClientFacade {
  private RpcClient client;
  private Properties properties;

  public void init(Properties properties) {
    // Setup the RPC connection
    this.properties = properties;
    // Create the ThriftSecureRpcClient instance by using SecureRpcClientFactory
    this.client = SecureRpcClientFactory.getThriftInstance(properties);
  }

  public void sendDataToFlume(String data) {
    // Create a Flume Event object that encapsulates the sample data
    Event event = EventBuilder.withBody(data, Charset.forName("UTF-8"));

    // Send the event
    try {
      client.append(event);
    } catch (EventDeliveryException e) {
      // clean up and recreate the client
      client.close();
      client = null;
      client = SecureRpcClientFactory.getThriftInstance(properties);
    }
  }

  public void cleanUp() {
    // Close the RPC connection
    client.close();
  }
}

远程ThriftSource需要再kerberos模块中启动。等待MyApp连接的Flume agent配置的例子:

a1.channels = c1
a1.sources = r1
a1.sinks = k1

a1.channels.c1.type = memory

a1.sources.r1.channels = c1
a1.sources.r1.type = thrift
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
a1.sources.r1.kerberos = true
a1.sources.r1.agent-principal = flume/[email protected]
a1.sources.r1.agent-keytab = /tmp/flume.keytab

a1.sinks.k1.channel = c1
a1.sinks.k1.type = logger

Failover Client

这个类包装了Avro RPC客户端来向客户端提供failover处理能力。需要一个空格分隔的:列表,代表构成failover集合的Flume agent组。Failover RPC Client现在不支持thrift。如果和当前主机agent的连接出现错误,failover客户端自动切换到列表中的下一个agent,例如:

// Setup properties for the failover
Properties props = new Properties();
props.put("client.type", "default_failover");

// List of hosts (space-separated list of user-chosen host aliases)
props.put("hosts", "h1 h2 h3");

// host/port pair for each host alias
String host1 = "host1.example.org:41414";
String host2 = "host2.example.org:41414";
String host3 = "host3.example.org:41414";
props.put("hosts.h1", host1);
props.put("hosts.h2", host2);
props.put("hosts.h3", host3);

// create the client with failover properties
RpcClient client = RpcClientFactory.getInstance(props);

为了更好的伸缩性,failover Flume客户端实现(FailoverRpcClient)可以配置如下属性:

client.type = default_failover

hosts = h1 h2 h3                     # at least one is required, but 2 or
                                     # more makes better sense

hosts.h1 = host1.example.org:41414

hosts.h2 = host2.example.org:41414

hosts.h3 = host3.example.org:41414

max-attempts = 3                     # Must be >=0 (default: number of hosts
                                     # specified, 3 in this case). A '0'
                                     # value doesn't make much sense because
                                     # it will just cause an append call to
                                     # immmediately fail. A '1' value means
                                     # that the failover client will try only
                                     # once to send the Event, and if it
                                     # fails then there will be no failover
                                     # to a second client, so this value
                                     # causes the failover client to
                                     # degenerate into just a default client.
                                     # It makes sense to set this value to at
                                     # least the number of hosts that you
                                     # specified.

batch-size = 100                     # Must be >=1 (default: 100)

connect-timeout = 20000              # Must be >=1000 (default: 20000)

request-timeout = 20000              # Must be >=1000 (default: 20000)

负载均衡RPC客户端

Flume Client SDK也支持在多主机之间均衡负载的RPCClient。这种客户端需要一个由空格分隔的:列表代表构成负载均衡组的Flume agent列表。
这个客户端可以用以下负载策略进行配置:

  • 从配置的主机中随机选择一个
  • 用轮训方式选择主机

可以写一个类,实现LoadBalancingRpcClient$HostSelector接口,这样就能自定义一个主机的选择顺序。这是,自定义类的FQCN需要配置作为host-selector属性的值。LoadBalancing RPC Client现在不支持thrift。
如果backoff配置为可用,客户端就会把失败的主机列到黑名单中,这样,在指定的超时时间之前这些主机会作为failover 主机拒绝被选择。在超时时间过去后,如果主机仍然未响应,就进入失效队列,超时时间指数增长,以避免等待未响应主机造成的阻塞。
最大backoff时间通过maxBackoff属性配置,以毫秒为单位,默认30秒(在OrderSelector类中指定,它的超类加载了均衡策略)。达到最大backoff时间之后,backoff超时时间会指数增长。最大可以达到65536秒(大约18.2小时)。
例如:

// Setup properties for the load balancing
Properties props = new Properties();
props.put("client.type", "default_loadbalance");

// List of hosts (space-separated list of user-chosen host aliases)
props.put("hosts", "h1 h2 h3");

// host/port pair for each host alias
String host1 = "host1.example.org:41414";
String host2 = "host2.example.org:41414";
String host3 = "host3.example.org:41414";
props.put("hosts.h1", host1);
props.put("hosts.h2", host2);
props.put("hosts.h3", host3);

props.put("host-selector", "random"); // For random host selection
// props.put("host-selector", "round_robin"); // For round-robin host
//                                            // selection
props.put("backoff", "true"); // Disabled by default.

props.put("maxBackoff", "10000"); // Defaults 0, which effectively
                                  // becomes 30000 ms

// Create the client with load balancing properties
RpcClient client = RpcClientFactory.getInstance(props);

为了更好的伸缩性,负载均衡的Flume client实现(LoadBalancingRpcClient)可以按如下配置:

client.type = default_loadbalance

hosts = h1 h2 h3                     # At least 2 hosts are required

hosts.h1 = host1.example.org:41414

hosts.h2 = host2.example.org:41414

hosts.h3 = host3.example.org:41414

backoff = false                      # Specifies whether the client should
                                     # back-off from (i.e. temporarily
                                     # blacklist) a failed host
                                     # (default: false).

maxBackoff = 0                       # Max timeout in millis that a will
                                     # remain inactive due to a previous
                                     # failure with that host (default: 0,
                                     # which effectively becomes 30000)

host-selector = round_robin          # The host selection strategy used
                                     # when load-balancing among hosts
                                     # (default: round_robin).
                                     # Other values are include "random"
                                     # or the FQCN of a custom class
                                     # that implements
                                     # LoadBalancingRpcClient$HostSelector

batch-size = 100                     # Must be >=1 (default: 100)

connect-timeout = 20000              # Must be >=1000 (default: 20000)

request-timeout = 20000              # Must be >=1000 (default: 20000)

Embedded agent

Flume有一个嵌入式agent api允许用户在自己的应用中嵌入agent。这意味着这个agent很轻量,但不是所有sources,sinks,channels都可用。特别是使用的source是一个嵌入式source并且event需要通过EmbeddedAgent对象的put,putAll方法发送到这个source的时候。只有File Channel和Memory Channel是可用的channel,Avro Sink是可用的sink,嵌入式agent也支持Interceptors。
注意:嵌入式agent依赖hadoop-core.jar。
Embedded agent的配置和全版本Agent配置很相似。
以下是全面的配置列表,必选的属性加粗了:

属性名 默认值 描述
source.type embedded 嵌入的source中唯一可用的source
channel.type - 可选memoryfile分别相当于MemoryChannel和FileChannel
channel.* - 配置指定channel的选项,查阅MemoryChannel或者FileChannel用户指南获得详细清单
sinks - sink名列表
sink.type - sink的配置选项,AvroSink 用户指南中有详细清单,但是注意AvroSink至少需要主机名和接口
processor.type - 可选failoverload_balance分别代表FailoverSinksProcessor 和LoadBalancingSinkProcessor
processor.* - 选择的sink处理器的配置项,FailoverSinksProcessor和LoadBalanceingSinkProcessor用户指南中有详细清单
source.interceptor - 空格隔开的拦截器列表
source.interceptor.* - source.interceptors属性中中各拦截器的配置项

下面使用改agent的例子:

Map properties = new HashMap();
properties.put("channel.type", "memory");
properties.put("channel.capacity", "200");
properties.put("sinks", "sink1 sink2");
properties.put("sink1.type", "avro");
properties.put("sink2.type", "avro");
properties.put("sink1.hostname", "collector1.apache.org");
properties.put("sink1.port", "5564");
properties.put("sink2.hostname", "collector2.apache.org");
properties.put("sink2.port",  "5565");
properties.put("processor.type", "load_balance");
properties.put("source.interceptors", "i1");
properties.put("source.interceptors.i1.type", "static");
properties.put("source.interceptors.i1.key", "key1");
properties.put("source.interceptors.i1.value", "value1");

EmbeddedAgent agent = new EmbeddedAgent("myagent");

agent.configure(properties);
agent.start();

List events = Lists.newArrayList();

events.add(event);
events.add(event);
events.add(event);
events.add(event);

agent.putAll(events);

...

agent.stop();

Transaction接口

Transaction接口Flume可靠性的基础。所有主要的组件(例如Source,SinkChannel)必须使用Flume Transaction

Transaction接口

Transaction是在Channel实现中实现的。每个连接到ChannelSourceSink都必须包含一个Transaction对象。SourceChannelProcessor来管理TransactionSink则直接通过配置的Channel来管理它们。存放(把event放到Channel中)或者取出(从Channel中取出)Event的操作在活动的Transaction中完成。
例如:

Channel ch = new MemoryChannel();
Transaction txn = ch.getTransaction();
txn.begin();
try {
  // This try clause includes whatever Channel operations you want to do

  Event eventToStage = EventBuilder.withBody("Hello Flume!",
                       Charset.forName("UTF-8"));
  ch.put(eventToStage);
  // Event takenEvent = ch.take();
  // ...
  txn.commit();
} catch (Throwable t) {
  txn.rollback();

  // Log exception, handle individual exceptions as needed

  // re-throw all Errors
  if (t instanceof Error) {
    throw (Error)t;
  }
} finally {
  txn.close();
}

这里从一个Channel中持有Transaction。在begin()方法返回之后,Transaction就是active/open的,然后Event就被put进Channel了。如果put成功,Transaction就提交然后关闭。

Sink

Sink的目标是从Channel中取出Event然后传输向流中的下一个Flume Agent或者存储到外部仓库。一个Sink和一个Channel关联,在Flume属性文件中配置。有一个SinkRunner实例将每一个配置的Sink联系起来,在Flume框架调用SinkRunner.start()方法的时候,创建一个新的线程来驱动Sink(用SinkRunner.PollingRunner作为线程的Runnable)。这个线程管理着Sink的生命周期。Sink需要实现start()stop()方法,它们都是LifecycleAware接口的一部分。Sink.start()需要初始化Sink并且让Sink达到能把Event发送到目的地的状态。Sink.process()方法要完成从Channel取出Event并发送的核心工作。Sink.stop()方法要处理必要的清理工作(例如资源回收)。Sink实现也需要实现Configurable接口来处理它自己的配置:

public class MySink extends AbstractSink implements Configurable {
  private String myProp;

  @Override
  public void configure(Context context) {
    String myProp = context.getString("myProp", "defaultValue");

    // Process the myProp value (e.g. validation)

    // Store myProp for later retrieval by process() method
    this.myProp = myProp;
  }

  @Override
  public void start() {
    // Initialize the connection to the external repository (e.g. HDFS) that
    // this Sink will forward Events to ..
  }

  @Override
  public void stop () {
    // Disconnect from the external respository and do any
    // additional cleanup (e.g. releasing resources or nulling-out
    // field values) ..
  }

  @Override
  public Status process() throws EventDeliveryException {
    Status status = null;

    // Start transaction
    Channel ch = getChannel();
    Transaction txn = ch.getTransaction();
    txn.begin();
    try {
      // This try clause includes whatever Channel operations you want to do

      Event event = ch.take();

      // Send the Event to the external repository.
      // storeSomeData(e);

      txn.commit();
      status = Status.READY;
    } catch (Throwable t) {
      txn.rollback();

      // Log exception, handle individual exceptions as needed

      status = Status.BACKOFF;

      // re-throw all Errors
      if (t instanceof Error) {
        throw (Error)t;
      }
    }
    return status;
  }
}

Source

Source 的目标是从外部客户端接受数据并将其存放到配置好的Channel中。一个Source能从它自己的ChannelProcessor中获得实例来连续处理Event,并在Channel本地transaction中提交。如果出现异常,相应的Channel将会传递这个异常,所有这些Channel将会回滚它们的transaction,但是之前其他Channel处理的event仍然会提交。
SinkRunner.PollingRunner, Runnable类似,Flume框架调用PollableSourceRunner.start()方法的时候会创建一个线程,在这个线程上面会执行PollingRunner,Runnable。每个配置的PollableSource和它自己的作为PollingRunner运行的线程关联起来。这个线程管理着PollableSource的生命周期,例如开始和结束。PollableSource实现必须实现定义在LifecycleAware接口中的start()和stop()方法。PollableSource的runner调用Source的process()方法。process()方法需要检车新数据并将之作为Flume Event存放到Channel中去。
注意,实际上有两类Source。除了已经提到的PollableSource,一定要有自己的回调机制来捕获新数据,存放到Channel,还有一种EventDrivenSource,和PollableSource不同,不是由自己的线程驱动。
下面是一个自定义的PollableSource:

public class MySource extends AbstractSource implements Configurable, PollableSource {
  private String myProp;

  @Override
  public void configure(Context context) {
    String myProp = context.getString("myProp", "defaultValue");

    // Process the myProp value (e.g. validation, convert to another type, ...)

    // Store myProp for later retrieval by process() method
    this.myProp = myProp;
  }

  @Override
  public void start() {
    // Initialize the connection to the external client
  }

  @Override
  public void stop () {
    // Disconnect from external client and do any additional cleanup
    // (e.g. releasing resources or nulling-out field values) ..
  }

  @Override
  public Status process() throws EventDeliveryException {
    Status status = null;

    try {
      // This try clause includes whatever Channel/Event operations you want to do

      // Receive new data
      Event e = getSomeData();

      // Store the Event into this Source's associated Channel(s)
      getChannelProcessor().processEvent(e);

      status = Status.READY;
    } catch (Throwable t) {
      // Log exception, handle individual exceptions as needed

      status = Status.BACKOFF;

      // re-throw all Errors
      if (t instanceof Error) {
        throw (Error)t;
      }
    } finally {
      txn.close();
    }
    return status;
  }
}

你可能感兴趣的:(Flume简介)