flume简单介绍

官网的一句话:Flume is a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store.

翻译：Flume是一个能有效地采集，汇总和移动大量的来自不同来源的日志数据并进行集中存储的分布式的、可靠的和可用的系统。

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.

Apache Flume的使用不局限于日志数据聚合。由于数据源是可定制的，flume可以用来传输大量的事件数据，包括但不限于网络流量数据、社交媒体生成的数据、电子邮件消息和几乎任何数据源。

本文档使用的apache-flume-1.8.0-bin版本

系统要求

Java8或更高版本
足够的内存空间（sources, channels or sinks）
足够的磁盘空间（channels or sinks）
文件夹的权限（agent 对文件夹的读写权限）

特点

复杂流动
可靠性
可恢复性
在配置文件中使用环境变量 1、在启动脚本中，添加以下参数 -DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties 2、在 conf/flume-env.sh文件中添加变量
支持 zookeeper 配置
支持第三方插件在启动flume时，会加载FLUME_HOME/plugins.d的目录。

flume搭建

flume采用的是一种开箱即用的部署方式，但是对于他的一些默认参数的设置会比较小，我们需要对一些参数进行一些修改。

我们只修改bin/flume-ng和conf/flume-env.sh两个文件。

修改bin/flume-ng文件修改该文件的225行：JAVA_OPTS="-Xmx1024m" ，默认是 20

如果想要配置远程debug调试将该行配置修改为：
```
 JAVA_OPTS="-Xmx2048m -Xdebug -Xrunjdwp:transport=dt_socket,address=5005,server=y,suspend=y"
```
修改 conf/flume-env.sh 文件将conf目录下的flume-env.sh.template 文件重命名为：flume-env.sh , 并修改该文件的以下配置：
```
 export JAVA_OPTS="-Xms1024m -Xmx1024m -Dcom.sun.management.jmxremote"
```

简单入门案例

在FLUME_HOME目录下创建一个otherconf目录，名称任意，用于存放我们编写的一些配置文件。
在新建的otherconf目录下创建一个example.conf的配置文件，名称任意。

在新建的example.conf文件中添加以下几行配置代码

 # Name the components on this agent
 a1.sources = r1
 a1.sinks = k1
 a1.channels = c1

 # Describe/configure the source
 a1.sources.r1.type = seq

 # Use a channel which buffers events in memory
 a1.channels.c1.type = memory
 a1.channels.c1.capacity = 1000
 a1.channels.c1.transactionCapacity = 100

 # Describe the sink
 a1.sinks.k1.type = file_roll
 a1.sinks.k1.sink.directory = D:/flume-test

 # Bind the source and sink to the channel
 a1.sources.r1.channels = c1
 a1.sinks.k1.channel = c1

启动flume服务

在启动之前一定要先将我们配置的目录创建出来，否则会抛出找不到目录的错误。

进入到FLUME_HOME目录下，执行以下命令。

 windows下的启动命令：
 .\bin\flume-ng.cmd agent --conf conf --conf-file otherconf\example.conf --name a1 -property flume.root.logger=INFO,console

 Linux下的启动命令：
 ./bin/flume-ng agent --conf conf --conf-file otherconf/example.conf --name a1  -Dflume.root.logger=INFO,console

数据提取

flume支持许多种从外部source提取数据的机制

RPC
Executing commands (执行命令)

Network streams （网络流）

  Avro
  Thrift
  Syslog
  Netcat

数据多级流动

  为了使数据流动到多个agent或者是hop中，前端agent的sink和后端agent的source需要设置成avro类型，并且source要指向sink的ip和端口。

在代理中添加多个流

# list the sources, sinks and channels in the agent
agent_foo.sources = avro-AppSrv-source1 exec-tail-source2
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

# flow #1 configuration
agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1
agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1

# flow #2 configuration
agent_foo.sources.exec-tail-source2.channels = file-channel-2
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

扇入扇出

扇入即表示多个source同时向一个channel中发送数据扇出即表示一个source同时向多个channel中发送数据扇出有两种模式：复制和复用复制即一份数据轮询发送到多个channel中复用可以对数据进行选择性的发送，根据配置中的选择器的值进行对channel的选择，如果匹配不到，就会发送到默认的选择器中。而且每个选择器也可以配置多个channel。

list the sources, sinks and channels in the agent

agent_foo.sources = avro-AppSrv-source1
agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
agent_foo.channels = mem-channel-1 file-channel-2

set channels for source

agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2

set channel for sinks

agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

channel selector configuration

agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
agent_foo.sources.avro-AppSrv-source1.selector.header = State
agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1

以上的案例中使用了复用的模式对avro-AppSrv-source1进行了处理，选择器对header为State的数据进行检查，如果为CA选择 mem-channel-1,如果AZ选择 file-channel-2,如果NY，选择两个，如果都不匹配选择默认的mem-channel-1。

selector也支持可选通道，配置如下：
agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2

选择器会试图第一时间将数据写到需求channel和当这些channel中某些channel没法消费这些events时会停止这次事务。该事务会重新连接所有channel。一旦所有channel都在消费了所有events，那么选择器会试图将events写到备选channel中。备选channel消费event产生的失效会被简单地忽略和不再重试。

如果对于一个指定的header存在备选channel和需求channel的重叠，那么选择需求channel，并且当一个需求channel发生失效时将会引起所有需求channel的重试。举个例子，在上面的案例中，为header“CA”指定了一个需求channel mem-channel-1，尽管备选channel和需求channel都指定了，但是一旦需求channel发生失效，name会引起该选择器中所有channel的重试。

需要说明的是如果一个header没有指定任何需求channel，那么events会写到默认channel和试图写到备选channel中。如果没有指定需求channel，就算指定了备选channel，events还是会被写到默认channel中。如果没有指定需求channel和默认channel，选择器会说将events写到备选channel。在这些情况中，失效会被忽略。

flume入门案例