Flume Sink是将事件写入到Hadoop分布式文件系统(HDFS)中。主要是Flume在Hadoop环境中的应用,即Flume采集数据输出到HDFS,适用大数据日志场景。
目前,它支持HDFS的文本和序列文件格式,以及支持两个文件类型的压缩。支持将所用的时间、数据大小、事件的数量为操作参数,对HDFS文件进行关闭(关闭当前文件,并创建一个新的)。它还可以对事源的机器名(hostname)及时间属性分离数据,即通过时间戳将数据分布到对应的文件路径。 HDFS目录路径可能包含格式转义序列用于取代由HDFS Sink生成一个目录/文件名存储的事件。
注意:Hadoop的版本需要支持sync()方法调用,当然首先得按照Hadoop。
下面是HDFS Sinks转义符的支持目录:
Alias |
Description |
%{host} |
Substitute value of event header named “host”. Arbitrary header names are supported. |
%t |
Unix time in milliseconds |
%a |
locale’s short weekday name (Mon, Tue, ...) |
%A |
locale’s full weekday name (Monday, Tuesday, ...) |
%b |
locale’s short month name (Jan, Feb, ...) |
%B |
locale’s long month name (January, February, ...) |
%c |
locale’s date and time (Thu Mar 3 23:05:25 2005) |
%d |
day of month (01) 每月中的第几天 |
%D |
date; same as %m/%d/%y |
%H |
hour (00..23) |
%I |
hour (01..12) |
%j |
day of year (001..366) 一年中的第几天 |
%k |
hour ( 0..23) |
%m |
month (01..12) |
%M |
minute (00..59) |
%p |
locale’s equivalent of am or pm |
%s |
seconds since 1970-01-01 00:00:00 UTC |
%S |
second (00..59) |
%y |
last two digits of year (00..99) 年的后两位 |
%Y |
year (2010) |
%z |
+hhmm numeric timezone (for example, -0400) |
下面是官网给出的HDFS Sinks的配置,加粗的参数是必选,可选项十分丰富,这里就不一一列出来了
Name |
Default |
Description |
channel |
– |
|
type |
– |
The component type name, needs to be hdfs |
hdfs.path |
– |
HDFS directory path (eg hdfs://namenode/flume/webdata/) |
hdfs.filePrefix |
FlumeData |
Name prefixed to files created by Flume in hdfs directory 文件前缀 |
hdfs.fileType |
SequenceFile |
File format: currently SequenceFile, DataStream or CompressedStream |
hdfs.useLocalTimeStamp |
false |
Use the local time (instead of the timestamp from the event header) while replacing the escape sequences. |
hdfs.codeC |
– |
Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy |
hdfs.round |
false |
Should the timestamp be rounded down (if true, affects all time based escape sequences except %t) 定时间用 |
hdfs.roundValue |
1 |
Rounded down to the highest multiple of this (in the unit configured using hdfs.roundUnit), less than current time.(需要hdfs.round为true) |
hdfs.roundUnit |
second |
The unit of the round down value - second, minute or hour.(同上) |
下面是官网的例子,他的三个round*配置是将向下舍入到最后10分钟的时间戳记录。
假设现在是上午10时56分20秒等等,2014年10月24日的Flume Sinks的数据到输出到HDFS的路径为/flume/events/2014-10-24/1050/00的。。
a1.channels=c1
a1.sinks=k1
a1.sinks.k1.type=hdfs
a1.sinks.k1.channel=c1
a1.sinks.k1.hdfs.path=/flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix=events-
a1.sinks.k1.hdfs.round=true
a1.sinks.k1.hdfs.roundValue=10
a1.sinks.k1.hdfs.roundUnit=minute
下面是实际的例子:
#配置文件:hdfs_case9.conf
#Name the components on this agent
a1.sources= r1
a1.sinks= k1
a1.channels= c1
#Describe/configure the source
a1.sources.r1.type= syslogtcp
a1.sources.r1.bind= 192.168.233.128
a1.sources.r1.port= 50000
a1.sources.r1.channels= c1
#Describe the sink
a1.sinks.k1.type= hdfs
a1.sinks.k1.channel= c1
a1.sinks.k1.hdfs.path= hdfs://carl:9000/flume/
a1.sinks.k1.hdfs.filePrefix= carl
a1.sinks.k1.hdfs.round= true
a1.sinks.k1.hdfs.roundValue= 1
a1.sinks.k1.hdfs.roundUnit= minute
a1.sinks.k1.hdfs.fileType=DataStream
# Usea channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity= 1000
a1.channels.c1.transactionCapacity= 100
这里我们偷懒拷了上节TCP的例子,然后加入sinks为HDFS中。我们设置数据是放入在HDFS的目录为hdfs://carl:9000/flume/,文件前缀为carl,其中这里有个设置要说明下:a1.sinks.k1.hdfs.fileType=DataStream,因为文件格式默认是 SequenceFile,如果直接打开是乱码,这个不方便演示,因此我们设置成普通数据格式。
#敲命令
flume-ng agent -cconf -f conf/hdfs_case9.conf -n a1 -Dflume.root.logger=INFO,console
启动成功后
打开另一个终端输入,往侦听端口送数据
echo "hello looklook7hello hdfs" | nc 192.168.233.128 50000
#在启动的终端查看console输出
这里可以看到他报了一个错误,说isfileclosed不可用。。。这个是这样的,这边的Hadoop是cdh3版本的,而flume ng 是说支持cdh4版本的,所以版本不匹配。不过这个无妨,下面看他们数据已经插入进去了,一开始生成一个hdfs://carl:9000/flume//carl.1414122459804.tmp,
然后数据进去了生成文件hdfs://carl:9000/flume/carl.1414122459804
那我们看下数据文件,hdfs://carl:9000/flume/carl.1414122459804
我们看到日志文件的生成过程,最后数据已经进去了。
然后我对配置文件里的这这个参数改下,参照官网的例子
a1.sinks.k1.hdfs.path= hdfs://carl:9000/flume/%y-%m-%d/%H%M/%S
然后加上这个参数
a1.sinks.k1.hdfs.useLocalTimeStamp=true
启动
打开另一个终端输入,往侦听端口送数据
echo "hello looklook7hello hdfs" | nc 192.168.233.128 50000
这里如果不加上面的参数a1.sinks.k1.hdfs.useLocalTimeStamp=true,会需要向事件里面明确header,否则会报错,如下
数据成功发送后,会生成数据文件
数据目录是/flume/14-10-24/1354/00
因为我们设的参数是1分钟a1.sinks.k1.hdfs.roundValue= 1 这个与官网讲的一致
INFO级别的日志事件。通常有用的测试/调试目的。之前的测试里有些,下面就不多赘述
下面是官网配置
Property Name |
Default |
Description |
channel |
– |
|
type |
– |
The component type name, needs to be logger |
Avro Sink主要用于Flume分层结构。Flumeevent 发送给这个sink的事件都会转换成Avro事件,发送到配置好的Avro主机和端口上。这些事件可以批量传输给通道。
下面是官网配置,加粗为必须,可选项太多就不一一列了
Property Name |
Default Description |
|
channel |
– |
|
type |
– |
The component type name, needs to be avro. |
hostname |
– |
The hostname or IP address to bind to. |
port |
– |
The port # to listen on. |
下面是官网例子
a1.channels=c1
a1.sinks=k1
a1.sinks.k1.type=avro
a1.sinks.k1.channel=c1
a1.sinks.k1.hostname=10.10.10.10
a1.sinks.k1.port=4545
因为Avro Sink主要用于Flume分层结构,那么这边都会想到我们学习心得(二)关于集群配置的列子就是关于Avro Sink与Avro Source的一个实例,其中pull.cof是关于Avro Source的例子,而push.conf 是Avro Sink的例子,具体内容大家可以去第二节看,这里不做赘述。
三、Avro Sink
Thrift也是用来支持Flume分层结构。Flumeevent 发送给这个sink的事件都会转换成Thrift事件,发送到配置好的Thrift主机和端口上。这些事件可以批量传输给通道。和Avro Sink一模一样。这边也就略过了。
IRC Sink 从通道中取得信息到IRCServer,这个没有IRC Server。。。无法测试,也略过吧。。。
存储到本地存储中。他有个滚动间隔的设置,设置多长时间去生成文件(默认是30秒)。
下面是官网配置
Property Name |
Default |
Description |
channel |
– |
|
type |
– |
The component type name, needs to be file_roll. |
sink.directory |
– |
The directory where files will be stored |
sink.rollInterval |
30 |
Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file. |
sink.serializer |
TEXT |
Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface. |
batchSize |
100 |
|
接下去是官网例子
a1.channels=c1
a1.sinks=k1
a1.sinks.k1.type=file_roll
a1.sinks.k1.channel=c1
a1.sinks.k1.sink.directory=/var/log/flume
下面是测试例子:
#配置文件:fileroll_case10.conf
#Name the components on this agent
a1.sources= r1
a1.sinks= k1
a1.channels= c1
#Describe/configure the source
a1.sources.r1.type= syslogtcp
a1.sources.r1.port= 50000
a1.sources.r1.host= 192.168.233.128
a1.sources.r1.channels= c1
#Describe the sink
a1.sinks.k1.type= file_roll
a1.sinks.k1.channel= c1
a1.sinks.k1.sink.directory= /tmp/logs
# Usea channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity= 1000
a1.channels.c1.transactionCapacity= 100
#敲命令
flume-ng agent -cconf -f conf/fileroll_case10.conf -n a1 -Dflume.root.logger=INFO,console
启动成功后
打开另一个终端输入,往侦听端口送数据
echo "hello looklook5hello hdfs" | nc 192.168.233.128 50000
#在启动的终端查看console输出
可以看到数据传过来并生成文件,然后无论是否有数据传过来,都会每过30秒就会生成文件。
丢弃从通道接收的所有事件。。。这边就不测试了。。
下面是官网配置
Property Name |
Default |
Description |
channel |
– |
|
type |
– |
The component type name, needs to be null. |
batchSize |
100 |
|
下面是官网例子
a1.channels=c1
a1.sinks=k1
a1.sinks.k1.type=null
a1.sinks.k1.channel=c1
HBaseSinks负责将数据写入到Hbase中。Hbase的配置信息从classpath路径里面遇到的第一个hbase-site.xml文件中获取。在配置文件中指定的实现了HbaseEventSerializer 接口的类,用于将事件转换成Hbase所表示的事件或者增量。然后将这些事件和增量写入Hbase中。
Hbase Sink支持写数据到安全的Hbase。为了将数据写入安全的Hbase,用户代理运行必须对配置的table表有写权限。主要用来验证对KDC的密钥表可以在配置中指定。在Flume Agent的classpath路径下的Hbase-site.xml文件必须设置到Kerberos认证。
注意有一定很重要,就是这个sinks 对格式的规范要求非常高。
至于 AsyncHBaseSink则是异步的HBaseSinks。
这边没有HBase环境,因此也就不演示了。。
一个自定义 Sinks其实是对Sinks接口的实现。当我们开始flume代理的时候必须将自定义Sinks和相依赖的jar包放到代理的classpath下面。自定义 Sinks的type就是我们实现Sinks接口对应的类全路径。
这里后面的内容里会详细介绍,这里不做赘述。
Source通过通道添加事件,Sinks通过通道取事件。所以通道类似缓存的存在。
Memory Channel是事件存储在一个内存队列中。速度快,吞吐量大。但会有代理出现故障后数据丢失的情况。
下面是官网配置
Property Name |
Default |
Description |
type |
– |
The component type name, needs to be memory |
capacity |
100 |
The maximum number of events stored in the channel |
transactionCapacity |
100 |
The maximum number of events the channel will take from a source or give to a sink per transaction |
keep-alive |
3 |
Timeout in seconds for adding or removing an event |
byteCapacityBufferPercentage |
20 |
Defines the percent of buffer between byteCapacity and the estimated total size of all events in the channel, to account for data in headers. See below. |
byteCapacity |
see description |
Maximum total bytes of memory allowed as a sum of all events in this channel. The implementation only counts the Event body, which is the reason for providing thebyteCapacityBufferPercentage configuration parameter as well. Defaults to a computed value equal to 80% of the maximum memory available to the JVM (i.e. 80% of the -Xmx value passed on the command line). Note that if you have multiple memory channels on a single JVM, and they happen to hold the same physical events (i.e. if you are using a replicating channel selector from a single source) then those event sizes may be double-counted for channel byteCapacity purposes. Setting this value to 0 will cause this value to fall back to a hard internal limit of about 200 GB. |
以及官网例子
a1.channels=c1
a1.channels.c1.type=memory
a1.channels.c1.capacity=10000
a1.channels.c1.transactionCapacity=10000
a1.channels.c1.byteCapacityBufferPercentage=20
a1.channels.c1.byteCapacity=800000
之前的例子全部是Memory Channel。关于Channel的列子不好演示,后面就不会有例子了。
JDBC Channel是把事件存储在数据库。目前的JDBC Channel支持嵌入式Derby。主要是为了数据持久化,并且可恢复的特性。
Property Name |
Default |
Description |
type |
– |
The component type name, needs to be jdbc |
db.type |
DERBY |
Database vendor, needs to be DERBY. |
driver.class |
org.apache.derby.jdbc.EmbeddedDriver |
Class for vendor’s JDBC driver |
driver.url |
(constructed from other properties) |
JDBC connection URL |
db.username |
“sa” |
User id for db connection |
db.password |
– |
password for db connection |
下面是官网例子:
a1.channels=c1
a1.channels.c1.type=jdbc
注意默认情况下,File Channel使用检查点(checkpointDir)和在用户目录(dataDirs)上指定的数据目录。所以在一个agent下面启动多个File Channel实例,只会有一个File channel能锁住文件目录,其他的都将初始化失败。因此,有必要提供明确的路径的所有已配置的通道,同时考虑最大吞吐率,检查点与数据目录最好是在不同的磁盘上。
Property Name Default |
Description |
|
type |
– |
The component type name, needs to be file. |
checkpointDir |
~/.flume/file-channel/checkpoint |
The directory where checkp |
dataDirs |
~/.flume/file-channel/data |
Comma separated list of directories for storing log files. Using multiple directories on separate disks can improve file channel peformance |
下面是官网例子
a1.channels=c1
a1.channels.c1.type=file
a1.channels.c1.checkpointDir=/mnt/flume/checkpoint
a1.channels.c1.dataDirs=/mnt/flume/data
File Channel 加密官网也给出了相应的配置
Generating a key with a password seperate from the key store password:
keytool -genseckey -alias key-0 -keypasskeyPassword -keyalg AES\
-keysize 128 -validity 9000 -keystore test.keystore\
-storetype jceks -storepass keyStorePassword
Generating a key with the password the same as the key store password:
keytool -genseckey -alias key-1 -keyalgAES -keysize 128 -validity 9000\
-keystore src/test/resources/test.keystore -storetype jceks\
-storepass keyStorePassword
a1.channels.c1.encryption.activeKey=key-0
a1.channels.c1.encryption.cipherProvider=AESCTRNOPADDING
a1.channels.c1.encryption.keyProvider=key-provider-0
a1.channels.c1.encryption.keyProvider=JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile=/path/to/my.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile=/path/to/my.keystore.password
a1.channels.c1.encryption.keyProvider.keys=key-0
Let’s say you have aged key-0 out and new files should be encrypted withkey-1:
a1.channels.c1.encryption.activeKey=key-1
a1.channels.c1.encryption.cipherProvider=AESCTRNOPADDING
a1.channels.c1.encryption.keyProvider=JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile=/path/to/my.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile=/path/to/my.keystore.password
a1.channels.c1.encryption.keyProvider.keys=key-0 key-1
The same scenerio as above, however key-0 has its own password:
a1.channels.c1.encryption.activeKey=key-1
a1.channels.c1.encryption.cipherProvider=AESCTRNOPADDING
a1.channels.c1.encryption.keyProvider=JCEKSFILE
a1.channels.c1.encryption.keyProvider.keyStoreFile=/path/to/my.keystore
a1.channels.c1.encryption.keyProvider.keyStorePasswordFile=/path/to/my.keystore.password
a1.channels.c1.encryption.keyProvider.keys=key-0 key-1
a1.channels.c1.encryption.keyProvider.keys.key-0.passwordFile=/path/to/key-0.password
前者还在试验阶段。。后者仅仅用来测试目的,不是在生产环境中使用,所以略过。
Custom Channel是对channel接口的实现。需要在classpath中引入实现类和相关的jar文件。这Channel对应的type是该类的完整路径
下面是官网配置
Property Name |
Default |
Description |
type |
– |
The component type name, needs to be a FQCN |
后面是官网例子
a1.channels=c1
a1.channels.c1.type=org.example.MyChannel