Flume

大数据-Flume

环境
Flume-ng 1.9.0
source TAILDIR
channel memory
sink hdfs

一、性能测试

针对以下场景对Flume进行性能测试
Flume_第1张图片
场景1:一个channel,采用lzo压缩
场景1:一个channel,不压缩
场景1:两个channel,采用lzo压缩
场景1:两个channel,不压缩

配置文件


# Name the components on this agent
exec-hdfs-agent.sources = r1
exec-hdfs-agent.sinks = s1 s2
exec-hdfs-agent.channels = c1 c2

# Describe/configure the source
exec-hdfs-agent.sources.r1.selector.type = com.sjj.ParityChannelSelector 
exec-hdfs-agent.sources.r1.type = TAILDIR
exec-hdfs-agent.sources.r1.channels = c1 c2
exec-hdfs-agent.sources.r1.positionFile = /tmp/flume/logs/taildir_position.json
exec-hdfs-agent.sources.r1.filegroups = test
exec-hdfs-agent.sources.r1.filegroups.test= /tmp/flume/data/.*.test.log

# Describe the sink1
exec-hdfs-agent.sinks.s1.channel = c1
exec-hdfs-agent.sinks.s1.type = hdfs
exec-hdfs-agent.sinks.s1.hdfs.path = hdfs://namenode/test/%y-%m-%d
exec-hdfs-agent.sinks.s1.hdfs.fileType= DataStream
#exec-hdfs-agent.sinks.s1.hdfs.fileType= CompressedStream
#exec-hdfs-agent.sinks.s1.hdfs.codeC= com.hadoop.compression.lzo.LzopCodec
exec-hdfs-agent.sinks.s1.hdfs.writeFormat= Text
exec-hdfs-agent.sinks.s1.hdfs.batchSize= 20000
exec-hdfs-agent.sinks.s1.hdfs.rollSize= 128000000
exec-hdfs-agent.sinks.s1.hdfs.rollCount= 0
exec-hdfs-agent.sinks.s1.hdfs.rollInterval=0
exec-hdfs-agent.sinks.s1.hdfs.minBlockReplicas=1
exec-hdfs-agent.sinks.s1.hdfs.callTimeout=20000
exec-hdfs-agent.sinks.s1.hdfs.useLocalTimeStamp=true
exec-hdfs-agent.sinks.s1.hdfs.fileSuffix=.lzo
exec-hdfs-agent.sinks.s1.hdfs.filePrefix=c1

# Describe the sink2
exec-hdfs-agent.sinks.s2.channel = c2
exec-hdfs-agent.sinks.s2.type = hdfs 
exec-hdfs-agent.sinks.s2.hdfs.path = hdfs://namenode/test/%y-%m-%d
exec-hdfs-agent.sinks.s2.hdfs.fileType= DataStream
exec-hdfs-agent.sinks.s2.hdfs.writeFormat= Text
exec-hdfs-agent.sinks.s2.hdfs.batchSize= 10000
exec-hdfs-agent.sinks.s2.hdfs.rollSize= 128000000
exec-hdfs-agent.sinks.s2.hdfs.rollCount= 0
exec-hdfs-agent.sinks.s2.hdfs.rollInterval=0
exec-hdfs-agent.sinks.s2.hdfs.minBlockReplicas=1
exec-hdfs-agent.sinks.s2.hdfs.callTimeout=20000
exec-hdfs-agent.sinks.s2.hdfs.useLocalTimeStamp=true
exec-hdfs-agent.sinks.s2.hdfs.fileSuffix=.lzo
exec-hdfs-agent.sinks.s2.hdfs.filePrefix=c2

# Use a channel1 which buffers events in memory
exec-hdfs-agent.channels.c1.type = memory
exec-hdfs-agent.channels.c1.capacity=50000
exec-hdfs-agent.channels.c1.transactionCapacity=10000

# Use a channel1 which buffers events in memory
exec-hdfs-agent.channels.c2.type = memory
exec-hdfs-agent.channels.c2.capacity=50000
exec-hdfs-agent.channels.c2.transactionCapacity=10000

二、Flume写hdfs(开启kerberos)

(一)增加配置
在Flume配置文件加入以下配置
[email protected],/usr/local/flume/conf/test.keytab替换成你自己的

[email protected]
exec-hdfs-agent.sinks.s2.hdfs.kerberosKeytab=/usr/local/flume/conf/test.keytab

(二)拷贝文件

  1. 把keytab文件拷贝到flume的conf目录下(创建keytab等操作请参考另一篇博文)。
  2. 把hadoop集群的core-site.xml和hdfs-site.xml拷贝到flume的conf目录下。
  3. wewe

三、Flume写hdfs(使用lzo压缩)

(一)增加配置
fileType= CompressedStream代表使用压缩

#exec-hdfs-agent.sinks.s2.hdfs.fileType= CompressedStream
#exec-hdfs-agent.sinks.s2.hdfs.codeC= com.hadoop.compression.lzo.LzopCodec

(二)拷贝JAR包

  1. 手工编译lzo和hadoop-lzo的,直接将jar包放在plugins.d下即可。
  2. 使用Cloudera安装hadoop-lzo parcel的,要将jar包和native下链接都放在plugins.d下。
    为何这样可以参考 hadoop-lzo.jar和hadoop-gpl-compression.jar区别:http://guoyunsky.iteye.com/blog/1289475

(三)拷贝配置文件
从hadoop集群上拉取core-site.xml放在flume/conf下,其实主要使用


  io.compression.codecs
  org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

四、HA
由于HDFS集群的HA机制,当hdfs集群的namenode状态发生变化时,flume上报时会报出Exception, Operation category READ(WRITE) is not supported in state standby,因为standby namenode是不对提供服务的。那么此时flume就处于不可用状态,必须手工修改配置文件中sink的hdfs.path值的namenode,然后重启flume才能解决。当要收集日志的服务器很多时,会增加很多人力成本;另外,日志上报状态监控没有做好的话,也许用到这个日志的时候才会发现flume出现问题。

解决这个问题也比较简单,就是将集群中hdfs-site.xml复制一份到flume的conf目录下即可,当namenode状态切换时,flume也能正确将日志上报到hdfs中。此时,hdfs.path配置也可以省略域名。

exec-hdfs-agent.sinks.s1.hdfs.path = /test/%y-%m-%d

你可能感兴趣的:(大数据,Flume)