- Flume简介
- Flume外部架构
数据发生器(如:facebook,twitter)产生的数据被被单个的运行在数据发生器所在服务器上的agent所收集,之后数据收容器从各个agent上汇集数据并将采集到的数据存入到HDFS或者 HBase中
- Flume数据传输基本单位Event
- Flume的核心Agent
Flume的核心是Agent,在Flume内部有一个或者多个Agent,每一个Agent是一个独立的守护进程。如上图所示,Agent由source、channel和sink三个组件组成:
- Flume运行机制
flume的核心就是agent,agent对外有两个进行交互的地方,一个是接受数据的输入—source,一个是数据的输出sink,sink负责将数据发送到外部指定的目的地。source接收到数据之后,将数据发送给channel,channel作为一个数据缓冲区会临时存放这些数据,随后sink会将channel中的数据发送到指定的地方—例如HDFS、Hbase等。
注意: 只有在sink将channel中的数据成功发送出去之后,channel才会将临时数据进行删除,这种机制保证了数据传输的可靠性与安全性。
提到这里,来说一下Flume的可信任性体现在什么地方?
- Agent Interceptor
- Agent Selector
channel selectors 有两种类型:
- Flume安装
wget http://www.apache.org/dist/flume/stable/apache-flume-1.9.0-bin.tar.gz
tar -zxvf apache-flume-1.9.0-bin.tar.gz
vim .bash_profile
添加flume环境变量
export FLUME_HOME=/app/apache-flume-1.9.0-bin
export PATH=$PATH:$FLUME_HOME/bin
保存文件后,source一下使配置文件生效
source .bash_profile
cp flume-env.sh.template flume-env.sh
vim flume-env.sh
scp -r apache-flume-1.9.0-bin/ hongqiang@slaver1:/app/
scp -r apache-flume-1.9.0-bin/ hongqiang@slaver2:/app/
- Flume实践
vim flume_netcat.conf
# Name the components on this agent
#首先定义了一个Agent,命名为a1
a1.sources = r1 #a1里面的source组件命名为r1
a1.sinks = k1 #a1里面的sink组件命名为k1
a1.channels = c1 #a1里面的channel命名为c1
# Describe/configure the source
#source输入源配置
a1.sources.r1.type = netcat #信息输入的方式,netcat代表网络的形式灌入数据
a1.sources.r1.bind = master #从master节点上监听数据
a1.sources.r1.port = 44444 #设置的端口号
# Describe the sink
#sink输出方式的设置,这里是输出logger的形式
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
#缓存方式设置
a1.channels.c1.type = memory #缓存方式,memory channel
a1.channels.c1.capacity = 1000 #设置channel中最大的消息(Event)容量
a1.channels.c1.transactionCapacity = 100 #一次最多从source获取的消息容量
# Bind the source and sink to the channel
#连接方式设置
a1.sources.r1.channels = c1 #a1中的source(r1)连接channel (c1)
a1.sinks.k1.channel = c1 #a1中的sink (k1)连接channel (c1)
执行命令
flume-ng agent --conf conf --conf-file ./flume_netcat.conf --name a1 - Dflume.root.logger=INFO,console
vim flume_exec.conf
Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /app/apache-flume-1.9.0-bin/test_data/1.log
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
执行命令
flume-ng agent --conf conf --conf-file ./flume_exec.conf --name a1 - Dflume.root.logger=INFO,console
vim flume_hdfs_webpy.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
## exec表示flume回去调用给的命令,然后从给的命令的结果中去拿数据
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /app/apache-flume-1.9.0-bin/test_data/1.log
a1.sources.r1.channels = c1
# Describe the sink
## 表示下沉到hdfs,类型决定了下面的参数
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
## 下面的配置告诉用hdfs去写文件的时候写到什么位置,下面的表示不是写死的,而是可以动态的变化的。
表示输出的目录名称是可变的
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/
## 使用本地时间戳
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#生成的文件类型,默认是Sequencefile,可用DataStream:为普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
##使用内存的方式
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
执行命令
flume-ng agent --conf conf --conf-file ./flume_hdfs_webpy.conf --name a1 - Dflume.root.logger=INFO,console
效果如下:
当监听到1.log数据有改变时,会将1.log中最后10条数据输出存储到hdfs相应的位置
vim flume-client.properties_failover
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /app/apache-flume-1.9.0-bin/test_data/1.log
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 52020
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 52020
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 10 #谁的值大,谁就是主节点,因此此时slaver1为主节点
a1.sinkgroups.g1.processor.priority.k2 = 1
a1.sinkgroups.g1.processor.priority.maxpenality = 10000
slaver1节点flume配置文件为
vim flume-server-failover.conf
# agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1
#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = slaver1 #此时监听节点为slaver1
a1.sources.r1.port = 52020
# set sink to hdfs
a1.sinks.k1.type = logger
# a1.sinks.k1.type = hdfs
# a1.sinks.k1.hdfs.path=/flume_data_pool
# a1.sinks.k1.hdfs.fileType=DataStream
# a1.sinks.k1.hdfs.writeFormat=TEXT
# a1.sinks.k1.hdfs.rollInterval=1
# a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d
a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1
slaver2节点flume配置文件为
vim flume-server-failover.conf
# agent1 name
a1.channels = c1
a1.sources = r1
a1.sinks = k1
#set channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# other node, slave to master
a1.sources.r1.type = avro
a1.sources.r1.bind = slaver2 #此时监听节点为slaver2
a1.sources.r1.port = 52020
# set sink to hdfs
a1.sinks.k1.type = logger
# a1.sinks.k1.type = hdfs
# a1.sinks.k1.hdfs.path=/flume_data_pool
# a1.sinks.k1.hdfs.fileType=DataStream
# a1.sinks.k1.hdfs.writeFormat=TEXT
# a1.sinks.k1.hdfs.rollInterval=1
# a1.sinks.k1.hdfs.filePrefix = %Y-%m-%d
a1.sources.r1.channels = c1
a1.sinks.k1.channel=c1
配置完成后,首先启动slaver1和slaver2上的flume,然后启动master节点上的flume。当我们向1.log文件中写入数据时,主节点slaver1将接收到数据,当手动关掉slaver1上的flume时,再次发送消息,从节点slaver2将收到数据,当再次重启slaver1上的flume时,slaver1上将再次接收到数据。
vim flume_client_replicating.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 50000
a1.sources.r1.host = master
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 50000
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 50000
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
slaver1、slaver2节点配置参考实践4
– multiplexing(根据Event中的header信息进行选择分发到哪个节点)
vim flume_client_multiplexing.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type= org.apache.flume.source.http.HTTPSource
a1.sources.r1.port= 50000
a1.sources.r1.host= master
a1.sources.r1.selector.type= multiplexing
a1.sources.r1.channels= c1 c2
a1.sources.r1.selector.header= areyouok
a1.sources.r1.selector.mapping.OK = c1
a1.sources.r1.selector.mapping.NO = c2
a1.sources.r1.selector.default= c1
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 50000
a1.sinks.k2.type = avro
a1.sinks.k2.channel = c2
a1.sinks.k2.hostname = slaver2
a1.sinks.k2.port = 50000
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
slaver1、slaver2节点配置参考实践4
更多相关示例可参考flume官网:http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html
如有问题,欢迎留言指正!