实现流程如下:
准备工作:
首先开启多个集群并且确认hbase和flume可以正常使用,还需将dom4j(用来解析xml文件)所需要的jar包dom4j-1.6.1.jar和jaxen-1.1-beta-7.jar(利用xpath技术来像sql一样来定位xml中的内容)导入到flume的lib目录下,并且将要读取的xml文件复制在所有节点中,并且导入用eclipse将项目打包成core.jar的jar包(这里需要注意的是字符编码的问题,不一致的话会报错),这里我开启了两个集群做测试。再将xml配置文件导入两个集群的所有几点/home/hadoop目录下,dao.xml内容如下:
[hadoop@h71 ~]$ cat dao.xml
4141
4040
3
4
5
6
3
4
5151
3
4
5
6
3
4
该项目想实现在一个sink端启动多个集群端口并且将数据插入对应集群中的hbase表中。该项目还实现了断点续传、在hbase中自动建表、file_roll模式下能生成想要的文件。
在h71启动flume进程产生三个端口时需先启动相应三个端口的source端:
hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin/conf$ cat messages5.conf
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# Describe/configure the source
a1.sources.r1.type = hui.avrosource.AvroSource
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.8.21
a1.sources.r1.port = 5151
# Describe the sink
a1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSink
a1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializer
a1.sinks.k1.channel = memoryChannel
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
hbase(main):013:0> list
TABLE
hui
messages
2 row(s) in 0.0220 seconds
[hadoop@h71 conf]$ cat messages5.conf
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# Describe/configure the source
a1.sources.r1.type = hui.avrosource.AvroSource
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.8.71
a1.sources.r1.port = 4141
# Describe the sink
a1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSink
a1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializer
a1.sinks.k1.channel = memoryChannel
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
[hadoop@h71 conf]$ cat messages6.conf
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.8.71
a1.sources.r1.port = 4040
# Describe the sink
a1.sinks.k1.type = cn.huyanping.flume.sinks.SafeRollingFileSink
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /home/hadoop/hui
a1.sinks.k1.sink.rollInterval = 0
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[hadoop@h71 hui]$ ls
(/home/hadoop/hui目录下为空)
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# Describe/configure the source
a1.sources.r1.type = org.apache.flume.chiwei.filemonitor.FileMonitorSource
a1.sources.r1.channels = c1
a1.sources.r1.file = /home/hadoop/messages
a1.sources.r1.positionDir = /home/hadoop
# Describe the sink
a1.sinks.k1.type = hui.avrosink.AvroSink
a1.sinks.k1.batch-size = 2
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[hadoop@h71 ~]$ cat messages (所要导入数据的日志文件源)
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages4.conf -n a1 -Dflume.root.logger=INFO,console
(该进程启动后在前面启动的三个source端所监听的端口都启动成功,如下)
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] OPEN
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] BOUND: /192.168.8.71:4040
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] CONNECTED: /192.168.8.71:33975
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] OPEN
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] BOUND: /192.168.8.71:4141
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] CONNECTED: /192.168.8.71:51345
12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] OPEN
12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] BOUND: /192.168.8.21:5151
12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] CONNECTED: /192.168.8.71:50634)
[hadoop@h71 ~]$ echo "Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames" >> messages
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan 3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
17/03/18 15:46:46 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.
17/03/18 15:46:46 WARN api.NettyAvroRpcClient: Using default maxIOWorkers
client-->NettyAvroRpcClient { host: h71, port: 4141 }
17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.
17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkers
client-->NettyAvroRpcClient { host: h71, port: 4040 }
17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.
17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkers
client-->NettyAvroRpcClient { host: 192.168.8.21, port: 5151 }
在h21查看相应hbase中的表:
hbase(main):014:0> scan 'messages'
ROW COLUMN+CELL
2012-12-13 01:23:00 column=cf:host, timestamp=1355379762701, value=s_sys@hui
2012-12-13 01:23:00 column=cf:ip, timestamp=1355379762628, value=192.168.101.254
2012-12-13 01:23:00 column=df:leixing, timestamp=1355379762741, value=trafficlogger:
2012-12-13 01:23:00 column=df:xinxi, timestamp=1355379762791, value=empty
2012-12-13 01:23:01 column=cf:host, timestamp=1355379763516, value=s_sys@hui
2012-12-13 01:23:01 column=cf:ip, timestamp=1355379763488, value=::
2012-12-13 01:23:01 column=df:leixing, timestamp=1355379763544, value=trafficlogger:
2012-12-13 01:23:01 column=df:xinxi, timestamp=1355379763573, value=empty
2 row(s) in 0.0610 seconds
hbase(main):015:0> scan 'hui'
ROW COLUMN+CELL
2012-12-13 01:23:01 column=ef:haha, timestamp=1355379763422, value=19:59:02
2012-12-13 01:23:01 column=ef:hehe, timestamp=1355379763452, value=192.168.101.254
2012-12-13 01:23:02 column=ef:haha, timestamp=1355379763607, value=::
2012-12-13 01:23:02 column=ef:hehe, timestamp=1355379763635, value=s_sys@hui
2 row(s) in 0.0500 seconds
在h71查看相应hbase中的表:
hbase(main):012:0> scan 'messages'
ROW COLUMN+CELL
2017-03-18 15:46:47 column=cf:host, timestamp=1489823233223, value=192.168.101.254
2017-03-18 15:46:47 column=cf:ip, timestamp=1489823233185, value=19:59:02
2017-03-18 15:46:47 column=df:leixing, timestamp=1489823233263, value=s_sys@hui
2017-03-18 15:46:47 column=df:xinxi, timestamp=1489823233297, value=trafficlogger:
2017-03-18 15:46:48 column=cf:host, timestamp=1489823233439, value=s_sys@hui
2017-03-18 15:46:48 column=cf:ip, timestamp=1489823233406, value=::
2017-03-18 15:46:48 column=df:leixing, timestamp=1489823233471, value=trafficlogger:
2017-03-18 15:46:48 column=df:xinxi, timestamp=1489823233505, value=empty
2 row(s) in 0.3660 seconds
hbase(main):013:0> scan 'hui'
ROW COLUMN+CELL
2017-03-18 15:46:47 column=ef:haha, timestamp=1489823233106, value=::
2017-03-18 15:46:47 column=ef:hehe, timestamp=1489823233145, value=s_sys@hui
2017-03-18 15:46:48 column=ef:haha, timestamp=1489823233544, value=::
2017-03-18 15:46:48 column=ef:hehe, timestamp=1489823233578, value=s_sys@hui
2 row(s) in 0.0160 seconds
[hadoop@h71 hui]$ cat messages.txt
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan 3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
并且在/home/hadoop/目录下生成position.log作为端点续传的功能。
项目代码已经上传http://download.csdn.net/download/m0_37739193/10154814