Flume自定义功能实现

该功能可以实现flume读取xml配置文件在avro sink模式下可以同时开启多个端口,并且根据客户定义的xml来将数据处理后导入多个集群中的相应hbase表中。

实现流程如下:


准备工作:
首先开启多个集群并且确认hbase和flume可以正常使用,还需将dom4j(用来解析xml文件)所需要的jar包dom4j-1.6.1.jar和jaxen-1.1-beta-7.jar(利用xpath技术来像sql一样来定位xml中的内容)导入到flume的lib目录下,并且将要读取的xml文件复制在所有节点中,并且导入用eclipse将项目打包成core.jar的jar包(这里需要注意的是字符编码的问题,不一致的话会报错),这里我开启了两个集群做测试。再将xml配置文件导入两个集群的所有几点/home/hadoop目录下,dao.xml内容如下:
[hadoop@h71 ~]$ cat dao.xml 



   
      4141
      4040
      
         
            
               3
               4
            
            
               5
               6
            
         
          
      
      
         
            
               3
               4
            
         
          
      
   
   
      5151
      
         
            
               3
               4
            
            
               5
               6
            
         
          
      
      
         
            
               3
               4
            
         
          
      
   
该项目想实现在一个sink端启动多个集群端口并且将数据插入对应集群中的hbase表中。该项目还实现了断点续传、在hbase中自动建表、file_roll模式下能生成想要的文件。
我这里要启动h71的flume进程来启动三个端口(第一个端口要在h71的hbase中自动建立相应的表并且插入数据,第二个端口在file_roll模式的sink下向h71的/home/hadoop/hui目录下生成messages.txt文件,第三个端口要在h21下的hbase中建相应的表并且插入数据)

在h71启动flume进程产生三个端口时需先启动相应三个端口的source端:

hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin/conf$ cat messages5.conf 

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# Describe/configure the source
a1.sources.r1.type = hui.avrosource.AvroSource
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.8.21
a1.sources.r1.port = 5151

# Describe the sink
a1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSink
a1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializer
a1.sinks.k1.channel = memoryChannel

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
hadoop@h21:~/apache-flume-1.6.0-cdh5.5.2-bin$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
(若hbase中没有xml中的表则建立,有则不建立,该hbase中没有,进程显示
Create messages SUCCESS!
Create hui SUCCESS!
12/12/13 01:08:58 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: hui.org.apache.flume.sink.hbase.HBaseSink
表名:messages
列族名:cf)
hbase(main):013:0> list
TABLE                                                                                                                                                                                                                                        
hui                                                                                                                                                                                                                                          
messages
2 row(s) in 0.0220 seconds
[hadoop@h71 conf]$ cat messages5.conf 
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# Describe/configure the source
a1.sources.r1.type = hui.avrosource.AvroSource
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.8.71
a1.sources.r1.port = 4141

# Describe the sink
a1.sinks.k1.type = hui.org.apache.flume.sink.hbase.HBaseSink
a1.sinks.k1.serializer = com.tcloud.flume.AsyncHbaseLogEventSerializer
a1.sinks.k1.channel = memoryChannel

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
 
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages5.conf -n a1 -Dflume.root.logger=INFO,console
(hbase中存在相应的表,进程显示:
messages exists!
hui exists!
17/03/18 15:36:46 INFO sink.DefaultSinkFactory: Creating instance of sink: k1, type: hui.org.apache.flume.sink.hbase.HBaseSink
表名:messages
列族名:cf)


[hadoop@h71 conf]$ cat messages6.conf 

# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 192.168.8.71
a1.sources.r1.port = 4040

# Describe the sink
a1.sinks.k1.type = cn.huyanping.flume.sinks.SafeRollingFileSink
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /home/hadoop/hui
a1.sinks.k1.sink.rollInterval = 0

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[hadoop@h71 hui]$ ls (/home/hadoop/hui目录下为空)
[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages6.conf -n a1 -Dflume.root.logger=INFO,console
[hadoop@h71 hui]$ ls
messages.txt
(启动进程后产生messages.tt空文件)

启动sink端:
[hadoop@h71 conf]$ cat messages4.conf 
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# Describe/configure the source
a1.sources.r1.type = org.apache.flume.chiwei.filemonitor.FileMonitorSource
a1.sources.r1.channels = c1
a1.sources.r1.file = /home/hadoop/messages
a1.sources.r1.positionDir = /home/hadoop

# Describe the sink
a1.sinks.k1.type = hui.avrosink.AvroSink
a1.sinks.k1.batch-size = 2

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[hadoop@h71 ~]$ cat messages (所要导入数据的日志文件源)
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan  3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames


[hadoop@h71 apache-flume-1.6.0-cdh5.5.2-bin]$ bin/flume-ng agent -c . -f conf/messages4.conf -n a1 -Dflume.root.logger=INFO,console
(该进程启动后在前面启动的三个source端所监听的端口都启动成功,如下)

17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] OPEN
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] BOUND: /192.168.8.71:4040
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xabab0b15, /192.168.8.71:33975 => /192.168.8.71:4040] CONNECTED: /192.168.8.71:33975

17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] OPEN
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] BOUND: /192.168.8.71:4141
17/03/18 15:42:51 INFO ipc.NettyServer: [id: 0xb9dd8531, /192.168.8.71:51345 => /192.168.8.71:4141] CONNECTED: /192.168.8.71:51345

12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] OPEN
12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] BOUND: /192.168.8.21:5151
12/12/13 01:19:05 INFO ipc.NettyServer: [id: 0x65dc160e, /192.168.8.71:50634 => /192.168.8.21:5151] CONNECTED: /192.168.8.71:50634)
[hadoop@h71 ~]$ echo "Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames" >> messages
h71的sink端的输出:
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan  3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames

17/03/18 15:46:46 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.
17/03/18 15:46:46 WARN api.NettyAvroRpcClient: Using default maxIOWorkers
client-->NettyAvroRpcClient { host: h71, port: 4141 }
17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.
17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkers
client-->NettyAvroRpcClient { host: h71, port: 4040 }
17/03/18 15:46:47 INFO avrosink.AvroSink: Attempting to create Avro Rpc client.
17/03/18 15:46:47 WARN api.NettyAvroRpcClient: Using default maxIOWorkers
client-->NettyAvroRpcClient { host: 192.168.8.21, port: 5151 }
在h21查看相应hbase中的表:
hbase(main):014:0> scan 'messages'
ROW                                                          COLUMN+CELL                                                                                                                                                                     
 2012-12-13 01:23:00                                         column=cf:host, timestamp=1355379762701, value=s_sys@hui                                                                                                                    
 2012-12-13 01:23:00                                         column=cf:ip, timestamp=1355379762628, value=192.168.101.254                                                                                                                    
 2012-12-13 01:23:00                                         column=df:leixing, timestamp=1355379762741, value=trafficlogger:                                                                                                                
 2012-12-13 01:23:00                                         column=df:xinxi, timestamp=1355379762791, value=empty                                                                                                                           
 2012-12-13 01:23:01                                         column=cf:host, timestamp=1355379763516, value=s_sys@hui                                                                                                                    
 2012-12-13 01:23:01                                         column=cf:ip, timestamp=1355379763488, value=::                                                                                                                                 
 2012-12-13 01:23:01                                         column=df:leixing, timestamp=1355379763544, value=trafficlogger:                                                                                                                
 2012-12-13 01:23:01                                         column=df:xinxi, timestamp=1355379763573, value=empty                                                                                                                           
2 row(s) in 0.0610 seconds

hbase(main):015:0> scan 'hui'
ROW                                                          COLUMN+CELL                                                                                                                                                                     
 2012-12-13 01:23:01                                         column=ef:haha, timestamp=1355379763422, value=19:59:02                                                                                                                         
 2012-12-13 01:23:01                                         column=ef:hehe, timestamp=1355379763452, value=192.168.101.254                                                                                                                  
 2012-12-13 01:23:02                                         column=ef:haha, timestamp=1355379763607, value=::                                                                                                                               
 2012-12-13 01:23:02                                         column=ef:hehe, timestamp=1355379763635, value=s_sys@hui                                                                                                                    
2 row(s) in 0.0500 seconds
在h71查看相应hbase中的表:
hbase(main):012:0> scan 'messages'
ROW                                                          COLUMN+CELL                                                                                                                                                                     
 2017-03-18 15:46:47                                         column=cf:host, timestamp=1489823233223, value=192.168.101.254                                                                                                                  
 2017-03-18 15:46:47                                         column=cf:ip, timestamp=1489823233185, value=19:59:02                                                                                                                           
 2017-03-18 15:46:47                                         column=df:leixing, timestamp=1489823233263, value=s_sys@hui                                                                                                                 
 2017-03-18 15:46:47                                         column=df:xinxi, timestamp=1489823233297, value=trafficlogger:                                                                                                                  
 2017-03-18 15:46:48                                         column=cf:host, timestamp=1489823233439, value=s_sys@hui                                                                                                                    
 2017-03-18 15:46:48                                         column=cf:ip, timestamp=1489823233406, value=::                                                                                                                                 
 2017-03-18 15:46:48                                         column=df:leixing, timestamp=1489823233471, value=trafficlogger:                                                                                                                
 2017-03-18 15:46:48                                         column=df:xinxi, timestamp=1489823233505, value=empty                                                                                                                           
2 row(s) in 0.3660 seconds

hbase(main):013:0> scan 'hui'
ROW                                                          COLUMN+CELL                                                                                                                                                                     
 2017-03-18 15:46:47                                         column=ef:haha, timestamp=1489823233106, value=::                                                                                                                               
 2017-03-18 15:46:47                                         column=ef:hehe, timestamp=1489823233145, value=s_sys@hui                                                                                                                    
 2017-03-18 15:46:48                                         column=ef:haha, timestamp=1489823233544, value=::                                                                                                                               
 2017-03-18 15:46:48                                         column=ef:hehe, timestamp=1489823233578, value=s_sys@hui                                                                                                                    
2 row(s) in 0.0160 seconds
[hadoop@h71 hui]$ cat messages.txt 
Jan 23 19:59:00 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Feb 20 06:25:04 h107 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22204" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan  3 19:59:02 192.168.101.254 s_sys@hui trafficlogger: empty map for 1:4097 in classnames
Jan 24 19:59:01 :: s_sys@hui trafficlogger: empty map for 1:4097 in classnames
并且在/home/hadoop/目录下生成position.log作为端点续传的功能。


项目代码已经上传http://download.csdn.net/download/m0_37739193/10154814

你可能感兴趣的:(hbase,flume)