Flume中的拦截器(Interceptor)介绍与使用
Flume中的拦截器(interceptor),用户Source读取events发送到Sink的时候,在events header中加入一些有用的信息,或者对events的内容进行过滤,完成初步的数据清洗。
Flume-ng 1.70中目前提供了以下拦截器:
Timestamp Interceptor;
Host Interceptor;
Static Interceptor;
Search and Replace Interceptor;
Regex Filtering Interceptor;
对一个Source可以使用多个拦截器。
步骤1、操作步骤 通过执行命令start-all.sh启动hadoop,并且在flume安装目录下的conf目录中进行编写配置文件
Timestamp Interceptor:
时间戳拦截器,将当前时间戳(毫秒)加入到events header中,key名字为:timestamp,值为当前时间戳。用的不是很多。比如在使用HDFS Sink时候,根据events的时间戳生成结果文件,hdfs.path = hdfs://master:9000/flume/%Y%m%d
#配置文件:mytime.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 50000
a1.sources.r1.host = 192.168.27.174 (根据实际情况设置)
a1.sources.r1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.preserveExisting= false
a1.sources.r1.interceptors.i1.type = timestamp
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path =hdfs://master:9000/flume/%Y-%m-%d/%H%M
a1.sinks.k1.hdfs.filePrefix = looklook5.
a1.sinks.k1.hdfs.fileType=DataStream
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
在设置好环境变量的情况下,切换到flume目录下,执行flume命令:flume-ng agent -c conf -f conf/mytime.conf -n a1 -Dflume.root.logger=INFO,console
此时打开另一个终端界面使用telnet请求命令向5000端口发送请求信息,使得flume获取生成时间戳的日志信息:
telnet 192.168.27.174 :50000
同时可以在hdfs上查看生成的日志文件。
Host Interceptor:
主机名拦截器。将运行Flume agent的主机名或者IP地址加入到events header中,key名字为:host(也可自定义)。设置配置文件myhost.conf
# define agent name, source/sink/channel name
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# source,http,jsonhandler
a1.sources.r1.type = http
a1.sources.r1.bind = master
a1.sources.r1.port = 50000
a1.sources.r1.handler = org.apache.flume.source.http.JSONHandler
# timestamp and host interceptors work before source
a1.sources.r1.interceptors = i1 i2
# 两个interceptor串联,依次作用于event
a1.sources.r1.interceptors.i1.type = timestamp
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i2.type = host
# flume event的头部将添加 “hostname”:实际主机名
a1.sources.r1.interceptors.i2.hostHeader = hostname
# 指定key,value将填充为flume agent所在节点的主机名
a1.sources.r1.interceptors.i2.useIP = false
# 04 hdfs sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://master:9000/flume/%Y-%m-%d/
# hdfs sink将根据event header中的时间戳进行替换
a1.sinks.k1.hdfs.filePrefix = %{hostname}
# hdfs sink将根据event header中的hostnmae对应的value进行替换
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount = 10
a1.sinks.k1.hdfs.rollSize = 1024000
# channel,memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# bind source,sink to channel
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
该配置用于将source的events保存到HDFS上hdfs://master:9000/flume/
%Y%m%d的目录下,文件名为file_<主机名>.log
在设置好环境变量的情况下,切换到flume目录下,执行flume命令:flume-ng agent -c conf -f conf/myhost.conf -n a1 -Dflume.root.logger=INFO,console
此时启动flume终端,成功启动时如下所示:
此时打开另一个终端界面使用telnet请求命令向5000端口发送请求信息:
telnet master:5000
在hdfs上查看生成的日志文件:
Static Interceptor:
静态拦截器,用于在events header中加入一组静态的key和value。
拦截器允许用户增加一个static的header并为所有的事件赋值。范围是所有事件。
## source 拦截器
#配置文件:static_case18.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 50000
a1.sources.r1.host = 192.168.27.174
a1.sources.r1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = static_key
a1.sources.r1.interceptors.i1.value =static_value
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
在设置好环境变量的情况下,切换到flume目录下,执行flume命令:flume-ng agent -c conf -f conf/static.conf -n a1 -Dflume.root.logger=INFO,console
此时启动flume终端,成功启动时如下所示:
这时打开另一个终端界面使用telnet请求命令向6666端口发送请求信息:
此时flume终端界面产生定义的static_key和static_value日志信息:
Body Text Serializer:
Body Text Serializer拦截器将把事件的body部分写入到输出流中而不需要任何转换或者修改。事件的header将直接被忽略。
#创建配置文件body_case15.conf
# 设置sources名称
a1.sources = r1
#设置sinks名称
a1.sinks = k1
#设置channels名称
a1.channels = c1
# sources拦截器
a1.sources.r1.type = http
a1.sources.r1.port = 50000
a1.sources.r1.host = 192.168.27.174 (根据实际情况进行设置)
a1.sources.r1.channels = c1
# sink配置
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /tmp/logs
a1.sinks.k1.sink.serializer = text
a1.sinks.k1.sink.serializer.appendNewline =false
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
执行flume-ng agent -c conf -f conf/body_case15.conf -n a1 -Dflume.root.logger=INFO,console,启动flume
启动成功后,打开另一个端口,往侦听端口送数据:
telnet 192.168.27.174:50000
(这里的ip地址根据自己实际情况进行修改)
此时切换至/tmp/logs目录中查看拦截器过滤后的信息
hadoop@master:/tmp/logs$ ll
total 32
drwxrwxrwx 2 root root 20480 Jul 31 20:14 ./
drwxrwxrwt 19 root root 4096 Jul 31 20:14 ../
-rw-rw-r-- 1 hadoop hadoop 14 Jul 31 19:48 1533082885173-294
hadoop@master:/tmp/logs$ cat 1533082885173-294
Hellolooklook
此时这里的日志结果为我们发送请求的body模块中指定的内容
Regex FilteringInterceptor:
Regex Filtering Interceptor拦截器用于过滤事件,筛选出与配置的正则表达式相匹配的事件。可以用于包含事件和排除事件。常用于数据清洗,通过正则表达式把数据过滤出来
#创建配置文件:regex_filter_case19.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 50000
a1.sources.r1.host = 192.168.233.128
a1.sources.r1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type =regex_filter
a1.sources.r1.interceptors.i1.regex =^[0-9]*$
a1.sources.r1.interceptors.i1.excludeEvents =true
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
这里定义的正则为 ^[0-9]*$,规则为对开头字母是数字的数据,全部过滤。
下面启动flume终端
启动成功后,打开另一个终端输入,往侦听端口送数据a,1222,a222:
telnet 192.168.233.128:50000
在启动flume发送的代理终端查看输出结果
从终端输出结果中可以看出,向50000端口发送的 ”a”,”1222”,”a222”三条数据中,经过拦截器的过滤,终端中输出结果中只有”a”与”a222”,可以看出”1222” 被认为是无效的数据没有传输出来。
通过Regex FilteringInterceptor我们也可以根据自己的需求来改写正则表达式匹配的内容,达到匹配任何我们想要的内容。
Regex extractor interceptor:
作用:
①将event body的内容和配置中指定的正则表达式进行匹配
②如果内容匹配,将配合配置文件中给定的key, 组成key:value添加到event的header中
③event body中的内容不会变化
#regex_exter_case.conf
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# 02 source,http,jsonhandler
a1.sources.r1.type = http
a1.sources.r1.bind = 127.0.0.1
a1.sources.r1.port = 6666
a1.sources.r1.handler = org.apache.flume.source.http.JSONHandler
# 03 regex extractor interceptor,match event body to extract character and digitala1.sources.r1.interceptors = i1 a1.sources.r1.interceptors.i1.type = regex_extractora1.sources.r1.interceptors.i1.regex = (^[a-zA-Z]*)\\s([0-9]*$) # regex匹配并进行分组,匹配结果将有两个部分, 注意\s空白字符要进行转义# specify key for 2 matched part
a1.sources.r1.interceptors.i1.serializers = s1 s2
# key name
a1.sources.r1.interceptors.i1.serializers.s1.name = word
a1.sources.r1.interceptors.i1.serializers.s2.name = digital
# logger sink
a1.sinks.k1.type = logger
# channel,memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
下面启动flume终端
启动成功后,打开另一个终端输入,往侦听端口送数据:
telnet 127.0.0.1:6666
在启动flume发送的代理终端查看输出结果