Flume拦截器(正则过滤拦截器,使用idea自定义拦截器)

Flume拦截器

  • 一.使用正则过滤拦截器(去掉首行)
  • 二.自定义拦截器
    • 1.创建maven工程
    • 2.在idea中自定义编写拦截器
    • 3.打成jar包传到$FLUME_HOME/lib 目录下
    • 4.编写agent文件
    • 5.执行结果

一.使用正则过滤拦截器(去掉首行)

属性参数

  • type 组件类型regex_filter
  • regex 用于匹配Event内容的正则表达式
  • excludeEventss 如果为true,被正则匹配到的Event会被丢弃;如果为false,不被正则匹配到的Event会被丢弃

需求:

使用Spooling directory source监督符合格式的文件进行上传(格式:user_年-月-日.csv);

使用正则拦截器去除首行;

使用file channel进行缓存;

以规定的文件格式()上传到HDFS上规定文件夹下

[root@hadoop1 user]#mkdir /opt/flume160/conf/jobkb09/dataSourceFile/user
[root@hadoop1 user]#mkdir /opt/flume160/conf/jobkb09/checkPointFile/user
[root@hadoop1 user]#mkdir /opt/flume160/conf/jobkb09/dataChannelFile/user

#agent文件
[root@hadoop1 jobkb09]# vi user-flume-hdfs.conf
users.sources=userSource
users.channels=userChannel
users.sinks=userSink

users.sources.userSource.type=spooldir
users.sources.userSource.spoolDir=/opt/flume160/conf/jobkb09/dataSourceFile/user
users.sources.userSource.includePattern=users_[0-9]{4}-[0-9]{2}-[0-9]{2}.csv
users.sources.userSource.deserializer=LINE
users.sources.userSource.deserializer.maxLineLength=10000
#正则过滤拦截器
users.sources.userSource.interceptors=head_filter
users.sources.userSource.interceptors.head_filter.type=regex_filter
#匹配以user_id开头的event
users.sources.userSource.interceptors.head_filter.regex=^user_id*
users.sources.userSource.interceptors.head_filter.excludeEvents=true#为true则表示去除正则表达式匹配的内容

users.channels.userChannel.type=file
users.channels.userChannel.checkpointDir=/opt/flume160/conf/jobkb09/checkPointFile/user
#存储临时文件
users.channels.userChannel.dataDirs=/opt/flume160/conf/jobkb09/dataChannelFile/user

users.sinks.userSink.type=hdfs
users.sinks.userSink.hdfs.fileType=DataStream
users.sinks.userSink.hdfs.filePrefix=user
users.sinks.userSink.hdfs.fileSuffix=.csv
users.sinks.userSink.hdfs.path=hdfs://192.168.36.100:9000/kb09file/user/users/%Y-%m-%d
users.sinks.userSink.hdfs.useLocalTimeStamp=true
users.sinks.userSink.hdfs.batchSize=640
users.sinks.userSink.hdfs.rollInterval=20
users.sinks.userSink.hdfs.rollCount=0
users.sinks.userSink.hdfs.rollSize=120000000

users.sources.userSource.channels=userChannel
users.sinks.userSink.channel=userChannel

#启动执行任务
[root@hadoop1 flume160]#./bin/flume-ng agent --name users --conf ./conf/ --conf-file ./conf/jobkb09/user-flume-hdfs.conf -Dflume.root.logger=INFO,console

#向指定文件夹中放入符合格式要求的文件
[root@hadoop1 tmp]# cp users.csv /opt/flume160/conf/jobkb09/dataSourceFile/user/users_2020-12-01.csv

二.自定义拦截器

需求:按照需求使文件进入指定文件中
以oldhu开头的进入oldhu文件夹中
其他进入hu文件夹中

1.创建maven工程

导以下依赖包

 <dependency>
      <groupId>org.apache.flumegroupId>
      <artifactId>flume-ng-coreartifactId>
      <version>1.6.0version>
    dependency>

2.在idea中自定义编写拦截器

package nj.zb.kb09;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

/**
 * @author Oldhu
 * @date @Date
 * @des
 */
public class InterceptorDmo implements Interceptor {
   private List<Event> addHeaderEvents;

    @Override
    public void initialize() {
        addHeaderEvents=new ArrayList();
    }

    @Override
    public Event intercept(Event event) {
        byte[] body = event.getBody();
        Map<String, String> headers = event.getHeaders();
        String bodyStr= new String(body);
        //以"oldhu"开头的放到oldhu文件夹中,其他放到hu文件夹中
         if(bodyStr.startsWith("oldhu")){
             headers.put("type","oldhu");
         }else{
             headers.put("type","hu");
         }
        return event;
    }

    @Override
    public List<Event> intercept(List<Event> events) {
        addHeaderEvents.clear();
        for(Event event:events){
         addHeaderEvents.add(   intercept(event));
        }
        return addHeaderEvents;
    }
    @Override
    public void close() {

    }
    public static class  Builder implements  Interceptor.Builder{
        @Override
        public Interceptor build() {
            return new InterceptorDmo();
        }
        @Override
        public void configure(Context context) {

        }
    }
}

3.打成jar包传到$FLUME_HOME/lib 目录下

4.编写agent文件

两个channel对应着两个sink

[root@hadoop1 jobkb09]# vi netcat-flume-interceptor-hdfs.conf
#对agent各个组件进行命名
ictdemo.sources=ictSource
ictdemo.channels=ictChannel1 ictChannel2
ictdemo.sinks=ictSink1 ictSink2

#设置netcat source相应参数
ictdemo.sources.ictSource.type=netcat
ictdemo.sources.ictSource.bind=localhost
ictdemo.sources.ictSource.port=7777

#命名自定义拦截器,定义其类型
ictdemo.sources.ictSource.interceptors=interceptor1
ictdemo.sources.ictSource.interceptors.interceptor1.type=nj.zb.kb09.InterceptorDmo$Builder

#选择器选择多路复用选择器
ictdemo.sources.ictSource.selector.type=multiplexing
#想要进行匹配的header属性的名字
ictdemo.sources.ictSource.selector.header=type
ictdemo.sources.ictSource.selector.mapping.oldhu=ictChannel1
ictdemo.sources.ictSource.selector.mapping.hu=ictChannel2

#memory channel
ictdemo.channels.ictChannel1.type=memory
ictdemo.channels.ictChannel1.capacity=1000
ictdemo.channels.ictChannel1.transactionCapacity=1000

ictdemo.channels.ictChannel2.type=memory
ictdemo.channels.ictChannel2.capacity=1000
ictdemo.channels.ictChannel2.transactionCapacity=1000

#hdfs sinks
ictdemo.sinks.ictSink1.type=hdfs
ictdemo.sinks.ictSink1.hdfs.fileType=DataStream
ictdemo.sinks.ictSink1.hdfs.filePrefix=oldhu
ictdemo.sinks.ictSink1.hdfs.fileSuffix=.csv
ictdemo.sinks.ictSink1.hdfs.path=hdfs://192.168.36.100:9000/kb09file/user/oldhu/%Y-%m-%d
ictdemo.sinks.ictSink1.hdfs.useLocalTimeStamp=true
ictdemo.sinks.ictSink1.hdfs.batchSize=640
ictdemo.sinks.ictSink1.hdfs.rollCount=0
ictdemo.sinks.ictSink1.hdfs.rollSize=1000
ictdemo.sinks.ictSink1.hdfs.rollInterval=3

ictdemo.sinks.ictSink2.type=hdfs
ictdemo.sinks.ictSink2.hdfs.fileType=DataStream
ictdemo.sinks.ictSink2.hdfs.filePrefix=hu
ictdemo.sinks.ictSink2.hdfs.fileSuffix=.csv
ictdemo.sinks.ictSink2.hdfs.path=hdfs://192.168.36.100:9000/kb09file/user/hu/%Y-%m-%d
ictdemo.sinks.ictSink2.hdfs.useLocalTimeStamp=true
ictdemo.sinks.ictSink2.hdfs.batchSize=640
ictdemo.sinks.ictSink2.hdfs.rollCount=0
ictdemo.sinks.ictSink2.hdfs.rollSize=1000
ictdemo.sinks.ictSink2.hdfs.rollInterval=3

ictdemo.sources.ictSource.channels=ictChannel1 ictChannel2
ictdemo.sinks.ictSink1.channel=ictChannel1
ictdemo.sinks.ictSink2.channel=ictChannel2

#启动执行
[root@hadoop1 flume160]#./bin/flume-ng agent --name ictdemo --conf ./conf/ --conf-file ./conf/jobkb09/netcat-flume-interceptor-hdfs.conf -Dflume.root.logger=INFO,console

#开启natcat
#启动成功后在下方输入相应文件会到指定文件夹中
[root@hadoop1 ~]# telnet localhost 7777
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
5555
OK
oldhu
OK
hu hahahh
OK
#即5555 hu hahahh到文件夹/kb09file/user/hu/2020-12-01中
#  oldhu到文件夹/kb09file/user/oldhu/2020-12-01中

5.执行结果

Flume拦截器(正则过滤拦截器,使用idea自定义拦截器)_第1张图片

Flume拦截器(正则过滤拦截器,使用idea自定义拦截器)_第2张图片

你可能感兴趣的:(Flume,flume,拦截器,hdfs)