二.Flume部署及使用

4.1、文件配置

查询JAVA_HOME: echo $JAVA_HOME

显示/opt/module/jdk1.8.0_144  /opt/module/jdk1.8.0_144

安装Flume

[itstar@bigdata113 software]$ tar -zxvf apache-flume1.8.0-bin.tar.gz -C /opt/module/

改名:

[itstar@bigdata113 conf]$ mv flume-env.sh.template flume-env.sh

flume-env.sh涉及修改项:

export JAVA_HOME=/opt/module/jdk1.8.0_144


4.2、案例

4.2.1、案例一:监控端口数据

目标:Flume监控一端Console,另一端Console发送消息,使被监控端实时显示。

分步实现:

1) 安装telnet工具

【联网状态】yum -y install telnet

【安装完成】


2) 创建Flume Agent配置文件flume-telnet.conf

#定义Agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#定义source

a1.sources.r1.type = netcat

a1.sources.r1.bind = bigdata113

a1.sources.r1.port = 44445

# 定义sink

a1.sinks.k1.type = logger

# 定义memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# 双向链接

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

3) 判断44444端口是否被占用

$ netstat -tunlp | grep 44445

4) 启动flume配置文件

/opt/module/flume-1.8.0/bin/flume-ng agent \

--conf /opt/module/flume-1.8.0/conf/ \

--name a1 \

--conf-file /opt/module/flume-1.8.0/jobconf/flume-telnet.conf \

-Dflume.root.logger==INFO,console

flume-ng 启动命令

--conf 配置所在的目录

--name agent的名字

--conf-file 配置文件所在的路径

-Dflume.root.logger==INFO,console 控制台打印

5) 使用telnet工具向本机的44444端口发送内容

$ telnet bigdata113 44445

4.2.2、案例二:实时读取本地文件到HDFS

1) 创建flume-hdfs.conf文件

# 1 agent

a2.sources = r2

a2.sinks = k2

a2.channels = c2

# 2 source

a2.sources.r2.type = exec

a2.sources.r2.command = tail -F /opt/Andy

a2.sources.r2.shell = /bin/bash -c

# 3 sink

a2.sinks.k2.type = hdfs

a2.sinks.k2.hdfs.path = hdfs://bigdata111:9000/flume/%Y%m%d/%H

#上传文件的前缀

a2.sinks.k2.hdfs.filePrefix = logs-

#是否按照时间滚动文件夹

a2.sinks.k2.hdfs.round = true

#多少时间单位创建一个新的文件夹

a2.sinks.k2.hdfs.roundValue = 1

#重新定义时间单位

a2.sinks.k2.hdfs.roundUnit = hour

#是否使用本地时间戳

a2.sinks.k2.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a2.sinks.k2.hdfs.batchSize = 1000

#设置文件类型,可支持压缩

a2.sinks.k2.hdfs.fileType = DataStream

#多久生成一个新的文件

a2.sinks.k2.hdfs.rollInterval = 600

#设置每个文件的滚动大小

a2.sinks.k2.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a2.sinks.k2.hdfs.rollCount = 0

#最小副本数

a2.sinks.k2.hdfs.minBlockReplicas = 1

# 定义 memory

a2.channels.c2.type = memory

a2.channels.c2.capacity = 1000

a2.channels.c2.transactionCapacity = 100

#双向链接channel

a2.sources.r2.channels = c2

a2.sinks.k2.channel = c2

3) 执行监控配置

/opt/module/flume-1.8.0/bin/flume-ng agent \

--conf /opt/module/flume-1.8.0/conf/ \

--name a2 \

--conf-file /opt/module/flume-1.8.0/jobconf/flume-hdfs.conf

4.2.3、案例三:实时读取目录文件到HDFS

目标:使用flume监听整个目录的文件

分步实现

1) 创建配置文件flume-dir.conf

#1 Agent

a3.sources = r3

a3.sinks = k3

a3.channels = c3

#2 source

a3.sources.r3.type = spooldir

a3.sources.r3.spoolDir = /opt/module/flume1.8.0/upload

a3.sources.r3.fileSuffix = .COMPLETED

a3.sources.r3.fileHeader = true

#忽略所有以.tmp结尾的文件,不上传

a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# 3 sink

a3.sinks.k3.type = hdfs

a3.sinks.k3.hdfs.path = hdfs://bigdata111:9000/flume/%H

#上传文件的前缀

a3.sinks.k3.hdfs.filePrefix = upload-

#是否按照时间滚动文件夹

a3.sinks.k3.hdfs.round = true

#多少时间单位创建一个新的文件夹

a3.sinks.k3.hdfs.roundValue = 1

#重新定义时间单位

a3.sinks.k3.hdfs.roundUnit = hour

#是否使用本地时间戳

a3.sinks.k3.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a3.sinks.k3.hdfs.batchSize = 100

#设置文件类型,可支持压缩

a3.sinks.k3.hdfs.fileType = DataStream

#多久生成一个新的文件

a3.sinks.k3.hdfs.rollInterval = 600

#设置每个文件的滚动大小大概是128M

a3.sinks.k3.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a3.sinks.k3.hdfs.rollCount = 0

#最小副本数

a3.sinks.k3.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory

a3.channels.c3.type = memory

a3.channels.c3.capacity = 1000

a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel

a3.sources.r3.channels = c3

a3.sinks.k3.channel = c3

2) 执行测试:执行如下脚本后,请向upload文件夹中添加文件试试

/opt/module/flume1.8.0/bin/flume-ng agent \

--conf /opt/module/flume1.8.0/conf/ \

--name a3 \

--conf-file /opt/module/flume1.8.0/jobconf/flume-dir.conf

尖叫提示: 在使用Spooling Directory Source时

1) 不要在监控目录中创建并持续修改文件

2) 上传完成的文件会以.COMPLETED结尾

3) 被监控文件夹每500毫秒扫描一次文件变动

4.2.4、案例四:FlumeFlume之间数据传递:单FlumeChannelSink


目标:使用flume1监控文件变动,flume1将变动内容传递给flume-2,flume-2负责存储到HDFS。同时flume1将变动内容传递给flume-3,flume-3负责输出到local

分步实现:

1) 创建flume1.conf,用于监控某文件的变动,同时产生两个channel和两个sink分别输送给flume2和flume3:

# 1.agent

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1 c2

# 将数据流复制给多个channel

a1.sources.r1.selector.type = replicating

# 2.source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.shell = /bin/bash -c

# 3.sink1

a1.sinks.k1.type = avro

a1.sinks.k1.hostname = bigdata111

a1.sinks.k1.port = 4141

# sink2

a1.sinks.k2.type = avro

a1.sinks.k2.hostname = bigdata111

a1.sinks.k2.port = 4142

# 4.channel—1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# 4.channel—2

a1.channels.c2.type = memory

a1.channels.c2.capacity = 1000

a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1 c2

a1.sinks.k1.channel = c1

a1.sinks.k2.channel = c2

2) 创建flume2.conf,用于接收flume1的event,同时产生1个channel和1个sink,将数据输送给hdfs:

# 1 agent

a2.sources = r1

a2.sinks = k1

a2.channels = c1

# 2 source

a2.sources.r1.type = avro

a2.sources.r1.bind = bigdata111

a2.sources.r1.port = 4141

# 3 sink

a2.sinks.k1.type = hdfs

a2.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume2/%H

#上传文件的前缀

a2.sinks.k1.hdfs.filePrefix = flume2-

#是否按照时间滚动文件夹

a2.sinks.k1.hdfs.round = true

#多少时间单位创建一个新的文件夹

a2.sinks.k1.hdfs.roundValue = 1

#重新定义时间单位

a2.sinks.k1.hdfs.roundUnit = hour

#是否使用本地时间戳

a2.sinks.k1.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a2.sinks.k1.hdfs.batchSize = 100

#设置文件类型,可支持压缩

a2.sinks.k1.hdfs.fileType = DataStream

#多久生成一个新的文件

a2.sinks.k1.hdfs.rollInterval = 600

#设置每个文件的滚动大小大概是128M

a2.sinks.k1.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a2.sinks.k1.hdfs.rollCount = 0

#最小副本数

a2.sinks.k1.hdfs.minBlockReplicas = 1

# 4 channel

a2.channels.c1.type = memory

a2.channels.c1.capacity = 1000

a2.channels.c1.transactionCapacity = 100

#5 Bind

a2.sources.r1.channels = c1

a2.sinks.k1.channel = c1

3) 创建flume3.conf,用于接收flume1的event,同时产生1个channel和1个sink,将数据输送给本地目录:

#1 agent

a3.sources = r1

a3.sinks = k1

a3.channels = c1

# 2 source

a3.sources.r1.type = avro

a3.sources.r1.bind = bigdata111

a3.sources.r1.port = 4142

#3 sink

a3.sinks.k1.type = file_roll

#备注:此处的文件夹需要先创建好

a3.sinks.k1.sink.directory = /opt/flume3

# 4 channel

a3.channels.c1.type = memory

a3.channels.c1.capacity = 1000

a3.channels.c1.transactionCapacity = 100

# 5 Bind

a3.sources.r1.channels = c1

a3.sinks.k1.channel = c1

尖叫提示:输出的本地目录必须是已经存在的目录,如果该目录不存在,并不会创建新的目录。

4) 执行测试:分别开启对应flume-job(依次启动flume1,flume-2,flume-3),同时产生文件变动并观察结果:

$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume1.conf

$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume2.conf

$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume3.conf

4.2.5、案例五:FlumeFlume之间数据传递,多Flume汇总数据到单Flume


目标:flume11监控文件hive.log,flume-22监控某一个端口的数据流,flume11与flume-22将数据发送给flume-33,flume33将最终数据写入到HDFS。

分步实现:

1) 创建flume11.conf,用于监控hive.log文件,同时sink数据到flume-33:

# 1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# 2 source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.shell = /bin/bash -c

# 3 sink

a1.sinks.k1.type = avro

a1.sinks.k1.hostname = bigdata111

a1.sinks.k1.port = 4141

# 4 channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# 5. Bind

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

2) 创建flume22.conf,用于监控端口44444数据流,同时sink数据到flume-33:

# 1 agent

a2.sources = r1

a2.sinks = k1

a2.channels = c1

# 2 source

a2.sources.r1.type = netcat

a2.sources.r1.bind = bigdata111

a2.sources.r1.port = 44444

#3 sink

a2.sinks.k1.type = avro

a2.sinks.k1.hostname = bigdata111

a2.sinks.k1.port = 4141

# 4 channel

a2.channels.c1.type = memory

a2.channels.c1.capacity = 1000

a2.channels.c1.transactionCapacity = 100

# 5 Bind

a2.sources.r1.channels = c1

a2.sinks.k1.channel = c1

3) 创建flume33.conf,用于接收flume11与flume22发送过来的数据流,最终合并后sink到HDFS:

# 1 agent

a3.sources = r1

a3.sinks = k1

a3.channels = c1

# 2 source

a3.sources.r1.type = avro

a3.sources.r1.bind = bigdata111

a3.sources.r1.port = 4141

# 3 sink

a3.sinks.k1.type = hdfs

a3.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume3/%H

#上传文件的前缀

a3.sinks.k1.hdfs.filePrefix = flume3-

#是否按照时间滚动文件夹

a3.sinks.k1.hdfs.round = true

#多少时间单位创建一个新的文件夹

a3.sinks.k1.hdfs.roundValue = 1

#重新定义时间单位

a3.sinks.k1.hdfs.roundUnit = hour

#是否使用本地时间戳

a3.sinks.k1.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a3.sinks.k1.hdfs.batchSize = 100

#设置文件类型,可支持压缩

a3.sinks.k1.hdfs.fileType = DataStream

#多久生成一个新的文件

a3.sinks.k1.hdfs.rollInterval = 600

#设置每个文件的滚动大小大概是128M

a3.sinks.k1.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a3.sinks.k1.hdfs.rollCount = 0

#最小冗余数

a3.sinks.k1.hdfs.minBlockReplicas = 1

# 4 channel

a3.channels.c1.type = memory

a3.channels.c1.capacity = 1000

a3.channels.c1.transactionCapacity = 100

# 5 Bind

a3.sources.r1.channels = c1

a3.sinks.k1.channel = c1

4) 执行测试:分别开启对应flume-job(依次启动flume-33,flume-22,flume11),同时产生文件变动并观察结果:

$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume33.conf

$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume22.conf

$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume11.conf

数据发送

[if !supportLists]1) [endif]telnet bigdata111 44444    打开后发送5555555

[if !supportLists]2) [endif]在/opt/Andy 中追加666666

4.2.6、案例六:Flume拦截器

时间戳拦截器

Timestamp.conf

#1.定义agent名, source、channel、sink的名称

a4.sources = r1

a4.channels = c1

a4.sinks = k1

#2.具体定义source

a4.sources.r1.type = spooldir

a4.sources.r1.spoolDir = /opt/module/flume-1.8.0/upload

#定义拦截器,为文件最后添加时间戳

a4.sources.r1.interceptors = i1

a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

#具体定义channel

a4.channels.c1.type = memory

a4.channels.c1.capacity = 10000

a4.channels.c1.transactionCapacity = 100

#具体定义sink

a4.sinks.k1.type = hdfs

a4.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume-interceptors/%H

a4.sinks.k1.hdfs.filePrefix = events-

a4.sinks.k1.hdfs.fileType = DataStream

#不按照条数生成文件

a4.sinks.k1.hdfs.rollCount = 0

#HDFS上的文件达到128M时生成一个文件

a4.sinks.k1.hdfs.rollSize = 134217728

#HDFS上的文件达到60秒生成一个文件

a4.sinks.k1.hdfs.rollInterval = 60

#组装source、channel、sink

a4.sources.r1.channels = c1

a4.sinks.k1.channel = c1

启动命令

/opt/module/flume-1.8.0/bin/flume-ng agent -n a4 \

-f /opt/module/flume-1.8.0/jobconf/Timestamp.conf \

-c /opt/module/flume-1.8.0/conf \

-Dflume.root.logger=INFO,console


主机名拦截器

Host.conf

#1.定义agent

a1.sources= r1

a1.sinks = k1

a1.channels = c1

#2.定义source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

#拦截器

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = host

#参数为true时用IP192.168.1.111,参数为false时用主机名,默认为true

a1.sources.r1.interceptors.i1.useIP = false

a1.sources.r1.interceptors.i1.hostHeader = agentHost

 #3.定义sinks

a1.sinks.k1.type=hdfs

a1.sinks.k1.channel = c1

a1.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flumehost/%H

a1.sinks.k1.hdfs.filePrefix = Andy_%{agentHost}

#往生成的文件加后缀名.log

a1.sinks.k1.hdfs.fileSuffix = .log

a1.sinks.k1.hdfs.fileType = DataStream

a1.sinks.k1.hdfs.writeFormat = Text

a1.sinks.k1.hdfs.rollInterval = 10

a1.sinks.k1.hdfs.useLocalTimeStamp = true

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动命令:

bin/flume-ng agent -c conf/ -f jobconf/Host.conf -n a1 -Dflume.root.logger=INFO,console

UUID拦截器

uuid.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

#type的参数不能写成uuid,得写具体,否则找不到类

a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

#如果UUID头已经存在,它应该保存

a1.sources.r1.interceptors.i1.preserveExisting = true

a1.sources.r1.interceptors.i1.prefix = UUID_

#如果sink类型改为HDFS,那么在HDFS的文本中没有headers的信息数据

a1.sinks.k1.type = logger

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/uuid.conf -n a1 -Dflume.root.logger==INFO,console

查询替换拦截器

search.conf

#1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#2 source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = search_replace

#遇到数字改成itstar,A123会替换为Aitstar

a1.sources.r1.interceptors.i1.searchPattern = [0-9]+

a1.sources.r1.interceptors.i1.replaceString = itstar

a1.sources.r1.interceptors.i1.charset = UTF-8

#3 sink

a1.sinks.k1.type = logger

#4 Chanel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

#5 bind

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/search.conf -n a1 -Dflume.root.logger=INFO,console

正则过滤拦截器

filter.conf

#1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#2 source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = regex_filter

a1.sources.r1.interceptors.i1.regex = ^A.*

#如果excludeEvents设为false,表示过滤掉不是以A开头的events。如果excludeEvents设为true,则表示过滤掉以A开头的events。

a1.sources.r1.interceptors.i1.excludeEvents = true

a1.sinks.k1.type = logger

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/filter.conf -n a1 -Dflume.root.logger=INFO,console

正则抽取拦截器

extractor.conf

#1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#2 source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = regex_extractor

a1.sources.r1.interceptors.i1.regex = hostname is (.*?) ip is (.*)

a1.sources.r1.interceptors.i1.serializers = s1 s2

a1.sources.r1.interceptors.i1.serializers.s1.name = hostname

a1.sources.r1.interceptors.i1.serializers.s2.name = ip

a1.sinks.k1.type = logger

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/extractor.conf -n a1 -Dflume.root.logger=INFO,console

注:正则抽取拦截器的headers不会出现在文件名和文件内容中


4.2.7、案例七:Flume自定义拦截器

字母小写变大写

1.Pom.xml

    

        

        

            org.apache.flume

            flume-ng-core

            1.8.0

        

    

    

        

            

            

                org.apache.maven.plugins

                maven-jar-plugin

                2.4

                

                    

                        

                            true

                            lib/

                            

                        

                    

                

            

            

            

                org.apache.maven.plugins

                maven-compiler-plugin

                

                    1.8

                    1.8

                    utf-8

                

            

        

    

2.自定义实现拦截器

import org.apache.flume.Context;

import org.apache.flume.Event;

import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;

import java.util.List;

public class MyInterceptor implements Interceptor {

    @Override

    public void initialize() {

    }

    @Override

    public void close() {

    }

    /**

     * 拦截source发送到通道channel中的消息

     *

     * @param event 接收过滤的event

     * @return event    根据业务处理后的event

     */

    @Override

    public Event intercept(Event event) {

        // 获取事件对象中的字节数据

        byte[] arr = event.getBody();

        // 将获取的数据转换成大写

        event.setBody(new String(arr).toUpperCase().getBytes());

        // 返回到消息中

        return event;

    }

    // 接收被过滤事件集合

    @Override

    public List intercept(List events) {

        List list = new ArrayList<>();

        for (Event event : events) {

            list.add(intercept(event));

        }

        return list;

    }

    public static class Builder implements Interceptor.Builder {

        // 获取配置文件的属性

        @Override

        public Interceptor build() {

            return new MyInterceptor();

        }

        @Override

        public void configure(Context context) {

        }

    }

使用Maven做成Jar包,在flume的目录下mkdir jar,上传此jar到jar目录中

[if !supportLists]2. [endif]Flume配置文件

ToUpCase.conf

#1.agent

a1.sources = r1

a1.sinks =k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

#全类名$Builder

a1.sources.r1.interceptors.i1.type = ToUpCase.MyInterceptor$Builder

# Describe the sink

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = /ToUpCase1

a1.sinks.k1.hdfs.filePrefix = events-

a1.sinks.k1.hdfs.round = true

a1.sinks.k1.hdfs.roundValue = 10

a1.sinks.k1.hdfs.roundUnit = minute

a1.sinks.k1.hdfs.rollInterval = 3

a1.sinks.k1.hdfs.rollSize = 20

a1.sinks.k1.hdfs.rollCount = 5

a1.sinks.k1.hdfs.batchSize = 1

a1.sinks.k1.hdfs.useLocalTimeStamp = true

#生成的文件类型,默认是 Sequencefile,可用 DataStream,则为普通文本

a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

运行命令:

bin/flume-ng agent -c conf/ -n a1 -f jar/ToUpCase.conf -C jar/Flume-1.0-SNAPSHOT.jar -Dflume.root.logger=DEBUG,console

4.2.8、案例八:Fulme自定义Source

[if !supportLists]1. [endif]代码:自定义实现记录偏移量,从而断点续传

import org.apache.commons.io.FileUtils;

import org.apache.flume.Context;

import org.apache.flume.EventDrivenSource;

import org.apache.flume.channel.ChannelProcessor;

import org.apache.flume.conf.Configurable;

import org.apache.flume.event.EventBuilder;

import org.apache.flume.source.AbstractSource;

import org.apache.flume.source.ExecSource;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import java.io.File;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.util.concurrent.ExecutorService;

import java.util.concurrent.Executors;

import java.util.concurrent.TimeUnit;

/**

 *

 *  自定义source,记录偏移量

 *  flume的生命周期: 先执行构造器,再执行 config方法 -> start方法-> processor.process

 *  读取配置文件:(配置读取的文件内容:读取个文件,编码及、偏移量写到那个文件,多长时间检测一下文件是否有新内容

 *

 */

public class TailFileSource extends AbstractSource implements EventDrivenSource, Configurable {

    private static final Logger logger = LoggerFactory.getLogger(ExecSource.class);

    private String filePath;

    private String charset;

    private String positionFile;

    private long interval;

    private ExecutorService executor;

    private FileRunnable fileRunnable;

    /**

     * 读取配置文件(flume在执行一次job时定义的配置文件)

     * (如果在flume的job的配置文件中不修改,就是用这些默认的配置)

     *

     * @param context

     */

    @Override

    public void configure(Context context) {

        //读取哪个文件

        filePath = context.getString("filePath");

        //默认使用utf-8

        charset = context.getString("charset", "UTF-8");

        //把偏移量写到哪

        positionFile = context.getString("positionFile");

        //指定默认每个一秒 去查看一次是否有新的内容

        interval = context.getLong("interval", 1000L);

    }

    /**

     * 创建一个线程来监听一个文件

     */

    @Override

    public synchronized void start() {

        //创建一个单线程的线程池

        executor = Executors.newSingleThreadExecutor();

        //获取一个ChannelProcessor

        final ChannelProcessor channelProcessor = getChannelProcessor();

        fileRunnable = new FileRunnable(filePath, charset, positionFile, interval, channelProcessor);

        //提交到线程池中

        executor.submit(fileRunnable);

        //调用父类的方法

        super.start();

    }

    @Override

    public synchronized void stop() {

        //停止

        fileRunnable.setFlag(false);

        //停止线程池

        executor.shutdown();

        while (!executor.isTerminated()) {

            logger.debug("Waiting for filer exec executor service to stop");

            try {

                //等500秒在停

                executor.awaitTermination(500, TimeUnit.MILLISECONDS);

            } catch (InterruptedException e) {

                logger.debug("InterutedExecption while waiting for exec executor service" +

                        " to stop . Just exiting");

                e.printStackTrace();

            }

        }

        super.stop();

    }

    private static class FileRunnable implements Runnable {

        private String charset;

        private long interval;

        private long offset = 0L;

        private ChannelProcessor channelProcessor;

        private RandomAccessFile raf;

        private boolean flag = true;

        private File posFile;

        /*

        先于run方法执行,构造器只执行一次

        先看看有没有偏移量,如果有就接着读,如果没有就从头开始读

         */

        public FileRunnable(String filePath, String charset, String positionFile, long interval, ChannelProcessor channelProcessor) {

            this.charset = charset;

            this.interval = interval;

            this.channelProcessor = channelProcessor;

            //读取偏移量, 在postionFile文件

            posFile = new File(positionFile);

            if (!posFile.exists()) {

                //如果不存在就创建一个文件

                try {

                    posFile.createNewFile();

                } catch (IOException e) {

                    e.printStackTrace();

                    logger.error("创建保存偏移量的文件失败:", e);

                }

            }

            try {

                //读取文件的偏移量

                String offsetString = FileUtils.readFileToString(posFile);

                //以前读取过

                if (!offsetString.isEmpty() && null != offsetString && !"".equals(offsetString)) {

                    //把偏移量穿换成long类型

                    offset = Long.parseLong(offsetString);

                }

                //按照指定的偏移量读取数据

                raf = new RandomAccessFile(filePath, "r");

                //按照指定的偏移量读取

                raf.seek(offset);

            } catch (IOException e) {

                logger.error("读取保存偏移量文件时发生错误", e);

                e.printStackTrace();

            }

        }

        @Override

        public void run() {

            while (flag) {

                //读取文件中的新数据

                try {

                    String line = raf.readLine();

                    if (line != null) {

                        //有数据进行处理,避免出现乱码

                        line = new String(line.getBytes("iso8859-1"), charset);

                        channelProcessor.processEvent(EventBuilder.withBody(line.getBytes()));

                        //获取偏移量,更新偏移量

                        offset = raf.getFilePointer();

                        //将偏移量写入到位置文件中

                        FileUtils.writeStringToFile(posFile, offset + "");

                    } else {

                        //没读到睡一会儿

                        Thread.sleep(interval);

                    }

                    //发给channle

                    //更新偏移量

                    //每个时间间隔读取一次

                } catch (InterruptedException e) {

                    e.printStackTrace();

                    logger.error("read filethread Interrupted", e);

                } catch (IOException e) {

                    logger.error("read log file error", e);

                }

            }

        }

        public void setFlag(boolean flag) {

            this.flag = flag;

        }

    }

}  

[if !supportLists]2. [endif]配置文件

#定义agent名, source、channel、sink的名称

a1.sources = r1

a1.channels = c1

a1.sinks = k1

#具体定义source,这里的type是自定义的source的类的全路径

a1.sources.r1.type = customSource.TailFileSource

#这里的参数名都和自定义类的参数一直

#读取哪个文件

a1.sources.r1.filePath = /opt/Andy

#偏移量保存的文件

a1.sources.r1.positionFile = /opt/Cndy

#时间间隔,每隔多久读取一次

a1.sources.r1.interval = 2000

#编码

a1.sources.r1.charset = UTF-8

#具体定义channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

#具体定义sink

a1.sinks.k1.type = file_roll

a1.sinks.k1.sink.directory = /opt/Bndy

#组装source、channel、sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动命令:

bin/flume-ng agent -n a1 -f jar/ConsumSource.conf -c conf/ -C jar/ConsumSource.jar -Dflume.root.logger=INFO,console


4.2.8、案例七:Flume对接kafka

配置flume(flume-kafka.conf)

# define

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F -c +0 /opt/jars/calllog.csv

a1.sources.r1.shell = /bin/bash -c

# sink

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink

a1.sinks.k1.brokerList = bigdata111:9092,bigdata112:9092,bigdata113:9092

a1.sinks.k1.topic = calllog

a1.sinks.k1.batchSize = 20

a1.sinks.k1.requiredAcks = 1

# channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# bind

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

进入flume根目录下,启动flume

/opt/module/flume-1.8.0/bin/flume-ng agent --conf /opt/module/flume-1.8.0/conf/ --name a1 --conf-file /opt/jars/flume2kafka.conf

4.2.9、案例八:kafka对接Flume

kafka2flume.conf

agent.sources = kafkaSource

agent.channels = memoryChannel

agent.sinks = hdfsSink

# The channel can be defined as follows.

agent.sources.kafkaSource.channels = memoryChannel

agent.sources.kafkaSource.type=org.apache.flume.source.kafka.KafkaSource

agent.sources.kafkaSource.zookeeperConnect=bigdata111:2181,bigdata112:2181,bigdata113:2181

agent.sources.kafkaSource.topic=calllog

#agent.sources.kafkaSource.groupId=flume

agent.sources.kafkaSource.kafka.consumer.timeout.ms=100

agent.channels.memoryChannel.type=memory

agent.channels.memoryChannel.capacity=10000

agent.channels.memoryChannel.transactionCapacity=1000

agent.channels.memoryChannel.type=memory

agent.channels.memoryChannel.capacity=10000

agent.channels.memoryChannel.transactionCapacity=1000

# the sink of hdfs

agent.sinks.hdfsSink.type=hdfs

agent.sinks.hdfsSink.channel = memoryChannel

agent.sinks.hdfsSink.hdfs.path=hdfs://bigdata111:9000/kafka2flume

agent.sinks.hdfsSink.hdfs.writeFormat=Text

agent.sinks.hdfsSink.hdfs.fileType=DataStream

#这两个不配置,会产生大量的小文件

agent.sinks.hdfsSink.hdfs.rollSize=0

agent.sinks.hdfsSink.hdfs.rollCount=0

启动命令

bin/flume-ng agent --conf conf --conf-file jobconf/kafka2flume.conf --name agent -Dflume.root.logger=INFO,console

注意:这个配置是从kafka过数据,但是需要重新向kafkatopic灌数据,他才会传到HDFS

4.3、Flume事物机制

一:Flume的事务机制

比如spooling directory source 为文件的每一行创建一个事件,一旦事务中所有的事件全部传递到channel且提交成功,那么source就将该文件标记为完成。

同理,事务以类似的方式处理从channel到sink的传递过程,如果因为某种 原因使得事件无法记录,那么事务将会回滚。且所有的事件都会保持到channel中,等待重新传递。

二:  Flume的At-least-once提交方式

       Flume的事务机制,总的来说,保证了source产生的每个事件都会传送到sink中。但是值得一说的是,实际上Flume作为高容量并行采集系统采用的是At-least-once(传统的企业系统采用的是exactly-once机制)提交方式,这样就造成每个source产生的事件至少到达sink一次,换句话说就是同一事件有可能重复到达。这样虽然看上去是一个缺陷,但是相比为了保证Flume能够可靠地将事件从source,channel传递到sink,这也是一个可以接受的权衡。如上博客中spooldir的使用,Flume会对已经处理完的数据进行标记。

三:Flume的批处理机制

为了提高效率,Flume尽可能的以事务为单位来处理事件,而不是逐一基于事件进行处理。比如提到的spooling directory source以100行文本作为一个批次来读取(BatchSize属性来配置,类似数据库的批处理模式)。批处理的设置尤其有利于提高file channle的效率,这样整个事务只需要写入一次本地磁盘,或者调用一次fsync,速度回快很多。

流处理语义

[if !supportLists]l [endif]At most once(最多一次):每条数据记录最多被处理一次,潜台词也表明数据会有丢失(没被处理掉)的可能。

[if !supportLists]l [endif]At least once(最少一次):每条数据记录至少被处理一次。这个比上一点强的地方在于这里至少保证数据不会丢,至少被处理过,唯一不足之处在于数据可能会被重复处理。

Exactly once(恰好一次):每条数据记录正好被处理一次。没有数据丢失,也没有重复的数据处理。这一点是3个语义里要求最高的。

你可能感兴趣的:(二.Flume部署及使用)