二.Flume部署及使用

4.1、文件配置

查询JAVA_HOME: echo $JAVA_HOME

显示/opt/module/jdk1.8.0_144 /opt/module/jdk1.8.0_144

安装Flume

[itstar@bigdata113 software]$ tar -zxvf apache-flume1.8.0-bin.tar.gz -C /opt/module/

改名：

[itstar@bigdata113 conf]$ mv flume-env.sh.template flume-env.sh

flume-env.sh涉及修改项：

export JAVA_HOME=/opt/module/jdk1.8.0_144

4.2、案例

4.2.1、案例一：监控端口数据

目标：Flume监控一端Console，另一端Console发送消息，使被监控端实时显示。

分步实现：

1) 安装telnet工具

【联网状态】yum -y install telnet

【安装完成】

2) 创建Flume Agent配置文件flume-telnet.conf

#定义Agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#定义source

a1.sources.r1.type = netcat

a1.sources.r1.bind = bigdata113

a1.sources.r1.port = 44445

# 定义sink

a1.sinks.k1.type = logger

# 定义memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# 双向链接

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

3) 判断44444端口是否被占用

$ netstat -tunlp | grep 44445

4) 启动flume配置文件

/opt/module/flume-1.8.0/bin/flume-ng agent \

--conf /opt/module/flume-1.8.0/conf/ \

--name a1 \

--conf-file /opt/module/flume-1.8.0/jobconf/flume-telnet.conf \

-Dflume.root.logger==INFO,console

flume-ng 启动命令

--conf 配置所在的目录

--name agent的名字

--conf-file 配置文件所在的路径

-Dflume.root.logger==INFO,console 控制台打印

5) 使用telnet工具向本机的44444端口发送内容

$ telnet bigdata113 44445

4.2.2、案例二：实时读取本地文件到HDFS

1) 创建flume-hdfs.conf文件

# 1 agent

a2.sources = r2

a2.sinks = k2

a2.channels = c2

# 2 source

a2.sources.r2.type = exec

a2.sources.r2.command = tail -F /opt/Andy

a2.sources.r2.shell = /bin/bash -c

# 3 sink

a2.sinks.k2.type = hdfs

a2.sinks.k2.hdfs.path = hdfs://bigdata111:9000/flume/%Y%m%d/%H

#上传文件的前缀

a2.sinks.k2.hdfs.filePrefix = logs-

#是否按照时间滚动文件夹

a2.sinks.k2.hdfs.round = true

#多少时间单位创建一个新的文件夹

a2.sinks.k2.hdfs.roundValue = 1

#重新定义时间单位

a2.sinks.k2.hdfs.roundUnit = hour

#是否使用本地时间戳

a2.sinks.k2.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a2.sinks.k2.hdfs.batchSize = 1000

#设置文件类型，可支持压缩

a2.sinks.k2.hdfs.fileType = DataStream

#多久生成一个新的文件

a2.sinks.k2.hdfs.rollInterval = 600

#设置每个文件的滚动大小

a2.sinks.k2.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a2.sinks.k2.hdfs.rollCount = 0

#最小副本数

a2.sinks.k2.hdfs.minBlockReplicas = 1

# 定义 memory

a2.channels.c2.type = memory

a2.channels.c2.capacity = 1000

a2.channels.c2.transactionCapacity = 100

#双向链接channel

a2.sources.r2.channels = c2

a2.sinks.k2.channel = c2

3) 执行监控配置

/opt/module/flume-1.8.0/bin/flume-ng agent \

--conf /opt/module/flume-1.8.0/conf/ \

--name a2 \

--conf-file /opt/module/flume-1.8.0/jobconf/flume-hdfs.conf

4.2.3、案例三：实时读取目录文件到HDFS

目标：使用flume监听整个目录的文件

分步实现：

1) 创建配置文件flume-dir.conf

#1 Agent

a3.sources = r3

a3.sinks = k3

a3.channels = c3

#2 source

a3.sources.r3.type = spooldir

a3.sources.r3.spoolDir = /opt/module/flume1.8.0/upload

a3.sources.r3.fileSuffix = .COMPLETED

a3.sources.r3.fileHeader = true

#忽略所有以.tmp结尾的文件，不上传

a3.sources.r3.ignorePattern = ([^ ]*\.tmp)

# 3 sink

a3.sinks.k3.type = hdfs

a3.sinks.k3.hdfs.path = hdfs://bigdata111:9000/flume/%H

#上传文件的前缀

a3.sinks.k3.hdfs.filePrefix = upload-

#是否按照时间滚动文件夹

a3.sinks.k3.hdfs.round = true

#多少时间单位创建一个新的文件夹

a3.sinks.k3.hdfs.roundValue = 1

#重新定义时间单位

a3.sinks.k3.hdfs.roundUnit = hour

#是否使用本地时间戳

a3.sinks.k3.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a3.sinks.k3.hdfs.batchSize = 100

#设置文件类型，可支持压缩

a3.sinks.k3.hdfs.fileType = DataStream

#多久生成一个新的文件

a3.sinks.k3.hdfs.rollInterval = 600

#设置每个文件的滚动大小大概是128M

a3.sinks.k3.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a3.sinks.k3.hdfs.rollCount = 0

#最小副本数

a3.sinks.k3.hdfs.minBlockReplicas = 1

# Use a channel which buffers events in memory

a3.channels.c3.type = memory

a3.channels.c3.capacity = 1000

a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel

a3.sources.r3.channels = c3

a3.sinks.k3.channel = c3

2) 执行测试：执行如下脚本后，请向upload文件夹中添加文件试试

/opt/module/flume1.8.0/bin/flume-ng agent \

--conf /opt/module/flume1.8.0/conf/ \

--name a3 \

--conf-file /opt/module/flume1.8.0/jobconf/flume-dir.conf

尖叫提示：在使用Spooling Directory Source时

1) 不要在监控目录中创建并持续修改文件

2) 上传完成的文件会以.COMPLETED结尾

3) 被监控文件夹每500毫秒扫描一次文件变动

4.2.4、案例四：Flume与Flume之间数据传递：单Flume多Channel、Sink

目标：使用flume1监控文件变动，flume1将变动内容传递给flume-2，flume-2负责存储到HDFS。同时flume1将变动内容传递给flume-3，flume-3负责输出到local

分步实现：

1) 创建flume1.conf，用于监控某文件的变动，同时产生两个channel和两个sink分别输送给flume2和flume3：

# 1.agent

a1.sources = r1

a1.sinks = k1 k2

a1.channels = c1 c2

# 将数据流复制给多个channel

a1.sources.r1.selector.type = replicating

# 2.source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.shell = /bin/bash -c

# 3.sink1

a1.sinks.k1.type = avro

a1.sinks.k1.hostname = bigdata111

a1.sinks.k1.port = 4141

# sink2

a1.sinks.k2.type = avro

a1.sinks.k2.hostname = bigdata111

a1.sinks.k2.port = 4142

# 4.channel—1

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# 4.channel—2

a1.channels.c2.type = memory

a1.channels.c2.capacity = 1000

a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1 c2

a1.sinks.k1.channel = c1

a1.sinks.k2.channel = c2

2) 创建flume2.conf，用于接收flume1的event，同时产生1个channel和1个sink，将数据输送给hdfs：

# 1 agent

a2.sources = r1

a2.sinks = k1

a2.channels = c1

# 2 source

a2.sources.r1.type = avro

a2.sources.r1.bind = bigdata111

a2.sources.r1.port = 4141

# 3 sink

a2.sinks.k1.type = hdfs

a2.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume2/%H

#上传文件的前缀

a2.sinks.k1.hdfs.filePrefix = flume2-

#是否按照时间滚动文件夹

a2.sinks.k1.hdfs.round = true

#多少时间单位创建一个新的文件夹

a2.sinks.k1.hdfs.roundValue = 1

#重新定义时间单位

a2.sinks.k1.hdfs.roundUnit = hour

#是否使用本地时间戳

a2.sinks.k1.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a2.sinks.k1.hdfs.batchSize = 100

#设置文件类型，可支持压缩

a2.sinks.k1.hdfs.fileType = DataStream

#多久生成一个新的文件

a2.sinks.k1.hdfs.rollInterval = 600

#设置每个文件的滚动大小大概是128M

a2.sinks.k1.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a2.sinks.k1.hdfs.rollCount = 0

#最小副本数

a2.sinks.k1.hdfs.minBlockReplicas = 1

# 4 channel

a2.channels.c1.type = memory

a2.channels.c1.capacity = 1000

a2.channels.c1.transactionCapacity = 100

#5 Bind

a2.sources.r1.channels = c1

a2.sinks.k1.channel = c1

3) 创建flume3.conf，用于接收flume1的event，同时产生1个channel和1个sink，将数据输送给本地目录：

#1 agent

a3.sources = r1

a3.sinks = k1

a3.channels = c1

# 2 source

a3.sources.r1.type = avro

a3.sources.r1.bind = bigdata111

a3.sources.r1.port = 4142

#3 sink

a3.sinks.k1.type = file_roll

#备注：此处的文件夹需要先创建好

a3.sinks.k1.sink.directory = /opt/flume3

# 4 channel

a3.channels.c1.type = memory

a3.channels.c1.capacity = 1000

a3.channels.c1.transactionCapacity = 100

# 5 Bind

a3.sources.r1.channels = c1

a3.sinks.k1.channel = c1

尖叫提示：输出的本地目录必须是已经存在的目录，如果该目录不存在，并不会创建新的目录。

4) 执行测试：分别开启对应flume-job（依次启动flume1，flume-2，flume-3），同时产生文件变动并观察结果：

$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume1.conf

$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume2.conf

$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume3.conf

4.2.5、案例五：Flume与Flume之间数据传递，多Flume汇总数据到单Flume

目标：flume11监控文件hive.log，flume-22监控某一个端口的数据流，flume11与flume-22将数据发送给flume-33，flume33将最终数据写入到HDFS。

分步实现：

1) 创建flume11.conf，用于监控hive.log文件，同时sink数据到flume-33：

# 1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# 2 source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.shell = /bin/bash -c

# 3 sink

a1.sinks.k1.type = avro

a1.sinks.k1.hostname = bigdata111

a1.sinks.k1.port = 4141

# 4 channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# 5. Bind

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

2) 创建flume22.conf，用于监控端口44444数据流，同时sink数据到flume-33：

# 1 agent

a2.sources = r1

a2.sinks = k1

a2.channels = c1

# 2 source

a2.sources.r1.type = netcat

a2.sources.r1.bind = bigdata111

a2.sources.r1.port = 44444

#3 sink

a2.sinks.k1.type = avro

a2.sinks.k1.hostname = bigdata111

a2.sinks.k1.port = 4141

# 4 channel

a2.channels.c1.type = memory

a2.channels.c1.capacity = 1000

a2.channels.c1.transactionCapacity = 100

# 5 Bind

a2.sources.r1.channels = c1

a2.sinks.k1.channel = c1

3) 创建flume33.conf，用于接收flume11与flume22发送过来的数据流，最终合并后sink到HDFS：

# 1 agent

a3.sources = r1

a3.sinks = k1

a3.channels = c1

# 2 source

a3.sources.r1.type = avro

a3.sources.r1.bind = bigdata111

a3.sources.r1.port = 4141

# 3 sink

a3.sinks.k1.type = hdfs

a3.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume3/%H

#上传文件的前缀

a3.sinks.k1.hdfs.filePrefix = flume3-

#是否按照时间滚动文件夹

a3.sinks.k1.hdfs.round = true

#多少时间单位创建一个新的文件夹

a3.sinks.k1.hdfs.roundValue = 1

#重新定义时间单位

a3.sinks.k1.hdfs.roundUnit = hour

#是否使用本地时间戳

a3.sinks.k1.hdfs.useLocalTimeStamp = true

#积攒多少个Event才flush到HDFS一次

a3.sinks.k1.hdfs.batchSize = 100

#设置文件类型，可支持压缩

a3.sinks.k1.hdfs.fileType = DataStream

#多久生成一个新的文件

a3.sinks.k1.hdfs.rollInterval = 600

#设置每个文件的滚动大小大概是128M

a3.sinks.k1.hdfs.rollSize = 134217700

#文件的滚动与Event数量无关

a3.sinks.k1.hdfs.rollCount = 0

#最小冗余数

a3.sinks.k1.hdfs.minBlockReplicas = 1

# 4 channel

a3.channels.c1.type = memory

a3.channels.c1.capacity = 1000

a3.channels.c1.transactionCapacity = 100

# 5 Bind

a3.sources.r1.channels = c1

a3.sinks.k1.channel = c1

4) 执行测试：分别开启对应flume-job（依次启动flume-33，flume-22，flume11），同时产生文件变动并观察结果：

$ bin/flume-ng agent --conf conf/ --name a3 --conf-file jobconf/flume33.conf

$ bin/flume-ng agent --conf conf/ --name a2 --conf-file jobconf/flume22.conf

$ bin/flume-ng agent --conf conf/ --name a1 --conf-file jobconf/flume11.conf

数据发送

[if !supportLists]1) [endif]telnet bigdata111 44444 打开后发送5555555

[if !supportLists]2) [endif]在/opt/Andy 中追加666666

4.2.6、案例六：Flume拦截器

时间戳拦截器

Timestamp.conf

#1.定义agent名， source、channel、sink的名称

a4.sources = r1

a4.channels = c1

a4.sinks = k1

#2.具体定义source

a4.sources.r1.type = spooldir

a4.sources.r1.spoolDir = /opt/module/flume-1.8.0/upload

#定义拦截器，为文件最后添加时间戳

a4.sources.r1.interceptors = i1

a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

#具体定义channel

a4.channels.c1.type = memory

a4.channels.c1.capacity = 10000

a4.channels.c1.transactionCapacity = 100

#具体定义sink

a4.sinks.k1.type = hdfs

a4.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flume-interceptors/%H

a4.sinks.k1.hdfs.filePrefix = events-

a4.sinks.k1.hdfs.fileType = DataStream

#不按照条数生成文件

a4.sinks.k1.hdfs.rollCount = 0

#HDFS上的文件达到128M时生成一个文件

a4.sinks.k1.hdfs.rollSize = 134217728

#HDFS上的文件达到60秒生成一个文件

a4.sinks.k1.hdfs.rollInterval = 60

#组装source、channel、sink

a4.sources.r1.channels = c1

a4.sinks.k1.channel = c1

启动命令

/opt/module/flume-1.8.0/bin/flume-ng agent -n a4 \

-f /opt/module/flume-1.8.0/jobconf/Timestamp.conf \

-c /opt/module/flume-1.8.0/conf \

-Dflume.root.logger=INFO,console

主机名拦截器

Host.conf

#1.定义agent

a1.sources= r1

a1.sinks = k1

a1.channels = c1

#2.定义source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

#拦截器

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = host

#参数为true时用IP192.168.1.111，参数为false时用主机名，默认为true

a1.sources.r1.interceptors.i1.useIP = false

a1.sources.r1.interceptors.i1.hostHeader = agentHost

#3.定义sinks

a1.sinks.k1.type=hdfs

a1.sinks.k1.channel = c1

a1.sinks.k1.hdfs.path = hdfs://bigdata111:9000/flumehost/%H

a1.sinks.k1.hdfs.filePrefix = Andy_%{agentHost}

#往生成的文件加后缀名.log

a1.sinks.k1.hdfs.fileSuffix = .log

a1.sinks.k1.hdfs.fileType = DataStream

a1.sinks.k1.hdfs.writeFormat = Text

a1.sinks.k1.hdfs.rollInterval = 10

a1.sinks.k1.hdfs.useLocalTimeStamp = true

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动命令：

bin/flume-ng agent -c conf/ -f jobconf/Host.conf -n a1 -Dflume.root.logger=INFO,console

UUID拦截器

uuid.conf

a1.sources = r1

a1.sinks = k1

a1.channels = c1

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

#type的参数不能写成uuid，得写具体，否则找不到类

a1.sources.r1.interceptors.i1.type = org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder

#如果UUID头已经存在,它应该保存

a1.sources.r1.interceptors.i1.preserveExisting = true

a1.sources.r1.interceptors.i1.prefix = UUID_

#如果sink类型改为HDFS，那么在HDFS的文本中没有headers的信息数据

a1.sinks.k1.type = logger

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/uuid.conf -n a1 -Dflume.root.logger==INFO,console

查询替换拦截器

search.conf

#1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#2 source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = search_replace

#遇到数字改成itstar，A123会替换为Aitstar

a1.sources.r1.interceptors.i1.searchPattern = [0-9]+

a1.sources.r1.interceptors.i1.replaceString = itstar

a1.sources.r1.interceptors.i1.charset = UTF-8

#3 sink

a1.sinks.k1.type = logger

#4 Chanel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

#5 bind

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/search.conf -n a1 -Dflume.root.logger=INFO,console

正则过滤拦截器

filter.conf

#1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#2 source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = regex_filter

a1.sources.r1.interceptors.i1.regex = ^A.*

#如果excludeEvents设为false,表示过滤掉不是以A开头的events。如果excludeEvents设为true，则表示过滤掉以A开头的events。

a1.sources.r1.interceptors.i1.excludeEvents = true

a1.sinks.k1.type = logger

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/filter.conf -n a1 -Dflume.root.logger=INFO,console

正则抽取拦截器

extractor.conf

#1 agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#2 source

a1.sources.r1.type = exec

a1.sources.r1.channels = c1

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

a1.sources.r1.interceptors.i1.type = regex_extractor

a1.sources.r1.interceptors.i1.regex = hostname is (.*?) ip is (.*)

a1.sources.r1.interceptors.i1.serializers = s1 s2

a1.sources.r1.interceptors.i1.serializers.s1.name = hostname

a1.sources.r1.interceptors.i1.serializers.s2.name = ip

a1.sinks.k1.type = logger

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

# bin/flume-ng agent -c conf/ -f jobconf/extractor.conf -n a1 -Dflume.root.logger=INFO,console

注：正则抽取拦截器的headers不会出现在文件名和文件内容中

4.2.7、案例七：Flume自定义拦截器

字母小写变大写

1.Pom.xml

org.apache.flume

flume-ng-core

1.8.0

org.apache.maven.plugins

maven-jar-plugin

2.4

true

lib/

org.apache.maven.plugins

maven-compiler-plugin

1.8

utf-8

2.自定义实现拦截器

import org.apache.flume.Context;

import org.apache.flume.Event;

import org.apache.flume.interceptor.Interceptor;

import java.util.ArrayList;

import java.util.List;

public class MyInterceptor implements Interceptor {

@Override

public void initialize() {

}

@Override

public void close() {

}

/**

* 拦截source发送到通道channel中的消息

* @param event 接收过滤的event

* @return event 根据业务处理后的event

@Override

public Event intercept(Event event) {

// 获取事件对象中的字节数据

byte[] arr = event.getBody();

// 将获取的数据转换成大写

event.setBody(new String(arr).toUpperCase().getBytes());

// 返回到消息中

return event;

}

// 接收被过滤事件集合

@Override

public List intercept(List events) {

List list = new ArrayList<>();

for (Event event : events) {

list.add(intercept(event));

}

return list;

}

public static class Builder implements Interceptor.Builder {

// 获取配置文件的属性

@Override

public Interceptor build() {

return new MyInterceptor();

}

@Override

public void configure(Context context) {

}

使用Maven做成Jar包，在flume的目录下mkdir jar，上传此jar到jar目录中

[if !supportLists]2. [endif]Flume配置文件

ToUpCase.conf

#1.agent

a1.sources = r1

a1.sinks =k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/Andy

a1.sources.r1.interceptors = i1

#全类名$Builder

a1.sources.r1.interceptors.i1.type = ToUpCase.MyInterceptor$Builder

# Describe the sink

a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = /ToUpCase1

a1.sinks.k1.hdfs.filePrefix = events-

a1.sinks.k1.hdfs.round = true

a1.sinks.k1.hdfs.roundValue = 10

a1.sinks.k1.hdfs.roundUnit = minute

a1.sinks.k1.hdfs.rollInterval = 3

a1.sinks.k1.hdfs.rollSize = 20

a1.sinks.k1.hdfs.rollCount = 5

a1.sinks.k1.hdfs.batchSize = 1

a1.sinks.k1.hdfs.useLocalTimeStamp = true

#生成的文件类型，默认是 Sequencefile，可用 DataStream，则为普通文本

a1.sinks.k1.hdfs.fileType = DataStream

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

运行命令：

bin/flume-ng agent -c conf/ -n a1 -f jar/ToUpCase.conf -C jar/Flume-1.0-SNAPSHOT.jar -Dflume.root.logger=DEBUG,console

4.2.8、案例八：Fulme自定义Source

[if !supportLists]1. [endif]代码:自定义实现记录偏移量,从而断点续传

import org.apache.commons.io.FileUtils;

import org.apache.flume.Context;

import org.apache.flume.EventDrivenSource;

import org.apache.flume.channel.ChannelProcessor;

import org.apache.flume.conf.Configurable;

import org.apache.flume.event.EventBuilder;

import org.apache.flume.source.AbstractSource;

import org.apache.flume.source.ExecSource;

import org.slf4j.Logger;

import org.slf4j.LoggerFactory;

import java.io.File;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.util.concurrent.ExecutorService;

import java.util.concurrent.Executors;

import java.util.concurrent.TimeUnit;

/**

* 自定义source，记录偏移量

* flume的生命周期：先执行构造器，再执行 config方法 -> start方法-> processor.process

* 读取配置文件:(配置读取的文件内容：读取个文件，编码及、偏移量写到那个文件，多长时间检测一下文件是否有新内容

public class TailFileSource extends AbstractSource implements EventDrivenSource, Configurable {

private static final Logger logger = LoggerFactory.getLogger(ExecSource.class);

private String filePath;

private String charset;

private String positionFile;

private long interval;

private ExecutorService executor;

private FileRunnable fileRunnable;

/**

* 读取配置文件（flume在执行一次job时定义的配置文件）

* (如果在flume的job的配置文件中不修改，就是用这些默认的配置)

* @param context

@Override

public void configure(Context context) {

//读取哪个文件

filePath = context.getString("filePath");

//默认使用utf-8

charset = context.getString("charset", "UTF-8");

//把偏移量写到哪

positionFile = context.getString("positionFile");

//指定默认每个一秒去查看一次是否有新的内容

interval = context.getLong("interval", 1000L);

}

/**

* 创建一个线程来监听一个文件

@Override

public synchronized void start() {

//创建一个单线程的线程池

executor = Executors.newSingleThreadExecutor();

//获取一个ChannelProcessor

final ChannelProcessor channelProcessor = getChannelProcessor();

fileRunnable = new FileRunnable(filePath, charset, positionFile, interval, channelProcessor);

//提交到线程池中

executor.submit(fileRunnable);

//调用父类的方法

super.start();

}

@Override

public synchronized void stop() {

//停止

fileRunnable.setFlag(false);

//停止线程池

executor.shutdown();

while (!executor.isTerminated()) {

logger.debug("Waiting for filer exec executor service to stop");

try {

//等500秒在停

executor.awaitTermination(500, TimeUnit.MILLISECONDS);

} catch (InterruptedException e) {

logger.debug("InterutedExecption while waiting for exec executor service" +

" to stop . Just exiting");

e.printStackTrace();

}

super.stop();

}

private static class FileRunnable implements Runnable {

private String charset;

private long interval;

private long offset = 0L;

private ChannelProcessor channelProcessor;

private RandomAccessFile raf;

private boolean flag = true;

private File posFile;

先于run方法执行，构造器只执行一次

先看看有没有偏移量，如果有就接着读，如果没有就从头开始读

public FileRunnable(String filePath, String charset, String positionFile, long interval, ChannelProcessor channelProcessor) {

this.charset = charset;

this.interval = interval;

this.channelProcessor = channelProcessor;

//读取偏移量，在postionFile文件

posFile = new File(positionFile);

if (!posFile.exists()) {

//如果不存在就创建一个文件

try {

posFile.createNewFile();

} catch (IOException e) {

e.printStackTrace();

logger.error("创建保存偏移量的文件失败:", e);

}

try {

//读取文件的偏移量

String offsetString = FileUtils.readFileToString(posFile);

//以前读取过

if (!offsetString.isEmpty() && null != offsetString && !"".equals(offsetString)) {

//把偏移量穿换成long类型

offset = Long.parseLong(offsetString);

}

//按照指定的偏移量读取数据

raf = new RandomAccessFile(filePath, "r");

//按照指定的偏移量读取

raf.seek(offset);

} catch (IOException e) {

logger.error("读取保存偏移量文件时发生错误", e);

e.printStackTrace();

}

@Override

public void run() {

while (flag) {

//读取文件中的新数据

try {

String line = raf.readLine();

if (line != null) {

//有数据进行处理，避免出现乱码

line = new String(line.getBytes("iso8859-1"), charset);

channelProcessor.processEvent(EventBuilder.withBody(line.getBytes()));

//获取偏移量,更新偏移量

offset = raf.getFilePointer();

//将偏移量写入到位置文件中

FileUtils.writeStringToFile(posFile, offset + "");

} else {

//没读到睡一会儿

Thread.sleep(interval);

}

//发给channle

//更新偏移量

//每个时间间隔读取一次

} catch (InterruptedException e) {

e.printStackTrace();

logger.error("read filethread Interrupted", e);

} catch (IOException e) {

logger.error("read log file error", e);

}

public void setFlag(boolean flag) {

this.flag = flag;

}

[if !supportLists]2. [endif]配置文件

#定义agent名， source、channel、sink的名称

a1.sources = r1

a1.channels = c1

a1.sinks = k1

#具体定义source,这里的type是自定义的source的类的全路径

a1.sources.r1.type = customSource.TailFileSource

#这里的参数名都和自定义类的参数一直

#读取哪个文件

a1.sources.r1.filePath = /opt/Andy

#偏移量保存的文件

a1.sources.r1.positionFile = /opt/Cndy

#时间间隔，每隔多久读取一次

a1.sources.r1.interval = 2000

#编码

a1.sources.r1.charset = UTF-8

#具体定义channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

#具体定义sink

a1.sinks.k1.type = file_roll

a1.sinks.k1.sink.directory = /opt/Bndy

#组装source、channel、sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动命令:

bin/flume-ng agent -n a1 -f jar/ConsumSource.conf -c conf/ -C jar/ConsumSource.jar -Dflume.root.logger=INFO,console

4.2.8、案例七：Flume对接kafka

配置flume(flume-kafka.conf)

# define

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# source

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F -c +0 /opt/jars/calllog.csv

a1.sources.r1.shell = /bin/bash -c

# sink

a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink

a1.sinks.k1.brokerList = bigdata111:9092,bigdata112:9092,bigdata113:9092

a1.sinks.k1.topic = calllog

a1.sinks.k1.batchSize = 20

a1.sinks.k1.requiredAcks = 1

# channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# bind

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

进入flume根目录下，启动flume

/opt/module/flume-1.8.0/bin/flume-ng agent --conf /opt/module/flume-1.8.0/conf/ --name a1 --conf-file /opt/jars/flume2kafka.conf

4.2.9、案例八：kafka对接Flume

kafka2flume.conf

agent.sources = kafkaSource

agent.channels = memoryChannel

agent.sinks = hdfsSink

# The channel can be defined as follows.

agent.sources.kafkaSource.channels = memoryChannel

agent.sources.kafkaSource.type=org.apache.flume.source.kafka.KafkaSource

agent.sources.kafkaSource.zookeeperConnect=bigdata111:2181,bigdata112:2181,bigdata113:2181

agent.sources.kafkaSource.topic=calllog

#agent.sources.kafkaSource.groupId=flume

agent.sources.kafkaSource.kafka.consumer.timeout.ms=100

agent.channels.memoryChannel.type=memory

agent.channels.memoryChannel.capacity=10000

agent.channels.memoryChannel.transactionCapacity=1000

agent.channels.memoryChannel.type=memory

agent.channels.memoryChannel.capacity=10000

agent.channels.memoryChannel.transactionCapacity=1000

# the sink of hdfs

agent.sinks.hdfsSink.type=hdfs

agent.sinks.hdfsSink.channel = memoryChannel

agent.sinks.hdfsSink.hdfs.path=hdfs://bigdata111:9000/kafka2flume

agent.sinks.hdfsSink.hdfs.writeFormat=Text

agent.sinks.hdfsSink.hdfs.fileType=DataStream

#这两个不配置，会产生大量的小文件

agent.sinks.hdfsSink.hdfs.rollSize=0

agent.sinks.hdfsSink.hdfs.rollCount=0

启动命令

bin/flume-ng agent --conf conf --conf-file jobconf/kafka2flume.conf --name agent -Dflume.root.logger=INFO,console

注意：这个配置是从kafka过数据，但是需要重新向kafka的topic灌数据，他才会传到HDFS

4.3、Flume事物机制

一：Flume的事务机制

比如spooling directory source 为文件的每一行创建一个事件，一旦事务中所有的事件全部传递到channel且提交成功，那么source就将该文件标记为完成。

同理，事务以类似的方式处理从channel到sink的传递过程，如果因为某种原因使得事件无法记录，那么事务将会回滚。且所有的事件都会保持到channel中，等待重新传递。

二: Flume的At-least-once提交方式

Flume的事务机制，总的来说，保证了source产生的每个事件都会传送到sink中。但是值得一说的是，实际上Flume作为高容量并行采集系统采用的是At-least-once（传统的企业系统采用的是exactly-once机制）提交方式，这样就造成每个source产生的事件至少到达sink一次，换句话说就是同一事件有可能重复到达。这样虽然看上去是一个缺陷，但是相比为了保证Flume能够可靠地将事件从source,channel传递到sink,这也是一个可以接受的权衡。如上博客中spooldir的使用，Flume会对已经处理完的数据进行标记。

三：Flume的批处理机制

为了提高效率，Flume尽可能的以事务为单位来处理事件，而不是逐一基于事件进行处理。比如提到的spooling directory source以100行文本作为一个批次来读取（BatchSize属性来配置，类似数据库的批处理模式）。批处理的设置尤其有利于提高file channle的效率，这样整个事务只需要写入一次本地磁盘，或者调用一次fsync，速度回快很多。

流处理语义

[if !supportLists]l [endif]At most once（最多一次）：每条数据记录最多被处理一次，潜台词也表明数据会有丢失（没被处理掉）的可能。

[if !supportLists]l [endif]At least once（最少一次）：每条数据记录至少被处理一次。这个比上一点强的地方在于这里至少保证数据不会丢，至少被处理过，唯一不足之处在于数据可能会被重复处理。

Exactly once（恰好一次）：每条数据记录正好被处理一次。没有数据丢失，也没有重复的数据处理。这一点是3个语义里要求最高的。

二.Flume部署及使用

你可能感兴趣的:(二.Flume部署及使用)