目录
一、Flume 概述
1.1 Flume 定义
1.2 Flume 基础架构
1.2.1 Agent(代理)
1.2.2 Source
1.2.3 Sink
1.2.4 Channel
1.2.5 Event(事件)
二、Flume 入门
2.1 监控端口数据官方案例
2.1.1 配置好 flume-netcat-logger.conf 文件
2.2 实时监控单个追加文件
2.2.1 配置好 flume-file-hdfs.conf 文件 (exec不能断点续传)
2.3 实时监控目录下多个新文件
2.3.1 配置好 flume-dir-hdfs.conf 文件 (spooldir不能监控动态变化的文件)
2.4 实时监控目录下的多个追加文件(Taildir既能断点续传又能动态监控)
2.4.1 配置好 flume-taildir-hdfs.conf 文件
三、Flume 进阶
3.1 Flume 事物
3.2 Flume Agent 内部原理
3.3 Flume 拓扑结构
3.3.1 简单串联
3.3.2 复制和多路复用
3.3.3 负载均衡和故障转移
3.3.4 聚合
3.4 Flume 企业开发案例
3.4.1 复制和多路复用
3.4.2 负载均衡和故障转移
3.4.3 聚合
3.5 自定义 Interceptor
3.6 Flume 数据流监控
3.6.1 Ganglia 的安装与部署
3.6.2 操作 Flume 测试监控
四、企业真实面试题
4.1 你是如何实现 Flume 数据传输的监控的
4.2 Flume 的 Source,Sink,Channel 的作用?你们 Source 是什么类型?
4.3 Flume 的 Channel Selectors
4.4 Flume 参数调优
4.5 Flume 的事务机制
4.6 Flume 采集数据会丢失吗?
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统,Flume支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接收方(可定制)的能力。
Flume 是 Cloudera 提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统。Flume 基于流式架构,灵活简单。
演示:
1)先开启 flume 监听端口
[hsw@hadoop102 flume]$ bin/flume-ng agent -n a1 -c conf/ -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console
[hsw@hadoop102 ~]$ nc localhost 44444
hello
OK
java
OK
3)在 Flume 监听页面观察接收数据情况
1) 运行 Flume
[hsw@hadoop102 flume]$ bin/flume-ng agent -n a2 -c conf/ -f job/flume-file-hdfs.conf
2) 开启 Hadoop 和 Hive 并操作 Hive 产生日志
[hsw@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
[hsw@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
[hsw@hadoop102 hive]$ bin/hive
hive (default)>
3) 在 HDFS 上查看文件。
1)启动监控文件夹命令
[hsw@hadoop102 flume]$ bin/flume-ng agent -n a3 -c conf/ -f job/flume-dir-hdfs.conf
2)向 upload 文件夹中添加文件
[hsw@hadoop102 flume]$ mkdir upload
向 upload 文件夹中添加文件
[hsw@hadoop102 upload]$ touch 1.txt
[hsw@hadoop102 upload]$ touch 1.tmp
[hsw@hadoop102 upload]$ touch 1.log
3) 查看 HDFS 上的数据
1)启动监控文件夹命令
[hsw@hadoop102 flume]$ bin/flume-ng agent -n a3 -c conf/ -f job/flume-taildir-hdfs.conf
2)向 files 文件夹中追加内容
在/opt/module/flume 目录下创建 files 文件夹
[hsw@hadoop102 flume]$ mkdir files
[hsw@hadoop102 files]$ echo hello >> file1.txt
[hsw@hadoop102 files]$ echo hello world >> file2.txt
{"inode":55420315,"pos":18,"file":"/opt/module/flume/files/file1.txt"}
{"inode":401283,"pos":6,"file":"/opt/module/flume/files2/log.txt"}
其中:
[hsw@hadoop102 job]$ cd group1/
[hsw@hadoop102 datas]$ mkdir flume3
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有 channel
a1.sources.r1.selector.type = replicating
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
# sink 端的 avro 是一个数据发送者
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
# source 端的 avro 是一个数据接收服务
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/flume2/%Y%m%d/%H
#上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
#是否按照时间滚动文件夹
a2.sinks.k1.hdfs.round = true
#多少时间单位创建一个新的文件夹
a2.sinks.k1.hdfs.roundValue = 1
#重新定义时间单位
a2.sinks.k1.hdfs.roundUnit = hour
#是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#积攒多少个 Event 才 flush 到 HDFS 一次
a2.sinks.k1.hdfs.batchSize = 100
#设置文件类型,可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
#多久生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 30
#设置每个文件的滚动大小大概是 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#文件的滚动与 Event 数量无关
a2.sinks.k1.hdfs.rollCount = 0
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2
(3)执行配置文件
分别启动对应的 flume 进程:flume-flume-dir,flume-flume-hdfs,flume-file-flume
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group1/flume-flume-dir.conf
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group1/flume-flume-hdfs.conf
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group1/flume-file-flume.conf
(4)启动 Hadoop 和 Hive
[hsw@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
[hsw@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh
[hsw@hadoop102 hive]$ bin/hive
hive (default)>
3)实现步骤
[hsw@hadoop102 job]$ cd group2/
# Name the components on this agent
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
若是负载均衡:只需要改
配置好flume-flume-console1.conf 文件 配置上级 Flume 输出的 Source,输出是到本地控制台
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141
# Describe the sink
a2.sinks.k1.type = logger
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
配置好 flume-flume-console2.conf 文件 配置上级 Flume 输出的 Source,输出是到本地控制台
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume.conf
$ nc localhost 44444
3)实现步骤
[hsw@hadoop102 module]$ xsync flume
[hsw@hadoop102 job]$ mkdir group3
[hsw@hadoop103 job]$ mkdir group3
[hsw@hadoop104 job]$ mkdir group3
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log
a1.sources.r1.shell = /bin/bash -c
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141
# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
配置好 flume2-netcat-flume.conf 文件 配置 Source 监控端口 44444 数据流,配置 Sink 数据到下一级 Flume
# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444
# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141
# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
配置好 flume3-flume-logger.conf 文件 配置 source 用于接收 flume1 与 flume2 发送过来的数据流,最终合并后 sink 到控制台
# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141
# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger
# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1
(3)执行配置文件
[hsw@hadoop104 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console
[hsw@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group3/flume1-logger-flume.conf
[hsw@hadoop103 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group3/flume2-netcat-flume.conf
[hsw@hadoop103 module]$ echo 'hello' > group.log
[hsw@hadoop102 flume]$ telnet hadoop102 44444
org.apache.flume
flume-ng-core
1.9.0
public class TypeInterceptor implements Interceptor {
//声明一个集合用于存放拦截器处理后的事件
private List addHeaderEvens = new ArrayList<>();
@Override
public void initialize() {
}
//单个事件处理方法
@Override
public Event intercept(Event event) {
//1. 获取header&body
Map headers = event.getHeaders();
String body = new String(event.getBody());
//2. 根据body中是否包含"atguigu"添加不同的头信息
if(body.contains("atguigu")){
headers.put("type","atguigu");
}else{
headers.put("type","other");
}
return event;
}
//多个实际处理方法
@Override
public List intercept(List events) {
//1. 清空集合
addHeaderEvens.clear();
//2. 遍历events
for(Event event:events){
addHeaderEvens.add(intercept(event));
}
//3.返回数据
return addHeaderEvens;
}
@Override
public void close() {
}
public static class Builder implements Interceptor.Builder{
@Override
public Interceptor build() {
return new TypeInterceptor();
}
@Override
public void configure(Context context) {
}
}
}
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.atguigu.interceptor.TypeInterceptor$Builder
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = type
a1.sources.r1.selector.mapping.atguigu = c1
a1.sources.r1.selector.mapping.other = c2
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141
a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4141
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 4242
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1
[hsw@hadoop102 flume]$ sudo yum -y install epel-release
[hsw@hadoop102 flume]$ sudo yum -y install ganglia-gmetad
[hsw@hadoop102 flume]$ sudo yum -y install ganglia-web
[hsw@hadoop102 flume]$ sudo yum -y install ganglia-gmond
[hsw@hadoop102 flume]$ sudo yum -y install ganglia-gmond
[hsw@hadoop102 flume]$ sudo vim /etc/httpd/conf.d/ganglia.conf
3)在 102 修改配置文件/etc/ganglia/gmetad.conf
[hsw@hadoop102 flume]$ sudo vim /etc/ganglia/gmetad.conf
[hsw@hadoop102 flume]$ sudo vim /etc/ganglia/gmond.conf
cluster {
name = "my cluster" #改这里
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel {
#bind_hostname = yes # Highly recommended, soon to be default.
# This option tells gmond to use a source address
# that resolves to the machine's hostname. Without
# this, the metrics may appear to come from any
# interface and the DNS names associated with
# those IPs will be used to create the RRDs.
# mcast_join = 239.2.11.71
# 数据发送给 hadoop102
host = hadoop102 #改这里
port = 8649
ttl = 1
}
udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8649
# 接收来自任意连接的数据
bind = 0.0.0.0 #改这里
retry_bind = true
# Size of the UDP buffer. If you are handling lots of metrics you really
# should bump it up to e.g. 10MB or even higher.
# buffer = 10485760
}
[hsw@hadoop102 flume]$ sudo vim /etc/selinux/config
# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of these two values:
# targeted - Targeted processes are protected,
# mls - Multi Level Security protection.
SELINUXTYPE=targeted
[hsw@hadoop102 flume]$ sudo setenforce 0
[hsw@hadoop102 flume]$ sudo systemctl start gmond
[hsw@hadoop102 flume]$ sudo systemctl start httpd
[hsw@hadoop102 flume]$ sudo systemctl start gmetad
[hsw@hadoop102 flume]$ sudo chmod -R 777 /var/lib/ganglia
[hsw@hadoop102 flume]$ bin/flume-ng agent \
-c conf/ \
-n a1 \
-f job/flume-netcat-logger.conf \
-Dflume.root.logger=INFO,console \
-Dflume.monitoring.type=ganglia \
-Dflume.monitoring.hosts=hadoop102:8649
[hsw@hadoop102 flume]$ nc localhost 44444
使用第三方框架 Ganglia 实时监控 Flume。
1)作用
(1)Source 组件是专门用来收集数据的,可以处理各种类型、各种格式的日志数据,包括 avro、thrift、exec、jms、spooling directory、netcat、sequence generator、syslog、http、legacy
(2)Channel 组件对采集到的数据进行缓存,可以存放在 Memory 或 File 中。
(3)Sink 组件是用于把数据发送到目的地的组件,目的地包括 Hdfs、Logger、avro、thrift、ipc、file、Hbase、solr、自定义。
2)我公司采用的 Source 类型为:
(1)监控后台日志:exec
(2)监控后台产生日志的端口:netcat