Flume实际应用常见案例

案例一:实时收集访问Nginx产生的日志至HDFS

Nginx作为日志服务器,通过exec source监听nginx的日志文件,使用memory channel作为数据传输通道,使用hdfs sink将数据存储到hdfs上。

source: exec(tail -f)

channel:MemoryChannel

sink:HDFS

一.配置a1.conf

agent.sources = r1
agent.sinks = k1
agent.channels = c1


## common
agent.sources.r1.channels = c1
agent.sinks.k1.channel = c1


## sources config
agent.sources.r1.type = exec
agent.sources.r1.command = tail -F /home/hadoop/access.log


## channels config
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 1000
agent.channels.c1.byteCapacityBufferPercentage = 20
agent.channels.c1.byteCapacity = 1000000
agent.channels.c1.keep-alive = 60



#sinks config
agent.sinks.k1.type = hdfs
agent.sinks.k1.channel = c1
agent.sinks.k1.hdfs.path = hdfs://hadoop.senior02:8020/logs/%m/%d
agent.sinks.k1.hdfs.fileType = DataStream
agent.sinks.k1.hdfs.filePrefix = BF-%H
agent.sinks.k1.hdfs.fileSuffix=.log
agent.sinks.k1.hdfs.minBlockReplicas=1
agent.sinks.k1.hdfs.rollInterval=3600
agent.sinks.k1.hdfs.rollSize=132692539
agent.sinks.k1.hdfs.idleTimeout=10
agent.sinks.k1.hdfs.batchSize = 1
agent.sinks.k1.hdfs.rollCount=0
agent.sinks.k1.hdfs.round = true
agent.sinks.k1.hdfs.roundValue = 2
agent.sinks.k1.hdfs.roundUnit = minute
agent.sinks.k1hdfs.useLocalTimeStamp = true

二、配置nginx

user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;



events {
   worker_connections 1024;
}



http {

   include /etc/nginx/mime.types;
   default_type application/octet-stream;
   log_format main '$remote_addr - $remote_user [$time_local] "$request" '
   '$status $body_bytes_sent "$http_referer" ''"$http_user_agent" "$http_x_forwarded_for"';

   log_format log_format '$remote_addr^A$msec^A$http_host^A$request_uri';
   sendfile on;
   keepalive_timeout 65;
   #include /etc/nginx/conf.d/*.conf;



server {

  listen 80;
  server_name hh 0.0.0.0;



   location ~ .*(BfImg)\.(gif)$ {
 
     default_type image/gif;
     access_log /home/hadoop/access.log log_format;
     root /etc/nginx/www/source;
 
   }

 }

}

三、启动Flume-agent a1.conf

$bin/flume-ng agent --conf conf --conf-file conf/a1.conf --name agent -Dflume.root.logger=INFO,console

( 去掉-Dflume.root.logger=INFO,console日志打印在$FLUME_HOME/logs/flume.log,--name要和配置文件里的代理器保持一致,否者会造成无法通讯)

四、启动nginx

/usr/local/nginx/sbin/nginx

service nginx start

五、启动hadoop

$ sbin/hadoop-daemon.sh start namenode

$ sbin/hadoop-daemon.sh start datanode

$ sbin/yarn-daemon.sh start resourcemanager

$ sbin/yarn-daemon.sh start nodemanager

或则

$ sbin/start-all.sh

 

六、问题:flume突然挂掉,hdfs中产生临时文件,跑mapreduce程序读该文件的时候,可能会出现异常

解决办法:

1. 现copy文件到其他目录

hdfs dfs -cp /logs/11/17/BF-09.1447769674855.log.tmp /BF-09.1447769674855.log.tmp

2. 删除现有的文件

hdfs dfs -rm /logs/11/17/BF-09.1447769674855.log.tmp

3. 将copy的文件复制回去。

hdfs dfs -cp /BF-09.1447769674855.log.tmp /logs/11/17/BF-09.1447769674855.log

案例二:实时收集通过Http访问产生的日志至HDFS

一、配置a2.conf

source: exec(tail -f)

channel:MemoryChannel

sink:HDFS

#a2:agent name
a2.sources = r2
a2.channels = c2
a2.sinks = k2

# define sources
#主动获取日志
a2.sources.r2.type = exec
#获取日志的命令(注意要有权限)
a2.sources.r2.command = tail -F /var/log/httpd/access_log
#上一行命令所运行的环境
a2.sources.r2.shell = /bin/bash -c

# define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# define sinks
#目标上传到hdfs
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path=hdfs://aliyun.lzh:8020/flume/%Y%m%d/%H
a2.sinks.k2.hdfs.filePrefix = accesslog
#启用按时间生成文件夹
a2.sinks.k2.hdfs.round=true
#设置roundValue:1,round单位:小时  
a2.sinks.k2.hdfs.roundValue=1
a2.sinks.k2.hdfs.roundUnit=hour
#使用本地时间戳(这个必须设置不然会报错)
a2.sinks.k2.hdfs.useLocalTimeStamp=true
#多少个events会flush to hdfs
a2.sinks.k2.hdfs.batchSize=1000
# File format: 默认是SequenceFile(key:value对),DataStream是无压缩的一般数据流
a2.sinks.k2.hdfs.fileType=DataStream
#序列化的格式Text
a2.sinks.k2.hdfs.writeFormat=Text

#设置解决文件过多、过小问题
#每600秒生成一个文件
a2.sinks.k2.hdfs.rollInterval=60
#当达到128000000bytes时,创建新文件 127*1024*1024(in bytes)
#实际环境中如果按照128M回滚文件,那么这里设置一般设置成127M
a2.sinks.k2.hdfs.rollSize=128000000
#设置文件的生成不和events数相关
a2.sinks.k2.hdfs.rollCount=0
#设置成1,否则当有副本复制时就重新生成文件,上面三条则没有效果
a2.sinks.k2.hdfs.minBlockReplicas=1

# bind the sources and sinks to the channels
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

二、安装Apache HTTP服务器程序用于生成网站日志文件

2.1 安装Apache HTTP

# yum -y install httpd

2.2 启动httpd服务

# service httpd start

2.3 编辑一个静态的html的页面

# vi /var/www/html/index.html

this is a test html

2.4 浏览器输入主机名访问这个页面

aliyun.lzh

2.5 实时监控httpd日志

# chmod -R 777 /var/log/httpd

$ tail -f /var/log/httpd/access_log

三.启动hadoop

$ sbin/start-dfs.sh

四.启动Flume-agent a2

$ bin/flume-ng agent --conf conf --conf-file conf/a2.conf --name a2 -Dflume.root.logger=INFO,console

五.刷新静态页面,观察HDFS是否生成指定的目录和文件

 

案例三:监控目录实时抽取数据

监控某个目录,若目录下产生了符合条件的文件,flume就抽取它到hdfs上,目录下可能有多种 文件,比如当文件以log.tmp结尾时表示正在写。对log.tmp文件设置一个size值,一旦到达size, 则会变成一个完整文件以.log结尾,则已经是完整文件(往往存在短暂),flume可以抽取其中数 据, 以.log.completed结尾则表示flume已经抽取完数据,可以删除掉。

一、业务分析

从上述需求可以知道,我们是要监控某个日志目录,所以Flume Agent的Source选择 【Sqooling Directory source】,这个source会监控spooling directory下的新文件,并且当新 文件出现解析event,上传数据到目标地。当这个文件在channel中被完全读取后,便会被重命名 表示完成或者被删除。

本案例中Flume Agent不再使用前面所说的MemoryChannel,而是使用FileChannel,将 Source获取的数据缓存到本地文件系统,要比MemoryChannel更加安全。

source: spooldir

channel:FileChannel

sink:HDFS

 

二、定义目录

为了后续案例的演示,作如下几项准备工作:

  1. /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/logs 从这里提取数据
  2. hdfs: /flume2/%Y%m%d/%H  用来存放抽取的数据

三、配置a3.conf

a3.sources = r3
a3.sinks = k3
a3.channels = c3



# Describe/configure the source
# 源是某个目录使用spooldir

a3.sources.r3.type = spooldir
# 抽取的目录
a3.sources.r3.spoolDir = /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/logs
# 抽取该目录下符合包含.log结尾的文件
a3.sources.r3.includePattern = ^.log$



# Use a channel which buffers events in file
# 设置channel类型是file
a3.channels.c3.type = file
# 设置检查点目录,记录已经获取哪些文件,一些元数据信息
a3.channels.c3.checkpointDir = /opt/modules/cdh/flume-1.5.0-cdh5.3.6/checkpoint
#设置缓存的数据存储目录
a3.channels.c3.dataDirs = /opt/modules/cdh/flume-1.5.0-cdh5.3.6/bufferdata



# Describe the sink
a3.sinks.k3.type = hdfs
# 启用设置多级目录,这里按年/月/日/时 2级目录,每个小时生成一个文件夹
a3.sinks.k3.hdfs.path = hdfs:/aliyun.lzh:8020/flume2/%Y%m%d/%H
# 设置HDFS生成文件的的前缀
a3.sinks.k3.hdfs.filePrefix = accesslog



#启用按时间生成文件夹
a3.sinks.k3.hdfs.round = true
#设置round单位:小时
a3.sinks.k3.hdfs.roundValue = 1
a3.sinks.k3.hdfs.roundUnit = hour
#使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true



# 设置每次写入的DFS的event的个数为100个
a3.sinks.k3.hdfs.batchSize = 100
# 写入HDFS的方式
a3.sinks.k3.hdfs.fileType = DataStream
# 写入HDFS的文件格式
a3.sinks.k3.hdfs.writeFormat = Text



#设置解决文件过多过小问题
#每600秒生成一个文件
a3.sinks.k3.hdfs.rollInterval = 60
#当达到128000000bytes时,创建新文件 127*1024*1024
#实际环境中如果按照128M回顾文件,那么这里设置一般设置成127M
a3.sinks.k3.hdfs.rollSize = 128000000
#设置文件的生成不和events数相关,与时间和大小相关
a3.sinks.k3.hdfs.rollCount = 0
#设置成1,否则当有副本复制时就重新生成文件,上面三条则没有效果
a3.sinks.k3.hdfs.minBlockReplicas =1



# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

 

四、启动Flume-agent a3

$ bin/flume-ng agent --conf conf --conf-file conf/a3.conf --name a3 -Dflume.root.logger=INFO,console

案例四:多Agent链式结构

Flume实际应用常见案例_第1张图片

client flume 

source: exec(tail -f)

channel:MemoryChannel

sink:avro

server flume 

source: avro

channel:MemoryChannel

sink:HDFS

 一、client flume配置文件 donkey_mother1.conf

# exec source + memory channel + avro sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/access.log
a1.sources.r1.channels = c1


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = aliyun.lzh
a1.sinks.k1.port = 4545

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

二、server flume配置文件 donkey_mother2.conf

# avro source + memory channel + hdfs sink 

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = aliyun.lzh
a1.sources.r1.port = 4545

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = bf-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 5120 
a1.sinks.k1.hdfs.rollCount = 0

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

三、启动flume

flume 聚合模式 

启动 server端 flume
bin/flume-ng agent --conf conf --conf-file conf/donkey_mother2.conf.conf --name a1 -Dflume.root.logger=INFO,console
然后再 启动 client 端 flume
bin/flume-ng agent --conf conf --conf-file conf/donkey_mother1.conf.conf --name a1 -Dflume.root.logger=INFO,console

你可能感兴趣的:(云计算/大数据,Elasticsearch)