Nginx作为日志服务器,通过exec source监听nginx的日志文件,使用memory channel作为数据传输通道,使用hdfs sink将数据存储到hdfs上。
source: exec(tail -f)
channel:MemoryChannel
sink:HDFS
agent.sources = r1
agent.sinks = k1
agent.channels = c1
## common
agent.sources.r1.channels = c1
agent.sinks.k1.channel = c1
## sources config
agent.sources.r1.type = exec
agent.sources.r1.command = tail -F /home/hadoop/access.log
## channels config
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000
agent.channels.c1.transactionCapacity = 1000
agent.channels.c1.byteCapacityBufferPercentage = 20
agent.channels.c1.byteCapacity = 1000000
agent.channels.c1.keep-alive = 60
#sinks config
agent.sinks.k1.type = hdfs
agent.sinks.k1.channel = c1
agent.sinks.k1.hdfs.path = hdfs://hadoop.senior02:8020/logs/%m/%d
agent.sinks.k1.hdfs.fileType = DataStream
agent.sinks.k1.hdfs.filePrefix = BF-%H
agent.sinks.k1.hdfs.fileSuffix=.log
agent.sinks.k1.hdfs.minBlockReplicas=1
agent.sinks.k1.hdfs.rollInterval=3600
agent.sinks.k1.hdfs.rollSize=132692539
agent.sinks.k1.hdfs.idleTimeout=10
agent.sinks.k1.hdfs.batchSize = 1
agent.sinks.k1.hdfs.rollCount=0
agent.sinks.k1.hdfs.round = true
agent.sinks.k1.hdfs.roundValue = 2
agent.sinks.k1.hdfs.roundUnit = minute
agent.sinks.k1hdfs.useLocalTimeStamp = true
user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" ''"$http_user_agent" "$http_x_forwarded_for"';
log_format log_format '$remote_addr^A$msec^A$http_host^A$request_uri';
sendfile on;
keepalive_timeout 65;
#include /etc/nginx/conf.d/*.conf;
server {
listen 80;
server_name hh 0.0.0.0;
location ~ .*(BfImg)\.(gif)$ {
default_type image/gif;
access_log /home/hadoop/access.log log_format;
root /etc/nginx/www/source;
}
}
}
$bin/flume-ng agent --conf conf --conf-file conf/a1.conf --name agent -Dflume.root.logger=INFO,console
( 去掉-Dflume.root.logger=INFO,console日志打印在$FLUME_HOME/logs/flume.log,--name要和配置文件里的代理器保持一致,否者会造成无法通讯)
/usr/local/nginx/sbin/nginx
service nginx start
$ sbin/hadoop-daemon.sh start namenode
$ sbin/hadoop-daemon.sh start datanode
$ sbin/yarn-daemon.sh start resourcemanager
$ sbin/yarn-daemon.sh start nodemanager
或则
$ sbin/start-all.sh
解决办法:
1. 现copy文件到其他目录
hdfs dfs -cp /logs/11/17/BF-09.1447769674855.log.tmp /BF-09.1447769674855.log.tmp
2. 删除现有的文件
hdfs dfs -rm /logs/11/17/BF-09.1447769674855.log.tmp
3. 将copy的文件复制回去。
hdfs dfs -cp /BF-09.1447769674855.log.tmp /logs/11/17/BF-09.1447769674855.log
source: exec(tail -f)
channel:MemoryChannel
sink:HDFS
#a2:agent name
a2.sources = r2
a2.channels = c2
a2.sinks = k2
# define sources
#主动获取日志
a2.sources.r2.type = exec
#获取日志的命令(注意要有权限)
a2.sources.r2.command = tail -F /var/log/httpd/access_log
#上一行命令所运行的环境
a2.sources.r2.shell = /bin/bash -c
# define channels
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100
# define sinks
#目标上传到hdfs
a2.sinks.k2.type = hdfs
a2.sinks.k2.hdfs.path=hdfs://aliyun.lzh:8020/flume/%Y%m%d/%H
a2.sinks.k2.hdfs.filePrefix = accesslog
#启用按时间生成文件夹
a2.sinks.k2.hdfs.round=true
#设置roundValue:1,round单位:小时
a2.sinks.k2.hdfs.roundValue=1
a2.sinks.k2.hdfs.roundUnit=hour
#使用本地时间戳(这个必须设置不然会报错)
a2.sinks.k2.hdfs.useLocalTimeStamp=true
#多少个events会flush to hdfs
a2.sinks.k2.hdfs.batchSize=1000
# File format: 默认是SequenceFile(key:value对),DataStream是无压缩的一般数据流
a2.sinks.k2.hdfs.fileType=DataStream
#序列化的格式Text
a2.sinks.k2.hdfs.writeFormat=Text
#设置解决文件过多、过小问题
#每600秒生成一个文件
a2.sinks.k2.hdfs.rollInterval=60
#当达到128000000bytes时,创建新文件 127*1024*1024(in bytes)
#实际环境中如果按照128M回滚文件,那么这里设置一般设置成127M
a2.sinks.k2.hdfs.rollSize=128000000
#设置文件的生成不和events数相关
a2.sinks.k2.hdfs.rollCount=0
#设置成1,否则当有副本复制时就重新生成文件,上面三条则没有效果
a2.sinks.k2.hdfs.minBlockReplicas=1
# bind the sources and sinks to the channels
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
2.1 安装Apache HTTP
# yum -y install httpd
2.2 启动httpd服务
# service httpd start
2.3 编辑一个静态的html的页面
# vi /var/www/html/index.html
this is a test html
2.4 浏览器输入主机名访问这个页面
aliyun.lzh
2.5 实时监控httpd日志
# chmod -R 777 /var/log/httpd
$ tail -f /var/log/httpd/access_log
$ sbin/start-dfs.sh
$ bin/flume-ng agent --conf conf --conf-file conf/a2.conf --name a2 -Dflume.root.logger=INFO,console
监控某个目录,若目录下产生了符合条件的文件,flume就抽取它到hdfs上,目录下可能有多种 文件,比如当文件以log.tmp结尾时表示正在写。对log.tmp文件设置一个size值,一旦到达size, 则会变成一个完整文件以.log结尾,则已经是完整文件(往往存在短暂),flume可以抽取其中数 据, 以.log.completed结尾则表示flume已经抽取完数据,可以删除掉。
从上述需求可以知道,我们是要监控某个日志目录,所以Flume Agent的Source选择 【Sqooling Directory source】,这个source会监控spooling directory下的新文件,并且当新 文件出现解析event,上传数据到目标地。当这个文件在channel中被完全读取后,便会被重命名 表示完成或者被删除。
本案例中Flume Agent不再使用前面所说的MemoryChannel,而是使用FileChannel,将 Source获取的数据缓存到本地文件系统,要比MemoryChannel更加安全。
source: spooldir
channel:FileChannel
sink:HDFS
为了后续案例的演示,作如下几项准备工作:
a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
# 源是某个目录使用spooldir
a3.sources.r3.type = spooldir
# 抽取的目录
a3.sources.r3.spoolDir = /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/logs
# 抽取该目录下符合包含.log结尾的文件
a3.sources.r3.includePattern = ^.log$
# Use a channel which buffers events in file
# 设置channel类型是file
a3.channels.c3.type = file
# 设置检查点目录,记录已经获取哪些文件,一些元数据信息
a3.channels.c3.checkpointDir = /opt/modules/cdh/flume-1.5.0-cdh5.3.6/checkpoint
#设置缓存的数据存储目录
a3.channels.c3.dataDirs = /opt/modules/cdh/flume-1.5.0-cdh5.3.6/bufferdata
# Describe the sink
a3.sinks.k3.type = hdfs
# 启用设置多级目录,这里按年/月/日/时 2级目录,每个小时生成一个文件夹
a3.sinks.k3.hdfs.path = hdfs:/aliyun.lzh:8020/flume2/%Y%m%d/%H
# 设置HDFS生成文件的的前缀
a3.sinks.k3.hdfs.filePrefix = accesslog
#启用按时间生成文件夹
a3.sinks.k3.hdfs.round = true
#设置round单位:小时
a3.sinks.k3.hdfs.roundValue = 1
a3.sinks.k3.hdfs.roundUnit = hour
#使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
# 设置每次写入的DFS的event的个数为100个
a3.sinks.k3.hdfs.batchSize = 100
# 写入HDFS的方式
a3.sinks.k3.hdfs.fileType = DataStream
# 写入HDFS的文件格式
a3.sinks.k3.hdfs.writeFormat = Text
#设置解决文件过多过小问题
#每600秒生成一个文件
a3.sinks.k3.hdfs.rollInterval = 60
#当达到128000000bytes时,创建新文件 127*1024*1024
#实际环境中如果按照128M回顾文件,那么这里设置一般设置成127M
a3.sinks.k3.hdfs.rollSize = 128000000
#设置文件的生成不和events数相关,与时间和大小相关
a3.sinks.k3.hdfs.rollCount = 0
#设置成1,否则当有副本复制时就重新生成文件,上面三条则没有效果
a3.sinks.k3.hdfs.minBlockReplicas =1
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
$ bin/flume-ng agent --conf conf --conf-file conf/a3.conf --name a3 -Dflume.root.logger=INFO,console
client flume
source: exec(tail -f)
channel:MemoryChannel
sink:avro
server flume
source: avro
channel:MemoryChannel
sink:HDFS
# exec source + memory channel + avro sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/access.log
a1.sources.r1.channels = c1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = aliyun.lzh
a1.sinks.k1.port = 4545
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
# avro source + memory channel + hdfs sink
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = aliyun.lzh
a1.sources.r1.port = 4545
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flume/events/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = bf-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp=true
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 5120
a1.sinks.k1.hdfs.rollCount = 0
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume 聚合模式
启动 server端 flume
bin/flume-ng agent --conf conf --conf-file conf/donkey_mother2.conf.conf --name a1 -Dflume.root.logger=INFO,console
然后再 启动 client 端 flume
bin/flume-ng agent --conf conf --conf-file conf/donkey_mother1.conf.conf --name a1 -Dflume.root.logger=INFO,console