Flume is a distributed(分布式), reliable(高可靠的), and available(高可用的) service for efficiently collecting(收集), ,aggregating(聚合,存在哪里?磁盘/内存?), and moving(移动) large amounts of log data.
It has a simple and flexible(简单灵活) architecture based on streaming data flows(流数据). It is robust and fault tolerant(健壮,容错) with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online(在线) analytic application.
可靠性:当节点出现故障时,日志能够被传送到其他节点上而不会丢失。Flume提供了三种级别的可靠性保障,从强到弱依次为:end-to-end(收集到数据agent首先将event写到磁盘上,当数据传送成功后,再删除,如果发送失败,可以重新发送。);Store on failure(这也是scribe采用的策略,当数据接收方crash时,将数据写到本地,待恢复后,继续发送);Best effort(数据发送到接收方后,不会进行确认)。
1. 下载, 解压到/opt/
2. 将Flume配置系统环境变量中: /etc/profile
export FLUME_HOME=/opt/apache-flume-1.6.0-cdh5.7.0-bin
export PATH=$FLUME_HOME/bin:$PATH
3. source下让其配置生效
4. flume-env.sh的配置:export JAVA_HOME=/opt/jdk1.8.0_144
5. 检测: flume-ng version
实战1: 监控一个目录,实时采集目录下新增加的文件,并将文件内容输出到控制台
This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk. This source will watch the specified directory for new files, and will parse events out of new files as they appear. The event parsing logic is pluggable. After a given file has been fully read into the channel, it is renamed to indicate completion (or optionally deleted).
Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable, uniquely-named files must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:
Property Name | Default | Description |
channels | - | |
type | - | The component type name, needs to be spooldir |
spoolDir | - | The directory from which to read files from. |
fileSuffix | .COMPLETED | Suffix to append to completely ingested files |
spooling source + memory channel + logger sink
# Name the components on this agent
a1.sources = spooling1
a1.sinks = target1
a1.channels = channel1
# Describe/configure the source
a1.sources.spooling1.type = spooldir
a1.sources.spooling1.channels = channel1
a1.sources.spooling1.spoolDir = /var/log/apache/flumeSpool
a1.sources.spooling1.fileSuffix = .test
# Describe the sink
a1.sinks.target1.type = logger
# Use a channel which buffers events in memory
a1.channels.channel1.type = memory
# Bind the source and sink to the channel
a1.sources.spooling1.channels = channel1
a1.sinks.target1.channel = channel1
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/spooling.conf \
-Dflume.root.logger=INFO,console
spooling source可以监控某个目录下的所有文件,通过channel输出到sink端。但是spooling source只支持对目录下同一个文件的一次性写入,不支持多次写入。
问题描述:假设spoolingDir=/.../OutputData/,其下有多个csv文件:201801010001.csv, 201801010002.csv,.... 有一个程序动态的在该目录下创建csv文件,将每分钟产生的数据放在,以第5分钟为例,xxxxxxxxx05.csv,文件里面。如何实时监控这些新产生的数据?
解决该问题有几种方案:
方案1:利用ignorePattern= ^(.)*\\.csv$,通过另外一个程序监控,该目录下:如果有新的csv文件产生,将上一分钟的'201801010001.csv'重新命名为'201801010001.spool'文件,这样就避免了在spooling目录下对一个文件的多次写入问题。
方案2:将spoolingDir=/.../Spooling/,即spoolingDir不指向OutputData目录,通过另一个程序或者shell脚本,监控Spooling目录下,如果有新的csv文件产生,将上一分钟的'201801010001.csv' move或者copy 到spoolingDir文件夹下。这样也避免了在spooling目录下对一个文件的多次写入问题。
This source is reliable and will not miss data even when the tailing files rotate. It periodically writes the last read position of each files on the given position file in JSON format. If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.
In other use case, this source can also start tailing from the arbitray position for each files using the given position file. When there is no position file on the specified path , it will start tailing from the first line of each files by default.
Files will be consumed in order of their modification time. File with the oldest modification time will be consumed first.
This source does not rename or delete or do any modifications to the file being tailed. Currently this source does not support tailing binary files. It reads text files line by line.
Property Name | Default | Description |
channels | ||
type | - | The component type name, needs to be TAILDIR. |
filegroups | - | Space-separated list of file groups. Each file group indicates a set of files to be tailed |
filegroups. |
- | Absolute path of the file group. Regular expression (and not file system patterns) can be used for filename only. |
positionFile | ~/.flume/taildir_position.json | File in JSON format to record the inode, the absolute path and the last position of each tailing file. |
headers. |
- | Header value which is the set with header key. Multiple headers can be specified for one file group. |
Example:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.*log.*
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true