Apache Flume是为有效收集聚合和移动大量来自不同源到中心数据存储而设计的可分布,可靠的,可用的系统。flume是一个分布式、可靠、和高可用的海量日志采集、聚合和传输的系统。支持在日志系统中定制各类数据发送方,用于收集数据,同时,Flume提供对数据进行简单处理,并写到各种数据接受方(比如文本、HDFS、Hbase等)的能力 。
如图所示,Flume传输的数据的基本单位是event,如果是文本文件,通常是一行记录,这也是事务的基本单位。Flume以agent为最小的独立运行单位。一个agent就是一个JVM。一个agent由Source、Sink和Channel三大组件构成。
(1)Source:用来接受数据,类型有多种。
主要类型如下图:
(2)channel: 临时存放地,对Source中的数据进行缓存,知道sink消费完。
主要类型如下图:
(3)Sink:从channel中提取数据存放到中央化存储(hdfs/hbase)
主要类型如下图:
(4)flume的架构:除了单Agent的架构外,还有其他复杂的数据架构。
(1)在flume/conf/下,首先创建一个hello.conf文件
#声明三种组件
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#定义source信息
a1.sources.r1.type=netcat
a1.sources.r1.bind=localhost
a1.sources.r1.port=8888
#定义sink信息
a1.sinks.k1.type=logger
#定义channel信息
a1.channels.c1.type=memory
#绑定在一起
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
首先这个配置文件中,source的类型是Netcat source,channel类型是memory,sink的类型是logger。
进行运行测试
a)启动flume agent
bin/flume-ng agent -f ../conf/helloworld.conf -n a1 -Dflume.root.logger=INFO,console
b)启动nc的客户端
nc localhost 8888
最后在nc客户端上输入 hello
c)在flume的终端输出hello world.
(2)实时日志收集,在/home/txp/下要有一个test.txt文件
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type=exec
a1.sources.r1.command=tail -F /home/txp/test.txt
a1.sinks.k1.type=logger
a1.channels.c1.type=memory
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
(3)目录监控
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/home/txp/spool
a1.sources.r1.fileHeader=true
a1.sinks.k1.type=logger
a1.channels.c1.type=memory
a1.sources.r1.channels=c1
a1.sinks.k1.channel=c1
其中spool是一个目录。从spool外部文件创建文件然后放入spool中,会出现结果–监控spool目录中的文件变化。
(4)hdfs–日志存放到hdfs中
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 8888
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H/%M/%S
#前缀events-
a1.sinks.k1.hdfs.filePrefix = events-
#是否是产生新目录,每十分钟产生一个新目录,一般控制的目录方面。
#2017-12-12 -->
#2017-12-12 -->%H%M%S
#10秒收集一一次
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = second
#使用本地时间
a1.sinks.k1.hdfs.useLocalTimeStamp=true
#是否产生新文件。滚动
a1.sinks.k1.hdfs.rollInterval=10
a1.sinks.k1.hdfs.rollSize=10
a1.sinks.k1.hdfs.rollCount=3
a1.channels.c1.type=memory
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(4)hbase
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 8888
a1.sinks.k1.type = hbase
a1.sinks.k1.table = ns1:t12
a1.sinks.k1.columnFamily = f1
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a1.channels.c1.type=memory
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
(5)使用avroSource和AvroSink实现跃点agent处理
#agent----a1
a1.sources = r1
a1.sinks= k1
a1.channels = c1
a1.sources.r1.type=netcat
a1.sources.r1.bind=localhost
a1.sources.r1.port=8888
a1.sinks.k1.type = avro
a1.sinks.k1.hostname=localhost
a1.sinks.k1.port=9999
a1.channels.c1.type=memory
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
#agent----a2
a2.sources = r2
a2.sinks= k2
a2.channels = c2
a2.sources.r2.type=avro
a2.sources.r2.bind=localhost
a2.sources.r2.port=9999
a2.sinks.k2.type = logger
a2.channels.c2.type=memory
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2
要先启动a2,再启动a1
启动a2:flume-ng agent -f /soft/flume/conf/avro_hop.conf -n a2 -Dflume.root.logger=INFO,console:
启动a1:flume-ng agent -f /soft/flume/conf/avro_hop.conf -n a1
Flume中的拦截器(interceptor),用户Source读取events发送到Sink的时候,在events header中加入一些有用的信息,或者对events的内容进行过滤,完成初步的数据清洗。源码中已有的拦截器
Timestamp Interceptor;
Host Interceptor;
Static Interceptor;
UUID Interceptor;
Morphline Interceptor;
Search and Replace Interceptor;
Regex Filtering Interceptor;
Regex Extractor Interceptor;
/**
* 自定义flume的拦截器,提取body中的createTimeMS字段作为header
*/
public class LogCollInterceptor implements Interceptor {
private final boolean preserveExisting;
private LogCollInterceptor(boolean preserveExisting) {
this.preserveExisting = preserveExisting;
}
public void initialize() {
}
/**
* Modifies events in-place.
*/
public Event intercept(Event event) {
Map headers = event.getHeaders();
//处理时间
byte[] json = event.getBody();
String jsonStr = new String(json);
save(jsonStr);
AppBaseLog log = JSONObject.parseObject(jsonStr , AppBaseLog.class);
long time = log.getCreatedAtMs();
headers.put(TIMESTAMP, Long.toString(time));
save(time +"");
//处理log类型的头
//pageLog
String logType = "" ;
if(jsonStr.contains("pageId")){
logType = "page" ;
}
//eventLog
else if (jsonStr.contains("eventId")) {
logType = "event";
}
//usageLog
else if (jsonStr.contains("singleUseDurationSecs")) {
logType = "usage";
}
//error
else if (jsonStr.contains("errorBrief")) {
logType = "error";
}
//startup
else if (jsonStr.contains("network")) {
logType = "startup";
}
headers.put("logType", logType);
save(logType);
return event;
}
/**
* Delegates to {@link #intercept(Event)} in a loop.
*
* @param events
* @return
*/
public List intercept(List events) {
for (Event event : events) {
intercept(event);
}
return events;
}
public void close() {
}
/**
*/
public static class Builder implements Interceptor.Builder {
private boolean preserveExisting = PRESERVE_DFLT;
public Interceptor build() {
return new LogCollInterceptor(preserveExisting);
}
public void configure(Context context) {
preserveExisting = context.getBoolean(PRESERVE, PRESERVE_DFLT);
}
}
/**
*保存
*/
private void save(String log) {
try {
FileWriter fw = new FileWriter("/home/centos/l.log",true);
fw.append(log + "\r\n");
fw.flush();
fw.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public static class Constants {
public static String TIMESTAMP = "timestamp";
public static String PRESERVE = "preserveExisting";
public static boolean PRESERVE_DFLT = false;
}
}
Flume参考资料
官方网站: http://flume.apache.org/
用户文档: http://flume.apache.org/FlumeUserGuide.html
开发文档: http://flume.apache.org/FlumeDeveloperGuide.html
参考文档:
https://www.cnblogs.com/ximengchj/p/6423689.html
https://www.cnblogs.com/zhangyinhua/p/7803486.html
https://blog.csdn.net/yuan_xw/article/details/51143698