log4j.properties配置:
log4j.rootLogger=INFO
log4j.category.com.besttone=INFO,flume
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = localhost
log4j.appender.flume.Port = 44444
log4j.appender.flume.UnsafeMode = true
需要将/opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/flume-ng/tools/flume-ng-log4jappender-1.4.0-cdh5.0.0-jar-with-dependencies.jar添加到classpath下。
然后可以写一个简单的测试类来测试一下:
- package com.besttone.flume;
-
- import java.util.Date;
-
- import org.apache.commons.logging.Log;
- import org.apache.commons.logging.LogFactory;
-
- public class WriteLog {
- protected static final Log logger = LogFactory.getLog(WriteLog.class);
-
-
-
-
-
- public static void main(String[] args) throws InterruptedException {
-
- while (true) {
-
- logger.info(new Date().getTime());
- Thread.sleep(2000);
- }
- }
- }
然后写一个run.sh脚本运行这个类:
- #!/bin/bash
- jarlist=`ls ./lib/*.jar`
- CLASSPATH='./bin/'
- for jar in ${jarlist}
- do
- CLASSPATH=${CLASSPATH}:${jar}
- done
- echo ${CLASSPATH}
-
- java -classpath "$CLASSPATH" com.besttone.flume.WriteLog &
执行run.sh,将sink设置为logger,去flume的日志文件里去看,可以看到log4j的日志输出已经传输到了flume中:
2014-07-16 14:23:54,193 INFO org.apache.flume.sink.LoggerSink: Event: { headers:{flume.client.log4j.log.level=20000, flume.client.log4j.message.encoding=UTF8, flume.client.log4j.logger.name=com.besttone.flume.WriteLog, flume.client.log4j.timestamp=1405491834189} body: 31 34 30 35 34 39 31 38 33 34 31 38 39 1405491834189 }
对于flume拦截器,我的理解是:在app(应用程序日志)和 source 之间的,对app日志进行拦截处理的。也即在日志进入到source之前,对日志进行一些包装、清新过滤等等动作。
官方上提供的已有的拦截器有:
Timestamp Interceptor
Host Interceptor
Static Interceptor
Regex Filtering Interceptor
Regex Extractor Interceptor
像很多java的开源项目如springmvc中的拦截器一样,flume的拦截器也是chain形式的,可以对一个source指定多个拦截器,按先后顺序依次处理。
Timestamp Interceptor :在event的header中添加一个key叫:timestamp,value为当前的时间戳。这个拦截器在sink为hdfs 时很有用,后面会举例说到
Host Interceptor:在event的header中添加一个key叫:host,value为当前机器的hostname或者ip。
Static Interceptor:可以在event的header中添加自定义的key和value。
Regex Filtering Interceptor:通过正则来清洗或包含匹配的events。
Regex Extractor Interceptor:通过正则表达式来在header中添加指定的key,value则为正则匹配的部分
下面举例说明这些拦截器的用法,首先我们调整一下第一篇文章中的那个WriteLog类:
- public class WriteLog {
- protected static final Log logger = LogFactory.getLog(WriteLog.class);
-
-
-
-
-
- public static void main(String[] args) throws InterruptedException {
-
- while (true) {
- logger.info(new Date().getTime());
- logger.info("{\"requestTime\":"
- + System.currentTimeMillis()
- + ",\"requestParams\":{\"timestamp\":1405499314238,\"phone\":\"02038824941\",\"cardName\":\"测试商家名称\",\"provinceCode\":\"440000\",\"cityCode\":\"440106\"},\"requestUrl\":\"/reporter-api/reporter/reporter12/init.do\"}");
- Thread.sleep(2000);
-
- }
- }
- }
又多输出了一行日志信息,现在每次循环都会输出两行日志信息,第一行是一个时间戳信息,第二行是一行JSON格式的字符串信息。
接下来我们用regex_filter和 timestamp这两个拦截器来实现这样一个功能:
1 过滤掉LOG4J输出的第一行那个时间戳日志信息,只收集JSON格式的日志信息
2 将收集的日志信息保存到HDFS上,每天的日志保存到以该天命名的目录下面,如2014-7-25号的日志,保存到/flume/events/14-07-25目录下面。
修改后的flume.conf如下:
- tier1.sources=source1
- tier1.channels=channel1
- tier1.sinks=sink1
-
- tier1.sources.source1.type=avro
- tier1.sources.source1.bind=0.0.0.0
- tier1.sources.source1.port=44444
- tier1.sources.source1.channels=channel1
-
- tier1.sources.source1.interceptors=i1 i2
- tier1.sources.source1.interceptors.i1.type=regex_filter
- tier1.sources.source1.interceptors.i1.regex=\\{.*\\}
- tier1.sources.source1.interceptors.i2.type=timestamp
-
- tier1.channels.channel1.type=memory
- tier1.channels.channel1.capacity=10000
- tier1.channels.channel1.transactionCapacity=1000
- tier1.channels.channel1.keep-alive=30
-
- tier1.sinks.sink1.type=hdfs
- tier1.sinks.sink1.channel=channel1
- tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%y-%m-%d
- tier1.sinks.sink1.hdfs.fileType=DataStream
- tier1.sinks.sink1.hdfs.writeFormat=Text
- tier1.sinks.sink1.hdfs.rollInterval=0
- tier1.sinks.sink1.hdfs.rollSize=10240
- tier1.sinks.sink1.hdfs.rollCount=0
- tier1.sinks.sink1.hdfs.idleTimeout=60
我们对source1添加了两个拦截器i1和i2,i1为regex_filter,过滤的正则为\\{.*\\},注意正则的写法用到了转义字符,不然source1无法启动,会报错。
i2为timestamp,在header中添加了一个timestamp的key,然后我们修改了sink1.hdfs.path在后面加上了/%y-%m-%d这一串字符,这一串字符要求event的header中必须有timestamp这个key,这就是为什么我们需要添加一个timestamp拦截器的原因,如果不添加这个拦截器,无法使用这样的占位符,会报错。还有很多占位符,请参考官方文档。
然后运行WriteLog,去hdfs上查看对应目录下面的文件,会发现内容只有JSON字符串的日志,与我们的功能描述一致。
先回想一下,spooldir source可以将文件名作为header中的key:basename写入到event的header当中去。试想一下,如果有一个拦截器可以拦截这个event,然后抽取header中这个key的值,将其拆分成3段,每一段都放入到header中,这样就可以实现那个需求了。
遗憾的是,flume没有提供可以拦截header的拦截器。不过有一个抽取body内容的拦截器:RegexExtractorInterceptor,看起来也很强大,以下是一个官方文档的示例:
If the Flume event body contained 1:2:3.4foobar5 and the following configuration was used
a1.sources.r1.interceptors.i1.regex = (\\d):(\\d):(\\d)
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
a1.sources.r1.interceptors.i1.serializers.s1.name = one
a1.sources.r1.interceptors.i1.serializers.s2.name = two
a1.sources.r1.interceptors.i1.serializers.s3.name = three
The extracted event will contain the same body but the following headers will have been added one=>1, two=>2, three=>3
大概意思就是,通过这样的配置,event body中如果有1:2:3.4foobar5 这样的内容,这会通过正则的规则抽取具体部分的内容,然后设置到header当中去。
于是决定打这个拦截器的主义,觉得只要把代码稍微改改,从拦截body改为拦截header中的具体key,就OK了。翻开源码,哎呀,很工整,改起来没难度,以下是我新增的一个拦截器:RegexExtractorExtInterceptor:
- package com.besttone.flume;
-
- import java.util.List;
- import java.util.Map;
- import java.util.regex.Matcher;
- import java.util.regex.Pattern;
-
- import org.apache.commons.lang.StringUtils;
- import org.apache.flume.Context;
- import org.apache.flume.Event;
- import org.apache.flume.interceptor.Interceptor;
- import org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer;
- import org.apache.flume.interceptor.RegexExtractorInterceptorSerializer;
- import org.slf4j.Logger;
- import org.slf4j.LoggerFactory;
-
- import com.google.common.base.Charsets;
- import com.google.common.base.Preconditions;
- import com.google.common.base.Throwables;
- import com.google.common.collect.Lists;
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- public class RegexExtractorExtInterceptor implements Interceptor {
-
- static final String REGEX = "regex";
- static final String SERIALIZERS = "serializers";
-
-
-
- static final String EXTRACTOR_HEADER = "extractorHeader";
- static final boolean DEFAULT_EXTRACTOR_HEADER = false;
- static final String EXTRACTOR_HEADER_KEY = "extractorHeaderKey";
-
-
-
- private static final Logger logger = LoggerFactory
- .getLogger(RegexExtractorExtInterceptor.class);
-
- private final Pattern regex;
- private final List serializers;
-
-
-
- private final boolean extractorHeader;
- private final String extractorHeaderKey;
-
-
-
- private RegexExtractorExtInterceptor(Pattern regex,
- List serializers, boolean extractorHeader,
- String extractorHeaderKey) {
- this.regex = regex;
- this.serializers = serializers;
- this.extractorHeader = extractorHeader;
- this.extractorHeaderKey = extractorHeaderKey;
- }
-
- @Override
- public void initialize() {
-
- }
-
- @Override
- public void close() {
-
- }
-
- @Override
- public Event intercept(Event event) {
- String tmpStr;
- if(extractorHeader)
- {
- tmpStr = event.getHeaders().get(extractorHeaderKey);
- }
- else
- {
- tmpStr=new String(event.getBody(),
- Charsets.UTF_8);
- }
-
- Matcher matcher = regex.matcher(tmpStr);
- Map headers = event.getHeaders();
- if (matcher.find()) {
- for (int group = 0, count = matcher.groupCount(); group < count; group++) {
- int groupIndex = group + 1;
- if (groupIndex > serializers.size()) {
- if (logger.isDebugEnabled()) {
- logger.debug(
- "Skipping group {} to {} due to missing serializer",
- group, count);
- }
- break;
- }
- NameAndSerializer serializer = serializers.get(group);
- if (logger.isDebugEnabled()) {
- logger.debug("Serializing {} using {}",
- serializer.headerName, serializer.serializer);
- }
- headers.put(serializer.headerName, serializer.serializer
- .serialize(matcher.group(groupIndex)));
- }
- }
- return event;
- }
-
- @Override
- public List intercept(List events) {
- List intercepted = Lists.newArrayListWithCapacity(events.size());
- for (Event event : events) {
- Event interceptedEvent = intercept(event);
- if (interceptedEvent != null) {
- intercepted.add(interceptedEvent);
- }
- }
- return intercepted;
- }
-
- public static class Builder implements Interceptor.Builder {
-
- private Pattern regex;
- private List serializerList;
-
-
-
- private boolean extractorHeader;
- private String extractorHeaderKey;
-
-
-
- private final RegexExtractorInterceptorSerializer defaultSerializer = new RegexExtractorInterceptorPassThroughSerializer();
-
- @Override
- public void configure(Context context) {
- String regexString = context.getString(REGEX);
- Preconditions.checkArgument(!StringUtils.isEmpty(regexString),
- "Must supply a valid regex string");
-
- regex = Pattern.compile(regexString);
- regex.pattern();
- regex.matcher("").groupCount();
- configureSerializers(context);
-
-
- extractorHeader = context.getBoolean(EXTRACTOR_HEADER,
- DEFAULT_EXTRACTOR_HEADER);
-
- if (extractorHeader) {
- extractorHeaderKey = context.getString(EXTRACTOR_HEADER_KEY);
- Preconditions.checkArgument(
- !StringUtils.isEmpty(extractorHeaderKey),
- "必须指定要抽取内容的header key");
- }
-
- }
-
- private void configureSerializers(Context context) {
- String serializerListStr = context.getString(SERIALIZERS);
- Preconditions.checkArgument(
- !StringUtils.isEmpty(serializerListStr),
- "Must supply at least one name and serializer");
-
- String[] serializerNames = serializerListStr.split("\\s+");
-
- Context serializerContexts = new Context(
- context.getSubProperties(SERIALIZERS + "."));
-
- serializerList = Lists
- .newArrayListWithCapacity(serializerNames.length);
- for (String serializerName : serializerNames) {
- Context serializerContext = new Context(
- serializerContexts.getSubProperties(serializerName
- + "."));
- String type = serializerContext.getString("type", "DEFAULT");
- String name = serializerContext.getString("name");
- Preconditions.checkArgument(!StringUtils.isEmpty(name),
- "Supplied name cannot be empty.");
-
- if ("DEFAULT".equals(type)) {
- serializerList.add(new NameAndSerializer(name,
- defaultSerializer));
- } else {
- serializerList.add(new NameAndSerializer(name,
- getCustomSerializer(type, serializerContext)));
- }
- }
- }
-
- private RegexExtractorInterceptorSerializer getCustomSerializer(
- String clazzName, Context context) {
- try {
- RegexExtractorInterceptorSerializer serializer = (RegexExtractorInterceptorSerializer) Class
- .forName(clazzName).newInstance();
- serializer.configure(context);
- return serializer;
- } catch (Exception e) {
- logger.error("Could not instantiate event serializer.", e);
- Throwables.propagate(e);
- }
- return defaultSerializer;
- }
-
- @Override
- public Interceptor build() {
- Preconditions.checkArgument(regex != null,
- "Regex pattern was misconfigured");
- Preconditions.checkArgument(serializerList.size() > 0,
- "Must supply a valid group match id list");
- return new RegexExtractorExtInterceptor(regex, serializerList,
- extractorHeader, extractorHeaderKey);
- }
- }
-
- static class NameAndSerializer {
- private final String headerName;
- private final RegexExtractorInterceptorSerializer serializer;
-
- public NameAndSerializer(String headerName,
- RegexExtractorInterceptorSerializer serializer) {
- this.headerName = headerName;
- this.serializer = serializer;
- }
- }
- }
简单说明一下改动的内容:
增加了两个配置参数:
extractorHeader 是否抽取的是header部分,默认为false,即和原始的拦截器功能一致,抽取的是event body的内容
extractorHeaderKey 抽取的header的指定的key的内容,当extractorHeader为true时,必须指定该参数。
按照第八讲的方法,我们将该类打成jar包,作为flume的插件放到了/var/lib/flume-ng/plugins.d/RegexExtractorExtInterceptor/lib目录下,重新启动flume,将该拦截器加载到classpath中。
最终的flume.conf如下:
- tier1.sources=source1
- tier1.channels=channel1
- tier1.sinks=sink1
- tier1.sources.source1.type=spooldir
- tier1.sources.source1.spoolDir=/opt/logs
- tier1.sources.source1.fileHeader=true
- tier1.sources.source1.basenameHeader=true
- tier1.sources.source1.interceptors=i1
- tier1.sources.source1.interceptors.i1.type=com.besttone.flume.RegexExtractorExtInterceptor$Builder
- tier1.sources.source1.interceptors.i1.regex=(.*)\\.(.*)\\.(.*)
- tier1.sources.source1.interceptors.i1.extractorHeader=true
- tier1.sources.source1.interceptors.i1.extractorHeaderKey=basename
- tier1.sources.source1.interceptors.i1.serializers=s1 s2 s3
- tier1.sources.source1.interceptors.i1.serializers.s1.name=one
- tier1.sources.source1.interceptors.i1.serializers.s2.name=two
- tier1.sources.source1.interceptors.i1.serializers.s3.name=three
- tier1.sources.source1.channels=channel1
- tier1.sinks.sink1.type=hdfs
- tier1.sinks.sink1.channel=channel1
- tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}
- tier1.sinks.sink1.hdfs.round=true
- tier1.sinks.sink1.hdfs.roundValue=10
- tier1.sinks.sink1.hdfs.roundUnit=minute
- tier1.sinks.sink1.hdfs.fileType=DataStream
- tier1.sinks.sink1.hdfs.writeFormat=Text
- tier1.sinks.sink1.hdfs.rollInterval=0
- tier1.sinks.sink1.hdfs.rollSize=10240
- tier1.sinks.sink1.hdfs.rollCount=0
- tier1.sinks.sink1.hdfs.idleTimeout=60
- tier1.channels.channel1.type=memory
- tier1.channels.channel1.capacity=10000
- tier1.channels.channel1.transactionCapacity=1000
- tier1.channels.channel1.keep-alive=30
我把source type改回了内置的spooldir,而不是上一讲自定义的source,然后添加了一个拦截器i1,type是自定义的拦截器:com.besttone.flume.RegexExtractorExtInterceptor$Builder,正则表达式按“.”分隔抽取三部分,分别放到header中的key:one,two,three当中去,即a.log.2014-07-31,通过拦截器后,在header当中就会增加三个key: one=a,two=log,three=2014-07-31。这时候我们在tier1.sinks.sink1.hdfs.path=hdfs://master68:8020/flume/events/%{one}/%{three}。
就实现了和前面第八讲一模一样的需求。
也可以看到,自定义拦截器的改动成本非常小,比自定义source小多了,我们这就增加了一个类,就实现了该功能。