这种方法的核心思想就是使用selector将各种级别的的日志发往对应的channel,然后再使用不同的sink去接对应的channel的event并发往Kafka指定分区。具体配置见下图(mutiline_regex_extractor 是自定义的多行拦截器):
# Name the components on this agent
a1.sources = r1
a1.sinks = info_sink debug_sink warn_sink error_sink
a1.channels = info_channel debug_channel warn_channel error_channel
# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 172.17.11.176
a1.sources.r1.port = 41414
a1.sources.r1.channels= info_channel debug_channel warn_channel error_channel
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = multiline_regex_extractor
a1.sources.r1.interceptors.i1.regex = \\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}.*\\s(INFO|DEBUG|WARN|ERROR).*
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = key
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header= key
a1.sources.r1.selector.mapping.INFO = info_channel
a1.sources.r1.selector.mapping.DEBUG = debug_channel
a1.sources.r1.selector.mapping.WARN = warn_channel
a1.sources.r1.selector.mapping.ERROR = error_channel
# Use a channel which buffers events inmemory
a1.channels.debug_channel.type = memory
a1.channels.debug_channel.capacity = 1000
a1.channels.debug_channel.transactionCapacity = 100
# Describe the sink
a1.sinks.debug_sink.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.debug_sink.channel = debug_channel
a1.sinks.debug_sink.kafka.topic = LOG_LEVEL_CLASSIFY
a1.sinks.debug_sink.kafka.bootstrap.servers = 172.17.11.163:9092,172.17.11.176:9092,172.17.11.174:9092
a1.sinks.debug_sink.defaultPartitionId = 0
a1.sinks.debug_sink.channel = debug_channel
需要注意的地方是拦截器截取出的字段的key,即下列加粗的参数的值:
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = key
如果名字不设置成key,会发现Kafka端消息的key为null,究其原因,是因为KafkaSink会取event的header中的键为“key”的键值作为Kafka消息的key。源码如下:
public static final String KEY_HEADER = "key";
……
eventKey = headers.get(KEY_HEADER);
……
if (partitionId != null) {
record = new ProducerRecordbyte[]>(eventTopic, partitionId, eventKey,
serializeEvent(event, useAvroEventFormat));
} else {
record = new ProducerRecordbyte[]>(eventTopic, eventKey,
serializeEvent(event, useAvroEventFormat));
}
在自定义分区类的时候,如果和lib目录下的Kafka的版本和你自定义Kafka分区类时使用的API版本不一致,就会出现找不到对应类的异常。那么如何自定义对应版本的分区类呢?
最简单的方法就是找到你的安装的版本的Kafka的lib目录下的类似于kafka_2.10-0.10.2.1.jar的jar包,并将此jar包作为工程的lib引入,Maven工程直接添加对应版本的依赖即可,先以最新版本的Kafka版本(0.10.2.1)为例,对应Maven工程需要添加的依赖就是:
<dependency>
<groupId>org.apache.kafkagroupId>
<artifactId>kafka_2.10artifactId>
<version>0.10.2.1version>
dependency>
之后,实现你的jar包下的对应的Partition接口即可(推荐继承DefaultPartitioner,以免少实现了某些方法):
public class SimplePartitioner extends DefaultPartitioner {
public SimplePartitioner() {
}
@Override
public void configure(Map configs) {
}
@Override
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
String stringKey = String.valueOf(key);
String[] keys = stringKey.split("\\.");
System.out.println(Integer.valueOf(keys[3]) % 2);
return Integer.valueOf(keys[3]) % 2;
}
@Override
public void close() {
}
}
Flume1.6和Flume1.7版本内置的Kafka版本都比较低,因此你的自定义分区类可能类似于这样,不过都没有关系,只要版本一致就行:
public class TestPartitioner extends DefaultPartitioner{
public TestPartitioner(VerifiableProperties props) {
super(props);
}
@Override
public int partition(Object key, int numPartitions) {
return super.partition(key, numPartitions);
}
}
然后需要将你自定义的分区类的jar包以及所依赖的Kafka版本的jar包(不同版本的可能依赖的个数也不一样,最新版需要依赖两个jar包:kafka-clients-0.10.2.1.jar和kafka_2.10-0.10.2.1.jar),放到flume的安装目录的lib文件夹下,并将其他版本的Kafka的jar包删除。
最后需要提醒的是一定要注意版本一致性。你用什么版本的kafka就需要将对应版本的jar包放到对应目录下,否则有时会找不到。
可以修改Flume源码中的KafkaSink类,直接用构造函数指定分区。查看源码,发现源码是通过partitionIdHeader参数和defaultPartitionId参数共同确定分区号,如果都没有指定,则不传入分区号。
if (staticPartitionId != null) {
partitionId = staticPartitionId;
}
//Allow a specified header to override a static ID
if (partitionHeader != null) {
String headerVal = event.getHeaders().get(partitionHeader);
if (headerVal != null) {
partitionId = Integer.parseInt(headerVal);
}
}
……
if (partitionId != null) {
record = new ProducerRecordbyte[]>(eventTopic, partitionId,
key == null ? eventKey : key,
serializeEvent(event, useAvroEventFormat));
} else {
record = new ProducerRecordbyte[]>(eventTopic,
key == null ? eventKey : key,serializeEvent(event, useAvroEventFormat));
}
因此我们只需要加上我们判断分区号的逻辑即可完成分区:
public static final String KEY_REGEX = "keyRegex";
……
if (context.getString(KEY_REGEX) != null) {
keyRegex = Pattern.compile(context.getString(KEY_REGEX));
} else {
keyRegex = Pattern.compile(".*");
}
……
String key = null;
Matcher matcher = keyRegex.matcher(new String(event.getBody(), Charsets.UTF_8));
if (matcher.find() && matcher.groupCount() > 0) {
key = matcher.group(1);
if (key.equals("DEBUG")) {
partitionId = 0;
} else if (key.equals("INFO")) {
partitionId = 1;
} else if (key.equals("WARN")) {
partitionId = 2;
} else if (key.equals("ERROR")) {
partitionId = 3;
}
} else {
if (staticPartitionId != null) {
partitionId = staticPartitionId;
}
//Allow a specified header to override a static ID
if (partitionHeader != null) {
String headerVal = event.getHeaders().get(partitionHeader);
if (headerVal != null) {
partitionId = Integer.parseInt(headerVal);
}
}
}
其中,keyRegex是新增加的参数的名,根据该参数提供的正则表达式来截取日志的级别并设置分区。使用方法即是:
a1.sinks.k1.kafka.keyRegex = \\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}.*\\s(INFO|DEBUG|WARN|ERROR).*