Flume+Kafka 将不同类别日志发往不同分区的三种方式的比较

方法1:不修改Flume源码,只使用Flume配置文件

这种方法的核心思想就是使用selector将各种级别的的日志发往对应的channel,然后再使用不同的sink去接对应的channel的event并发往Kafka指定分区。具体配置见下图(mutiline_regex_extractor 是自定义的多行拦截器):

# Name the components on this agent
a1.sources = r1
a1.sinks = info_sink debug_sink warn_sink error_sink
a1.channels = info_channel debug_channel warn_channel error_channel

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.bind = 172.17.11.176
a1.sources.r1.port = 41414

a1.sources.r1.channels= info_channel debug_channel warn_channel error_channel

a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = multiline_regex_extractor
a1.sources.r1.interceptors.i1.regex = \\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}.*\\s(INFO|DEBUG|WARN|ERROR).*
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = key

a1.sources.r1.selector.type = multiplexing  
a1.sources.r1.selector.header= key
a1.sources.r1.selector.mapping.INFO = info_channel
a1.sources.r1.selector.mapping.DEBUG = debug_channel
a1.sources.r1.selector.mapping.WARN = warn_channel
a1.sources.r1.selector.mapping.ERROR = error_channel


# Use a channel which buffers events inmemory
a1.channels.debug_channel.type = memory
a1.channels.debug_channel.capacity = 1000
a1.channels.debug_channel.transactionCapacity = 100

# Describe the sink
a1.sinks.debug_sink.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.debug_sink.channel = debug_channel
a1.sinks.debug_sink.kafka.topic = LOG_LEVEL_CLASSIFY
a1.sinks.debug_sink.kafka.bootstrap.servers = 172.17.11.163:9092,172.17.11.176:9092,172.17.11.174:9092
a1.sinks.debug_sink.defaultPartitionId = 0
a1.sinks.debug_sink.channel = debug_channel

需要注意的地方是拦截器截取出的字段的key,即下列加粗的参数的值:
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = key
如果名字不设置成key,会发现Kafka端消息的key为null,究其原因,是因为KafkaSink会取event的header中的键为“key”的键值作为Kafka消息的key。源码如下:

public static final String KEY_HEADER = "key";
……
eventKey = headers.get(KEY_HEADER);
……
if (partitionId != null) {
    record = new ProducerRecordbyte[]>(eventTopic, partitionId, eventKey,
             serializeEvent(event, useAvroEventFormat));
} else {
    record = new ProducerRecordbyte[]>(eventTopic, eventKey,
             serializeEvent(event, useAvroEventFormat));
}

方法2:增加自定义分区类

在自定义分区类的时候,如果和lib目录下的Kafka的版本和你自定义Kafka分区类时使用的API版本不一致,就会出现找不到对应类的异常。那么如何自定义对应版本的分区类呢?
最简单的方法就是找到你的安装的版本的Kafka的lib目录下的类似于kafka_2.10-0.10.2.1.jar的jar包,并将此jar包作为工程的lib引入,Maven工程直接添加对应版本的依赖即可,先以最新版本的Kafka版本(0.10.2.1)为例,对应Maven工程需要添加的依赖就是:

<dependency>
     <groupId>org.apache.kafkagroupId>
     <artifactId>kafka_2.10artifactId>
     <version>0.10.2.1version>
dependency>

之后,实现你的jar包下的对应的Partition接口即可(推荐继承DefaultPartitioner,以免少实现了某些方法):

public class SimplePartitioner extends DefaultPartitioner {

    public SimplePartitioner() {
    }
    @Override
    public void configure(Map configs) {
    }

    @Override
    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {

        String stringKey = String.valueOf(key);
        String[] keys = stringKey.split("\\.");
        System.out.println(Integer.valueOf(keys[3]) % 2);
        return Integer.valueOf(keys[3]) % 2;
    }

    @Override
    public void close() {
    }


}

Flume1.6和Flume1.7版本内置的Kafka版本都比较低,因此你的自定义分区类可能类似于这样,不过都没有关系,只要版本一致就行:

public class TestPartitioner extends DefaultPartitioner{

    public TestPartitioner(VerifiableProperties props) {
        super(props);
    }

    @Override
    public int partition(Object key, int numPartitions) {
        return super.partition(key, numPartitions);
    }
}

然后需要将你自定义的分区类的jar包以及所依赖的Kafka版本的jar包(不同版本的可能依赖的个数也不一样,最新版需要依赖两个jar包:kafka-clients-0.10.2.1.jar和kafka_2.10-0.10.2.1.jar),放到flume的安装目录的lib文件夹下,并将其他版本的Kafka的jar包删除
最后需要提醒的是一定要注意版本一致性。你用什么版本的kafka就需要将对应版本的jar包放到对应目录下,否则有时会找不到。

方法3:修改KafkaSink类的源码

可以修改Flume源码中的KafkaSink类,直接用构造函数指定分区。查看源码,发现源码是通过partitionIdHeader参数和defaultPartitionId参数共同确定分区号,如果都没有指定,则不传入分区号。

if (staticPartitionId != null) {
    partitionId = staticPartitionId;
}
//Allow a specified header to override a static ID
if (partitionHeader != null) {
    String headerVal = event.getHeaders().get(partitionHeader);
    if (headerVal != null) {
        partitionId = Integer.parseInt(headerVal);
    }
}
……
if (partitionId != null) {
    record = new ProducerRecordbyte[]>(eventTopic, partitionId, 
        key == null ? eventKey : key,
    serializeEvent(event, useAvroEventFormat));
    } else {
    record = new ProducerRecordbyte[]>(eventTopic, 
        key == null ? eventKey : key,serializeEvent(event, useAvroEventFormat));
}

因此我们只需要加上我们判断分区号的逻辑即可完成分区:

public static final String KEY_REGEX = "keyRegex";
……
if (context.getString(KEY_REGEX) != null) {
    keyRegex = Pattern.compile(context.getString(KEY_REGEX));
} else {
    keyRegex = Pattern.compile(".*");
}
……
String key = null;

Matcher matcher = keyRegex.matcher(new String(event.getBody(), Charsets.UTF_8));
if (matcher.find() && matcher.groupCount() > 0) {
    key = matcher.group(1);
    if (key.equals("DEBUG")) {
        partitionId = 0;
    } else if (key.equals("INFO")) {
        partitionId = 1;
    } else if (key.equals("WARN")) {
        partitionId = 2;
    } else if (key.equals("ERROR")) {
        partitionId = 3;
    }

} else {
    if (staticPartitionId != null) {
        partitionId = staticPartitionId;
    }
    //Allow a specified header to override a static ID
    if (partitionHeader != null) {
        String headerVal = event.getHeaders().get(partitionHeader);
        if (headerVal != null) {
            partitionId = Integer.parseInt(headerVal);
        }
    }
}

其中,keyRegex是新增加的参数的名,根据该参数提供的正则表达式来截取日志的级别并设置分区。使用方法即是:

a1.sinks.k1.kafka.keyRegex = \\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2}.*\\s(INFO|DEBUG|WARN|ERROR).*

你可能感兴趣的:(大数据组件)