使用flume采集数据,在使用一个source情况下,将不同的日志采集到指定的kafka的主题中。
例如:有两个日志文件:error.log和info.log
error.log采集到kafka的kafka_channel主题
info.log采集到kafka的kafka_channel2主题
我们使用tailDir source 和kafkaChannel
思路:
使用a0.sources.r1.headers.f1.headerKey = error,a0.sources.r1.headers.f2.headerKey = info。去设置event的一个header值,不同文件设置不同的header值,用于区分,其中headerKey可以随便设置,就是header中的一个key而已,在源码中找到kafka-channel,在都doPut()方法中,去获去每一个event的header,我们知道event的hader一个map。然后header.get(headerKey)获取我们设置的头标记,如果是error,kafka的主题设置为kafka_channel如果是info,则kafka的主题设置为kafka_channel2,也就是如下代码逻辑。
String type=headers.get("headerKey");
if(type.equals("info")){
topicStr="kafka_channel2";
}else if(type.equals("error")){
topicStr="kafka_channel";
}
更改前:
protected void doPut(Event event) throws InterruptedException {
type = TransactionType.PUT;
if (!producerRecords.isPresent()) {
producerRecords = Optional.of(new LinkedList>());
}
String key = event.getHeaders().get(KEY_HEADER);
//get header
Map headers = event.getHeaders();
String topicStr=null;
Integer partitionId = null;
try {
if (staticPartitionId != null) {
partitionId = staticPartitionId;
}
if (partitionHeader != null) {
String headerVal = event.getHeaders().get(partitionHeader);
if (headerVal != null) {
partitionId = Integer.parseInt(headerVal);
}
}
if (partitionId != null) {
producerRecords.get().add(
new ProducerRecord(topic.get(), partitionId, key,
serializeValue(event, parseAsFlumeEvent)));
} else {
producerRecords.get().add(
new ProducerRecord(topic.get(), key,
serializeValue(event, parseAsFlumeEvent)));
}
} catch (NumberFormatException e) {
throw new ChannelException("Non integer partition id specified", e);
} catch (Exception e) {
throw new ChannelException("Error while serializing event", e);
}
}
更改后:
protected void doPut(Event event) throws InterruptedException {
type = TransactionType.PUT;
if (!producerRecords.isPresent()) {
producerRecords = Optional.of(new LinkedList>());
}
String key = event.getHeaders().get(KEY_HEADER);
//get header
Map headers = event.getHeaders();
String topicStr=null;
Integer partitionId = null;
/**
* 在这可以更改代码逻辑,实现:数据发送到指定的kafka分区中
*/
try {
if (staticPartitionId != null) {
partitionId = staticPartitionId;
}
if (partitionHeader != null) {
String headerVal = event.getHeaders().get(partitionHeader);
if (headerVal != null) {
partitionId = Integer.parseInt(headerVal);
}
}
/**
*添加的逻辑
*/
String type=headers.get("headerKey");
if(type.equals("info")){
topicStr="kafka_channel2";
}else if(type.equals("error")){
topicStr="kafka_channel";
}
if (partitionId != null) {
producerRecords.get().add(
new ProducerRecord(topicStr, partitionId, key,
serializeValue(event, parseAsFlumeEvent)));
} else {
producerRecords.get().add(
new ProducerRecord(topicStr, key,
serializeValue(event, parseAsFlumeEvent)));
}
} catch (NumberFormatException e) {
throw new ChannelException("Non integer partition id specified", e);
} catch (Exception e) {
throw new ChannelException("Error while serializing event", e);
}
}
注意:
更改源码后,不需要在配置文件中指定kafka的主题,当然指定主题也不错,但是已经没作用了,已经在代码中更改了。如果你有精力还可以把不同的kafka主题写到properties配置文件中,把程序写活一点。在相同的思路下你还可以做到颗粒更细:就是指定主题和分区,通过条件判断更改topic和partitionId。最后kafkaSink要想实现这些功能更改源码的思路是一样的。
a0.sources = r1
a0.channels = c1
a0.sources.r1.type = TAILDIR
#通过 json 格式存下每个文件消费的偏移量,避免从头消费
a0.sources.r1.positionFile = /data/server/flume-1.8.0/conf/taildir_position.json
a0.sources.r1.filegroups = f1 f2
#配置f1信息
a0.sources.r1.headers.f1.headerKey = error
a0.sources.r1.filegroups.f1 = /data/access/error.log
#配置f1信息
a0.sources.r1.headers.f2.headerKey = info
a0.sources.r1.filegroups.f2 = /data/access/info.log
#是否添加一个存储的绝对路径名的头文件
#a0.sources.r1.fileHeader = true
#拦截器获取服务器的主机名
a0.sources.r1.interceptors = i1 i2 i3
#a0.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a0.sources.r1.interceptors.i1.type = org.apache.flume.host.MyHostInterceptor$Builder
a0.sources.r1.interceptors.i1.preserveExisting = false
#a0.sources.r1.interceptors.i1.useIP = false
a0.sources.r1.interceptors.i1.HeaderName= agentHost
#静态过滤器添加指定的标志
a0.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.StaticInterceptor$Builder
a0.sources.r1.interceptors.i2.key = logType
a0.sources.r1.interceptors.i2.value= kafka_data
a0.sources.r1.interceptors.i2.preserveExisting = false
#添加时间戳
a0.sources.r1.interceptors.i3.type = timestamp
#定义channel
a0.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
a0.channels.c1.kafka.bootstrap.servers = ip1:9092,ip2.14:9092,ip3:9092
a0.channels.c1.parseAsFlumeEvent = false
#a0.channels.c1.kafka.producer.compression.type = lz4
a0.sources.r1.channels = c1