flume之kafka source

       对于线上业务系统来说,有的时候需要对大量的数据进行统计,如果直接将数据保存到本地文件(例如使用log4j)可能会拖慢线上系统。那么,最好的方式是将大量的数据通过jms(例如:kafka)发送到消息服务器,消息中间件后面再对接flume来完成数据统计等需求。

接下来,我们来介绍一下flume 的kafka source。

一、理论:

 

#-------- kafkaSource相关配置-----------------
# 定义消息源类型
agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource
# 定义kafka所在zk的地址
agent.sources.kafkaSource.zookeeperConnect = 10.45.9.139:2181
# 配置消费的kafka topic
agent.sources.kafkaSource.topic = my-topic
# 配置消费者组的id
agent.sources.kafkaSource.groupId = flume
# 消费超时时间,参照如下写法可以配置其他所有kafka的consumer选项。注意格式从kafka.xxx开始是consumer的配置属性
agent.sources.kafkaSource.kafka.consumer.timeout.ms = 100
Property Name Default Value Description
type   必须设置为org.apache.flume.source.kafka.KafkaSource.
zookeeperConnect   The URI of the ZooKeeper server or quorum used by Kafka. This can be a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (for example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181).
topic   source 读取消息的Kafka topic。 Flume 每个source只支持一个 topic.。
groupID flume The unique identifier of the Kafka consumer group. Set the same groupID in all sources to indicate that they belong to the same consumer group.
batchSize 1000 向channel写入消息的最多条数
batchDurationMillis 1000 向channel书写的最大时间 (毫秒)  。 
其他Kafka consumer  支持的属性   通过Kafka source配置Kafka consumer。可以使用任何consumer 支持的属性。 Prepend the consumer property name with the prefix kafka. (for example, kafka.fetch.min.bytes). See the Kafka documentation for the full list of Kafka consumer properties.

Kafka source 重写了两个Kafka consumer 的属性:

 

  1. auto.commit.enable 设置为 false by the source, and every batch is committed. 为了改善性能, 设置为 true 改为使用 kafka.auto.commit.enable。 这个可能会丢失数据 if the source goes down before committing.
  2. consumer.timeout.ms设置为 10, so when Flume polls Kafka for new data, it waits no more than 10 ms for the data to be available. Setting this to a higher value can reduce CPU utilization due to less frequent polling, but introduces latency in writing batches to the channel.

 

这里, flume承担kafka的consumer的角色。如果存在多个消费者,注意把他们配置在同一个消费者组中,以免出问题!!

 

二、实例:

1、安装、配置flume:

1)下载flume-1.7版本,解压,然后配置jdk等信息;

2)配置flume:

 

agent1.sources = logsource
agent1.channels = mc1
agent1.sinks = avro-sink

agent1.sources.logsource.channels = mc1
agent1.sinks.avro-sink.channel = mc1

#source
agent1.sources.logsource.type = org.apache.flume.source.kafka.KafkaSource
agent1.sources.logsource.zookeeperConnect = ttAlgorithm-kafka-online021-jylt.abc.virtual:2181,ttAlgorithm-kafka-online022-jylt.abc.virtual:2181,ttAlgorithm-kafka-online023-jylt.abc.virtual:2181,ttAlgorithm-kafka-online024-jylt.abc.virtual:2181,ttAlgorithm-kafka-online025-jylt.abc.virtual:2181
agent1.sources.logsource.topic = page_visits
agent1.sources.logsource.groupId = flume
agent1.sources.logsource.kafka.consumer.timeout.ms = 100

#channel1
agent1.channels.mc1.type = memory
agent1.channels.mc1.capacity = 1000
agent1.channels.mc1.keep-alive = 60


#sink1
agent1.sinks.avro-sink.type = file_roll
agent1.sinks.avro-sink.sink.directory = /data/mysink

 

3)启动flume:

flume-ng agent -c /usr/local/apache-flume-1.7.0-bin/conf -f /usr/local/flume/conf/kafka_hdfs.conf  -n agent1

4)启动后flume 会报错:

 

[10-16 18:30:44] [ERROR] [org.apache.flume.node.PollingPropertiesFileConfigurationProvider:146] Failed to start agent because dependencies were not found in classpath. Error follows.

java.lang.NoClassDefFoundError: org/apache/zookeeper/Watcher

解决方法:

 

将zookeeper-3.4.5.jar 放到flume 的lib目录下,然后重新启动flume。

5)再次启动flume后,日志中不断显示如下信息:

 

[10-16 18:37:12] [INFO] [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:529] Marking the coordinator 2147483642 dead.

该问题特别奇怪,没有报错信息,但就是无法收到kafka的数据。经过仔细比对,发现是由于flume版本导致的,线上kafka的版本是0.8.2,flume-1.7内置的支持kafka source使用的是0.9版本,flume-1.6内置支持的kafka source使用的是0.8版本。

 

flume-1.7版本的lib目录:

flume-1.6版本的lib目录:

flume之kafka source_第1张图片

下载flume源码,发现1.7的flume-kafka-source中使用了kafka最新的balance等功能。所以,该问题的解决方法是将flume换成1.6版本。

6)换成了flume-1.6,然后重新配置,启动flume后,一切正常。(下载地址:http://archive.apache.org/dist/flume/1.6.0/)

2、编写kafka的product,模拟发送消息,flume收消息:

1)kafka produce代码:

 

package cn.edu.nuc.MyTestSimple.kafka;

import java.util.Date;
import java.util.Properties;
import java.util.Random;

import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;

public class TestProducer {
	public static void main(String[] args) {
		sendStringMsg();
//		sendBytesMsg();
    }

	public static void sendStringMsg() {
		try {
			long events = 1000;
	        Random rnd = new Random();
	 
	        Properties props = new Properties();
	        props.put("metadata.broker.list", "ttAlgorithm-kafka-online001-jyltqbs.abc.virtual:9092,"
	        		+ "ttAlgorithm-kafka-online002-jyltqbs.abc.virtual:9092,"
	        		+ "ttAlgorithm-kafka-online003-jyltqbs.abc.virtual:9092,"
	        		+ "ttAlgorithm-kafka-online004-jyltqbs.abc.virtual:9092,"
	        		+ "ttAlgorithm-kafka-online005-jyltqbs.abc.virtual:9092");
	        props.put("serializer.class", "kafka.serializer.StringEncoder");
	        props.put("request.required.acks", "1");
	 
	        ProducerConfig config = new ProducerConfig(props);
	 
	        Producer producer = new Producer(config);
	 
	        for (long nEvents = 0; nEvents < events; nEvents++) { 
	               long runtime = new Date().getTime();  
	               String ip = "192.168.2." + rnd.nextInt(255); 
	               String msg = runtime + ",www.abc.com," + ip; 
	               KeyedMessage data = new KeyedMessage("page_visits", ip, msg);
	               producer.send(data);
	        }
	        producer.close();
		} catch(Exception e) {
			e.printStackTrace();
		}
	}
}


2)flume日志中会报如下错:

 

 

17/10/16 18:51:35 ERROR kafka.KafkaSource: KafkaSource EXCEPTION, {}
org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: mc1}
	at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:200)
	at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:130)
	at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.ChannelException: Put queue for MemoryTransaction of capacity 100 full, consider committing more frequently, increasing capacity or increasing thread count
	at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doPut(MemoryChannel.java:84)
	at org.apache.flume.channel.BasicTransactionSemantics.put(BasicTransactionSemantics.java:93)
	at org.apache.flume.channel.BasicChannelSemantics.put(BasicChannelSemantics.java:80)
	at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:189)
	... 3 more


看到这个错误,应该是flume的channel组件问题,关于这个问题,下一篇文章会详细介绍。这里只简单给出一个解决方案:

 

修改flume的配置文件

agent1.channels.mc1.type = memory

agent1.channels.mc1.capacity = 10000

agent1.channels.mc1.transactionCapacity = 10000

agent1.channels.mc1.keep-alive = 60

 

增大了capacity和transactioncapacity。

3)重新启动flume(可以不用重启,flume会自动加载配置文件),发现可以收到kafka的数据。

 

至此,flume-kafka-source的配置和坑全部介绍完毕。

参考:

http://kaimingwan.com/post/flume/flumecong-kafkala-xiao-xi-chi-jiu-hua-dao-hdfs

https://my.oschina.net/u/1421929/blog/498969

http://www.cnblogs.com/niuzhifa/p/6285784.html

 

 

 

 

 

 

 

你可能感兴趣的:(flume)