对于线上业务系统来说,有的时候需要对大量的数据进行统计,如果直接将数据保存到本地文件(例如使用log4j)可能会拖慢线上系统。那么,最好的方式是将大量的数据通过jms(例如:kafka)发送到消息服务器,消息中间件后面再对接flume来完成数据统计等需求。
接下来,我们来介绍一下flume 的kafka source。
一、理论:
#-------- kafkaSource相关配置-----------------
# 定义消息源类型
agent.sources.kafkaSource.type = org.apache.flume.source.kafka.KafkaSource
# 定义kafka所在zk的地址
agent.sources.kafkaSource.zookeeperConnect = 10.45.9.139:2181
# 配置消费的kafka topic
agent.sources.kafkaSource.topic = my-topic
# 配置消费者组的id
agent.sources.kafkaSource.groupId = flume
# 消费超时时间,参照如下写法可以配置其他所有kafka的consumer选项。注意格式从kafka.xxx开始是consumer的配置属性
agent.sources.kafkaSource.kafka.consumer.timeout.ms = 100
Property Name | Default Value | Description |
---|---|---|
type | 必须设置为org.apache.flume.source.kafka.KafkaSource. | |
zookeeperConnect | The URI of the ZooKeeper server or quorum used by Kafka. This can be a single node (for example, zk01.example.com:2181) or a comma-separated list of nodes in a ZooKeeper quorum (for example, zk01.example.com:2181,zk02.example.com:2181, zk03.example.com:2181). | |
topic | source 读取消息的Kafka topic。 Flume 每个source只支持一个 topic.。 | |
groupID | flume | The unique identifier of the Kafka consumer group. Set the same groupID in all sources to indicate that they belong to the same consumer group. |
batchSize | 1000 | 向channel写入消息的最多条数 |
batchDurationMillis | 1000 | 向channel书写的最大时间 (毫秒) 。 |
其他Kafka consumer 支持的属性 | 通过Kafka source配置Kafka consumer。可以使用任何consumer 支持的属性。 Prepend the consumer property name with the prefix kafka. (for example, kafka.fetch.min.bytes). See the Kafka documentation for the full list of Kafka consumer properties. |
Kafka source 重写了两个Kafka consumer 的属性:
这里, flume承担kafka的consumer的角色。如果存在多个消费者,注意把他们配置在同一个消费者组中,以免出问题!!
二、实例:
1、安装、配置flume:
1)下载flume-1.7版本,解压,然后配置jdk等信息;
2)配置flume:
agent1.sources = logsource
agent1.channels = mc1
agent1.sinks = avro-sink
agent1.sources.logsource.channels = mc1
agent1.sinks.avro-sink.channel = mc1
#source
agent1.sources.logsource.type = org.apache.flume.source.kafka.KafkaSource
agent1.sources.logsource.zookeeperConnect = ttAlgorithm-kafka-online021-jylt.abc.virtual:2181,ttAlgorithm-kafka-online022-jylt.abc.virtual:2181,ttAlgorithm-kafka-online023-jylt.abc.virtual:2181,ttAlgorithm-kafka-online024-jylt.abc.virtual:2181,ttAlgorithm-kafka-online025-jylt.abc.virtual:2181
agent1.sources.logsource.topic = page_visits
agent1.sources.logsource.groupId = flume
agent1.sources.logsource.kafka.consumer.timeout.ms = 100
#channel1
agent1.channels.mc1.type = memory
agent1.channels.mc1.capacity = 1000
agent1.channels.mc1.keep-alive = 60
#sink1
agent1.sinks.avro-sink.type = file_roll
agent1.sinks.avro-sink.sink.directory = /data/mysink
3)启动flume:
flume-ng agent -c /usr/local/apache-flume-1.7.0-bin/conf -f /usr/local/flume/conf/kafka_hdfs.conf -n agent1
4)启动后flume 会报错:
[10-16 18:30:44] [ERROR] [org.apache.flume.node.PollingPropertiesFileConfigurationProvider:146] Failed to start agent because dependencies were not found in classpath. Error follows.
java.lang.NoClassDefFoundError: org/apache/zookeeper/Watcher
解决方法:
将zookeeper-3.4.5.jar 放到flume 的lib目录下,然后重新启动flume。
5)再次启动flume后,日志中不断显示如下信息:
[10-16 18:37:12] [INFO] [org.apache.kafka.clients.consumer.internals.AbstractCoordinator:529] Marking the coordinator 2147483642 dead.
该问题特别奇怪,没有报错信息,但就是无法收到kafka的数据。经过仔细比对,发现是由于flume版本导致的,线上kafka的版本是0.8.2,flume-1.7内置的支持kafka source使用的是0.9版本,flume-1.6内置支持的kafka source使用的是0.8版本。
flume-1.7版本的lib目录:
flume-1.6版本的lib目录:
下载flume源码,发现1.7的flume-kafka-source中使用了kafka最新的balance等功能。所以,该问题的解决方法是将flume换成1.6版本。
6)换成了flume-1.6,然后重新配置,启动flume后,一切正常。(下载地址:http://archive.apache.org/dist/flume/1.6.0/)
2、编写kafka的product,模拟发送消息,flume收消息:
1)kafka produce代码:
package cn.edu.nuc.MyTestSimple.kafka;
import java.util.Date;
import java.util.Properties;
import java.util.Random;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
public class TestProducer {
public static void main(String[] args) {
sendStringMsg();
// sendBytesMsg();
}
public static void sendStringMsg() {
try {
long events = 1000;
Random rnd = new Random();
Properties props = new Properties();
props.put("metadata.broker.list", "ttAlgorithm-kafka-online001-jyltqbs.abc.virtual:9092,"
+ "ttAlgorithm-kafka-online002-jyltqbs.abc.virtual:9092,"
+ "ttAlgorithm-kafka-online003-jyltqbs.abc.virtual:9092,"
+ "ttAlgorithm-kafka-online004-jyltqbs.abc.virtual:9092,"
+ "ttAlgorithm-kafka-online005-jyltqbs.abc.virtual:9092");
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
ProducerConfig config = new ProducerConfig(props);
Producer producer = new Producer(config);
for (long nEvents = 0; nEvents < events; nEvents++) {
long runtime = new Date().getTime();
String ip = "192.168.2." + rnd.nextInt(255);
String msg = runtime + ",www.abc.com," + ip;
KeyedMessage data = new KeyedMessage("page_visits", ip, msg);
producer.send(data);
}
producer.close();
} catch(Exception e) {
e.printStackTrace();
}
}
}
2)flume日志中会报如下错:
17/10/16 18:51:35 ERROR kafka.KafkaSource: KafkaSource EXCEPTION, {}
org.apache.flume.ChannelException: Unable to put batch on required channel: org.apache.flume.channel.MemoryChannel{name: mc1}
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:200)
at org.apache.flume.source.kafka.KafkaSource.process(KafkaSource.java:130)
at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flume.ChannelException: Put queue for MemoryTransaction of capacity 100 full, consider committing more frequently, increasing capacity or increasing thread count
at org.apache.flume.channel.MemoryChannel$MemoryTransaction.doPut(MemoryChannel.java:84)
at org.apache.flume.channel.BasicTransactionSemantics.put(BasicTransactionSemantics.java:93)
at org.apache.flume.channel.BasicChannelSemantics.put(BasicChannelSemantics.java:80)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:189)
... 3 more
看到这个错误,应该是flume的channel组件问题,关于这个问题,下一篇文章会详细介绍。这里只简单给出一个解决方案:
修改flume的配置文件
agent1.channels.mc1.type = memory
agent1.channels.mc1.capacity = 10000
agent1.channels.mc1.transactionCapacity = 10000
agent1.channels.mc1.keep-alive = 60
增大了capacity和transactioncapacity。
3)重新启动flume(可以不用重启,flume会自动加载配置文件),发现可以收到kafka的数据。
至此,flume-kafka-source的配置和坑全部介绍完毕。
参考:
http://kaimingwan.com/post/flume/flumecong-kafkala-xiao-xi-chi-jiu-hua-dao-hdfs
https://my.oschina.net/u/1421929/blog/498969
http://www.cnblogs.com/niuzhifa/p/6285784.html