python 通过 pykafka 发送数据和消费数据,遇到两个问题,与大家分享下。
问题1
1. 通过pykafa 发送数据时候,每隔5s才发送一次,几百万数据,效率大大影响。
生产者只有get_producer 方法,当前参数 ack_timeout_ms=1000, linger_ms=5000只有两个。
查询官网 https://pykafka.readthedocs.io/en/latest/api/producer.html
linger_ms 参数 默认为5000,刚好5s,修改为0,来了就发送。
linger_ms (int) – This setting gives the upper bound on the delay for batching: once the producer gets min_queued_messages worth of messages for a broker, it will be sent immediately regardless of this setting. However, if we have fewer than this many messages accumulated for this partition we will ‘linger’ for the specified time waiting for more records to show up. linger_ms=0 indicates no lingering - messages are sent as fast as possible after they are `produce()`d.
–此设置给出批处理延迟的上限:一旦生产者获得代理的最小排队消息值,则不管此设置如何,它都将立即发送。但是,如果我们为这个分区积累的消息少于这个数量,我们将“逗留”指定的时间,等待更多的记录出现。linger_ms=0表示没有延迟-消息在“product()”d之后会尽快发送。
修改 linger_ms=0
问题2
pykafa 消费者正常消息,3台kafak虚拟机30G磁盘空间,3天即满,查看 __consumer_offsets,产生大量日志。
server.properties 也配置了删除机制,还是不见生效。
查看 __consumer_offsets 里面的log日志,频繁写。
查看官网
https://pykafka.readthedocs.io/en/latest/api/managedbalancedconsumer.html
auto_commit_interval_ms (int) – The frequency (in milliseconds) at which the consumer’s offsets are committed to kafka. This setting is ignored if auto_commit_enable is False.
auto_commit_interval_ms(int)–消费者向卡夫卡提交偏移量的频率(以毫秒为单位)。如果自动提交启用为false,则忽略此设置。
增大auto_commit_interval_ms=60000,问题解决。
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# author :
import sys
import time
import json, logging
from pykafka import KafkaClient
logging.basicConfig(
level=logging.INFO,
format="[%(asctime)s] %(name)s:%(levelname)s: %(message)s"
)
class MyKafkaProducer():
'''''
KAFKA 生产模块
'''
def __init__(self, kafka_servers, topic=None):
self.kafkaServers = kafka_servers
self.client = KafkaClient(hosts=kafka_servers)
if topic:
self.kafkaTopic = topic
topicdocu = self.client.topics[topic]
self.producer = topicdocu.get_producer(ack_timeout_ms=5000, linger_ms=1000)
def send_json(self, params, topic=None):
if topic:
self.kafkaTopic = topic
topicdocu = self.client.topics[topic]
self.producer = topicdocu.get_producer(ack_timeout_ms=1000, linger_ms=0)
try:
param_message = json.dumps(params, ensure_ascii=False)
producer = self.producer
value = param_message.encode('utf-8')
producer.produce(value)
logging.info('kafka发送成功')
logging.info(value)
except Exception as e:
logging.error('kafka发送失败')
logging.error(str(e))
logging.error(value)
def close(self):
self.producer.stop()
from pykafka.simpleconsumer import OffsetType
class MyKafkaConsumer():
'''''
消费模块: 通过不同groupid消费topic里面的消息
'''
def __init__(self, zookeeper_hosts, kafka_servers, topic, group_id):
self.kafkaTopic = topic
self.group_id = group_id
self.client = KafkaClient(hosts=kafka_servers)
if topic:
self.kafkaTopic = topic
topicdocu = self.client.topics[topic]
self.consumer = topicdocu.get_balanced_consumer(consumer_group=group_id,
zookeeper_connect=zookeeper_hosts,
#consumer_timeout_ms = 5000,
auto_commit_enable=True,
auto_commit_interval_ms=60000)
def consume_data(self):
try:
for message in self.consumer:
if message is not None:
yield message
except KeyboardInterrupt as e:
logging.error(str(e))
if __name__ == '__main__':
KAFKA_SERVERS = '192.168.142.134:9092,192.168.142.135:9092,192.168.142.136:9092'
consumer = MyKafkaConsumer(KAFKA_SERVERS, 'img_download', '192.168.142.134:2181, 192.168.142.135:2181,192.168.142.136:2181', 'rd015downloadgroup')
for message in consumer.consumer:
print(message.value)