初识kafka的世界

一:前言

     最近实在是太忙了,学习的kafka很想记下笔记,却一直忙于生产工作没有时间,这几天加班写了写,记录下自己的收获吧。

二:kafka简介
   Kafka是一个分布式发布-订阅消息系统,使用 Scala语言编写,之后成为 Apache 项目的一部分。Kafka 是一个分布式的,可划分的,多订阅者,冗余备份的持久性的日志服务。
三:kafka的特点
 1:分布式
   kafka的producer,consumer,broker都是分布式的,可水平扩展,无需停机。
 2:持久化
   kafka将日志持久化到磁盘,通过将消息持久化到磁盘以及它的replication机制,保证数据的不丢失,通常一个topic会有多个分区,不同的分区分布在不同的server上,而每个分区又可以有多个副本,这样既提高了并行处理的能力,又保证了消息的可靠,因为越多的partitions意味着可以接受更多的consumer请求。
 3:高吞吐
   kafka采用磁盘直接进行线性读写而不是随机读写,大大提高了读写请求的速度。
四:kafka的核心概念
       Producer 特指消息的生产者。
  Consumer 特指消息的消费者。
  Consumer Group 消费者组,可以并行消费Topic中partition的消息。
  Broker:Kafka 集群中的一台服务器称为一个 broker。
  Topic:kafka处理的消息分类,一个消息分类就是一种topic。
  Partition:Topic的分区,一个 topic可以分为多个partition,每个partition是一个有序的队列,分区里面的消息都是按接收的顺序追加的且partition中的每条消息都会被分配一个有序的 id(offset)。
  Producers:消息和数据生产者,向 Kafka 的一个 topic发布消息的过程叫做 producers。
  Consumers:消息和数据消费者,订阅 topics 并处理其发布的消息的过程叫做 consumers。
  4.1 kafka的Producers
        在kafka中,生产的消息像kafka的topic发送的过程叫做producers,Producer能够决定将此消息发送到topic的哪个分区下面,可以通过配置文件配置也可在api里显示指定,其支持异步批量发送,批量发送可以有效的提高发送效率,先将消息缓存,然后批量发送给topic,减少网络IO。
  4.2 kafka的Consumers
         在kafka中,一个分区的消息可以被一个消费者组中一个consumer消费,但是一个consumer可以消费多个分区的消息。如果多个消费者同时在一个消费者组中,那么kafka会以轮询的方式,让消息在消费者之间负载均衡,如果不同的消费者存在不同的消费者组中,这就有点像zookeeper里面的发布-订阅模式,不同组的消费者可以同时消费某个分区的消息。
  需要注意的是,kafka只能给我们保证某个分区的消息是按顺序被消费的,但它不能保证不同分区的消费按一定顺序
  4.3 kafka的broker
        我们可以理解为一台机器就是一个broker,我们发送的消息(message)日志在broker中是以append的方式追加,并且broker会将我们的消息暂时的buffer起来,根据我们的配置,当消息的大小或者是个数达到了配置的值,broker会将消息一次性的刷新到磁盘,有效降低了每次消息磁盘调用的IO次数。
  kafka中的broker没有状态,如果一个broker挂掉,这里面的消息也会丢掉,由于broker的无状态,所以消息的消费都记录在消费者那,并不记录在broker。已经被消费了的消息会根据配置在保存一定时间后自动删除,默认是7天。
  4.4 kafka的message
      在kafka中,一条message消息由几部分组成,offest代表消息偏移量,MessageSize表示消息的大小,data代表了消息的具体内容,kafka在记录message的时候,还会每隔一定的字节建立一个索引,当消费者需要消费指定某条消息的时候,kafka采用二分法查找索引的位置从而找到你需要消费的消息。
五:伪分布式环境搭建

  [root@hadoop config]# cat server.properties

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults

############################# Server Basics #############################

# The id of the broker. This must be set to a unique integer for each broker.
#brokerid 每个kafka机器是一个实例,id不能重复
broker.id=0

############################# Socket Server Settings #############################
listeners=PLAINTEXT://:9092
# The port the socket server listens on
port=9092
# Hostname the broker will bind to. If not set, the server will bind to all interfaces
#如果是伪分布式,ip必须不能写成localhost
host.name=ip
# Hostname the broker will advertise to producers and consumers. If not set, it uses the
# value for "host.name" if configured.  Otherwise, it will use the value returned from
# java.net.InetAddress.getCanonicalHostName().
advertised.host.name=ip

# The port to publish to ZooKeeper for clients to use. If this is not set,
# it will publish the same port that the broker binds to.
#advertised.port=<port accessible by clients>

# The number of threads handling network requests
#broker处理消息的最大线程数,一般情况下为CPU核数
num.network.threads=3

# The number of threads doing disk I/O
#broker处理磁盘IO的线程数
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection against OOM)
socket.request.max.bytes=104857600

############################# Log Basics #############################

# A comma seperated list of directories under which to store log files
log.dirs=/kafka_2.11-0.9.0.1/kafka-logs-1

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
#每个topic默认的分区
num.partitions=2

# The number of threads per data directory to be used for log recovery at startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs located in RAID array.
num.recovery.threads.per.data.dir=1

############################# Log Flush Policy #############################

# Messages are immediately written to the filesystem but by default we only fsync() to sync
# the OS cache lazily. The following configurations control the flush of data to disk.
# There are a few important trade-offs here:
#    1. Durability: Unflushed data may be lost if you are not using replication.
#    2. Latency: Very large flush intervals may lead to latency spikes when the flush does occur as there will be a lot of data to flush.
#    3. Throughput: The flush is generally the most expensive operation, and a small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data after a period of time or
# every N messages (or both). This can be done globally and overridden on a per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=10000

# The maximum amount of time a message can sit in a log before we force a flush
#log.flush.interval.ms=1000

############################# Log Retention Policy #############################

# The following configurations control the disposal of log segments. The policy can
# be set to delete segments after a period of time, or after a given size has accumulated.
# A segment will be deleted whenever *either* of these criteria are met. Deletion always happens
# from the end of the log.

# The minimum age of a log file to be eligible for deletion
#kafka server日志默认保存时间
log.retention.hours=24
log.cleaner.enable = true
log.cleanup.policy=delete

# A size-based retention policy for logs. Segments are pruned from the log as long as the remaining
# segments don't drop below log.retention.bytes.
#topic每个分区的最大文件大小,超过后会删除
log.retention.bytes=1073741824

# The maximum size of a log segment file. When this size is reached a new log segment will be created.
#topic的分区是以一堆segment文件存储的,这个控制每个segment的大小
log.segment.bytes=1073741824

# The interval at which log segments are checked to see if they can be deleted according
# to the retention policies
log.retention.check.interval.ms=300000

############################# Zookeeper #############################

# Zookeeper connection string (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
# You can also append an optional chroot string to the urls to specify the
# root directory for all kafka znodes.
zookeeper.connect=ip:2181,ip:2182,ip:2183

# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000
伪分布式只需要cp几个server.properties文件即可,然后分别启动,需要注意的是配置文件里的broker.id和监听的端口不能重复

  在配置完kafka后,首先需要启动zookeeper

 bin/zkServer.sh start zoo.cfg
 bin/zkServer.sh start zoo2.cfg
 bin/zkServer.sh start zoo3.cfg 

  然后启动kafka

bin/kafka-server-start.sh config/server.properties &
bin/kafka-server-start.sh config/server-1.properties &
bin/kafka-server-start.sh config/server-2.properties &

  查看是否启动成功:

[root@hadoop zookeeper-3.4.7]# jps
2717 Kafka
2285 QuorumPeerMain
2679 Kafka
2696 Kafka
2158 QuorumPeerMain
2622 ZooKeeperMain
2199 QuorumPeerMain
2908 Jps

  OK!,你本地的kafka伪分布式环境已经搭建好了,下面我们来简单的看一下效果
使用kafka的命令来测试下kafka集群环境
  1:创建一个分区

 bin/kafka-topics.sh --create --zookeeper 192.168.8.88:2181 --replication-factor 3 --partitions 3 --topic first_topic

  2:启动消费者

bin/kafka-console-consumer.sh --zookeeper 192.168.8.88:2181 --topic first_topic --from-beginning 

  3:启动生产着

bin/kafka-console-producer.sh --broker-list 192.168.8.88:9092,192.168.8.88:9093,192.168.8.88:9094 --topic first_topic

  执行完上面的操作后,会发现,生产的消息会立马被消费掉,效果图如下:



 

六:kafka在zookeeper中的节点存储结构
   zookeeper在kafka中扮演了举足轻重的作用,kakfa的broker,消费者等相关的信息都存在zk的节点上,zk还提供了对kafka的动态负载均衡等机制,下面我们一一介绍
6.1:broker注册
  在kafka中,每当一个broker启动后,会在zk的节点下存储相关的信息,这是一个临时节点(如果不清楚zk的节点类型,可以参考其官网介绍),所以当某个broker宕掉后,其对应的临时节点会消失.

[zk: ip:2181(CONNECTED) 0] ls /brokers
[seqid, topics, ids]
[zk: ip:2181(CONNECTED) 1] ls /brokers/topics
[testKafka, __consumer_offsets, test, consumer_offsets, myfirstTopic, first_topic, kafka_first_topic]
[zk: ip:2181(CONNECTED) 2] ls /brokers/ids
[2, 1, 0]

     可以看到,brokers节点下存储目前有多少台broker,该broker下有哪些topic,每个broker可以通过get来获取详细信息:

[zk: ip:2181(CONNECTED) 0] get /brokers/ids/2
{"jmx_port":-1,"timestamp":"1461828196728","endpoints":["PLAINTEXT://ip:9094"],"host":"ip","version":2,"port":9094}
cZxid = 0x490000006b
ctime = Thu Apr 28 19:23:16 GMT+12:00 2016
mZxid = 0x490000006b
mtime = Thu Apr 28 19:23:16 GMT+12:00 2016
pZxid = 0x490000006b
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x3545bbe83450003
dataLength = 135
numChildren = 0
[zk: ip:2181(CONNECTED) 1] 

  这里面有每个broker的ip,port,版本等信息,每个topic的分区消息分布在不同的broker上,

[zk: ip:2181(CONNECTED) 1] get /brokers/topics/kafka_first_topic
{"version":1,"partitions":{"2":[0,1,2],"1":[2,0,1],"0":[1,2,0]}}
cZxid = 0x4100000236
ctime = Sat Apr 16 04:48:10 GMT+12:00 2016
mZxid = 0x4100000236
mtime = Sat Apr 16 04:48:10 GMT+12:00 2016
pZxid = 0x410000023a
cversion = 1
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 64
numChildren = 1
[zk: 192.168.8.88:2181(CONNECTED) 2]  

     意思是说,kafka_first_topic这个topic有3个分区,每个不同的分区有3个备份,分别备份在broker id为0,1,2的机器上
6.2 消费者注册
 前面已经说到,消费者有消费者组的概念,kafka会为每个消费者组分配一个唯一的ID,也为会每个消费者分配一个ID,

[zk: 192.168.8.88:2181(CONNECTED) 40] ls /consumers/console-consumer-85351/ids
[console-consumer-85351_hadoop-1461831147033-3265fdb3] 

    意思是在消费者组console-consumer-85351下有一个id为如上的消费者。
 而一个消费者组里面的某个消费者消费某个分区的消息,在zk中是这样记录的:

[zk: ip:2181(CONNECTED) 4] get /consumers/console-consumer-85351/owners/kafka_first_topic/2
console-consumer-85351_hadoop-1461831147033-3265fdb3-0
cZxid = 0x490000010f
ctime = Thu Apr 28 20:12:28 GMT+12:00 2016
mZxid = 0x490000010f
mtime = Thu Apr 28 20:12:28 GMT+12:00 2016
pZxid = 0x490000010f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x1545bbe6fb80015
dataLength = 54
numChildren = 0

  表示在消费者组console-consumer-85351的topic为kafka_first_topic的第二个分区下,有一个消费者id为console-consumer-85351_hadoop-1461831147033-3265fdb3-0正在消费每个topic有不同的分区,每个分区存储了消息的offest,当消费者重启后能够从该节点记录的值后开始继续消费

[zk: ip:2181(CONNECTED) 37] get /consumers/console-consumer-85351/offsets/kafka_first_topic/0
38
cZxid = 0x4900000117
ctime = Thu Apr 28 20:13:27 GMT+12:00 2016
mZxid = 0x4900000117
mtime = Thu Apr 28 20:13:27 GMT+12:00 2016
pZxid = 0x4900000117
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 2
numChildren = 0

  表示分区0的消息的消费偏移到了38这个位置,需要注意的是,这是一个临时节点,当我停掉消费者线程,你会发现在consumers组里面没有刚才那个消费者了
6.3 controller
        zookeeper节点contorller主要存储的是中央控制器(可以理解为leader)所在机器的信息,下面表示此集群leader是broker id为0这台机器

[zk: ip:2181(CONNECTED) 33] get /controller
{"version":1,"brokerid":0,"timestamp":"1461828122648"}
cZxid = 0x4900000007
ctime = Thu Apr 28 19:22:02 GMT+12:00 2016
mZxid = 0x4900000007
mtime = Thu Apr 28 19:22:02 GMT+12:00 2016
pZxid = 0x4900000007
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x2545bbe6fda0000
dataLength = 54
numChildren = 0

5.4:controller_epoch
      zookeeper里面的controller_epoch存储的是leader选举的次数,比如一开始只有broker0这台机器,后面加入了broker1,那么就会重新进行leader选择,次数就会+1,这样依次类推:

[zk: ip:2181(CONNECTED) 2] get /controller_epoch
72
cZxid = 0x3700000014
ctime = Tue Apr 05 19:09:47 GMT+12:00 2016
mZxid = 0x4900000008
mtime = Thu Apr 28 19:22:02 GMT+12:00 2016
pZxid = 0x3700000014
cversion = 0
dataVersion = 71
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 2
numChildren = 0 

6.6 动态负载均衡
 6.6.1消费者负载均衡
   消费者在注册的时候,会使用zookeeper的watcher监听机制来实现对消费者分组里的各个消费者,以及broker节点注册监听,
  一旦某个消费者组里的某个消费者宕掉,或者某个broker宕掉,消费者会收到了事件监听回复,就会根据需要触发消费者负载均衡。
 6.6.2生产着负载均衡
      生产者在将消息发送给broker的时候,会注册watcher监听,监听broker节点的变化,每个topic的新增和减少,以便合理的发送消息到broker。

 

 

 

你可能感兴趣的:(zookeeper,kafka)