由于工作需要笔者需要经常和 Kafka 打交道, 也常查阅Kafka文献. 但CSDN在内的N多博客 对于Kafka的介绍,其实大都并不深入, 误导人的文章更是多如牛毛. 所以,笔者在此写个小专栏,专门用来记录一些核心却不算特别常见的配置\概念\用法, 以及部分核心源码等.
需要指出的是, 本专栏并不会从零开始去介绍Kafka , 需要读者有一定的Kafka使用经验.
写得不对,欢迎留言探讨!
kafka.Network.UnderReplicatedPartitions
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
Number of under-replicated partitions (| ISR | < | current replicas |). Replicas that are added as part of a reassignment will not count toward this value. Alert if value is greater than 0.
翻译一下就是: 这个指标反映了集群中 under-replicated
的分区数量. kafka follower 会 从 leader pull数据,但当 currentTimeMs - lastCaughtUpTimeMs
> replica.lag.time.max.ms
, follower 会被移除出 ISR (“掉队了”).
再看下代码指出了 follower 卡住 以及 follower 拉数据慢两种常见导致"掉队"的场景
/**
If the follower already has the same leo as the leader, it will not be
considered as out-of-sync, otherwise there are two cases that will be
handled here -
1 Stuck followers: If the leo of the replica hasn't been updated for
maxLagMs ms, the follower is stuck and should be removed from the ISR
2 Slow followers: If the replica has not read up to the leo within the last
maxLagMs ms, then the follower is lagging and should be removed from the ISR
Both these cases are handled by checking the lastCaughtUpTimeMs which
represents the last time when the replica was fully caught up.
If either of the above conditions is violated, that replica is considered to
be out of sync If an ISR update is in-flight, we will return an empty set
here
*/
def getOutOfSyncReplicas(maxLagMs: Long){
}
kafka.controller:type=KafkaController,name=OfflinePartitionsCount
Number of partitions that don’t have an active leader and are hence not writable or readable. Alert if value is greater than 0.
这个指标表征了当前 controller 记录的没有 leader 的分区数量. 如果该指标 >0 , 那么可能存在leader切换 ,broker宕机等异常情况了.
这里要注意下,它是controller记录的,所以从falcon之类的监控上看, 对应到非controller节点上,这个指标一直保持为0
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica