现在要进行机房(rack)迁移,ES集群共有三个节点,开启了awareness allocation对应的attribute为rack(机房),集群中的所有节点都在同一个机房(历史遗留,理论上因为只有一个机房,所以该配置无意义)。进行机房迁移的操作是扩容了三个节点并将之前的三个节点exclude,但发现有部分分片无法迁移。
抽查一个无法迁移分片的索引的分片分布如下
问题在于gh-data-rt0728节点上的副本0分片为什么不能迁移到hlsc-data-rt-es0997节点呢?
使用_cluster/allocation/explain
API查看原因(只保留所在节点gh-data-rt0728和hlsc-data-rt-es0997节点的信息):
{
"index": "kibana1_copy.20231203",
"shard": 0,
"primary": false,
"current_state": "started",
"current_node": {
"id": "4OKIXAODQS6TZ15Q1Q_r9Q",
"name": "gh-data-rt0728",
"transport_address": "10.22.170.21:8412",
"attributes": {
"机房": "gh"
}
},
"can_remain_on_current_node": "no",
"can_remain_decisions": [
{
"decider": "filter",
"decision": "NO",
"explanation": "node matches cluster setting [cluster.routing.allocation.exclude] filters [_ip:\"10.22.170.18 OR 10.22.170.21 OR 10.22.170.20\"]"
}
],
{
"node_id": "bFxwzd5VRk6y0Pp_CV0g-g",
"node_name": "hlsc-data-rt-es0997",
"transport_address": "10.253.66.154:8412",
"node_attributes": {
"机房": "hlsc"
},
"node_decision": "no",
"weight_ranking": 5,
"deciders": [
{
"decider": "awareness",
"decision": "NO",
"explanation": "there are too many copies of the shard allocated to nodes with attribute [机房], there are [3] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [3] to be less than or equal to the upper bound of the required number of shards per attribute [2]"
}
]
}
]
}
通过返回结果可以看出,原因在于awareness decider对分配返回的结果为NO,看描述是说如果分片分配到hlsc-data-rt-es0997后,该机房上总共有shard 0的 3 个分片,大于最大值 2 。那么问题来了,现在分片不能进行分配,为什么之前未扩容3个新节点时分片能够分配呢?之前3个节点也在同一个机房,新扩容的3个节点也在同一个机房啊。
我们通过看ES源码来继续排查一下,主要关注上面的 **3 **和 2 这两个数字是如何计算出来的,下面是这两个数字计算的代码,省去了无用代码,可以看出currentNodeCount对应 **3 **maximumNodeCount对应 2 ,如果currentNodeCount大于maximumNodeCount则分片无法分配:
// shard attr实际值 -> 对应shard数量
ObjectIntHashMap<String> shardPerAttribute = new ObjectIntHashMap<>();
// 索引的副本数量+1
int shardCount = indexMetadata.getNumberOfReplicas() + 1;
// 该attr key的value种类
int numberOfAttributes = nodesPerAttribute.size();
final int currentNodeCount = shardPerAttribute.get(node.node().getAttributes().get(awarenessAttribute));
final int maximumNodeCount = (shardCount + numberOfAttributes - 1) / numberOfAttributes;
if (currentNodeCount > maximumNodeCount) {
return allocation.decision(Decision.NO, NAME,
"there are too many copies of the shard allocated to nodes with attribute [%s], there are [%d] total configured " +
"shard copies for this shard id and [%d] total attribute values, expected the allocated shard count per " +
"attribute [%d] to be less than or equal to the upper bound of the required number of shards per attribute [%d]",
awarenessAttribute,
shardCount,
numberOfAttributes,
currentNodeCount,
maximumNodeCount);
}
对于机房迁移前,这些变量值为:
shardCount = 3 // 索引有两个replica
currentNodeCount = 3 // 3个分片分配在3个节点上
numberOfAttributes = 1 // 所有节点都在同一个机房上
maximumNodeCount = (3 + 1 - 1)/1 = 3
由于currentNodeCount==maximumNodeCount,所以分片正常分配
对于机房迁移后,这些变量值为:
shardCount = 3 // 索引有两个replica
currentNodeCount = 3 // 3个分片分配在3个节点上
numberOfAttributes = 2 // 旧节点和新节点在不同的机房上,所以为2
maximumNodeCount = (3 + 2 - 1)/2 = 2
由于currentNodeCount > maximumNodeCount,所以分片不能迁移
maximumNodeCount = (shardCount + numberOfAttributes - 1) / numberOfAttributes
= 1 + (shardCount - 1) / numberOfAttributes
= 1 + 2 / numberOfAttributes
因为当前numberOfAttributes为 2 不能解决问题,所以预期numberOfAttributes大于 2 ,那么maximumNodeCount=1则每个分片需要分配在不同机房的节点,因为shardCount = 3那么必须有3个及以上的可用机房节点