Elasticsearch(ES)集群状态显示黄色时,使用cerebro会提示显示黄色原因,如果使用其他工具,则可以通过健康检查api查看集群状态GET /_cluster/health
。
调用健康检查apiGET /_cluster/health
反馈如下信息:
{
"cluster_name" : "troll*",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : ***,
"number_of_data_nodes" : ***,
"active_primary_shards" : ***,
"active_shards" : ***,
"relocating_shards" : ***,
"initializing_shards" : ***,
"unassigned_shards" : ***, // ~注意看这里~
"delayed_unassigned_shards" : ***,
"number_of_pending_tasks" : ***,
"number_of_in_flight_fetch" : ***,
"task_max_waiting_in_queue_millis" : ***,
"active_shards_percent_as_number" : ***
}
/_cluster/health
接口反馈内容解释如下:# 查看集群健康状态
GET /_cluster/health
查看集群分片的情况,重点关注unassigned_shards没有正常分配的副本数量。
{
“cluster_name” : “*******”,
“status” : “yellow”,
“timed_out” : false,
“number_of_nodes” : *******,
“number_of_data_nodes” : *******,
“active_primary_shards” : *******,
“active_shards” : *******,
“relocating_shards” : *******,
“initializing_shards” : *******,
“unassigned_shards” : *******,
“delayed_unassigned_shards” : *******,
“number_of_pending_tasks” : *******,
“number_of_in_flight_fetch” : *******,
“task_max_waiting_in_queue_millis” : *******,
“active_shards_percent_as_number” : *******
}
# 查看索引情况
GET _cat/indices
根据返回值找到异常索引
yello open 索引名 ***** ***** ***** ***** ***** ***** *****
# 查看异常原因
GET /_cluster/allocation/explain
查看分片异常的原因,这里提示异常原因为:unassigned
、node_left
、the shard cannot be allocated to the same node on which a copy of the shard already exists
和cannot allocate because allocation is not permitted to any of the nodes
,此处是由于节点丢失导致无法进行副本复制导致。
{
“index” : “",
“shard” : "”,
“primary” : “",
“current_state” : “unassigned”,
“unassigned_info” : {
“reason” : “NODE_LEFT”,
“at” : “2020-05-15T06:12:23.967Z”,
“details” : “node_left [KyZROB7BSASwY0i3r7q3nw]”,
“last_allocation_status” : “no_attempt”
},
“can_allocate” : “no”,
“allocate_explanation” : “cannot allocate because allocation is not permitted to any of the nodes”,
“node_allocation_decisions” : [
{
“node_id” : “FkwTKuMISlG88uNtelHQbQ”,
“node_name” : “es7_01”,
“transport_address” : “172.21.0.6:9300”,
“node_attributes” : {
“ml.machine_memory” : “12566077440”,
“ml.max_open_jobs” : “20”,
“xpack.installed” : “true”
},
“node_decision” : “no”,
“deciders” : [
{
“decider” : “same_shard”,
“decision” : “NO”,
“explanation” : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[]][0], node[FkwTKuMISlG88uNtelHQbQ], [P], s[STARTED], a[id=l_k948LiTcSqjhp8PRKqVQ]]”
}
]
},
{
“node_id” : “mjNvBmkASwq0Dx6W5028Uw”,
“node_name” : “es7_03”,
“transport_address” : “172.21.。:9300”,
“node_attributes” : {
“ml.machine_memory” : “12566077440”,
“ml.max_open_jobs” : “20”,
“xpack.installed” : “true”
},
“node_decision” : “no”,
“deciders” : [
{
“decider” : “same_shard”,
“decision” : “NO”,
“explanation” : “the shard cannot be allocated to the same node on which a copy of the shard already exists [[******]][0], node[mjNvBmkASwq0Dx6W5028Uw], [R], s[STARTED], a[id=lS8fqbDoRA-ju6QW5psnjA]]”
}
]
}
]
}
# 查看索引信息,找出异常索引
GET /_cat/indices\?v
返回:
# health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
# green open ** D90ToWRGTpyeJAIy2ZVCvw *** *** *** *** *** ***
# yellow open ** hXI3lFOlSVi6gnqREZzEwQ *** *** *** *** *** ***
# green open .kibana_task_manager_1 akJZg3QkRta-oGH8BEfhXA *** *** *** *** *** ***
# green open .apm-agent-configuration f5ftL0VISRm36KXnN3QtPQ *** *** *** *** *** ***
# green open .kibana_1 d5k_3pOkRSe95Cf-dMo0SQ *** *** *** *** *** ***
从以上信息中可以看出第二行的索引存在异常,为黄色(yellow),elasticsearch健康状态为黄色则代表所有主分片均已分配,但未分配一个或多个副本分片。如果群集中的某个节点发生故障,则在修复该节点之前,某些数据可能不可用。则将副本集大小进行重新设置即可。
查看es集群的健康状态GET /_cluster/health
返回信息如下:
{
"cluster_name" : "troll*",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : ***,
"number_of_data_nodes" : ***,
"active_primary_shards" : ***,
"active_shards" : ***,
"relocating_shards" : ***,
"initializing_shards" : ***,
"unassigned_shards" : ***, // ~注意看这里~
"delayed_unassigned_shards" : ***,
"number_of_pending_tasks" : ***,
"number_of_in_flight_fetch" : ***,
"task_max_waiting_in_queue_millis" : ***,
"active_shards_percent_as_number" : ***
}
对照返回值官方文档解释(如上介绍中),发现存在部分副本分片为正常分配的情况。
查看es集群黄色状态索引的settings
# 查看索引设置
GET /***/_settings
反馈信息如下:
{
"***" : {
"settings" : {
"index" : {
"creation_date" : "***",
"number_of_shards" : "***",
"number_of_replicas" : "***", // 关注此处的副本分片的大小
"uuid" : "hXI3lFOlSVi6gnqREZzEwQ",
"version" : {
"created" : "***"
},
"provided_name" : "***"
}
}
}
}
此处假设number_of_replicas的数量为3,则说明3个分片未分配。我们需要根据不同的情况进行分析:
# 重建索引
POST _reindex
{
"source": {
"index": "旧索引名"
},
"dest": {
"index": "新索引名"
}
}
# 查看重建索引的设置
GET /新索引名
# 删除索引
DELETE /旧索引名
# 创建索引别名
POST /_aliases
{
"actions": [
{
"add": {
"index": "新索引名",
"alias": "旧索引名"
}
}
]
}
# 重新设置索引分片信息
PUT 索引名/_settings
{
"number_of_replicas" : **
}
在设置延迟复制副本集策略的生产集群,则需要进行手工启动复制副本集操作,以免出现数据丢失风险。
重建索引或大量写入过程中,若处于重建过程中,则黄色状态指示暂时的,需观察一段时间后再判断是否有异常。
当副本数大于数据节点数时,那么每个分片只能最多有节点数量-1个副本
,无法分配的副本数则为主分片数*(副本数-(节点数-1))
,例如:假设节点数为3,主分片数为5,副本数为3,那么无法分配的副本数则为:5*(3-(3-1))=5。那么此时只需要重新设置索引副本分片数即可,具体操作如下:
# 重新设置索引分片信息
PUT 索引名/_settings
{
"number_of_replicas" : 2
}
执行结果如下:
{
“acknowledged” : true
}
查看修改后的索引配置
# 查看索引设置
GET /***/_settings
执行重新设置副本分片数后,最新settings如下:
{
"***" : {
"settings" : {
"index" : {
"creation_date" : "***",
"number_of_shards" : "***",
"number_of_replicas" : "2", // 关注此处的副本分片的大小已改变
"uuid" : "hXI3lFOlSVi6gnqREZzEwQ",
"version" : {
"created" : "***"
},
"provided_name" : "***"
}
}
}
}
此时再看集群健康状态:
{
"cluster_name" : "troll*",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : ***,
"number_of_data_nodes" : ***,
"active_primary_shards" : ***,
"active_shards" : ***,
"relocating_shards" : ***,
"initializing_shards" : ***,
"unassigned_shards" : ***, // ~注意看这里~
"delayed_unassigned_shards" : ***,
"number_of_pending_tasks" : ***,
"number_of_in_flight_fetch" : ***,
"task_max_waiting_in_queue_millis" : ***,
"active_shards_percent_as_number" : ***
}