elasticsearch入库错误:gc overhead导致数据节点脱离集群

 最近在进行ES的大规模数据入库操作,遇到了一个问题:数据量较小时入库正常;数据量较大时,在入库的过程中,经过一段时间会有部分数据节点脱离集群,查log日志如下:

[2018-04-09T21:08:48,481][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][young][4312117][1523595] duration [11.4s], collections [1]/[1.6s], total [11.4s]/[12.9m], memory [27.7gb]->[15.3gb]/[31.8gb], all_pools {[young] [17.4gb]->[13.4mb]/[1.4gb]}{[survivor] [46mb]->[191.3mb]/[191.3mb]}{[old] [10gb]->[14.6gb]/[30.1gb]}             

[2018-04-09T21:08:48,481][WARN ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312117] overhead, spent [11.4s] collecting in the last [12s]
[2018-04-09T21:08:54,654][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_25] [gc][4312123] overhead, spent [412ms] collecting in the last [1.1s]

 很明显,JVM的full GC时间过长,这与heap size设置为32GB有很大关系:

    ES 内存使用和GC指标——默认情况下,主节点每30秒会去检查其他节点的状态,如果任何节点的垃圾回收时间超过30秒(Garbage collection duration),则会导致主节点任务该节点脱离集群。

    设置过大的heap会导致GC时间过长,这些长时间的停顿(stop-the-world)会让集群错误的认为该节点已经脱离。 

但是如果把heap size设置的过小,GC太过频繁,会影响ES入库和搜索的效率 。

通过阅读官方文档 https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html  和博客 http://www.cnblogs.com/bonelee/p/8063915.html

Setting Description

ping_interval

How often a node gets pinged. Defaults to 1s.

ping_timeout

How long to wait for a ping response, defaults to 30s.

ping_retries

How many ping failures / timeouts cause a node to be considered failed. Defaults to 3.

 所以通过增加ping_timeout的时间,和增加ping_retries的次数来防止节点错误的脱离集群,可以使节点有充足的时间进行full GC。

discovery.zen.fd.ping_timeout: 1000s
discovery.zen.fd.ping_retries: 10

 

你可能感兴趣的:(Elasticsearch)