ES节点丢失导致实时数据导入速度特别慢

一个节点死机了,无法自动重启。通过logtash导数据,由于当天入的数据是0备份,节点丢失后,某些shard丢失,导致集群一直处于red状态。节点丢失后,该索引的导入速度直线下降。经测试发现是logtash的原因,logtash的input阶段是一个线程,filter和output用一个线程。中间通过一个同步队列缓存数据。如果在output的过程中出现问题,那么失败的数据会无限制地放回同步队列,然后队列中的数据被再次分配shard导入,分配到丢失shard的数据会再次失败,再次放入同步队列。因此数据一直在同步队列和es的bulk中循环,导致整个索引的导入速度变慢。

用测试机测试出的结果如下:
1、正常导数据:

xxx-20170925              1     p      STARTED   24713  24.7mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              5     p      STARTED   24256  33.7mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              2     p      STARTED   24702  24.2mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              3     p      STARTED   24626  24.2mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              7     p      STARTED   24916  34.2mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              4     p      STARTED   23970  38.2mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              6     p      STARTED   24786    24mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              0     p      STARTED   24824  34.4mb xxx.6.105  node-xxx.6.105-performance_test

2 关闭一个节点

xxx-20170925              6     p      STARTED     128179 110.8mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              1     p      UNASSIGNED                                
xxx-20170925              4     p      STARTED     128263 108.1mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              7     p      STARTED     128593 109.3mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              2     p      STARTED     128613 112.8mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              5     p      UNASSIGNED                                
xxx-20170925              3     p      STARTED     127969 115.6mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              0     p      STARTED     128322 110.3mb xxx.6.105  node-xxx.6.105-performance_test

3 经过一段时间后查看shard,发现其他shard增长的速度特别慢

xxx-20170925              6     p      STARTED     128436 111.1mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              5     p      UNASSIGNED                                
xxx-20170925              3     p      STARTED     128231 110.9mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              7     p      STARTED     128814 109.6mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              1     p      UNASSIGNED                                
xxx-20170925              2     p      STARTED     128871 182.6mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              4     p      STARTED     128502 108.5mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              0     p      STARTED     128568 109.1mb xxx.6.105  node-xxx.6.105-performance_test

logtash的日志如下:

[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][1] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [15] requests]"})
[2017-11-21T11:04:26,780][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,784][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 503 ({"type"=>"unavailable_shards_exception", "reason"=>"[xxx-20170925][5] primary shard is not active Timeout: [1m], request: [BulkShardRequest to [xxx-20170925] containing [19] requests]"})
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action
[2017-11-21T11:04:26,784][ERROR][logstash.outputs.elasticsearch] Action

4 数据恢复后

xxx-20170925              4     p      STARTED     154764 125.3mb xxx.6.105  node-xxx.6.105-performance_test
xxx-20170925              5     p      STARTED     157936 126.4mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              2     p      STARTED     154945 138.9mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              7     p      STARTED     155224 156.8mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              1     p      STARTED     158080 124.8mb xxx.7.67   node-xxx.7.67-performance_test
xxx-20170925              3     p      STARTED     154243 153.8mb xxx.7.81   node-xxx.7.81-performance_test
xxx-20170925              6     p      STARTED     154909 146.9mb xxx.11.131 node-xxx.11.131-performance_test
xxx-20170925              0     p      STARTED     154681   127mb xxx.6.105  node-xxx.6.105-performance_test

你可能感兴趣的:(ES节点丢失导致实时数据导入速度特别慢)