elasticsearch出现TranslogCorruptedException导致shard不能启动的问题修复

测试elasticsearch过程中,遇到translog损坏的异常,将修复的过程记录下来。

1. 问题

单机数据量有8亿+,一个index,20+个字段,使用bulk不停的写数据,bulk.size=5W,此时机器意外断电宕机。

机器修复后重启ES,出现translogCorruptedException异常:

[plain]  view plain  copy
  1. [2015-01-06 16:12:34,061][WARN ][indices.cluster          ] [node_141] [ips][4] failed to start shard  
  2. org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [ips][4] failed to recover shard  
  3.     at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)  
  4.     at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)  
  5.     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  
  6.     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  
  7.     at java.lang.Thread.run(Thread.java:745)  
  8. Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream  
  9.     at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)  
  10.     at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)  
  11.     ... 4 more  
  12. Caused by: java.io.EOFException  
  13.     at org.elasticsearch.common.io.stream.InputStreamStreamInput.readBytes(InputStreamStreamInput.java:53)  
  14.     at org.elasticsearch.index.translog.BufferedChecksumStreamInput.readBytes(BufferedChecksumStreamInput.java:55)  
  15.     at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:86)  
  16.     at org.elasticsearch.common.io.stream.StreamInput.readBytesReference(StreamInput.java:74)  
  17.     at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:353)  
  18.     at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)  
  19.     ... 5 more  
  20. [2015-01-06 16:12:34,062][DEBUG][index.service            ] [node_141] [ips] [4] closing... (reason: [recovery failure [IndexShardGatewayRecoveryException[[ips][4] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]])  
  21. [2015-01-06 16:12:34,062][DEBUG][index.shard.service      ] [node_141] [ips][4] state: [RECOVERING]->[CLOSED], reason [recovery failure [IndexShardGatewayRecoveryException[[ips][4] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: EOFException; ]]  

提示有四个shard start failed,bulk写数据到index失败:

[plain]  view plain  copy
  1. Primary shard is not active or isn't assigned is a known node. Timeout: [1m], request: org.elasticsearch.action.bulk.BulkShardRequest@372c22f5]]  

2. 解决方法

找了一些办法修复,包括lucene的CheckIndex修复工具。

CheckIndex的官方解释:Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.

会造成损坏segment中的数据丢失。


想找一个数据丢失最少的解决方法,在google group上找到一个类似的问题:ES failed to recover after crash

Motov给的解决方案:

- shut down elasticsearch cluster 
- find all shards that cannot recover by searching log file
- for each shard move its non-zero length translog file into a temporary directory (see explanation below)
- start elasticsearch cluster
- if you see messages for other shards - repeat

也就是 
关闭集群 --> 找到不能启动的shard --> 清除这些shard的 translog(注意做备份) --> 重启ES集群
如果还不行重复以上过程。

尝试着清除出现问题shard 的 translog,果然ES所有的shard都启动成功。

3. 分析总结

ES 的translog中包含 对ES所有的所有更改,是数据备份和恢复的重要组件。
如果在写translog时发生宕机事故,translog写入流程没有正常的结束,translog文件结尾没有正确的结束符号,
导致eof Exception。


另:Motov的完整回答:

In nel's case it was corrupted transaction log. When you run out of disk space sometimes the last transaction cannot be fully written into transaction log and then it fails on recovery. If you see exactly the same error messages, you can try the following:

- shut down elasticsearch cluster
- find all shards that cannot recover by searching log file
- for each shard move its non-zero length translog file into a temporary directory (see explanation below)
- start elasticsearch cluster
- if you see messages for other shards - repeat

If you see message like this:

[2012-06-22 17:36:17,165][WARN ][indices.cluster          ] [Cat-Man] [ myindex][ 1] failed to start shard

It means that it cannot recover shard  1 of the index  myindex on the node Cat-Man. If you take a look at data/elasticsearch/nodes/0/indices/ myindex/1/translog directory, you will find files like this: translog-123456677899 or translog-123456677899.recovering. One of them will have non-zero length. Move it to a temporary directory and try starting the server. 

The transaction log files that you will be moving out contain your most recently updated and indexed documents. So, these updates will be lost as a result of this operations, but you should be able to recover the rest of your data.

你可能感兴趣的:(Elasticsearch)