空文件引起ES结点重启失败处理

问题背景

ES集群的各结点有加进程监控,如果进程不在,监控会自动重新拉起,一般情况下重启不会有问题。今天遇到一个结点一直在报无进程的告警,只能去结点上查看日志。

问题查找

通过日志发现ES结点重启报下面的错误

4) Error injecting constructor, ElasticsearchException[java.io.IOException: failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IOException[failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IllegalStateException[class org.apache.lucene.store.BufferedChecksumIndexInput cannot seek backwards (pos=-16 getFilePointer()=0)];
  at org.elasticsearch.gateway.GatewayMetaState.(Unknown Source)
  while locating org.elasticsearch.gateway.GatewayMetaState
Caused by: ElasticsearchException[java.io.IOException: failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IOException[failed to read [id:2, legacy:false, file:/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st]]; nested: IllegalStateException[class org.apache.lucene.store.BufferedChecksumIndexInput cannot seek backwards (pos=-16 getFilePointer()=0)];

从上面很明显可以看出,ES结点在重启的时候读/data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st文件的时候出现异常,查看发现该文件是个空文件,由于pos=-16,因此导致了读文件异常,进而导致进程失败。

ll /data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
-rw-rw-r-- 1 search search 0 Oct  1 22:53 /data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st

问题处理

查询出所有的空文件,直接删除,再重启ES结点就OK了。

find /data*/search/data/nodes/0/indices/ | grep state | grep "\.st" | xargs ls -l | awk '{if($5==0)print $0}' 
-rw-rw-r-- 1 search search    0 Oct  1 22:53 /data2/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
-rw-rw-r-- 1 search search    0 Oct  1 22:53 /data3/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st
-rw-rw-r-- 1 search search    0 Oct  1 22:53 /data4/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-1.st
-rw-rw-r-- 1 search search    0 Oct  1 22:53 /data4/search/data/nodes/0/indices/QNpDowX_TwiIiqZlB9e92g/_state/state-2.st

深层原因

通过日志,发现ES结点在重启之前并没有任务异常,后面发现是机器故障,导致st文件为空,进而引起了结点重启失败的问题。
对于这种机器故障的问题,也没有什么好的处理办法,只能将ES数据多备份来避免机器挂了不能正常使用后导致ES丢数据。
如果对于机器不放心,那就只能将机器从ES集群踢掉罗。

你可能感兴趣的:(空文件引起ES结点重启失败处理)