ES:一次分片设计问题导致的故障

### 现象:

1. 单节点CPU持续高

ES:一次分片设计问题导致的故障_第1张图片

 2.写入骤降

ES:一次分片设计问题导致的故障_第2张图片

3.线程池队列积压,但没有reject

ES:一次分片设计问题导致的故障_第3张图片

 4.使用方没有记录日志

### 排查

1.ES监控

只能看到相应的结果指标,无法反应出原因。

2.ES日志:大量日志打印相关异常(routate等调用栈)

core.appender.OutputStreamManager.writeToDestination(OutputStreamManager.java:263)
at org.apache.logging.log4j.core.appender.FileManager.writeToDestination

3.查询CPU的使用,GET _nodes/hot_threads

35.3% (176.7ms out of 500ms) cpu usage by thread 'elasticsearch[xxxxx-es-hot2-13][write][T#10]'
     10/10 snapshots sharing following 179 elements
       app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:433)
       app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:374)
       app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction$$Lambda$3657/0x0000000800d2f440.accept(Unknown Source)
       app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
       app//org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$14(IndexShard.java:2588)
       app//org.elasticsearch.index.shard.IndexShard$$Lambda$3659/0x0000000800d2fc40.accept(Unknown Source)
       app//org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61)
       app//org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:273)
       app//org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:240)
       app//org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2563)
       app//org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:996)
       app//org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:370)





....



35.0% (174.7ms out of 500ms) cpu usage by thread 'elasticsearch[xxxxxx-es-hot2-13][write][T#5]'
     5/10 snapshots sharing following 216 elements
       app//org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeChunkedText(TextEncoderHelper.java:146)
       app//org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(TextEncoderHelper.java:58)
       app//org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:68)
       app//org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(StringBuilderEncoder.java:32)
       app//org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:220)
       app//org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:58)
       app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:177)
       app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
       app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
       app//org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
       app//org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
       app//org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
       app//org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)

“CPU高” 和写入、日志打印相关,无法获取更详细的信息,且由于瞬时抓取,也并不非常精准。

4.火焰图

ES:一次分片设计问题导致的故障_第4张图片

大致确认和日志相关。

5. 根据以往经验,可能和单分片doc数量限制相关

6.继续搜索日志,确认是单分片超过限制

2023-08-21 02:31:10,215 elasticsearch[xxxx-es-hot2-13][write][T#1] ERROR Recovering from StringBuilderEncoder.encode('[2023-08-21T02:31:10,201][DEBUG][o.e.a.b.TransportShardBulkAction] [xxxxx-es-hot2-13][cp0001001_2023_08][0] failed to execute bulk item (index) index {[xxxxx001_2023_08][event_xxx][xxxxxxxxx], source[{"id":"9f61ef55-0334-4363-9bcf-xxxx","rowkey":"xxxxxxd83ce110","column01":"1007922682","datachangelasttime":1692584511322,"column19":"xxx","column20":"80,295",xxx.......}]}
2023-08-21T02:31:10.237858677Z java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519

### 处理

删除索引重建,并设计好分片

你可能感兴趣的:(elasticsearch,大数据,搜索引擎)