【ElasticSearch】索引数据mapping嵌套深度过大导致Stackoverflow问题排查

现象

集群所有数据节点频繁因为StackOverflowError的错误挂掉,启动后还会挂掉,StackOverflowError异常栈如下

[2023-12-22T16:03:44,057][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [xr-data-hdp-dn-rtyarn0725] fatal error in thread [elasticsearch[xr-data-hdp-dn-rtyarn0725][write][T#6]], exiting
java.lang.StackOverflowError: null
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:283) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]
        ...
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parse(ObjectMapper.java:210) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:319) ~[elasticsearch-7.9.1.jar:7.9.1]
        at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:237) ~[elasticsearch-7.9.1.jar:7.9.1]

处理

通过堆栈可以看出是写入线程池[write]发生的Stackoverflow,并且可能是在解析mapping的过程发生的,通过ObjectMapper类推断是Object类型数据写入导致的。因此通过拉取集群内所有索引的mapping,尝试找出哪个索引的mapping有Object类型的字段,但结果没能找到。
最后,因为这个集群的索引较少,我们通过简单暴力的方法——二分查找停掉作业观察集群状态,来找到问题索引。

问题排查

问题一

为什么会发生Stackoverflow?

栈溢出的堆栈发生在ES服务端处理客户端的写入请求时,在开启dynamic mapping的情况下,如果写入数据包含新的字段配置,需要解析字段配置,解析字段配置的逻辑是递归解析配置对应的JSON数据,当字段类型为嵌套格式(Object/nested)时,递归的次数取决于用户数据的嵌套层数。问题索引的数据嵌套层数过多导致,递归次数过多,进而导致栈溢出。

验证:

测试写入一条多层嵌套的数据,结果中的代码堆栈和现象中发生StackOverflowError的栈相同,出现了多次递归

{

    "o1":{
        "a":{
            "b":{
                "c":{
                    "d":{
                        "e":{
                            "f":{
                                "g":{
                                    "h":{
                                        "j":"ddd"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

代码堆栈:
【ElasticSearch】索引数据mapping嵌套深度过大导致Stackoverflow问题排查_第1张图片
查看问题索引确实开启了dynamic mapping,并且原始日志确实存在包含大量嵌套结构的数据

问题二

为什么问题索引的mapping中不包含Object类型的字段?
异常堆栈的触发时机为数据写入解析mapping,此时还未将新的mapping更新为索引的mapping,由于解析mapping时发生了Stackoverflow导致ES进程crash,因此索引mapping没有更新,自然问题索引的mapping中不包含Object类型的字段。

问题三

ES侧有nested字段的深度限制(index.mapping.depth.limit),为什么没拦截掉该消息?
该检查在解析字段配置之后,解析字段时就发生了栈溢出,详见下面的代码

private synchronized Map internalMerge(Map mappings, MergeReason reason) {
        //...省略无关代码...

            try {
                documentMapper =
                    documentParser.parse(type, entry.getValue(), applyDefault ? defaultMappingSourceOrLastStored : null); // 数据的mapping解析
            } catch (Exception e) {
                throw new MapperParsingException("Failed to parse mapping [{}]: {}", e, entry.getKey(), e.getMessage());
            }
        }

        return internalMerge(defaultMapper, defaultMappingSource, documentMapper, reason);// 这里会检查mapping
    }

private synchronized Map internalMerge(@Nullable DocumentMapper defaultMapper,
                                                                   @Nullable String defaultMappingSource, DocumentMapper mapper,
                                                                   MergeReason reason) {
        //...省略无关代码...
        boolean hasNested = this.hasNested;
        Map fullPathObjectMappers = this.fullPathObjectMappers;

        Map results = new LinkedHashMap<>(2);

        if (defaultMapper != null) {
            if (indexSettings.getIndexVersionCreated().onOrAfter(Version.V_7_0_0)) {
                throw new IllegalArgumentException(DEFAULT_MAPPING_ERROR_MESSAGE);
            } else if (reason == MergeReason.MAPPING_UPDATE) { // only log in case of explicit mapping updates
                deprecationLogger.deprecatedAndMaybeLog("default_mapping_not_allowed", DEFAULT_MAPPING_ERROR_MESSAGE);
            }
            assert defaultMapper.type().equals(DEFAULT_MAPPING);
            results.put(DEFAULT_MAPPING, defaultMapper);
        }

            for (ObjectMapper objectMapper : objectMappers) {
            if (reason != MergeReason.MAPPING_RECOVERY) {
                checkTotalFieldsLimit(objectMappers.size() + fieldMappers.size() - metadataMappers.length
                    + fieldAliasMappers.size());
                checkFieldNameSoftLimit(objectMappers, fieldMappers, fieldAliasMappers);
                checkNestedFieldsLimit(fullPathObjectMappers);
                checkDepthLimit(fullPathObjectMappers.keySet()); // 检查mapping的最大深度是打破阈值,是则抛出IllegalArgumentException
            }

            results.put(newMapper.type(), newMapper);
        }


        return results;
    }

解决方法

官方社区在v8.6修复了该问题,https://github.com/elastic/elasticsearch/issues/52098,我们使用的版本是ES7,需要升级或者打patch才能解决

生产环境建议

  1. 最好不好开启dynamic mapping功能,不仅影响性能,低版本还可能出现本文的问题
  2. 故障处理时可以考虑临时增加日志,辅助问题排查。像这次问题如果在mapping解析的部分加上索引名或者字段信息辅助找到问题索引,故障时间将大幅缩短
  3. 版本迭代最好跟上社区,很多问题社区都解决了
  4. 该问题排查还可以考虑开启Transport tracer,打印出写入请求日志,看看发生栈溢出之前的写入的索引数据情况

你可能感兴趣的:(elasticsearch,jenkins,大数据)