前段时间帮着客户排查ES相关的问题,客户环境后期接入的数据量比当初规划的多了很多,依据机器资源的使用情况决定对当前ES集群进行扩容;由2data扩充为4data且专门独立出一个master。由于ES集群当前已经存储了TB级别的数据,想要后续对ES集群操作上更轻便一些,所以决定暂时将存储的索引数据(每个data节点存储路径下的indices目录中)提前move到一个临时存储位置Dest。对ES集群扩充操作完毕后,为了测试,这个时候先从Dest中移出一小部分索引数据加载到当前ES集群中的data节点,然后重启ES集群;因为容器存储卷映射配置上出了点问题,导致data节点的分词插件出现错误,所以加载进来的索引均没有成功assigned。重新迁回索引数据,正确处理好容器卷映射的问题后,不经意间通过_cat/indices接口发现所有unassigned索引,心里想着反正是未分配的,且已经将数据拷贝出来了,所以就随手执行了DELETE *索引的操作(当时心里的认知是认为索引的数据以及metaData等信息都是存储在索引文件中的,在data节点加载数据的时候会读取进来并上报给master节点然后进行全局的集群状态更新;所以不认为DELETE *的删除索引操作会出事儿,况且还是删除的未被正常分配的索引)。之后再重新将上述操作的同样的那部分索引数据分别拷贝至ES集群的data节点,重启整个ES集群;重启完成后 ,这个时候严重的问题出现了,_cluster/health接口无索引恢复的百分比,感觉奇怪;接着马上执行_cat/indices接口,结果无任何索引信息;最后查看每个data节点存储路径下拷贝过来的索引目录也已经不存在了。到这里心里开始慌了,因为搞丢了一部分数据,且这个意外的发生已经超出了自己对于ES这块知识的认知了;后面小心谨慎的处理好了客户环境后,但这个问题需要好好深入的研究下了。所以这篇文档是对上述问题对应的ES内部处理机制的研究记录。

  • ES 5.6.16
  • 1master + 1data(分别用Intellij IDE源码运行ES实例)
    对于上述问题,其实刚开始并没有清晰的目标知道要从ES的哪个模块,哪个类开始研究,所以决定先搭建ES环境重现上述问题,然后从中寻找切入点。搭建1master + 1data两个节点的ES集群,并分别都设置debug日志级别,模拟上述数据被删除的整个操作流程,尝试从debug日志中挖掘有用的信息
[2020-10-09T13:48:48,538][DEBUG][o.e.i.c.IndicesClusterStateService] [master] [[twitter/4fHvcKLSRBuXK4mGTVI9Bg]] cleaning index, no longer part of the metadata


 * Deletes indices (with shard data).
 * @param event cluster change event
private void deleteIndices(final ClusterChangedEvent event) {
    final ClusterState previousState = event.previousState();
    final ClusterState state = event.state();
    final String localNodeId = state.nodes().getLocalNodeId();
    assert localNodeId != null;

    for (Index index : event.indicesDeleted()) {
        if (logger.isDebugEnabled()) {
            logger.debug("[{}] cleaning index, no longer part of the metadata", index);
        AllocatedIndex indexService = indicesService.indexService(index);
        final IndexSettings indexSettings;
        if (indexService != null) {
            indexSettings = indexService.getIndexSettings();
            indicesService.removeIndex(index, DELETED, "index no longer part of the metadata");
        } else if (previousState.metaData().hasIndex(index.getName())) {
            // The deleted index was part of the previous cluster state, but not loaded on the local node
            final IndexMetaData metaData = previousState.metaData().index(index);
            indexSettings = new IndexSettings(metaData, settings);
            indicesService.deleteUnassignedIndex("deleted index was not assigned to local node", metaData, state);
        } else {
            // The previous cluster state's metadata also does not contain the index,
            // which is what happens on node startup when an index was deleted while the
            // node was not part of the cluster.  In this case, try reading the index
            // metadata from disk.  If its not there, there is nothing to delete.
            // First, though, verify the precondition for applying this case by
            // asserting that the previous cluster state is not initialized/recovered.
            assert previousState.blocks().hasGlobalBlock(GatewayService.STATE_NOT_RECOVERED_BLOCK);
            final IndexMetaData metaData = indicesService.verifyIndexIsDeleted(index, event.state());
            if (metaData != null) {
                indexSettings = new IndexSettings(metaData, settings);
            } else {
                indexSettings = null;
        if (indexSettings != null) {
            threadPool.generic().execute(new AbstractRunnable() {
                public void onFailure(Exception e) {
                        (Supplier) () -> new ParameterizedMessage("[{}] failed to complete pending deletion for index", index), e);

                protected void doRun() throws Exception {
                    try {
                        // we are waiting until we can lock the index / all shards on the node and then we ack the delete of the store
                        // to the master. If we can't acquire the locks here immediately there might be a shard of this index still
                        // holding on to the lock due to a "currently canceled recovery" or so. The shard will delete itself BEFORE the
                        // lock is released so it's guaranteed to be deleted by the time we get the lock
                        indicesService.processPendingDeletes(index, indexSettings, new TimeValue(30, TimeUnit.MINUTES));
                    } catch (LockObtainFailedException exc) {
                        logger.warn("[{}] failed to lock all shards for index - timed out after 30 seconds", index);
                    } catch (InterruptedException e) {
                        logger.warn("[{}] failed to lock all shards for index - interrupted", index);


 * Returns the indices deleted in this event
public List indicesDeleted() {
    if (previousState.blocks().hasGlobalBlock(GatewayService.STATE_NOT_RECOVERED_BLOCK)) {
        // working off of a non-initialized previous state, so use the tombstones for index deletions
        return indicesDeletedFromTombstones();
    } else {
        // examine the diffs in index metadata between the previous and new cluster states to get the deleted indices
        return indicesDeletedFromClusterState();

private List indicesDeletedFromTombstones() {
    // We look at the full tombstones list to see which indices need to be deleted.  In the case of
    // a valid previous cluster state, indicesDeletedFromClusterState() will be used to get the deleted
    // list, so a diff doesn't make sense here.  When a node (re)joins the cluster, its possible for it
    // to re-process the same deletes or process deletes about indices it never knew about.  This is not
    // an issue because there are safeguards in place in the delete store operation in case the index
    // folder doesn't exist on the file system.
    List tombstones = state.metaData().indexGraveyard().getTombstones();
    return tombstones.stream().map(IndexGraveyard.Tombstone::getIndex).collect(Collectors.toList());

private List indicesDeletedFromClusterState() {
    // If the new cluster state has a new cluster UUID, the likely scenario is that a node was elected
    // master that has had its data directory wiped out, in which case we don't want to delete the indices and lose data;
    // rather we want to import them as dangling indices instead.  So we check here if the cluster UUID differs from the previous
    // cluster UUID, in which case, we don't want to delete indices that the master erroneously believes shouldn't exist.
    // See test DiscoveryWithServiceDisruptionsIT.testIndicesDeleted()
    // See discussion on https://github.com/elastic/elasticsearch/pull/9952 and
    // https://github.com/elastic/elasticsearch/issues/11665
    if (metaDataChanged() == false || isNewCluster()) {
        return Collections.emptyList();
    List deleted = null;
    for (ObjectCursor cursor : previousState.metaData().indices().values()) {
        IndexMetaData index = cursor.value;
        IndexMetaData current = state.metaData().index(index.getIndex());
        if (current == null) {
            if (deleted == null) {
                deleted = new ArrayList<>();
    return deleted == null ? Collections.emptyList() : deleted;


"metadata": {
    "cluster_uuid": "kURWiZwNQ0-jmDqNIQOa9g",
    "templates": {},
    "indices": {},
    "index-graveyard": {  
      "tombstones": [
          "index": {
            "index_name": "twitter",
            "index_uuid": "IR5DYQLLTJKKBGxgal63nQ"
          "delete_date_in_millis": 1602208073269


  • IndexGraveyard(索引墓地):此类用来表示被删除索引的类
  • tombstone(墓碑):被删除的索引
  • tombstones:被删除的索引的集合,tombstones大小可通过cluster.indices.tombstones.size设置,默认大小为500
  • dangling indices:表示这类索引其state信息还在磁盘中,但不存在于集群的metaData中(上述操作就属于此类型)

有了这些认识铺垫后,接着研究了ES master节点的持久化存储,在master存储路径下有两个很重要的文件,一个用于记录集群metaData相关信息(global-x.st),一个用于记录master节点相关信息(node-x.st)。通过vim并以16进制的方式分别打开这两个文件:

# global-1.st
00000000: 3fd7 6c17 0573 7461 7465 0000 0001 0000  ?.l..state......
00000010: 0001 3a29 0a05 fa88 6d65 7461 2d64 6174  ..:)....meta-dat
00000020: 61fa 8676 6572 7369 6f6e d08b 636c 7573  a..version..clus
00000030: 7465 725f 7575 6964 5542 7563 7a51 3365  ter_uuidUBuczQ3e
00000040: 6353 6757 6e61 7378 7465 476b 7636 6788  cSgWnasxteGkv6g.
00000050: 7465 6d70 6c61 7465 73fa fb8e 696e 6465  templates...inde
00000060: 782d 6772 6176 6579 6172 64fa 8974 6f6d  x-graveyard..tom
00000070: 6273 746f 6e65 73f8 fa84 696e 6465 78fa  bstones...index.
00000080: 8969 6e64 6578 5f6e 616d 6544 6e61 6d65  .index_nameDname
00000090: 7389 696e 6465 785f 7575 6964 5564 5857  s.index_uuidUdXW
000000a0: 4957 4878 7352 6d57 3575 5441 6274 3969  IWHxsRmW5uTAbt9i
000000b0: 6b65 77fb 9464 656c 6574 655f 6461 7465  kew..delete_date
000000c0: 5f69 6e5f 6d69 6c6c 6973 2501 3a43 0e1b  _in_millis%.:C..
000000d0: 1dae fbf9 fbfb fbc0 2893 e800 0000 0000  ........(.......
000000e0: 0000 0028 e8a7 b60a                      ...(....

# node-0.st
00000000: 3fd7 6c17 0573 7461 7465 0000 0001 0000  ?.l..state......
00000010: 0001 3a29 0a05 fa86 6e6f 6465 5f69 6455  ..:)....node_idU
00000020: 6857 5147 786b 3637 5342 2d4d 5575 3874  hWQGxk67SB-MUu8t
00000030: 6548 7173 4c51 fbc0 2893 e800 0000 0000  eHqsLQ..(.......
00000040: 0000 001e e8fb f70a 



 * Verify that the contents on disk for the given index is deleted; if not, delete the contents.
 * This method assumes that an index is already deleted in the cluster state and/or explicitly
 * through index tombstones.
 * @param index {@code Index} to make sure its deleted from disk
 * @param clusterState {@code ClusterState} to ensure the index is not part of it
 * @return IndexMetaData for the index loaded from disk
public IndexMetaData verifyIndexIsDeleted(final Index index, final ClusterState clusterState) {
    // this method should only be called when we know the index (name + uuid) is not part of the cluster state
    if (clusterState.metaData().index(index) != null) {
        throw new IllegalStateException("Cannot delete index [" + index + "], it is still part of the cluster state.");
    if (nodeEnv.hasNodeFile() && FileSystemUtils.exists(nodeEnv.indexPaths(index))) {
        final IndexMetaData metaData;
        try {
            metaData = metaStateService.loadIndexState(index);
        } catch (Exception e) {
            logger.warn((Supplier) () -> new ParameterizedMessage("[{}] failed to load state file from a stale deleted index, folders will be left on disk", index), e);
            return null;
        final IndexSettings indexSettings = buildIndexSettings(metaData);
        try {
            deleteIndexStoreIfDeletionAllowed("stale deleted index", index, indexSettings, ALWAYS_TRUE);
        } catch (Exception e) {
            // we just warn about the exception here because if deleteIndexStoreIfDeletionAllowed
            // throws an exception, it gets added to the list of pending deletes to be tried again
            logger.warn((Supplier) () -> new ParameterizedMessage("[{}] failed to delete index on disk", metaData.getIndex()), e);
        return metaData;
    return null;


  • 执行DELETE 删除索引的操作,被删除的索引会被写入到集群metaData中的tombstones集合中,且metaData信息是存储在master节点的本地文件中的(global-x.st)
  • master节点启动时,会从本地路径下读取对应的文件,并将集群信息加载到metaData中
  • 在master节点同步集群状态过程中,会验证处于tombstones中的索引是否被有效删除(本地索引存储目录是否被有效删除)
  • 如果tombstones中的索引文件依然存在,则会在此过程中被删除
  • 上述丢数据的场景就是因为首先执行了DELETE删除操作,这个时候这些deleted状态的索引已经被记录到了metaData中,后面又拷贝索引文件至data节点的路径下,故而会被ES删除掉



  • 操作数据之前,做好完整的数据备份(使用cp而不是mv)
  • 对一个功能背后的知识点有了足够的掌握之后,再去做进一步的操作


  • Elasticsearch存储深入详解
  • Elasticsearch clusterStateChange改变过程

