elasticsearch源码分析——集群状态

现在的工程就是在源码的层面进行改动,之前因为一个问题出现了集群假死的状态。所以才深入的去分析了,源码的集群同步的状态。

简述

  首先需要明白,类似于solr使用的是zookeeper来进行集群状态的同步。等于是使用了三方件实现集群状态的维护。但是要明白elasticsearch没有用到zookeeper,etcd来管理节点的主备逻辑。
  所以,集群状态同步是怎么完成的呢。
  推荐看一下这篇文章 ELASTICSEARCH 机制和架构 这个网站写了很多elasticsearch相关的分析,对我的启发不小。我也只是在他的文章的期初上做点发挥。

节点类型

  不说那么复杂,简单关注两个节点类型。

master节点

  首先,在elasticsearch.yml文件中只有配置了node.master: true ,本节点才能保证可以被选为主节点。

如果自己做源码分析,最好是将master和data节点分开,如果可以就自己多打点日志。或者开启debug日志,可以简单跟踪一下流程。单节点调试的话,因为很多流程是异步的,所以不一定能分离的很清楚。

  其次,主节点主要就是负责集群状态的下发。关注ClusterService类。
  状态更新的入口,至于怎么走到这个入口的慢慢分析:

void runTasks(TaskInputs taskInputs) {
        ...
        TaskOutputs taskOutputs = calculateTaskOutputs(taskInputs, previousClusterState, startTimeNS);  // 第一个重点,主节点计算metadata,比如创建index之后的集群状态。
        taskOutputs.notifyFailedTasks();

        if (taskOutputs.clusterStateUnchanged()) {
            taskOutputs.notifySuccessfulTasksOnUnchangedClusterState();
            TimeValue executionTime = TimeValue.timeValueMillis(Math.max(0, TimeValue.nsecToMSec(currentTimeInNanos() - startTimeNS)));
            logger.debug("processing [{}]: took [{}] no change in cluster_state", taskInputs.summary, executionTime);
            warnAboutSlowTaskIfNeeded(executionTime, taskInputs.summary);
        } else {
            ClusterState newClusterState = taskOutputs.newClusterState;
            if (logger.isTraceEnabled()) {
                logger.trace("cluster state updated, source [{}]\n{}", taskInputs.summary, newClusterState);
            } else if (logger.isDebugEnabled()) {
                logger.debug("cluster state updated, version [{}], source [{}]", newClusterState.version(), taskInputs.summary);
            }
            try {
                publishAndApplyChanges(taskInputs, taskOutputs); // 看名字就知道什么意思了,将集群状态下发。
                TimeValue executionTime = TimeValue.timeValueMillis(Math.max(0, TimeValue.nsecToMSec(currentTimeInNanos() - startTimeNS)));
                logger.debug("processing [{}]: took [{}] done applying updated cluster_state (version: {}, uuid: {})", taskInputs.summary,
                    executionTime, newClusterState.version(), newClusterState.stateUUID());
                warnAboutSlowTaskIfNeeded(executionTime, taskInputs.summary);
            } catch (Exception e) {
                TimeValue executionTime = TimeValue.timeValueMillis(Math.max(0, TimeValue.nsecToMSec(currentTimeInNanos() - startTimeNS)));
                final long version = newClusterState.version();
                final String stateUUID = newClusterState.stateUUID();
                final String fullState = newClusterState.toString();
                logger.warn(
                    (Supplier) () -> new ParameterizedMessage(
                        "failed to apply updated cluster state in [{}]:\nversion [{}], uuid [{}], source [{}]\n{}",
                        executionTime,
                        version,
                        stateUUID,
                        taskInputs.summary,
                        fullState),
                    e);
                // TODO: do we want to call updateTask.onFailure here?
            }
        }
    }

  下面关注一下,状态时怎么下发的,这个流程也比较长,慢慢更新吧。
  每一次状态更新都会对应一个version,根据这个version就可以判断,哪一次更新是最新的。

    private void publishAndApplyChanges(TaskInputs taskInputs, TaskOutputs taskOutputs) {
        ClusterState previousClusterState = taskOutputs.previousClusterState;
        ClusterState newClusterState = taskOutputs.newClusterState;

        ClusterChangedEvent clusterChangedEvent = new ClusterChangedEvent(taskInputs.summary, newClusterState, previousClusterState);
        // new cluster state, notify all listeners
        final DiscoveryNodes.Delta nodesDelta = clusterChangedEvent.nodesDelta();
        if (nodesDelta.hasChanges() && logger.isInfoEnabled()) {
            String summary = nodesDelta.shortSummary();
            if (summary.length() > 0) {
                logger.info("{}, reason: {}", summary, taskInputs.summary);
            }
        }

        final Discovery.AckListener ackListener = newClusterState.nodes().isLocalNodeElectedMaster() ?
            taskOutputs.createAckListener(threadPool, newClusterState) :
            null;

        nodeConnectionsService.connectToNodes(newClusterState.nodes());

        // if we are the master, publish the new state to all nodes
        // we publish here before we send a notification to all the listeners, since if it fails
        // we don't want to notify
        // 这里就是主节点的转发逻辑
        if (newClusterState.nodes().isLocalNodeElectedMaster()) {
            logger.debug("publishing cluster state version [{}]", newClusterState.version());
            try { // 好吧,又是函数式编程,经过我的一路跟踪,默认使用的ZenDiscovery的publish方法,后面详细解释这个流程。
                clusterStatePublisher.accept(clusterChangedEvent, ackListener);
            } catch (Discovery.FailedToCommitClusterStateException t) {
                final long version = newClusterState.version();
                logger.warn(
                    (Supplier) () -> new ParameterizedMessage(
                        "failing [{}]: failed to commit cluster state version [{}]", taskInputs.summary, version),
                    t);
                // ensure that list of connected nodes in NodeConnectionsService is in-sync with the nodes of the current cluster state
                nodeConnectionsService.connectToNodes(previousClusterState.nodes());
                nodeConnectionsService.disconnectFromNodesExcept(previousClusterState.nodes());
                taskOutputs.publishingFailed(t);
                return;
            }
        }

        logger.debug("applying cluster state version {}", newClusterState.version());
        try {
            // nothing to do until we actually recover from the gateway or any other block indicates we need to disable persistency
            if (clusterChangedEvent.state().blocks().disableStatePersistence() == false && clusterChangedEvent.metaDataChanged()) {
                final Settings incomingSettings = clusterChangedEvent.state().metaData().settings();
                clusterSettings.applySettings(incomingSettings);
            }
        } catch (Exception ex) {
            logger.warn("failed to apply cluster settings", ex);
        }

        logger.debug("set local cluster state to version {}", newClusterState.version());
        // 注意这个地方,master节点是先给其它节点发送请求,如果有节点没有响应,默认的是30s超时,之后才会走到本地节点的状态更新。记得是本地的data节点,所以将master和data节点进行分离,源码比较好分析。
        // 这里就有一个问题,加入说一个shard有三个shard分布在三个node上,每个shard删除加入说需要1s的话。这里相当远是同步的方法,所以总共的删除时间就需要2s。
        callClusterStateAppliers(newClusterState, clusterChangedEvent);

        nodeConnectionsService.disconnectFromNodesExcept(newClusterState.nodes());

        updateState(css -> newClusterState);

        Stream.concat(clusterStateListeners.stream(), timeoutClusterStateListeners.stream()).forEach(listener -> {
            try {
                logger.trace("calling [{}] with change to version [{}]", listener, newClusterState.version());
                listener.clusterChanged(clusterChangedEvent);
            } catch (Exception ex) {
                logger.warn("failed to notify ClusterStateListener", ex);
            }
        });

        //manual ack only from the master at the end of the publish
        if (newClusterState.nodes().isLocalNodeElectedMaster()) {
            try {
                ackListener.onNodeAck(newClusterState.nodes().getLocalNode(), null);
            } catch (Exception e) {
                final DiscoveryNode localNode = newClusterState.nodes().getLocalNode();
                logger.debug(
                    (Supplier) () -> new ParameterizedMessage("error while processing ack for master node [{}]", localNode),
                    e);
            }
        }

        taskOutputs.processedDifferentClusterState(previousClusterState, newClusterState);

        if (newClusterState.nodes().isLocalNodeElectedMaster()) {
            try {
                taskOutputs.clusterStatePublished(clusterChangedEvent);
            } catch (Exception e) {
                logger.error(
                    (Supplier) () -> new ParameterizedMessage(
                        "exception thrown while notifying executor of new cluster state publication [{}]",
                        taskInputs.summary),
                    e);
            }
        }
    }

状态分发

  状态的分发,其实包括两个阶段。一个叫send一个叫commit。目的就是保证集群状态的一致性。master首先发送send请求,如果有足够的节点发送了响应,那接下来master节点再发送commit请求,这时候其它节点才开始执行。那么这就牵扯到了几个问题。
  1、send请求发送之后,其它节点会讲这个state保存在一个队列里面。
  2、接收到commit请求的时候,将队列中的节点标记为marked,然后进行处理。
  3、send请求,SEND_ACTION_NAME = “internal:discovery/zen/publish/send”;
  4、commit请求,COMMIT_ACTION_NAME = “internal:discovery/zen/publish/commit”
  顺着这个action name你就能找到它的发送和处理逻辑。elasticsearch很多地方都是这样进行请求发送和处理的。

处理逻辑

  一路跟啊跟的,你就能看到创建和删除的流程是在以下地方执行的。IndicesClusterStateService,其实也就是在上面的ClusterService做本地更新的时候调用的。就是这个方法,callClusterStateAppliers(newClusterState, clusterChangedEvent);

@Override
    public synchronized void applyClusterState(final ClusterChangedEvent event) {
        if (!lifecycle.started()) {
            return;
        }

        final ClusterState state = event.state();

        // we need to clean the shards and indices we have on this node, since we
        // are going to recover them again once state persistence is disabled (no master / not recovered)
        // TODO: feels hacky, a block disables state persistence, and then we clean the allocated shards, maybe another flag in blocks?
        if (state.blocks().disableStatePersistence()) {
            for (AllocatedIndex indexService : indicesService) {
                indicesService.removeIndex(indexService.index(), NO_LONGER_ASSIGNED,
                    "cleaning index (disabled block persistence)"); // also cleans shards
            }
            return;
        }

        updateFailedShardsCache(state);

        deleteIndices(event); // also deletes shards of deleted indices

        removeUnallocatedIndices(event); // also removes shards of removed indices

        failMissingShards(state);

        removeShards(state);   // removes any local shards that doesn't match what the master expects

        updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache

        createIndices(state);

        createOrUpdateShards(state);
    }

关注点:

  1. 此方法是synchronized,同步的方法,也就是说,前一个状态没有更新完,下一个状态是进不来的。
  2. 那么就有一个问题,如果创建或者删除耗时较长,那不就有阻塞了?其实这个方法里面的都是元数据的更新,删除和比较耗时的数据recovery流程都是在后台线程执行的。所以逻辑上是不会卡主线程的。其实牵扯到recovery的流程还是有一定的复杂度在里面的,后续专门写一篇文章介绍吧。

     经过这么一个复杂的流程,集群的状态也就更新了。

data node

  主要就是负责数据的写入,默认data node的值为true。
  主要关注,集群状态时怎么在data node进行更新的。

send消息

  上面有提到send请求使用的action名是SEND_ACTION_NAME,根据这个就可以找到处理逻辑。

protected void handleIncomingClusterStateRequest(BytesTransportRequest request, TransportChannel channel) throws IOException {
        Compressor compressor = CompressorFactory.compressor(request.bytes());
        StreamInput in = request.bytes().streamInput();
        try {
            if (compressor != null) {
                in = compressor.streamInput(in);
            }
            in = new NamedWriteableAwareStreamInput(in, namedWriteableRegistry);
            in.setVersion(request.version());
            synchronized (lastSeenClusterStateMutex) {
                final ClusterState incomingState;
                // If true we received full cluster state - otherwise diffs
                if (in.readBoolean()) {
                    incomingState = ClusterState.readFrom(in, clusterStateSupplier.get().nodes().getLocalNode());
                    logger.debug("received full cluster state version [{}] with size [{}]", incomingState.version(),
                        request.bytes().length());
                } else if (lastSeenClusterState != null) {
                    Diff diff = ClusterState.readDiffFrom(in, lastSeenClusterState.nodes().getLocalNode());
                    incomingState = diff.apply(lastSeenClusterState);
                    logger.debug("received diff cluster state version [{}] with uuid [{}], diff size [{}]",
                        incomingState.version(), incomingState.stateUUID(), request.bytes().length());
                } else {
                    logger.debug("received diff for but don't have any local cluster state - requesting full state");
                    throw new IncompatibleClusterStateVersionException("have no local cluster state");
                }
                // sanity check incoming state
                validateIncomingState(incomingState, lastSeenClusterState);

                pendingStatesQueue.addPending(incomingState); // 关键点,主要是加到pending队列里面
                lastSeenClusterState = incomingState;
            }
        } finally {
            IOUtils.close(in);
        }
        channel.sendResponse(TransportResponse.Empty.INSTANCE);
    }

  这里就可以看到send只是确保data node节点接收到请求,但是并没有进行处理先放在pendingStatesQueue中。进行回复,主节点就知道这个data node能接收到消息。后面master节点会发送commit请求过来。

commit请求

  COMMIT_ACTION_NAME,一样的办法ctrl+h搜索,就可以看到这个action是怎么注册的,以及对应的处理逻辑。

protected void handleCommitRequest(CommitClusterStateRequest request, final TransportChannel channel) {
        final ClusterState state = pendingStatesQueue.markAsCommitted(request.stateUUID,
            new PendingClusterStatesQueue.StateProcessedListener() {
            @Override
            public void onNewClusterStateProcessed() {  // 异步框架会看到很多这样的逻辑,处理完成之后就会调用sendResponse方法
                try {
                    // send a response to the master to indicate that this cluster state has been processed post committing it.
                    channel.sendResponse(TransportResponse.Empty.INSTANCE);
                } catch (Exception e) {
                    logger.debug("failed to send response on cluster state processed", e);
                    onNewClusterStateFailed(e);
                }
            }

            @Override
            public void onNewClusterStateFailed(Exception e) {
                try {
                    channel.sendResponse(e);
                } catch (Exception inner) {
                    inner.addSuppressed(e);
                    logger.debug("failed to send response on cluster state processed", inner);
                }
            }
        });
        if (state != null) {
            newPendingClusterStatelistener.onNewClusterState("master " + state.nodes().getMasterNode() +
                " committed version [" + state.version() + "]");  // 具体处理逻辑
        }
    }

  后续还是走到了ZenDistovery的处理逻辑。

private class NewPendingClusterStateListener implements PublishClusterStateAction.NewPendingClusterStateListener {

        @Override
        public void onNewClusterState(String reason) {
            processNextPendingClusterState(reason);
        }
    }

  processNextPendingClusterState最终会提交一个BatchedTask,具体的处理逻辑就又回到ClusterService里面了,就对上上面的流程。但这里要注意一点就是,
  特别注意!!!!!
  1、threadExecutor,跟进去初始化的逻辑就可以看到这个有限队列的大小是1。是1,也就代表着如果这个优先队列的节点没有处理完,没有remove掉,那么这个线程池就会将后续的请求缓存到workqueue。
  2、需要知道前面的所有的状态更新是要提交到pendingStatesQueue,所以如果这个线程池一直被卡主,就会导致pendingStatesQueue请求一直在积累。这个pendingStatesQueue有一个逻辑就是大小是25,如果超过大小,就会将最早的状态更新请求删除掉。我们的工程上要对elasticsearch进行改动,添加了C++的逻辑,结果在这里就遇到了一个坑,后端因为问题C++,死锁了,结果这个线程池就一直在这里卡主,后续的请求根本就进不来。导致pendingStatesQueue不断的进行删除,但是一直不能处理
  3、curl 127.0.0.1:9200/_cat/tasks?v 可以查看后台正在执行的任务。
  

public void submitTasks(List tasks, @Nullable TimeValue timeout) throws EsRejectedExecutionException {
        if (tasks.isEmpty()) {
            return;
        }
        final BatchedTask firstTask = tasks.get(0);
        assert tasks.stream().allMatch(t -> t.batchingKey == firstTask.batchingKey) :
            "tasks submitted in a batch should share the same batching key: " + tasks;
        // convert to an identity map to check for dups based on task identity
        final Map tasksIdentity = tasks.stream().collect(Collectors.toMap(
            BatchedTask::getTask,
            Function.identity(),
            (a, b) -> { throw new IllegalStateException("cannot add duplicate task: " + a); },
            IdentityHashMap::new));

        synchronized (tasksPerBatchingKey) {
            LinkedHashSet existingTasks = tasksPerBatchingKey.computeIfAbsent(firstTask.batchingKey,
                k -> new LinkedHashSet<>(tasks.size()));
            for (BatchedTask existing : existingTasks) {
                // check that there won't be two tasks with the same identity for the same batching key
                BatchedTask duplicateTask = tasksIdentity.get(existing.getTask());
                if (duplicateTask != null) {
                    throw new IllegalStateException("task [" + duplicateTask.describeTasks(
                        Collections.singletonList(existing)) + "] with source [" + duplicateTask.source + "] is already queued");
                }
            }
            existingTasks.addAll(tasks);
        }

        if (timeout != null) {
            threadExecutor.execute(firstTask, timeout, () -> onTimeoutInternal(tasks, timeout));
        } else {
            threadExecutor.execute(firstTask);
        }
    }

你可能感兴趣的:(elasticsearch)