Elasticsearch加入集群流程

简介

es节点启动时,做的最重要的一件事就是加入集群,今天分析下es节点加入集群的源码。

主要知识点

es基础

基于lucene的分布式搜索引擎,不在累述,可以自行查阅资料。

Bully算法

提到分布式选举算法,大家都知道Paxos算法,但是这个算法比较复杂,es选取的是一个很简单的Bully算法,算法描述如下:

  1. 某个进程通过广播,获取所有能够连接到的进程的ID。
  2. 如果发现自己是ID最大的进程,则广播自己成为master节点,并广播该消息。
  3. 如果进程发现有其他进程比自己的ID大,则认为自己不能成为master,等待master广播。

整个选主流程非常简单,下面我们去es代码中一窥究竟。

代码分析

Node启动的时候做了很多事情,我们这次主要关注的涉及以下代码,discovery的默认实现是ZenDiscovery。

public Node start() throws NodeValidationException {
    ...
    // 启动Discovery,这个模块用来发现其他节点。
    discovery.start();
    transportService.acceptIncomingRequests();
    // 开始加入集群
    discovery.startInitialJoin();
    ...
}

ZenDiscovery是Es默认实现的节点发现服务,上面调用了startInitialJoin()方法:

@Override
public void startInitialJoin() {
    // start the join thread from a cluster state update. See {@link JoinThreadControl} for details.
    clusterService.submitStateUpdateTask("initial_join", new LocalClusterUpdateTask() {

        @Override
        public ClusterTasksResult execute(ClusterState currentState) throws Exception {
            // do the join on a different thread, the DiscoveryService waits for 30s anyhow till it is discovered
            joinThreadControl.startNewThreadIfNotRunning();                return unchanged();
        }

        @Override
        public void onFailure(String source, @org.elasticsearch.common.Nullable Exception e) {
            logger.warn("failed to start initial join process", e);
        }
    });
}

这里直接向clusterService提交了一个任务,在execute方法中,通过joinThreadControl来启动一个线程进行join。JoinThreadControl是ZenDistovery的内部类,主要用来控制join线程,保证只有一个线程执行join任务,以及join成功后的处理,这样可以使得startInitialJoin()方法迅速返回,不阻塞node节点的start方法。

下面来看下startNewThreadIfNotRunning()方法:

public void startNewThreadIfNotRunning() {
    ClusterService.assertClusterStateThread();
    if (joinThreadActive()) {
        return;
    }
    threadPool.generic().execute(new Runnable() {
        @Override
        public void run() {
            Thread currentThread = Thread.currentThread();
            if (!currentJoinThread.compareAndSet(null, currentThread)) {
                return;
            }
            while (running.get() && joinThreadActive(currentThread)) {
                try {
                    innerJoinCluster();
                    return;
                } catch (Exception e) {
                    logger.error("unexpected error while joining cluster,trying again", e);
                    // Because we catch any exception here, we want to know in
                    // tests if an uncaught exception got to this point and the test infra uncaught exception
                    // leak detection can catch this. In practise no uncaught exception should leak
                    assert ExceptionsHelper.reThrowIfNotNull(e);
                }
            }
            // cleaning the current thread from currentJoinThread is done by explicit calls.
        }
    });
}

这个方法也很简单,直接调用了innerJoinCluster()方法,注意这里是while循序,只要joinThreadControl还在运行,并且当前是当前线程在执行join任务,抛出异常的情况下要重新执行,直到join成功。

再看下innerJoinCluster()方法:

private void innerJoinCluster() {
    //1--------------------------------
    DiscoveryNode masterNode = null;
    final Thread currentThread = Thread.currentThread();
    nodeJoinController.startElectionContext();
    while (masterNode == null && joinThreadControl.joinThreadActive(currentThread)) {
        masterNode = findMaster();
    }

    if (!joinThreadControl.joinThreadActive(currentThread)) {
        logger.trace("thread is no longer in currentJoinThread. Stopping.");
        return;
    }

    // 2-----------------------------
    if (clusterService.localNode().equals(masterNode)) {
        final int requiredJoins = Math.max(0, electMaster.minimumMasterNodes() - 1); // we count as one
        logger.debug("elected as master, waiting for incoming joins ([{}] needed)", requiredJoins);
        nodeJoinController.waitToBeElectedAsMaster(requiredJoins, masterElectionWaitForJoinsTimeout,
            new NodeJoinController.ElectionCallback() {
                    @Override
                    public void onElectedAsMaster(ClusterState state) {
                        joinThreadControl.markThreadAsDone(currentThread);
                        // we only starts nodesFD if we are master (it may be that we received a cluster state while pinging)
                        nodesFD.updateNodesAndPing(state); // start the nodes FD
                    }

                    @Override
                    public void onFailure(Throwable t) {
                        logger.trace("failed while waiting for nodes to join, rejoining", t);
                        joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                    }
                }

        );
        
    // 3------------------------------------
    } else {
        // process any incoming joins (they will fail because we are not the master)
        nodeJoinController.stopElectionContext(masterNode + " elected");

        // send join request
        final boolean success = joinElectedMaster(masterNode);

        // finalize join through the cluster state update thread
        final DiscoveryNode finalMasterNode = masterNode;
        clusterService.submitStateUpdateTask("finalize_join (" + masterNode + ")", new LocalClusterUpdateTask() {
            @Override
            public ClusterTasksResult execute(ClusterState currentState) throws Exception {
                if (!success) {
                    // failed to join. Try again...
                    joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                    return unchanged();
                }

                if (currentState.getNodes().getMasterNode() == null) {
                    // Post 1.3.0, the master should publish a new cluster state before acking our join request. we now should have
                    // a valid master.
                    logger.debug("no master node is set, despite of join request completing. retrying pings.");
                    joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                    return unchanged();
                }

                if (!currentState.getNodes().getMasterNode().equals(finalMasterNode)) {
                    return joinThreadControl.stopRunningThreadAndRejoin(currentState, "master_switched_while_finalizing_join");
                }

                // Note: we do not have to start master fault detection here because it's set at {@link #processNextPendingClusterState }
                // when the first cluster state arrives.
                joinThreadControl.markThreadAsDone(currentThread);
                return unchanged();
            }

            @Override
            public void onFailure(String source, @Nullable Exception e) {
                logger.error("unexpected error while trying to finalize cluster join", e);
                joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
            }
        });
    }
}

这个方法比较长一些,我们把它分成3部分,一步一步看:

  1. 这一步见名知意,目的就是拿到master节点,并且是在死循环中,注意这一步拿到的不一定是真正的master节点,只是最有可能成为master的节点,我们要在下面再做验证。
  2. 如果拿到的master节点是自己,那自己首先要得到足够多的节点支持,如果在确定的时间中得到了足够的票数,那么就可以确认自己就是master节点,并启动nodesFD开始检测其他节点。如果不能得到足够的票数,就要执行markThreadAsDoneAndStartNew()方法,相当于重新执行一次选举,因为这里callback是个异步过程,所以这里是先mark上次的线程已经结束,然后再重新向线程池提交任务。
  3. 如果拿到的master节点不是自己,那么自己就要join到现有的master节点中,并且同步master节点的clusterState。在callback中获取自己任务的master节点的clusterState,并做一系列判断,如果自己认定的master不是真正的master,则重新发起一轮join任务(markThreadAsDoneAndStartNew());

下面我看下findMaster()方法:

private DiscoveryNode findMaster() {
    logger.trace("starting to ping");
    List fullPingResponses = pingAndWait(pingTimeout).toList();
    if (fullPingResponses == null) {
        logger.trace("No full ping responses");
        return null;
    }
    if (logger.isTraceEnabled()) {
        StringBuilder sb = new StringBuilder();
        if (fullPingResponses.size() == 0) {
            sb.append(" {none}");
        } else {
            for (ZenPing.PingResponse pingResponse : fullPingResponses) {
                sb.append("\n\t--> ").append(pingResponse);
            }
        }
        logger.trace("full ping responses:{}", sb);
    }

    final DiscoveryNode localNode = clusterService.localNode();

    // add our selves
    assert fullPingResponses.stream().map(ZenPing.PingResponse::node)
        .filter(n -> n.equals(localNode)).findAny().isPresent() == false;

    fullPingResponses.add(new ZenPing.PingResponse(localNode, null, clusterService.state()));

    // filter responses
    final List pingResponses = filterPingResponses(fullPingResponses, masterElectionIgnoreNonMasters, logger);

    List activeMasters = new ArrayList<>();
    for (ZenPing.PingResponse pingResponse : pingResponses) {
        // We can't include the local node in pingMasters list, otherwise we may up electing ourselves without
        // any check / verifications from other nodes in ZenDiscover#innerJoinCluster()
        if (pingResponse.master() != null && !localNode.equals(pingResponse.master())) {
            activeMasters.add(pingResponse.master());
        }
    }

    // nodes discovered during pinging
    List masterCandidates = new ArrayList<>();
    for (ZenPing.PingResponse pingResponse : pingResponses) {
        if (pingResponse.node().isMasterNode()) {
            masterCandidates.add(new ElectMasterService.MasterCandidate(pingResponse.node(), pingResponse.getClusterStateVersion()));
        }
    }

    if (activeMasters.isEmpty()) {
        if (electMaster.hasEnoughCandidates(masterCandidates)) {
            final ElectMasterService.MasterCandidate winner = electMaster.electMaster(masterCandidates);
            logger.trace("candidate {} won election", winner);
            return winner.getNode();
        } else {
            // if we don't have enough master nodes, we bail, because there are not enough master to elect from
            logger.warn("not enough master nodes discovered during pinging (found [{}], but needed [{}]), pinging again",
                        masterCandidates, electMaster.minimumMasterNodes());
            return null;
        }
    } else {
        assert !activeMasters.contains(localNode) : "local node should never be elected as master when other nodes indicate an active master";
        // lets tie break between discovered nodes
        return electMaster.tieBreakActiveMasters(activeMasters);
    }
}

大概做了以下事情,最终返回master节点:

  1. 同步ping所有已知节点,等待回复。
  2. 将自己加入到pingResponse中,并根据配置,决定是否过滤掉非master资格节点,默认false.
  3. 将所有认为自己是master的节点加入activeMasters列表,将所有有master资格的节点加入masterCandidates列表。
  4. 如果activeMasters列表为空,证明没有master节点,本地进行一轮选举,选择clusterStateVersion最大的节点作为master,clusterStateVersion相同的情况下按节点名排序。
  5. 如果activeMasters列表不为空,从所有activeMasters列表中,选择clusterStateVersion最大的节点作为master,clusterStateVersion相同的情况下按节点名排序。
  6. 返回master节点。

再来看节点认为自己是master节点时的处理

    if (clusterService.localNode().equals(masterNode)) {
        final int requiredJoins = Math.max(0, electMaster.minimumMasterNodes() - 1); // we count as one
        logger.debug("elected as master, waiting for incoming joins ([{}] needed)", requiredJoins);
        nodeJoinController.waitToBeElectedAsMaster(requiredJoins, masterElectionWaitForJoinsTimeout,
            new NodeJoinController.ElectionCallback() {
                    @Override
                    public void onElectedAsMaster(ClusterState state) {
                        joinThreadControl.markThreadAsDone(currentThread);
                        // we only starts nodesFD if we are master (it may be that we received a cluster state while pinging)
                        nodesFD.updateNodesAndPing(state); // start the nodes FD
                    }

                    @Override
                    public void onFailure(Throwable t) {
                        logger.trace("failed while waiting for nodes to join, rejoining", t);
                        joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                    }
                }

        );
    }

做的事情很清楚,根据配置,master节点需要等待至少electMaster.minimumMasterNodes()个节点支持才行(包括自己支持自己一票),这里会同步等待一段时间,如果等到了足够的票数,则callback中调用updateNodesAndPing,开始监控所有节点,否则需要调用failContextIfNeeded(),关闭context并停止选举,重新开始一轮join任务。

最后来看节点认定的master不是自己的情况下如何处理:

    } else {
        // process any incoming joins (they will fail because we are not the master)
        nodeJoinController.stopElectionContext(masterNode + " elected");

        // send join request
        final boolean success = joinElectedMaster(masterNode);

        // finalize join through the cluster state update thread
        final DiscoveryNode finalMasterNode = masterNode;
        clusterService.submitStateUpdateTask("finalize_join (" + masterNode + ")", new LocalClusterUpdateTask() {
            @Override
            public ClusterTasksResult execute(ClusterState currentState) throws Exception {
                if (!success) {
                    // failed to join. Try again...
                    joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                    return unchanged();
                }

                if (currentState.getNodes().getMasterNode() == null) {
                    // Post 1.3.0, the master should publish a new cluster state before acking our join request. we now should have
                    // a valid master.
                    logger.debug("no master node is set, despite of join request completing. retrying pings.");
                    joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
                    return unchanged();
                }

                if (!currentState.getNodes().getMasterNode().equals(finalMasterNode)) {
                    return joinThreadControl.stopRunningThreadAndRejoin(currentState, "master_switched_while_finalizing_join");
                }

                // Note: we do not have to start master fault detection here because it's set at {@link #processNextPendingClusterState }
                // when the first cluster state arrives.
                joinThreadControl.markThreadAsDone(currentThread);
                return unchanged();
            }

            @Override
            public void onFailure(String source, @Nullable Exception e) {
                logger.error("unexpected error while trying to finalize cluster join", e);
                joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
            }
        });
    }

首先就是要join到自己认为的master节点,但是这个master节点是自己认定的,不一定是真正的master节点,所以要根据master节点回复中的clusterState确定自己是否正确选择了master节点,如果join失败、或者回复说自己没有master节点、或者回复说master节点不是自己选中的那个,都要重新进行一次join。

你可能感兴趣的:(es)