现在的工程就是在源码的层面进行改动,之前因为一个问题出现了集群假死的状态。所以才深入的去分析了,源码的集群同步的状态。
首先需要明白,类似于solr使用的是zookeeper来进行集群状态的同步。等于是使用了三方件实现集群状态的维护。但是要明白elasticsearch没有用到zookeeper,etcd来管理节点的主备逻辑。
所以,集群状态同步是怎么完成的呢。
推荐看一下这篇文章 ELASTICSEARCH 机制和架构 这个网站写了很多elasticsearch相关的分析,对我的启发不小。我也只是在他的文章的期初上做点发挥。
不说那么复杂,简单关注两个节点类型。
首先,在elasticsearch.yml文件中只有配置了node.master: true ,本节点才能保证可以被选为主节点。
如果自己做源码分析,最好是将master和data节点分开,如果可以就自己多打点日志。或者开启debug日志,可以简单跟踪一下流程。单节点调试的话,因为很多流程是异步的,所以不一定能分离的很清楚。
其次,主节点主要就是负责集群状态的下发。关注ClusterService类。
状态更新的入口,至于怎么走到这个入口的慢慢分析:
void runTasks(TaskInputs taskInputs) {
...
TaskOutputs taskOutputs = calculateTaskOutputs(taskInputs, previousClusterState, startTimeNS); // 第一个重点,主节点计算metadata,比如创建index之后的集群状态。
taskOutputs.notifyFailedTasks();
if (taskOutputs.clusterStateUnchanged()) {
taskOutputs.notifySuccessfulTasksOnUnchangedClusterState();
TimeValue executionTime = TimeValue.timeValueMillis(Math.max(0, TimeValue.nsecToMSec(currentTimeInNanos() - startTimeNS)));
logger.debug("processing [{}]: took [{}] no change in cluster_state", taskInputs.summary, executionTime);
warnAboutSlowTaskIfNeeded(executionTime, taskInputs.summary);
} else {
ClusterState newClusterState = taskOutputs.newClusterState;
if (logger.isTraceEnabled()) {
logger.trace("cluster state updated, source [{}]\n{}", taskInputs.summary, newClusterState);
} else if (logger.isDebugEnabled()) {
logger.debug("cluster state updated, version [{}], source [{}]", newClusterState.version(), taskInputs.summary);
}
try {
publishAndApplyChanges(taskInputs, taskOutputs); // 看名字就知道什么意思了,将集群状态下发。
TimeValue executionTime = TimeValue.timeValueMillis(Math.max(0, TimeValue.nsecToMSec(currentTimeInNanos() - startTimeNS)));
logger.debug("processing [{}]: took [{}] done applying updated cluster_state (version: {}, uuid: {})", taskInputs.summary,
executionTime, newClusterState.version(), newClusterState.stateUUID());
warnAboutSlowTaskIfNeeded(executionTime, taskInputs.summary);
} catch (Exception e) {
TimeValue executionTime = TimeValue.timeValueMillis(Math.max(0, TimeValue.nsecToMSec(currentTimeInNanos() - startTimeNS)));
final long version = newClusterState.version();
final String stateUUID = newClusterState.stateUUID();
final String fullState = newClusterState.toString();
logger.warn(
(Supplier>) () -> new ParameterizedMessage(
"failed to apply updated cluster state in [{}]:\nversion [{}], uuid [{}], source [{}]\n{}",
executionTime,
version,
stateUUID,
taskInputs.summary,
fullState),
e);
// TODO: do we want to call updateTask.onFailure here?
}
}
}
下面关注一下,状态时怎么下发的,这个流程也比较长,慢慢更新吧。
每一次状态更新都会对应一个version,根据这个version就可以判断,哪一次更新是最新的。
private void publishAndApplyChanges(TaskInputs taskInputs, TaskOutputs taskOutputs) {
ClusterState previousClusterState = taskOutputs.previousClusterState;
ClusterState newClusterState = taskOutputs.newClusterState;
ClusterChangedEvent clusterChangedEvent = new ClusterChangedEvent(taskInputs.summary, newClusterState, previousClusterState);
// new cluster state, notify all listeners
final DiscoveryNodes.Delta nodesDelta = clusterChangedEvent.nodesDelta();
if (nodesDelta.hasChanges() && logger.isInfoEnabled()) {
String summary = nodesDelta.shortSummary();
if (summary.length() > 0) {
logger.info("{}, reason: {}", summary, taskInputs.summary);
}
}
final Discovery.AckListener ackListener = newClusterState.nodes().isLocalNodeElectedMaster() ?
taskOutputs.createAckListener(threadPool, newClusterState) :
null;
nodeConnectionsService.connectToNodes(newClusterState.nodes());
// if we are the master, publish the new state to all nodes
// we publish here before we send a notification to all the listeners, since if it fails
// we don't want to notify
// 这里就是主节点的转发逻辑
if (newClusterState.nodes().isLocalNodeElectedMaster()) {
logger.debug("publishing cluster state version [{}]", newClusterState.version());
try { // 好吧,又是函数式编程,经过我的一路跟踪,默认使用的ZenDiscovery的publish方法,后面详细解释这个流程。
clusterStatePublisher.accept(clusterChangedEvent, ackListener);
} catch (Discovery.FailedToCommitClusterStateException t) {
final long version = newClusterState.version();
logger.warn(
(Supplier>) () -> new ParameterizedMessage(
"failing [{}]: failed to commit cluster state version [{}]", taskInputs.summary, version),
t);
// ensure that list of connected nodes in NodeConnectionsService is in-sync with the nodes of the current cluster state
nodeConnectionsService.connectToNodes(previousClusterState.nodes());
nodeConnectionsService.disconnectFromNodesExcept(previousClusterState.nodes());
taskOutputs.publishingFailed(t);
return;
}
}
logger.debug("applying cluster state version {}", newClusterState.version());
try {
// nothing to do until we actually recover from the gateway or any other block indicates we need to disable persistency
if (clusterChangedEvent.state().blocks().disableStatePersistence() == false && clusterChangedEvent.metaDataChanged()) {
final Settings incomingSettings = clusterChangedEvent.state().metaData().settings();
clusterSettings.applySettings(incomingSettings);
}
} catch (Exception ex) {
logger.warn("failed to apply cluster settings", ex);
}
logger.debug("set local cluster state to version {}", newClusterState.version());
// 注意这个地方,master节点是先给其它节点发送请求,如果有节点没有响应,默认的是30s超时,之后才会走到本地节点的状态更新。记得是本地的data节点,所以将master和data节点进行分离,源码比较好分析。
// 这里就有一个问题,加入说一个shard有三个shard分布在三个node上,每个shard删除加入说需要1s的话。这里相当远是同步的方法,所以总共的删除时间就需要2s。
callClusterStateAppliers(newClusterState, clusterChangedEvent);
nodeConnectionsService.disconnectFromNodesExcept(newClusterState.nodes());
updateState(css -> newClusterState);
Stream.concat(clusterStateListeners.stream(), timeoutClusterStateListeners.stream()).forEach(listener -> {
try {
logger.trace("calling [{}] with change to version [{}]", listener, newClusterState.version());
listener.clusterChanged(clusterChangedEvent);
} catch (Exception ex) {
logger.warn("failed to notify ClusterStateListener", ex);
}
});
//manual ack only from the master at the end of the publish
if (newClusterState.nodes().isLocalNodeElectedMaster()) {
try {
ackListener.onNodeAck(newClusterState.nodes().getLocalNode(), null);
} catch (Exception e) {
final DiscoveryNode localNode = newClusterState.nodes().getLocalNode();
logger.debug(
(Supplier>) () -> new ParameterizedMessage("error while processing ack for master node [{}]", localNode),
e);
}
}
taskOutputs.processedDifferentClusterState(previousClusterState, newClusterState);
if (newClusterState.nodes().isLocalNodeElectedMaster()) {
try {
taskOutputs.clusterStatePublished(clusterChangedEvent);
} catch (Exception e) {
logger.error(
(Supplier>) () -> new ParameterizedMessage(
"exception thrown while notifying executor of new cluster state publication [{}]",
taskInputs.summary),
e);
}
}
}
状态的分发,其实包括两个阶段。一个叫send一个叫commit。目的就是保证集群状态的一致性。master首先发送send请求,如果有足够的节点发送了响应,那接下来master节点再发送commit请求,这时候其它节点才开始执行。那么这就牵扯到了几个问题。
1、send请求发送之后,其它节点会讲这个state保存在一个队列里面。
2、接收到commit请求的时候,将队列中的节点标记为marked,然后进行处理。
3、send请求,SEND_ACTION_NAME = “internal:discovery/zen/publish/send”;
4、commit请求,COMMIT_ACTION_NAME = “internal:discovery/zen/publish/commit”
顺着这个action name你就能找到它的发送和处理逻辑。elasticsearch很多地方都是这样进行请求发送和处理的。
一路跟啊跟的,你就能看到创建和删除的流程是在以下地方执行的。IndicesClusterStateService,其实也就是在上面的ClusterService做本地更新的时候调用的。就是这个方法,callClusterStateAppliers(newClusterState, clusterChangedEvent);
@Override
public synchronized void applyClusterState(final ClusterChangedEvent event) {
if (!lifecycle.started()) {
return;
}
final ClusterState state = event.state();
// we need to clean the shards and indices we have on this node, since we
// are going to recover them again once state persistence is disabled (no master / not recovered)
// TODO: feels hacky, a block disables state persistence, and then we clean the allocated shards, maybe another flag in blocks?
if (state.blocks().disableStatePersistence()) {
for (AllocatedIndex extends Shard> indexService : indicesService) {
indicesService.removeIndex(indexService.index(), NO_LONGER_ASSIGNED,
"cleaning index (disabled block persistence)"); // also cleans shards
}
return;
}
updateFailedShardsCache(state);
deleteIndices(event); // also deletes shards of deleted indices
removeUnallocatedIndices(event); // also removes shards of removed indices
failMissingShards(state);
removeShards(state); // removes any local shards that doesn't match what the master expects
updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache
createIndices(state);
createOrUpdateShards(state);
}
关注点:
那么就有一个问题,如果创建或者删除耗时较长,那不就有阻塞了?其实这个方法里面的都是元数据的更新,删除和比较耗时的数据recovery流程都是在后台线程执行的。所以逻辑上是不会卡主线程的。其实牵扯到recovery的流程还是有一定的复杂度在里面的,后续专门写一篇文章介绍吧。
经过这么一个复杂的流程,集群的状态也就更新了。
主要就是负责数据的写入,默认data node的值为true。
主要关注,集群状态时怎么在data node进行更新的。
上面有提到send请求使用的action名是SEND_ACTION_NAME,根据这个就可以找到处理逻辑。
protected void handleIncomingClusterStateRequest(BytesTransportRequest request, TransportChannel channel) throws IOException {
Compressor compressor = CompressorFactory.compressor(request.bytes());
StreamInput in = request.bytes().streamInput();
try {
if (compressor != null) {
in = compressor.streamInput(in);
}
in = new NamedWriteableAwareStreamInput(in, namedWriteableRegistry);
in.setVersion(request.version());
synchronized (lastSeenClusterStateMutex) {
final ClusterState incomingState;
// If true we received full cluster state - otherwise diffs
if (in.readBoolean()) {
incomingState = ClusterState.readFrom(in, clusterStateSupplier.get().nodes().getLocalNode());
logger.debug("received full cluster state version [{}] with size [{}]", incomingState.version(),
request.bytes().length());
} else if (lastSeenClusterState != null) {
Diff diff = ClusterState.readDiffFrom(in, lastSeenClusterState.nodes().getLocalNode());
incomingState = diff.apply(lastSeenClusterState);
logger.debug("received diff cluster state version [{}] with uuid [{}], diff size [{}]",
incomingState.version(), incomingState.stateUUID(), request.bytes().length());
} else {
logger.debug("received diff for but don't have any local cluster state - requesting full state");
throw new IncompatibleClusterStateVersionException("have no local cluster state");
}
// sanity check incoming state
validateIncomingState(incomingState, lastSeenClusterState);
pendingStatesQueue.addPending(incomingState); // 关键点,主要是加到pending队列里面
lastSeenClusterState = incomingState;
}
} finally {
IOUtils.close(in);
}
channel.sendResponse(TransportResponse.Empty.INSTANCE);
}
这里就可以看到send只是确保data node节点接收到请求,但是并没有进行处理先放在pendingStatesQueue中。进行回复,主节点就知道这个data node能接收到消息。后面master节点会发送commit请求过来。
COMMIT_ACTION_NAME,一样的办法ctrl+h搜索,就可以看到这个action是怎么注册的,以及对应的处理逻辑。
protected void handleCommitRequest(CommitClusterStateRequest request, final TransportChannel channel) {
final ClusterState state = pendingStatesQueue.markAsCommitted(request.stateUUID,
new PendingClusterStatesQueue.StateProcessedListener() {
@Override
public void onNewClusterStateProcessed() { // 异步框架会看到很多这样的逻辑,处理完成之后就会调用sendResponse方法
try {
// send a response to the master to indicate that this cluster state has been processed post committing it.
channel.sendResponse(TransportResponse.Empty.INSTANCE);
} catch (Exception e) {
logger.debug("failed to send response on cluster state processed", e);
onNewClusterStateFailed(e);
}
}
@Override
public void onNewClusterStateFailed(Exception e) {
try {
channel.sendResponse(e);
} catch (Exception inner) {
inner.addSuppressed(e);
logger.debug("failed to send response on cluster state processed", inner);
}
}
});
if (state != null) {
newPendingClusterStatelistener.onNewClusterState("master " + state.nodes().getMasterNode() +
" committed version [" + state.version() + "]"); // 具体处理逻辑
}
}
后续还是走到了ZenDistovery的处理逻辑。
private class NewPendingClusterStateListener implements PublishClusterStateAction.NewPendingClusterStateListener {
@Override
public void onNewClusterState(String reason) {
processNextPendingClusterState(reason);
}
}
processNextPendingClusterState最终会提交一个BatchedTask,具体的处理逻辑就又回到ClusterService里面了,就对上上面的流程。但这里要注意一点就是,
特别注意!!!!!
1、threadExecutor,跟进去初始化的逻辑就可以看到这个有限队列的大小是1。是1,也就代表着如果这个优先队列的节点没有处理完,没有remove掉,那么这个线程池就会将后续的请求缓存到workqueue。
2、需要知道前面的所有的状态更新是要提交到pendingStatesQueue,所以如果这个线程池一直被卡主,就会导致pendingStatesQueue请求一直在积累。这个pendingStatesQueue有一个逻辑就是大小是25,如果超过大小,就会将最早的状态更新请求删除掉。我们的工程上要对elasticsearch进行改动,添加了C++的逻辑,结果在这里就遇到了一个坑,后端因为问题C++,死锁了,结果这个线程池就一直在这里卡主,后续的请求根本就进不来。导致pendingStatesQueue不断的进行删除,但是一直不能处理
3、curl 127.0.0.1:9200/_cat/tasks?v 可以查看后台正在执行的任务。
public void submitTasks(List extends BatchedTask> tasks, @Nullable TimeValue timeout) throws EsRejectedExecutionException {
if (tasks.isEmpty()) {
return;
}
final BatchedTask firstTask = tasks.get(0);
assert tasks.stream().allMatch(t -> t.batchingKey == firstTask.batchingKey) :
"tasks submitted in a batch should share the same batching key: " + tasks;
// convert to an identity map to check for dups based on task identity
final Map