最近项目引入Elastic Job
实现定时任务的分布式调度。引入的版本2.1.5
,加入相关的job配置后启动项目,主线程假死
,不进行后续逻辑处理和日志输出。
输出的日志如下:
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.049] [] [StdSchedulerFactory] [Using default implementation for ThreadExecutor]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.130] [] [SchedulerSignalerImpl] [Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.131] [] [QuartzScheduler] [Quartz Scheduler v.2.2.1 created.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.135] [] [JobShutdownHookPlugin] [Registering Quartz shutdown hook.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.136] [] [RAMJobStore] [RAMJobStore initialized.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [QuartzScheduler] [Scheduler meta-data: Quartz Scheduler (v2.2.1) 'dailyScanMercReratingJob' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.
]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler 'dailyScanMercReratingJob' initialized from an externally provided properties instance.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]
直接上解决方案:将项目的curator的框架版本全部调整为2.10.0
,包括
在项目的Spring
框架的以下位置打断点追踪项目启动过程:
发现代码在AbstractApplicationContext.refresh()
方法里,执行: finishBeanFactoryInitialization(beanFactory)
时陷入等待一直无法跳出继续执行。
根据Spring框架的启动机制,finishBeanFactoryInitialization
是完成单例bean的初始化的方法,这个方法会去真正操作elastic-job
对于job
的操作代码。
从日志中发现代码最后一行输出为:
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 15:53:27.139] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]
在StdSchedulerFactory
类中找到关于Quartz scheduler version
相关的日志打断点继续跟踪:
//…………………… 省略之前的代码
jrsf.initialize(scheduler);
qs.initialize();
getLog().info(
"Quartz scheduler '" + scheduler.getSchedulerName()
+ "' initialized from " + propSrc);
//这里断点
getLog().info("Quartz scheduler version: " + qs.getVersion());
// prevents the repository from being garbage collected
qs.addNoGCObject(schedRep);
// prevents the db manager from being garbage collected
if (dbMgr != null) {
qs.addNoGCObject(dbMgr);
}
schedRep.bind(scheduler);
return scheduler;
在上面的位置断点后发现,elastic job
继续执行,持续跟踪最终跟踪到SchedulerFacade
类的registerStartUpInfo
方法:
/**
* 注册作业启动信息.
*
* @param enabled 作业是否启用
*/
public void registerStartUpInfo(final boolean enabled) {
listenerManager.startAllListeners();
leaderService.electLeader();
serverService.persistOnline(enabled);
instanceService.persistOnline();
shardingService.setReshardingFlag();
monitorService.listen();
if (!reconcileService.isRunning()) {
reconcileService.startAsync();
}
}
代码在leaderService.electLeader();
陷入等待。
根据以上最终可用得出结论:
根据对LeaderService
的代码的研究,elastic job
选主使用的是 curator
框架的 LeaderLatch
类完成的。
具体时线程wait
的操作在:JobNodeStorage
的executeInLeader
方法中:
/**
* 在主节点执行操作.
*
* @param latchNode 分布式锁使用的作业节点名称
* @param callback 执行操作的回调
*/
public void executeInLeader(final String latchNode, final LeaderExecutionCallback callback) {
try (LeaderLatch latch = new LeaderLatch(getClient(), jobNodePath.getFullPath(latchNode))) {
latch.start();
latch.await();
callback.execute();
//CHECKSTYLE:OFF
} catch (final Exception ex) {
//CHECKSTYLE:ON
handleException(ex);
}
}
上面的方法调用 latch.await();
来等待获取 leadership
。由于无法获取主节点,导致线程一致wait。
LeaderLatch
大概的机制为:所有客户端向zk的同一个path
竞争的写入数据,谁先写入成功谁就获取了leadership
。
LeaderLatch
的await
方法如下:
public void await() throws InterruptedException, EOFException
{
synchronized(this)
{
while ( (state.get() == State.STARTED) && !hasLeadership.get() )
{
wait();
}
}
if ( state.get() != State.STARTED )
{
throw new EOFException();
}
}
如果LeaderLatch
无法获取leadership
那么就当前的Thread
就会一直陷入wait
。
定位到问题的发生点,解决问题的思路就要看为什么无法获取到leadership
。
登录到ZK上查询节点信息,发现正常项目启动后,elastic job
会向zk的写入如下格式的节点内容:
/{job-namespace}/{job-id}/leader/election/latch
但是异常的项目是没有这个节点的,所以应该是ZK的操作发生了问题。具体哪里发生了问题这里还没有发现。
继续将项目日志调整为DEBUG
级别会发下有如下的日志输出:
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.687] [] [RAMJobStore] [RAMJobStore initialized.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [QuartzScheduler] [Scheduler meta-data: Quartz Scheduler (v2.2.1) 'dailyScanMercReratingJob' with instanceId 'NON_CLUSTERED'
Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
NOT STARTED.
Currently in standby mode.
Number of jobs executed: 0
Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 1 threads.
Using job-store 'org.quartz.simpl.RAMJobStore' - which does not support persistence. and is not clustered.
]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [StdSchedulerFactory] [Quartz scheduler 'dailyScanMercReratingJob' initialized from an externally provided properties instance.]
[INFO] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:47.689] [] [StdSchedulerFactory] [Quartz scheduler version: 2.2.1]
[DEBUG] [Timer-0] [2018-10-10 17:51:49.553] [] [UpdateChecker] [Checking for available updated version of Quartz...]
[DEBUG] [RMI TCP Connection(2)-127.0.0.1] [2018-10-10 17:51:49.586] [] [LeaderService] [Elect a new leader now.]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.724] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.738] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.759] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.769] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.791] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.803] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.813] [] [RegExceptionHandler] [Elastic job: ignored exception for: KeeperErrorCode = NoNode for /payrisk-job/dailyScanMercReratingJob/leader/election/instance]
[DEBUG] [Curator-TreeCache-0] [2018-10-10 17:51:50.818] [] [LeaderService] [Elect a new leader now.]
[DEBUG] [Timer-0] [2018-10-10 17:51:51.261] [] [UpdateChecker] [Quartz version update check failed: Server returned HTTP response code: 403 for URL: http://www.terracotta.org/kit/reflector?kitID=quartz&pageID=update.properties&id=2130706433&os-name=Mac+OS+X&jvm-name=Java+HotSpot%28TM%29+64-Bit+Server+VM&jvm-version=1.8.0_112&platform=x86_64&tc-version=2.2.1&tc-product=Quartz&source=Quartz&uptime-secs=1&patch=UNKNOWN]
这行日志的输出代码位于:elastic-job
的RegExceptionHandler.handleException()
方法:
/**
* 处理异常.
*
* 处理掉中断和连接失效异常并继续抛注册中心.
*
* @param cause 待处理异常.
*/
public static void handleException(final Exception cause) {
if (null == cause) {
return;
}
if (isIgnoredException(cause) || null != cause.getCause() && isIgnoredException(cause.getCause())) {
log.debug("Elastic job: ignored exception for: {}", cause.getMessage());
} else if (cause instanceof InterruptedException) {
Thread.currentThread().interrupt();
} else {
throw new RegException(cause);
}
}
这里 elastic job
ignore了zk的操作异常,导致选主失败但是并没有做兼容处理,主线程陷入 wait()
。
根据上面的查询,无法选主是因为curator
框架抛出了NoNodeException
,通过google很容找到解决这个问题的方法:统一curator的版本。
关于为什么会抛出这个问题,需要深入研究下,留待考察和研究。大概的研究思路:根据异常的堆栈找到抛出异常的位置,对比下curator
在2.7.0
和2.10.0
两个版本下调用的代码差异和问题点。出现NoNodeException
的其中一种猜测:2.10.0
版本的代码在选主的时候底层API会自己创建Node,而2.7.0
的代码Node可能需要选主的代码来创建。(只是一种猜测)
将项目中curator
中的jar包版本全部统一为2.10.0
问题解决。
NOTE:注意jar包是否完全升级了要去打包后的项目的lib下面观察下,看所有的jar是否全部都是2.10.0并且没有其他版本的jar。
调试此类问题很耗时,经验:
github
官网的issue
中查看是否有同类问题;此外写框架的时候: