quartz答疑

这篇笔记记录我在阅读quartz源码的时候是如何分析的，如何去查找问题的.

1. 任务的状态

可以参考https://segmentfault.com/a/1190000015492260 写的很好，分析很详细，这里我盗张图，quartz状态转化如下:

image

2. 如何查询任务

上节写过，生产者线程会查所有要执行的触发器，在QuartzSchedulerThread的run方法中

 triggers = qsRsrcs.getJobStore().acquireNextTriggers(
                                now + idleWaitTime, Math.min(availThreadCount, qsRsrcs.getMaxBatchSize()), qsRsrcs.getBatchTimeWindow());

idleWaitTime:默认30s，可通过配置属性org.quartz.scheduler.idleWaitTime设置。
availThreadCount`：获取可用（空闲）的工作线程数量，总会大于1，因为该方法会一直阻塞，直到有工作线程空闲下来。
maxBatchSize：一次拉取trigger的最大数量，默认是1，可通过org.quartz.scheduler.batchTriggerAcquisitionMaxCount改写
batchTimeWindow：时间窗口调节参数，默认是0，可通过org.quartz.scheduler.batchTriggerAcquisitionFireAheadTimeWindow改写
misfireThreshold：超过这个时间还未触发的trigger,被认为发生了misfire,默认60s，可通过org.quartz.jobStore.misfireThreshold设置。

我们使用的是数据库存储，所以acquireNextTriggers调用的是org.quartz.impl.jdbcjobstore.JobStoreSupport#acquireNextTriggers。

public List acquireNextTriggers(final long noLaterThan, final int maxCount, final long timeWindow)
        throws JobPersistenceException {
        //1. 判断锁，获取锁
        String lockName;
        if(isAcquireTriggersWithinLock() || maxCount > 1) { 
            lockName = LOCK_TRIGGER_ACCESS;
        } else {
            lockName = null;
        }
        return executeInNonManagedTXLock(lockName, 
                new TransactionCallback>() {
                    public List execute(Connection conn) throws JobPersistenceException {
                      //这个方法是获取触发器的
                        return acquireNextTrigger(conn, noLaterThan, maxCount, timeWindow);
                    }
                },
                new TransactionValidator>() {
                    public Boolean validate(Connection conn, List result) throws JobPersistenceException {
                        try {
                            List acquired = getDelegate().selectInstancesFiredTriggerRecords(conn, getInstanceId());
                            Set fireInstanceIds = new HashSet();
                            for (FiredTriggerRecord ft : acquired) {
                                fireInstanceIds.add(ft.getFireInstanceId());
                            }
                            for (OperableTrigger tr : result) {
                                if (fireInstanceIds.contains(tr.getFireInstanceId())) {
                                    return true;
                                }
                            }
                            return false;
                        } catch (SQLException e) {
                            throw new JobPersistenceException("error validating trigger acquisition", e);
                        }
                    }
                });
    }

acquireNextTriggers方法先获取锁，然后回调调用acquireNextTrigger获取触发器，查询触发器的sql。

SELECT TRIGGER_NAME, TRIGGER_GROUP, NEXT_FIRE_TIME, PRIORITY FROM {0}TRIGGERS WHERE SCHED_NAME = {1} AND TRIGGER_STATE = ? AND 
NEXT_FIRE_TIME <= ? AND (MISFIRE_INSTR = -1 OR (MISFIRE_INSTR != -1 AND NEXT_FIRE_TIME >= ?)) ORDER BY NEXT_FIRE_TIME ASC, PRIORITY DESC

由sql可以得知，quartz可以查询过去60s将来30s的触发器。查询出来后会把触发器保存到QRTZ_FIRED_TRIGGERS表中，作用在第三节会讲。

3. 如何保证任务不丢失

任务在什么情况下会丢失。没有多余的消费者线程可以消费，服务器重启导致任务丢失。针对这两种情况，quartz是如何做的

3.1 线程堵塞导致任务丢失

如何模拟线程堵塞，这里我把消费者线程池大小设置为1

spring:
  quartz:
    job-store-type: jdbc
    overwriteExistingJobs: true
    properties:
      org.quartz.threadPool.threadCount: 1

任务每5秒执行一次，但是执行的时候睡眠10s中，这样就会导致任务丢失。

public class TestTaskJob1 extends QuartzJobBean {
    private static final Logger logger = LoggerFactory.getLogger(TestTaskJob1.class);

    @Override
    protected void executeInternal(JobExecutionContext jobExecutionContext) throws JobExecutionException {
        String localDateTime = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
        logger.info("TestTaskJob1-->" + localDateTime);

        try {
            Thread.sleep(10000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

在第二节中，在查询触发器的sql中有一句(MISFIRE_INSTR = -1 OR (MISFIRE_INSTR != -1 AND NEXT_FIRE_TIME >= ?)这样的条件，MISFIRE_INSTR的值如果是-1，则会一直查询出来，那么这个值代表什么意思呢？可以查看https://www.jianshu.com/p/634d2a6fae7b，通过设置misfire的值，可以设置保证任务不丢失。不过执行时间有延迟而已。

3.2 服务器重启导致

misfire可以保证阻塞状态下，任务不丢失，但是如果正在执行过程中，服务器挂了，如何保证不丢失？
在第二节中讲到，查询触发器的时候，会把查询的触发器插入到QRTZ_FIRED_TRIGGERS表中，当服务器重启的时候，会去读取这张表，将任务恢复执行。

4. 如何保证分布式一致性

为了保证任务的可靠用，我们基本都会部署多台服务器，但是部署多台服务器就会出现任务在多台服务器中被执行，这种情况该如何处理。

在第二节中获取触发器的时候，获取触发器是通过executeInNonManagedTXLock方法回调的，看下executeInNonManagedTXLock的实现逻辑.

protected  T executeInNonManagedTXLock(
            String lockName, 
            TransactionCallback txCallback, final TransactionValidator txValidator) throws JobPersistenceException {
        boolean transOwner = false;
        Connection conn = null;
        try {
           
           if (lockName != null) {
                // If we aren't using db locks, then delay getting DB connection 
                // until after acquiring the lock since it isn't needed.
                //1. 获取数据库连接
                    if (getLockHandler().requiresConnection()) {
                    conn = getNonManagedTXConnection();
                }
                //2.获取锁
                transOwner = getLockHandler().obtainLock(conn, lockName);
            }
            
            if (conn == null) {
                conn = getNonManagedTXConnection();
            }
            //3. 回调查询结果
            final T result = txCallback.execute(conn);
            try {
                commitConnection(conn);
            } catch (JobPersistenceException e) {
                rollbackConnection(conn);
                if (txValidator == null || !retryExecuteInNonManagedTXLock(lockName, new TransactionCallback() {
                    @Override
                    public Boolean execute(Connection conn) throws JobPersistenceException {
                        return txValidator.validate(conn, result);
                    }
                })) {
                    throw e;
                }
            }

            Long sigTime = clearAndGetSignalSchedulingChangeOnTxCompletion();
            if(sigTime != null && sigTime >= 0) {
                signalSchedulingChangeImmediately(sigTime);
            }
            
            return result;
        } catch (JobPersistenceException e) {
            rollbackConnection(conn);
            throw e;
        } catch (RuntimeException e) {
            rollbackConnection(conn);
            throw new JobPersistenceException("Unexpected runtime exception: "
                    + e.getMessage(), e);
        } finally {
            try {
                releaseLock(lockName, transOwner);
            } finally {
                cleanupConnection(conn);
            }
        }
    }

主要分析getLockHandler().obtainLock(conn, lockName);这段逻辑，getLockHandler是获取锁处理对象，因为使用的是数据库模式，所以是DBSemaphore#obtainLock。

public boolean obtainLock(Connection conn, String lockName)
        throws LockException {

        if(log.isDebugEnabled()) {
            log.debug(
                "Lock '" + lockName + "' is desired by: "
                        + Thread.currentThread().getName());
        }
            //1. 判断是不是自己获取了锁，锁可重入
        if (!isLockOwner(lockName)) {

          //2. 执行sql获取锁
            executeSQL(conn, lockName, expandedSQL, expandedInsertSQL);
            
            if(log.isDebugEnabled()) {
                log.debug(
                    "Lock '" + lockName + "' given to: "
                            + Thread.currentThread().getName());
            }
           //3. 如果取到锁，把锁放到threadLocal中
            getThreadLocks().add(lockName);
            //getThreadLocksObtainer().put(lockName, new
            // Exception("Obtainer..."));
        } else if(log.isDebugEnabled()) {
            log.debug(
                "Lock '" + lockName + "' Is already owned by: "
                        + Thread.currentThread().getName());
        }

        return true;
    }

获取锁的逻辑是：

判断锁是不是已经被自己获取了，判断的逻辑是threadLocal中是不是有值，这个目的是锁可以重入。

执行sql获取锁，

SELECT * FROM QRTZ_LOCKS WHERE SCHED_NAME = 'quartzScheduler' AND LOCK_NAME = ? FOR UPDATE

根据锁的名字，使用for update获取行锁。