RQ源码阅读

数据字典

rq:worker:名称

用途：记录每个worker的相关信息
数据类型：Hash
字段：
- birth：开始工作的时间
- queues：工作的队列名称，多个队列以逗号分隔
- death：「死亡」的时间
- shutdown_requested_date：worker被停掉的时间
- state：当前的状态
- current_job：正在进行中的Job ID
- started_at：开始当前执行当前的job的时间

rq:workers

用途：记录当前正在工作的worker名称
数据类型：Set

rq:wip:队列名称

用途：记录每个队列正在执行中的任务
数据结构：SortedSet
说明：每个member的score是job的expiration time（unix时间戳）

rq:deferred:队列名称

TBD

rq:job:任务ID:dependents

用途：
数据结构

rq:job:任务ID

用途：记录每个任务的信息
数据结构：Hash
字段：
- created_at: 创建时间
- data:
- origin: 队列名称
- description: 任务的描述信息，主要是写日志时用于区分不同的任务
- enqueued_at: 进入队列时间（UNIX时间戳）
- started_at: 开始执行时间（UNIX时间戳）
- ended_at: 执行结束时间（UNIX时间戳）
- result: 任务执行结果
- exc_info: 抛出异常的相关信息
- timeout: 任务停留在队列中的最长时间
- result_ttl: 任务执行结果的过期时间
- status: 任务的当前状态
- dependency_id: 依赖的任务的ID
- meta: 供任务生产者自由设置的meta信息，原数据是字典，这里pickle.dumps后存入redis
- ttl: 任务的最长执行时间

rq:queues

用途: 记录当前存在的队列名称
数据结构: Set

rq:queue:队列名称

用途: 记录当前队列排队中的任务ID
数据结构: List

rq:finished:队列名称

用途: 记录当前已经完成的任务ID
数据结构: SortedSet
说明: 每个member的score是job的「执行结果」过期时间（unix时间戳）

worker的生命周期

首先贴一下官方文档中关于worker的生命周期的说明，来一个整体的认识：

The life-cycle of a worker consists of a few phases:

Boot. Loading the Python environment.
2.Birth registration. The worker registers itself to the system so it knows of this worker.
3.Start listening. A job is popped from any of the given Redis queues. If all queues are empty and the worker is running in burst mode, quit now. Else, wait until jobs arrive.
4.Prepare job execution. The worker tells the system that it will begin work by setting its status to busy and registers job in the StartedJobRegistry.
5.Fork a child process. A child process (the "work horse") is forked off to do the actual work in a fail-safe context.
6.Process work. This performs the actual job work in the work horse.
7.Cleanup job execution. The worker sets its status to idle and sets both the job and its result to expire based on result_ttl. Job is also removed from StartedJobRegistry and added to to FinishedJobRegistry in the case of successful execution, or FailedQueue in the case of failure.
8.Loop. Repeat from step 3.

任务的几种状态

def enum(name, *sequential, **named):
    """
    通过元类来实现枚举类型
    """
    values = dict(zip(sequential, range(len(sequential))), **named)   
    
    # NOTE: Yes, we *really* want to cast using str() here.
    # On Python 2 type() requires a byte string (which is str() on Python 2).
    # On Python 3 it does not matter, so we'll use str(), which acts as
    # a no-op.    
    return type(str(name), (), values)

# 这里看到rq中的job的几种状态
JobStatus = enum(
    'JobStatus',
    QUEUED='queued',
    FINISHED='finished',
    FAILED='failed',
    STARTED='started',
    DEFERRED='deferred'
)

异常处理机制

如果执行任务过程中抛出异常怎么办？

# worker.py
def perform_job(self, job, queue):
    # 省略与异常处理无关的代码
    try:
        job.perform()
    except Exception:
        # 在实际执行任务的函数外捕获所有异常（也可能因为超时）

        """Handles the failure or an executing job by:    
            1. Setting the job status to failed
            2. Removing the job from the started_job_registry
            3. Setting the workers current job to None
        """
        self.handle_job_failure(
            job=job,
            started_job_registry=started_job_registry
        )
        
        # 这里值得留意sys.exc_info()的用法
        self.handle_exception(job, *sys.exc_info())
        return False

# 再看看self.handle_exception()方法
def handle_exception(self, job, *exc_info):
    """Walks the exception handler stack to delegate exception handling."""
    exc_string = ''.join(traceback.format_exception_only(*exc_info[:2]) +
                         traceback.format_exception(*exc_info))
    self.log.error(exc_string, exc_info=True, extra={
        'func': job.func_name,
        'arguments': job.args,
        'kwargs': job.kwargs,
        'queue': job.origin,
    })

    for handler in reversed(self._exc_handlers):    
        self.log.debug('Invoking exception handler {0}'.format(handler)
        fallthrough = handler(job, *exc_info)   
      
        # Only handlers with explicit return values should disable further
        # exc handling, so interpret a None return value as True.    
        if fallthrough is None:
            fallthrough = True
        
        if not fallthrough:
            break

从handle_exception方法中可以看到，worker对象的_exc_handlers中可以注册一系列的异常处理方法，当worker捕获到异常之后会按handler注册的先后顺序委托它们来处理异常，这样就可以灵活定制job执行失败的异常处理逻辑了。当然worker也提供了注册exc_handlers的方法，这里就不展开了。

与任务的执行效率有关的数据

每个job的信息中有几个字段可以体现每个job的执行效率，分别是：

created_at: 进入队列时记录
started_at: 开始执行时记录
ended_at: 执行成功时记录

如果要统计任务队列的效率问题就可以使用这三个值来进行统计。

任务的优先级

首先，每个worker可以同时处理多个队列中的任务，从队列中获取任务的方法如下：

# queue.py
@classmethod
def dequeue_any(cls, queues, timeout, connection=None):
    """Class method returning the job_class instance at the front of the given
    set of Queues, where the order of the queues is important.    

    When all of the Queues are empty, depending on the `timeout` argument, 
    either blocks execution of this function for the duration of the
    timeout or until new messages arrive on any of the queues, or returns    
    None.    

    See the documentation of cls.lpop for the interpretation of timeout.
    """    
    while True:        
        queue_keys = [q.key for q in queues]        
        result = cls.lpop(queue_keys, timeout, connection=connection) 
        # 后面的代码省略

这里的queue对象的lpop方法是封装好的，主要是利用redis中的lpop和blpop方法。无论是阻塞式还是非阻塞式地从队列中获取任务，如果要同时从多个队列中获取任务，优先级都跟对worker指定队列时的先后顺序有关。

例如运行worker方式是这样的：

$ rq worker high normal low

如果high队列中还有任务，这个worker是永远不会开始执行normal中的任务的。

而对于同一个队列中的任务，由于使用的是Redis的List结构，要么从队列头部插入要么从尾部插入，所以正常情况下是FIFO的，当然rq也提供了at_front的选项，能够从队列的头部插入任务。

任务的存活时间

官方文档是这么说的：

A job has two TTLs, one for the job result and one for the job itself. This means that if you have job that shouldn't be executed after a certain amount of time, you can define a TTL as such:

# When creating the job:
job = Job.create(func=say_hello, ttl=43)
# or when queueing a new job:
job = q.enqueue(count_words_at_url, 'http://nvie.com', ttl=43)

对于每个任务来说，一共有三个限制值：

timeout：执行任务的超时时间，如果任务从开始执行经过timeout秒后还没完成则视为「lost」
result_ttl：结果的保存时间
ttl：任务在队列中的存活时间

TIMEOUT

这部分的整个机制在别的地方都可以照搬借用。先看源码：

with self.death_penalty_class(job.timeout or self.queue_class.DEFAULT_TIMEOUT):    
    rv = job.perform()

death_penalty_class的源码在timeouts.py模块中

这里的death_penalty_class利用了信号中的SIGALRM，超时就会抛出异常。通过这样的方式限制了任务的执行时长：

class UnixSignalDeathPenalty(BaseDeathPenalty):    
    def handle_death_penalty(self, signum, frame):  
        raise JobTimeoutException('Job exceeded maximum timeout '                                  
                                  'value ({0} seconds)'.format(self._timeout))    
    
    def setup_death_penalty(self):      
        """Sets up an alarm signal and a signal handler that raises   
        a JobTimeoutException after the timeout amount (expressed in        
        seconds).        
        """        
        signal.signal(signal.SIGALRM, self.handle_death_penalty)
        signal.alarm(self._timeout)    

    def cancel_death_penalty(self):       
        """Removes the death penalty alarm and puts back the system into        
        default signal handling.        
        """       
        signal.alarm(0)       
        signal.signal(signal.SIGALRM, signal.SIG_DFL)

任务执行完毕跳出了context manager之后就会执行cancel_death_penalty。

RESULT_TTL

result_ttl最容易理解，result就是每个任务执行完毕的返回值，result_ttl就是这个结果保存在redis中的时间。

如果result_ttl设置成0，这个任务的所有信息（也就是rq:job:[任务ID]这个redis key）会马上清除；如果是None，该任务信息会永久保存；如果是大于0的值，则会设置rq:jog:[任务就ID]这个key的过期时间，到期自动由redis清除，然后任务的执行结果则会先使用python的pickle.dumps处理，然后保存在result这个字段中。

其实还有一个叫FinishedJobRegistry的东西，它使用的是上文提到的数据字典中所说的名为rq:finished:[队列名称]的SortedSet。里面保存着每个队列已经执行完毕的任务，估计主要是monitor用的。

TTL

设定任务在队列中的存活时长是通过设置rq:job:[任务ID]这个redis key的超时时间来实现的。达到了超时时间后key都被删掉了自然就拿不到这个任务了。如果ttl设为None就是不设置超时时间。

平滑关闭worker

为了平滑地关闭，rq的worker注册了两个信号的handler：

# worker.py
def _install_signal_handlers(self):    
    """Installs signal handlers for handling SIGINT and SIGTERM
    gracefully.   
    """    
    signal.signal(signal.SIGINT, self.request_stop)
    signal.signal(signal.SIGTERM, self.request_stop)

所以使用SIGINT和SIGTERM都能平滑地关闭worker。再看request_stop方法：

# worker.py
def request_stop(self, signum, frame):    
    """Stops the current worker loop but waits for child processes to    end gracefully (warm shutdown).   
    """   
    self.log.debug('Got signal {0}'.format(signal_name(signum)))
    
    signal.signal(signal.SIGINT, self.request_force_stop)
    signal.signal(signal.SIGTERM, self.request_force_stop)
  
    self.handle_warm_shutdown_request()   
   
    # If shutdown is requested in the middle of a job, wait until   
    # finish before shutting down and save the request in redis
    if self.get_state() == 'busy':  
        self._stop_requested = True
        self.set_shutdown_requested_date() 
        self.log.debug('Stopping after current horse is finished. '                       'Press Ctrl+C again for a cold shutdown.') 
    else:     
        raise StopRequested()

这里有几个看点：

重新注册了SIGINT和SIGTERM的handler，意味着连续两次发出这两个类型的信号就会进行cold shut down
worker处于busy状态和非busy状态的处理方式是不一样的：处于非busy状态时直接结束worker，处于busy状态时就设置_stop_requested属性，等结束这一趟job之后再自杀
使用抛出异常的方式来结束worker，这个StopRequested的异常会在worker的主循环中捕获，然后跳出worker循环，结束worker进程

日志

只有worker模块有打日志。首先一个常规套路，就是用模块名称来命名logger：

# worker.py
logger = logging.getLogger(__name__)

然后在worker开始工作的时候调用setup_loghandlers函数配置logger的handler：

# logutils.py
def setup_loghandlers(level):    
    logger = logging.getLogger('rq.worker')   
    if not _has_effective_handler(logger):   
        logger.setLevel(level)       
        # This statement doesn't set level properly in Python-2.6      
        # Following is an additional check to see if level has been set to       
        # appropriate(int) value       
        if logger.getEffectiveLevel() == level:       
            # Python-2.6. Set again by using logging.INFO etc.
            level_int = getattr(logging, level)   
            logger.setLevel(level_int)    
        formatter = logging.Formatter(fmt='%(asctime)s %(message)s',
                                      datefmt='%H:%M:%S')
        handler = ColorizingStreamHandler()
        handler.setFormatter(formatter)
        logger.addHandler(handler)

def _has_effective_handler(logger):    
    """ 
    Checks if a logger has a handler that will catch its messages in its logger hierarchy.   
    :param `logging.Logger` logger: The logger to be checked.
    :return: True if a handler is found for the logger, False otherwise.    
    :rtype: bool    
    """    
    while True:  
        if logger.handlers:          
            return True     
        if not logger.parent:        
            return False    
        logger = logger.parent

以上代码主要有两点：

如果名为rq.worker这个logger未设置handler的话就在setup_loghandlers里面设置一下，从_has_effective_handler里面可以体现出logger的继承性；
使用了一个自定义的名为ColorizingStreamHandler的Handler，它的作用是为输出到终端的日志加上颜色，这里就不贴出具体的代码了，有需要再参考源码。

但是按照这个默认配置也会有问题，因为在整个worker的运行周期中，只配置了rq.worker这个logger，所以只有rq.worker或者继承rq.worker的logger可以使用setup_loghandlers函数里面的配置，这样看来rq的logger配置方面做得还不够灵活。恐怕这方面只能靠自定义worker才能达到灵活配置logging的目标。