一、前言

在内核驱动中，常常见到 工作队列(workqueue)。对于熟悉内核或者驱动的工程师来说，这个机制应该是比较熟悉的，经常出现在 中断上下文 中，用于执行中断后的操作。随着内核发展，驱动遇到越多越多的场景，而 工作队列 也逐渐发展，现在我们常用的 工作队列 称为 并发管理工作队列(concurrency managed workqueue)，本文对 工作队列 进行简单的介绍和用法说明，希望能够帮助各位读者熟悉其使用

二、workqueue介绍

2.1 workqueue说明

工作队列 常常在内核驱动中会被使用到，而内核中也有对于其描述的文档，其路径为 Documentation/core-api/workqueue.rst：

下面引用原文来看看 workqueue 的描述

There are many cases where an asynchronous process execution context is needed and the workqueue (wq) API is the most commonly used mechanism for such cases.
When such an asynchronous execution context is needed, a work item describing which function to execute is put on a queue.
An independent thread serves as the asynchronous execution context.
The queue is called workqueue and the thread is called worker.
While there are work items on the workqueue the worker executes the functions associated with the work items one after the other.
When there is no work item left on the workqueue the worker becomes idle.
When a new work item gets queued, the worker begins executing again.

简单来说，工作队列 提供了一种 异步执行任务 的机制。一般有以下几个概念：

work：指需要异步执行的任务
woker：指处理 work 的 异步执行上下文，通产可以理解为一个线程
workqueue：多个 work 以节点的形式链接起来形成 workqueue，其可以通过调度让 woker 来执行 work

2.2 为什么使用workqueue

中断的 bottom half机制 比如 tasklet 都是在 中断上下文(softirq) 中执行，而在 中断上下文 中通常是不能执行休眠操作中，假如有某些特殊的 bottom half 需要休眠，此时则不能使用 task。

workqueue 是运行在进程上下文中的，因而可以执行休眠操作，这是和其他 bottom half机制 有本质的区别，大大方便了在驱动中处理中断。

2.3 旧版本Workqueue和Concurrency Managed Workqueue

在驱动中我们常常使用 workqueue 来提供 异步执行环境，此时我们定义 work 并链入 workqueue，由 woker 不断处理 work。处理完毕后 worker 进入休眠。

其操作流程看起来简单，但这里面有不少细节需要跟读者们说明一下，以让各位好了解 旧版本workqueue 的不足之处。

我们可以先看看内核原文：

In the original wq implementation, a multi threaded (MT) wq had one
worker thread per CPU and a single threaded (ST) wq had one worker
thread system-wide. A single MT wq needed to keep around the same
number of workers as the number of CPUs. The kernel grew a lot of MT
wq users over the years and with the number of CPU cores continuously
rising, some systems saturated the default 32k PID space just booting
up.
Although MT wq wasted a lot of resource, the level of concurrency
provided was unsatisfactory. The limitation was common to both ST and
MT wq albeit less severe on MT. Each wq maintained its own separate
worker pool. An MT wq could provide only one execution context per CPU
while an ST wq one for the whole system. Work items had to compete for
those very limited execution contexts leading to various problems
including proneness to deadlocks around the single execution context.

内核线程数量太多。在错误使用 workqueue机制 的情况下，容易消耗尽 PID space，而扩大 PID space 则会导致系统 task 过多而造成性能的负面影响。
并发性不足。如果是 single threaded workqueue，则没有并发的概念，任何的 work 都是按顺序排队执行，没有执行到的 work 只能原地等待。如果是多核的 per-CPU multi threaded workqueue，虽然创建了 thread pool(即多个worker)，但是其数目是固定的。且每个 oneline cpu 上运行一个 严格绑定的woker，从而到了每个线程都只能严格运行在绑定的 CPU 上，这就造成了无法并发处理的情况。例如 cpu0 上的 worker 进入阻塞状态，那么由该 worker 处理的 workqueue 中的 其他work 都会被阻塞，不能转移到 其他CPU 去执行。
死锁问题。举个例子：比如驱动中存在 work A 和 work B，且 work A 依赖 work B 的执行结果。如果这两个 work 都被同一个 CPU 的同一个 worker 调度并执行的时候会出现问题。由于 worker 不能并发的执行 work A 和 work B，因此该驱动模块会发生死锁。
二元化的线程池机制。workqueue 创建的线程数目要么是 1，要么是 number of CPU。无法自主设置，不够灵活。

而 Concurrency Managed Workqueue (cmwq) 为了解决这些问题实现了以下几个功能：

兼容旧版本的 workqueue API
让所有的 workqueue 共享 worker pool，以按需提供并发处理并节省资源
自动调节 并发性能，用户无需关心细节

三、并发管理工作队列(cmwq)

3.1、与传统workqueue的区别

对于 workqueue 的使用方法而言，前端的操作包括 2 种：

创建 workqueue
将指定的 work 添加到 workqueue。

在 传统workqueue 中，工作线程(worker thread) 和 workqueue 是密切联系的。
对于 single thread workqueue，创建单个 系统范围内的worker thread。而对于 multi thread workqueue，则创建 per-CPU worker thread，也就是每个 worker thread 对应一个 CPU，一切都是固定死的。

一般情况下，workqueue 是提供一个 异步执行的环境，如果把 创建workqueue 和 创建worker thread 这 2 个操作固定在一起会大大限定了资源的使用。用户并不关心后台如何处理 work 、释放后使用多线程等具体细节。
在 cmwq 中，打破了 workqueue 和 worker thread的绑定关系。其使用了 worker pool 的概念(可以理解为线程池)。在系统中已经存在若干个不同类型的 worker pool，且它们不和 特定的workqueue 绑定，所有的 workqueue 都可以根据自身的需要选择使用其中的 worker pool 。

用户可以在创建 workqueue时通过指定 flag 来约束在该 workqueue 上的 work 的处理方式。workqueue 会根据指定的 flag 将 work 交付给系统中对应的 worker pool 去处理。
通过这样的方式让所有的 workqueue 共享系统中的 worker pool，即减少了由于创建 worker thread 带来的资源的浪费，又因为 worker pool 可以根据自身的情况选择是否 减少或增加woker thread 从而提高了灵活性和并发性。

3.2、cmwq如何解决问题

我们前面提到了，传统的 workqueue 有几个问题，下面我们看看如何解决

3.2.1、如何解决线程数目过多的问题？

在 cmwq 中，用户可以根据需求来创建 workqueue，但一般情况下不需要创建 worker thread，而且是否需要 新woker 也由 woker thread pool 本身决定。也就是说 创建workqueue 已经和 后端的线程池操作没有关系 了。

系统中的 woker pool 包括 2大种：

和特定CPU绑定的线程池：也称为 per CPU worker pool，这类 woker pool 也细分为 2 种：
- normal thread pool：用于 管理普通优先级的worker thread 和 处理普通优先级的work
- high priority thread pool：用于 管理高优先级的worker thread 和 处理高优先级的work
此类 worker pool 的数量是 固定的，主要和 CPU数量 的数目。假设了 N个online CPU，则会创建 2N个 per CPU worker pool。也就是每个 CPU 都有 2 个绑定的 worker pool，分别为 normal thread pool 和 high priority thread pool。
unbound线程池：也就是 非绑定worker pool，此类 worker pool 的 worker thread 可以运行在 任意的CPU 上。这种 wokrer pool 可以动态创建。如果系统中已经 有了相同属性的worker pool则不需要创建新的线程池，否则需要创建。worker pool属性 包括该worker pool 中的 worker thread的优先级、可以运行的CPU链表 等。

基于上面的讲述，worker thread pool 创建 worker thread 的策略是怎样呢？是否会导致 worker thread 数目过多呢？

在默认情况下，创建 worker thread pool 的时候会创建 一个worker thread 来处理 work。thread worker pool 会 根据work的提交以及work的执行情况动态创建 worker thread。
cmwq 使用了一个比较简单的策略：

当 worker thread pool 中 处于运行状态的worker thread 的数量等于 0，并且有 需要处理的work 的时候，thread worker pool 就会创建 新的worker线程。
当 worker线程处于idle的时候，不会立刻销毁它，而是保持一段时间。如果这时候有创建 新的worker 的需求候，那么直接唤醒 idle worker 即可。
如果 一段时间内过去仍然没有事情处理，那么销毁 worker thread 。

3.2.2、如何解决并发问题？

假设有 A、B、C、D 四个work 在某个 CPU 上运行。在默认情况下，worker thread pool 会创建 一个worker 来处理这四个 work 。在 传统workqueue 中，这四个 work 会在 CPU 上 串行执行。也就是假如 work B阻塞了，那么 work C、work D 无法执行下去，需要一直等到 work B 解除阻塞并执行完毕。

对于 cmwq 来说，当 work B 阻塞了，worker thread pool**可以通过判断 worker thread 的执行状态来感知到这一事件，此时它会 创建一个新的worker thread来处理work C、work D，从而解决了并发的问题。由于解决了并发问题，实际上也解决了由于竞争一个 进程上下文 而引入的 死锁问题。

3.3、cmwq相关数据结构

3.3.1 cmwq数据结构及其关系

前面简单地提到过 workqueue机制 的主要数据结构，下面我们再重新整体的看看：

work：工作
workqueue：工作的集合。workqueue 和 work 是 一对多 的关系。
worker：工人。在程序中一个 worker 对应一个 work_thread内核线程。
worker_pool：工人的集合。worker_pool 和 worker 是 一对多 的关系。
pool_workqueue：可以理解为 workqueue 和 worker pool 之间的桥梁，负责建立起两者关系。workqueue 和 pool_workqueue 是 一对多 的关系，pool_workqueue 和 worker_pool 是 一对一 的关系。以下简称 pwq。

下面我们看看各个数据结构在代码中的主要成员吧

struct workqueue_struct {
    /* pwqs是指该workqueue所对接的pwq形成的链表 */
    struct list_head    pwqs;       
    /* list是全局workqueue链表节点，每个workqueue使用这个成员链入全局workqueue链表 */
    struct list_head    list;
    /* 救援线程：可以在内存压力下执行，这样避免在内存回收时出现死锁 */
    struct worker       *rescuer;
    /* flags用于指定该workqueue上的work的处理方式 */
    unsigned int        flags ____cacheline_aligned;
    /* cpu_pwqs指向该workqueue的每CPUpwq，可以通过该成员找到每CPU的 worker pool */
    struct pool_workqueue __percpu *cpu_pwqs;
    /* numa_pwq_tbl指向该workqueue的非绑定pwq，可以通过该成员找到flag所对应的work pool */
    struct pool_workqueue __rcu *numa_pwq_tbl[];
    ......
};

struct pool_workqueue {
    /* 指向该pwq所连接的worker pool */
    struct worker_pool  *pool;
    /* 指向该pwq所属的workqueue */
    struct workqueue_struct *wq;
    ......
};

struct worker_pool {
    /* 指定非绑定worker pool所在的CPU */
    int         cpu;
    /* workpool的ID */
    int         id;
    /* worker pool标志，用于指定worker pool的处理方式 */
    unsigned int        flags;
    /* 待处理的work */
    struct list_head    worklist;   /* L: list of pending works */
    /* 该worker pool所有worker的数量 */
    int         nr_workers; /* L: total number of workers */
    /* 当前idle状态的worker数量 */
    int         nr_idle;    /* L: currently idle ones */
    /* 用于链入空闲worker的链表 */
    struct list_head    idle_list;  /* X: list of idle workers */
    /* 用于链入忙时worker的哈希表 */
    DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
    /* 用于链入所有worker的链表 */
    struct list_head    workers;    /* A: attached workers */
    /* 
      用于管理worker的创建和销毁的统计计数，表示运行中的worker数量。
      该变量可能被多CPU同时访问，因此独占一个缓存行，避免多核读写造成 cache颠簸
    */
    atomic_t        nr_running ____cacheline_aligned_in_smp;
} ____cacheline_aligned_in_smp;

struct worker {
    /* 链表入口，空闲时使用entry链入空闲链表，忙时使用hentry忙时哈希表 */
    union {
        struct list_head    entry;  /* L: while idle */
        struct hlist_node   hentry; /* L: while busy */
    };
    /* 当前正在执行的work */
    struct work_struct  *current_work;
    /* 当前正在执行的work的回调函数 */
    work_func_t     current_func;
    /* 该worker所属的pwq */
    struct pool_workqueue   *current_pwq; /* L: current_work's pwq */
    /* 该worker所使用的task结构 */
    struct task_struct  *task;
    /* 该worker所在的woker pool */
    struct worker_pool  *pool;
    /* 使用该成员链入所在的woker pool */
    struct list_head    node;
};

struct work_struct {
    /* 
      低比特位部分是work的标志位。
      剩余比特位通常用于存放上一次运行的worker_pool ID或pool_workqueue的指针。
      存放的内容有WORK_STRUCT_PWQ标志位来决定 
    */
    atomic_long_t data;
    /* 用于把work挂到其他队列上 */
    struct list_head entry;
    /* 工作任务的处理函数 */
    work_func_t func;
};

下图是几个数据结构之间的简单关系示意图：

workqueue数据结构

3.3.2 cmwq数据结构初始化

我们前面说过，worker pool 会在系统初始化 workqueue 机制时被创建。下面我们简单地看看代码是如何初始化这些数据结构的，先看一下函数调用图谱。

workqueue_init_early
  ->init_worker_pool
  ->alloc_workqueue_attrs
  ->alloc_workqueue
    ->__alloc_workqueue_key
      ->alloc_and_link_pwqs
        ->apply_workqueue_attrs
          ->apply_workqueue_attrs_locked
            ->apply_wqattrs_prepare
              ->alloc_unbound_pwq
                ->get_unbound_pool
                  ->init_worker_pool
                  ->hash_add
init_workqueues
  ->create_worker(per CPU)
  ->create_worker(unbound pool)

PS：由于函数调用深度比较深，本文仅对关键地方进行讲述

/* workqueue第一阶段初始化(早期初始化) */
int __init workqueue_init_early(void)
{
    int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
    int i, cpu;

    ......
    /* initialize CPU pools */
    for_each_possible_cpu(cpu) {
        struct worker_pool *pool;

        i = 0;
        for_each_cpu_worker_pool(pool, cpu) {
          /* 针对每个CPU的worker pool，即per CPU worker pool 进行初始化 */
            BUG_ON(init_worker_pool(pool));
            pool->cpu = cpu;
            cpumask_copy(pool->attrs->cpumask, cpumask_of(cpu));
            pool->attrs->nice = std_nice[i++];
            pool->node = cpu_to_node(cpu);

            /* alloc pool ID */
            mutex_lock(&wq_pool_mutex);
            BUG_ON(worker_pool_assign_id(pool));
            mutex_unlock(&wq_pool_mutex);
        }
    }

    /* create default unbound and ordered wq attrs */
    /* NR_STD_WORKER_POOLS应该是系统标准worker pool数量 */
    for (i = 0; i < NR_STD_WORKER_POOLS; i++) {
        struct workqueue_attrs *attrs;
      /* 创建标准非绑定workqueue的属性attrs */
        BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
        attrs->nice = std_nice[i];
      /* 注意：这里被初始化好的unbound_std_wq_attrs将在后面被使用到 */
        unbound_std_wq_attrs[i] = attrs;

        /*
         * An ordered wq should have only one pwq as ordering is
         * guaranteed by max_active which is enforced by pwqs.
         * Turn off NUMA so that dfl_pwq is used for all nodes.
         */
      /* 创建有序workqueue的属性attrs */
        BUG_ON(!(attrs = alloc_workqueue_attrs(GFP_KERNEL)));
        attrs->nice = std_nice[i];
        attrs->no_numa = true;
        /* 注意：这里被初始化好的ordered_wq_attrs将在后面被使用到 */
        ordered_wq_attrs[i] = attrs;
    }

    /* 以下都是开辟标准workqueue */
    system_wq = alloc_workqueue("events", 0, 0);
    system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
    system_long_wq = alloc_workqueue("events_long", 0, 0);
    system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
                        WQ_UNBOUND_MAX_ACTIVE);
    system_freezable_wq = alloc_workqueue("events_freezable",
                          WQ_FREEZABLE, 0);
    system_power_efficient_wq = alloc_workqueue("events_power_efficient",
                          WQ_POWER_EFFICIENT, 0);
    system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_power_efficient",
                          WQ_FREEZABLE | WQ_POWER_EFFICIENT,
                          0);
    BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
           !system_unbound_wq || !system_freezable_wq ||
           !system_power_efficient_wq ||
           !system_freezable_power_efficient_wq);

    return 0;
}


struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
                           unsigned int flags,
                           int max_active,
                           struct lock_class_key *key,
                           const char *lock_name, ...)
{
    size_t tbl_size = 0;
    va_list args;
    struct workqueue_struct *wq;
    struct pool_workqueue *pwq;

    /*
     * Unbound && max_active == 1 used to imply ordered, which is no
     * longer the case on NUMA machines due to per-node pools.  While
     * alloc_ordered_workqueue() is the right way to create an ordered
     * workqueue, keep the previous behavior to avoid subtle breakages
     * on NUMA.
     */
    /* 根据flag执行不同的操作 */
    if ((flags & WQ_UNBOUND) && max_active == 1)
        flags |= __WQ_ORDERED;

    /* see the comment above the definition of WQ_POWER_EFFICIENT */
    if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
        flags |= WQ_UNBOUND;

    /* allocate wq and format name */
    if (flags & WQ_UNBOUND)
        tbl_size = nr_node_ids * sizeof(wq->numa_pwq_tbl[0]);

    /* 分配workqueue */
    wq = kzalloc(sizeof(*wq) + tbl_size, GFP_KERNEL);
    if (!wq)
        return NULL;

    /* 给workqueue的属性分配空间 */
    if (flags & WQ_UNBOUND) {
        wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
        if (!wq->unbound_attrs)
            goto err_free_wq;
    }

    va_start(args, lock_name);
    vsnprintf(wq->name, sizeof(wq->name), fmt, args);
    va_end(args);

    max_active = max_active ?: WQ_DFL_ACTIVE;
    max_active = wq_clamp_max_active(max_active, flags, wq->name);

    /* init wq */
    wq->flags = flags;
    wq->saved_max_active = max_active;
    mutex_init(&wq->mutex);
    atomic_set(&wq->nr_pwqs_to_flush, 0);
    INIT_LIST_HEAD(&wq->pwqs);
    INIT_LIST_HEAD(&wq->flusher_queue);
    INIT_LIST_HEAD(&wq->flusher_overflow);
    INIT_LIST_HEAD(&wq->maydays);

    lockdep_init_map(&wq->lockdep_map, lock_name, key, 0);
    INIT_LIST_HEAD(&wq->list);

    /*   
      给workqueue分配pool_workqueue结构
      pool_workqueue会将workqueue和worker_pool链接起来 
    */
    if (alloc_and_link_pwqs(wq) < 0)
        goto err_free_wq;

    /*
     * Workqueues which may be used during memory reclaim should
     * have a rescuer to guarantee forward progress.
     */
    /* 创建救援线程 */
    if (flags & WQ_MEM_RECLAIM) {
        struct worker *rescuer;

        rescuer = alloc_worker(NUMA_NO_NODE);
        if (!rescuer)
            goto err_destroy;

        rescuer->rescue_wq = wq;
        rescuer->task = kthread_create(rescuer_thread, rescuer, "%s",
                           wq->name);
        if (IS_ERR(rescuer->task)) {
            kfree(rescuer);
            goto err_destroy;
        }

        wq->rescuer = rescuer;
        kthread_bind_mask(rescuer->task, cpu_possible_mask);
        wake_up_process(rescuer->task);
    }

    if ((wq->flags & WQ_SYSFS) && workqueue_sysfs_register(wq))
        goto err_destroy;

    /*
     * wq_pool_mutex protects global freeze state and workqueues list.
     * Grab it, adjust max_active and add the new @wq to workqueues
     * list.
     */
    mutex_lock(&wq_pool_mutex);

    mutex_lock(&wq->mutex);
    for_each_pwq(pwq, wq)
        pwq_adjust_max_active(pwq);
    mutex_unlock(&wq->mutex);

    list_add_tail_rcu(&wq->list, &workqueues);

    mutex_unlock(&wq_pool_mutex);

    return wq;

err_free_wq:
    free_workqueue_attrs(wq->unbound_attrs);
    kfree(wq);
    return NULL;
err_destroy:
    destroy_workqueue(wq);
    return NULL;
}

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
    bool highpri = wq->flags & WQ_HIGHPRI;
    int cpu, ret;

    /* 给workqueue的per CPU pool_workqueue进行创建和初始化 */
    if (!(wq->flags & WQ_UNBOUND)) {
        wq->cpu_pwqs = alloc_percpu(struct pool_workqueue);
        if (!wq->cpu_pwqs)
            return -ENOMEM;

        for_each_possible_cpu(cpu) {
            struct pool_workqueue *pwq =
                per_cpu_ptr(wq->cpu_pwqs, cpu);
            struct worker_pool *cpu_pools =
                per_cpu(cpu_worker_pools, cpu);

      /* 这里将per CPU worker_pool绑定到pool_workqueue上 */
            init_pwq(pwq, wq, &cpu_pools[highpri]);

            mutex_lock(&wq->mutex);
    /* 使用pool_workqueue链接绑定好的workqueue和per CPU wokrer pool */
            link_pwq(pwq);
            mutex_unlock(&wq->mutex);
        }
        return 0;
    } else if (wq->flags & __WQ_ORDERED) {
        ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
        /* there should only be single pwq for ordering guarantee */
        WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
                  wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
             "ordering guarantee broken for workqueue %s\n", wq->name);
        return ret;
    } else {
        /* 
          将在外面设置的workqueue属性unbound_std_wq_attrs赋予workqueue 
          在这里面workqueue会创建和链接对应的worker pool
        */
        return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
    }
}

static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
                    const struct workqueue_attrs *attrs)
{
    struct apply_wqattrs_ctx *ctx;

    /* only unbound workqueues can change attributes */
    if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
        return -EINVAL;

    /* creating multiple pwqs breaks ordering guarantee */
    if (!list_empty(&wq->pwqs)) {
        if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
            return -EINVAL;

        wq->flags &= ~__WQ_ORDERED;
    }

    /* workqueue进行属性apply */
    ctx = apply_wqattrs_prepare(wq, attrs);
    if (!ctx)
        return -ENOMEM;

    /* the ctx has been prepared successfully, let's commit it */
    apply_wqattrs_commit(ctx);
    apply_wqattrs_cleanup(ctx);

    return 0;
}

static struct apply_wqattrs_ctx *
apply_wqattrs_prepare(struct workqueue_struct *wq,
              const struct workqueue_attrs *attrs)
{
    struct apply_wqattrs_ctx *ctx;
    struct workqueue_attrs *new_attrs, *tmp_attrs;
    int node;

    lockdep_assert_held(&wq_pool_mutex);

    ctx = kzalloc(sizeof(*ctx) + nr_node_ids * sizeof(ctx->pwq_tbl[0]),
              GFP_KERNEL);

    /* 开辟属性attrs的新空间 */
    new_attrs = alloc_workqueue_attrs(GFP_KERNEL);
    tmp_attrs = alloc_workqueue_attrs(GFP_KERNEL);
    if (!ctx || !new_attrs || !tmp_attrs)
        goto out_free;

    /*
     * Calculate the attrs of the default pwq.
     * If the user configured cpumask doesn't overlap with the
     * wq_unbound_cpumask, we fallback to the wq_unbound_cpumask.
     */
    /* 将属性拷贝到new_attrs */
    copy_workqueue_attrs(new_attrs, attrs);
    cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
    if (unlikely(cpumask_empty(new_attrs->cpumask)))
        cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);

    /*
     * We may create multiple pwqs with differing cpumasks.  Make a
     * copy of @new_attrs which will be modified and used to obtain
     * pools.
     */
    copy_workqueue_attrs(tmp_attrs, new_attrs);

    /*
     * If something goes wrong during CPU up/down, we'll fall back to
     * the default pwq covering whole @attrs->cpumask.  Always create
     * it even if we don't use it immediately.
     */
    /* 根据属性创建pool_workqueue，让pool_workqueue链接到对应属性的woker pool */
    ctx->dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
    if (!ctx->dfl_pwq)
        goto out_free;

    for_each_node(node) {
        if (wq_calc_node_cpumask(new_attrs, node, -1, tmp_attrs->cpumask)) {
            ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
            if (!ctx->pwq_tbl[node])
                goto out_free;
        } else {
            ctx->dfl_pwq->refcnt++;
            ctx->pwq_tbl[node] = ctx->dfl_pwq;
        }
    }

    /* save the user configured attrs and sanitize it. */
    copy_workqueue_attrs(new_attrs, attrs);
    cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
    ctx->attrs = new_attrs;

    ctx->wq = wq;
    free_workqueue_attrs(tmp_attrs);
    return ctx;

out_free:
    free_workqueue_attrs(tmp_attrs);
    free_workqueue_attrs(new_attrs);
    apply_wqattrs_cleanup(ctx);
    return NULL;
}

static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
                    const struct workqueue_attrs *attrs)
{
    struct worker_pool *pool;
    struct pool_workqueue *pwq;

    lockdep_assert_held(&wq_pool_mutex);
    /* 根据属性获取对应的worker pool */
    pool = get_unbound_pool(attrs);
    if (!pool)
        return NULL;

    /* 创建pool_workqueue数据结构 */
    pwq = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL, pool->node);
    if (!pwq) {
        put_unbound_pool(pool);
        return NULL;
    }

    /* 链接对应的workqueue和worker pool */
    init_pwq(pwq, wq, pool);
    return pwq;
}

static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
{
    u32 hash = wqattrs_hash(attrs);
    struct worker_pool *pool;
    int node;
    int target_node = NUMA_NO_NODE;

    lockdep_assert_held(&wq_pool_mutex);

    /* do we already have a matching pool? */
    /* 如果属性所对应的woker pool已经被创建，则直接返回对应的worker pool */
    hash_for_each_possible(unbound_pool_hash, pool, hash_node, hash) {
        if (wqattrs_equal(pool->attrs, attrs)) {
            pool->refcnt++;
            return pool;
        }
    }

    /* if cpumask is contained inside a NUMA node, we belong to that node */
    if (wq_numa_enabled) {
        for_each_node(node) {
            if (cpumask_subset(attrs->cpumask,
                       wq_numa_possible_cpumask[node])) {
                target_node = node;
                break;
            }
        }
    }

    /* nope, create a new one */
    /* 如果没有则创建新的worker pool */
    pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, target_node);
    if (!pool || init_worker_pool(pool) < 0)
        goto fail;

    lockdep_set_subclass(&pool->lock, 1);   /* see put_pwq() */
    /* 赋予worker pool对应的值 */
    copy_workqueue_attrs(pool->attrs, attrs);
    pool->node = target_node;

    /*
     * no_numa isn't a worker_pool attribute, always clear it.  See
     * 'struct workqueue_attrs' comments for detail.
     */
    pool->attrs->no_numa = false;

    if (worker_pool_assign_id(pool) < 0)
        goto fail;

    /* create and start the initial worker */
    if (wq_online && !create_worker(pool))
        goto fail;

    /* install */
    /* 将worker pool加入对应的hash表用于遍历，在后面会使用到 */
    hash_add(unbound_pool_hash, &pool->hash_node, hash);

    return pool;
fail:
    if (pool)
        put_unbound_pool(pool);
    return NULL;
}

/* 第二阶段初始化 */
int __init workqueue_init(void)
{
    ......

    /* create the initial workers */
    for_each_online_cpu(cpu) {
        for_each_cpu_worker_pool(pool, cpu) {
            pool->flags &= ~POOL_DISASSOCIATED;
            /* 可以从这里看到，per CPU woker pool是在这里创建worker的 */
            BUG_ON(!create_worker(pool));
        }
    }

    /* 
      前面已经讲到，标准非绑定的worker pool会链入该哈希表
      这里是遍历哈希表来为每个work pool创建worker
    */
    hash_for_each(unbound_pool_hash, bkt, pool, hash_node)
        BUG_ON(!create_worker(pool));
    ......
    return 0;
}

总的来说，在 系统初始化 时会创建默认的 workqueue 和 worker pool。在创建 workqueue 时，会根据 flag 创建对应的 pool_workqueue，并将对应属性的 woker pool 链接到 pool_workqueue。如果是 bound workqueue，则会直接指定 per CPU pool_workqueue 的 worker pool 为已经默认定义好的 per CPU woker pool。

3.4、cmwq相关API

3.4.1 API使用流程

cmwq 的一般使用方法比较简单，一遍为以下流程：
1. 如果需要自行创建 workqueue，则需要调用 alloc_workqueue()
2. 接着创建一个 **struct work_struct **，并使用 INIT_WORK 进行初始化
3. 调用 queue_work_on 将 work 放入 workqueue

API使用方法
举例cmwq的工作过程，内核文档中的例子

3.4.1 API说明

3.4.1 创建workqueue

#define alloc_workqueue(fmt, flags, max_active, args...)        \
    __alloc_workqueue_key((fmt), (flags), (max_active),     \
                  NULL, NULL, ##args)
struct workqueue_struct *__alloc_workqueue_key(const char *fmt,
                           unsigned int flags,
                           int max_active,
                           struct lock_class_key *key,
                           const char *lock_name, ...)

fmt：指 workqueue 的名称
flag：决定 workqueue 的执行模式及其 work 的处理模式(决定 workqueue 绑定那些 worker pool)，主要使用的标志如下：
- WQ_MEM_RECLAIM：如果 workqueue 的 work 相互之间有相关性，则需要设置该 flag。如果设置了这个 flag，那么工作队列在创建的时候会创建一个 救助者内核线程 备用，线程的名称为工作队列的名字。
  举例说明：假设有 A work、B work、C work、D work，当正在处理的 B work 被阻塞后，worker pool 会创建一个新的 worker thread 来处理其他的 work ，但是，在内存资源比较紧张的时候，创建 worker thread 未必能够成功。而如果此时 B work 是依赖 C或者D work 的执行结果的时候，系统进入死锁。这种状态是由于不能创建新的 worker thread 导致的。对于每一个设置 WQ_MEM_RECLAIM 的 workqueue，系统都会创建一个 rescuer thread。当发生这种情况的时候，C或者D work 会被 rescuer thread 接手处理，从而解除了死锁。
- WQ_FREEZABLE：该标志是一个和 电源管理 相关的。在系统 suspend 的时候，需要冻结用户空间的进程以及部分标注 freezable 的内核线程(包括 worker thread )。设置该标志的 workqueue 需要参与到进程冻结的过程中。其 worker thread 被冻结的时候，会处理完当前所有的 work。一旦冻结完成，那么就不会启动新的 work 的执行，直到进程被解冻。缺省情况下，所有的内核线程 都是 非freezable
- WQ_UNBOUND：设置该标志的 workqueue 说明其 work 的处理不需要绑定在特定的 CPU 上执行，workqueue 需要关联一个系统中的 unbound worker pool。如果系统中能根据 workqueue 的属性找到匹配的线程池，那么就选择一个。如果找不到适合的线程池，workqueue 就会创建一个 worker pool 来处理 work。
- WQ_HIGHPRI：说明挂入该 workqueue 的 work 是属于高优先级的 work，需要高优先级的 worker thread(一般指nice value比较低的worker)。
- WQ_CPU_INTENSIVE：该标志说明挂入 workqueue 的 work 是属于 特别消耗CPU 的任务。这类 work 会得到系统进程调度器的监管，排在这类 work 后面的 non-cpu intensive 类型 work 可能会推迟执行。也就是说设置该标志会影响并发性。
- WQ_POWER_EFFICIENT：由于 cache 的原因，per CPU 线程池 的性能会好一些，但是对功耗有一些影响。一般情况下，workqueue 需要在性能和功耗之间平衡。想要更好的性能，最好让 per CPU上的 worker thread 来处理 work。此时 cache命中率 会比较高，性能会更好。但如果从功耗的角度来看，最好的策略是让空闲状态的 CPU 尽可能的保持空闲，而不是反复进行 空闲-工作-空闲 的循环。
  举例说明。在 T1时刻，work 被调度到 CPU A 上执行。在 T2时刻，work 执行完毕。 CPU A 进入空闲状态。在 T3时刻，有一个新的 work 需要处理。这时候调度 work 到 CPU A 上运行。由于之前 CPU A处理过work，其 cache 中的内容还没被刷洗掉，处理起 work 速度很快。但会如果将 CPU A 从空闲状态中唤醒。如果选择 CPU B 这个一直运行的 CPU，则不存在将 CPU 从空闲状态唤醒的操作，从而获取功耗方面的好处。
max_active：该参数决定了 每个CPU 上分配到 workqueue的 work 的 最大数量。例如 max_active 是16，那么 同一个CPU上同一时刻最大只有16工作项可以工作。
对于 **bound workqueue，其最大限制值为 512，并且 默认值为0 的时候则自动设置为 256。
对于 unbound workqueue，其最大限制值为 512，并且 默认值为0 的时候则自动设置为 4。
workqueue 中的 active work 的数量一般来说是由用户来进行调节的。除非一些特殊的要求需要限制工作项的数量，一般推荐 默认的0。
一些用户需要依赖 single thread workqueue 的 严格顺序执行 的特点。那么 将其设置成1并且将flag设置成WQ_UNBOUND可以将workqueue 指定为 single thread workqueue。这类 workqueue 上的 work 总是会被放入 unbound worker pool，并且在 任意时刻都只有一个工作项被执行。这样就能够实现 类似single thread workqueue 的 相同顺序执行 的特性。

3.4.2 初始化work

#define INIT_WORK(_work, _func)                     \
    __INIT_WORK((_work), (_func), 0)

#define __INIT_WORK(_work, _func, _onstack)             \
    do {                                \
        __init_work((_work), _onstack);             \
        (_work)->data = (atomic_long_t) WORK_DATA_INIT();   \
        INIT_LIST_HEAD(&(_work)->entry);            \
        (_work)->func = (_func);                \
    } while (0)hile (0)

work：需要被初始化的 struct work_struct 结构体
func：该 work 需要执行的函数

3.4.2 入列work

bool queue_work_on(int cpu, struct workqueue_struct *wq,
           struct work_struct *work)

cpu：指定该 work 所运行的 CPU
wq：指定该 work 需要入的 workqueue
work：需要执行的 work

除了上面的 alloc_workqueue 接口之外，还有 alloc_ordered_workqueue 接口来创建一个 严格串行 执行 work 的一个 workqueue，并且该 **workqueue **是 unbound类型。

create_*接口都是为了兼容过去接口而设立的，大家可以自行理解，这里就不多说了。

了解了上面的基础内容之后，我们再回头看看 per CPU thread pool 和 unbound thread pool。当 workqueue 收到一个 work：

如果是 unbound workqueue 的话，那么该 work 由 unbound thread pool 处理并由系统的调度器模块决定该 work 所执行的 CPU 。对于 调度器 而言，它会考虑 CPU的空闲状态，尽可能的让 CPU保持在空闲状态，从而节省了功耗。因此，对于 unbound workqueue 来说，处理挂入的 work 是考虑到 power saving 的。
如果是 bound workqueue ，即没有设置 WQ_UNBOUND。则说明该 workqueue 是 per CPU 的。此时 work 的调度不由 调度器 决定，因此会间接影响功耗。

我们通过下面的例子来看看 cmwq 在不同的配置下的操作。假设有工作项 w0，w1，w2 被放入 同一个CPU 上的一个 bound workqueue 中。w0消耗了CPU 5ms的时间然后进入休眠10ms，然后再次消耗CPU 5ms。w1和w2消耗cpu 5ms然后进入休眠10ms。其中不考虑其他任务或者工作的开销，并且使用的是 FIFO调度。

如果 max_active = 1。如下所示，所有任务都是串行执行。

TIME IN MSECS	EVEN
0	w0 starts and burns CPU
5	w0 sleeps
15	w0 wakes up and burns CPU
20	w0 finishes
20	w1 starts and burns CPU
25	w1 sleeps
35	w1 wakes up and finishes
35	w2 starts and burns CPU
40	w2 sleeps
50	w2 wakes up and finishes

如果 max_active >= 3。如下所示，在某些任务休眠的同时会执行其他任务。

TIME IN MSECS	EVEN
0	w0 starts and burns CPU
5	w0 sleeps
5	w1 starts and burns CPU
10	w1 sleeps
10	w2 starts and burns CPU
15	w2 sleeps
15	w0 wakes up and burns CPU
20	w0 finishes
20	w1 wakes up and finishes
25	w2 wakes up and finishes

如果 max_active = 2。如下所示，w2 只能最后才运行。

TIME IN MSECS	EVEN
0	w0 starts and burns CPU
5	w0 sleeps
5	w1 starts and burns CPU
10	w1 sleeps
15	w0 wakes up and burns CPU
20	w0 finishes
20	w1 wakes up and finishes
20	w2 starts and burns CPU
25	w2 sleeps
35	w2 wakes up and finishes

如果我们现在讲队列设置了 WQ_CPU_INTENSIVE 标志，则如下

TIME IN MSECS	EVEN
0	w0 starts and burns CPU
5	w0 sleeps
5	w1 and w2 start and burn CPU
10	w1 sleeps
15	w2 sleeps
15	w0 wakes up and burns CPU
20	w0 finishes
20	w1 wakes up and finishes
25	w2 wakes up and finishes

四、参考链接

蜗窝科技
Linux Workqueue
Linux kernel workqueue机制分析
郭健： currency Managed Workqueue(CMWQ)概述
工作队列详解
Linux中断管理 (3)workqueue工作队列
并发管理工作队列(kernel译文[1])
Linux Concurrency Managed Workqueue分析

linux驱动之workqueue