Linux内核学习014——进程调度(三)

Linux调度算法

在Linux中，调度器是以模块方式提供的，这样可以允许不同类型的进程有针对性地选择调度算法。这种模块化结构称为调度器类，其允许多种不同的可动态添加的调度算法并存，调度属于自己范畴的进程。每个调度器有一个优先级，基础调度器会依优先级遍历调度类，拥有最高优先级的调度器类选择将要执行的程序。

CFS就是一个调度器类，在Linux中称为SCHED_NORMAL，其具体定义在Linux2.6.34/kernel/sched_fair.c中。

CFS的理念很简单：进程调度的效果应如同系统具备一个理想中的完美多任务处理器。即每个进程可获得1/n的处理器时间(n为可运行的进程数量)。但是，由于调度时进程抢占的开销，比如：进程换入、进程换出、缓存影响等，理想模型是无法实现的。因此，CFS充分考虑了额外开销，确保了系统性能不受影响。CFS的做法是：允许每个进程运行一段时间，循环轮转，选择运行最少的进程作为下一个运行进程，而不是分配每个进程以时间片。nice值在CFS中被作为进程获得处理器运行比得权重(高nice值，低优先级，低处理器使用权重)。当可运行进程数量过多时，它们各自所获得的处理器使用比和时间片都趋于0，CFS为此设立了最小粒度，默认值为1ms。

总结：任何进程所获得的处理器时间是由它自己和其他所有可运行进程nice值的相对比例决定的。

Linux调度的实现

CFS具体实现位于kernel/schde_fair.c，其中需要重点关注的是四个组成部分：

时间记账
进程选择
调度器入口
睡眠和唤醒

时间记账

显然，所有的调度器都必须记录进程的运行时间。CFS中使用一个结构体sched_entity来记录进程运行信息，其定义在Linux2.6.34/include/linux/sched.h#L1090：

/*
 * CFS stats for a schedulable entity (task, task-group etc)
 *
 * Current field usage histogram:
 *
 *     4 se->block_start
 *     4 se->run_node
 *     4 se->sleep_start
 *     6 se->load.weight
 */
struct sched_entity {
    struct load_weight  load;       /* for load-balancing */
    struct rb_node      run_node;
    struct list_head    group_node;
    unsigned int        on_rq;

    u64         exec_start;
    u64         sum_exec_runtime;
    u64         vruntime;
    u64         prev_sum_exec_runtime;

    u64         last_wakeup;
    u64         avg_overlap;

    u64         nr_migrations;

    u64         start_runtime;
    u64         avg_wakeup;

#ifdef CONFIG_SCHEDSTATS
    u64         wait_start;
    u64         wait_max;
    u64         wait_count;
    u64         wait_sum;
    u64         iowait_count;
    u64         iowait_sum;

    u64         sleep_start;
    u64         sleep_max;
    s64         sum_sleep_runtime;

    u64         block_start;
    u64         block_max;
    u64         exec_max;
    u64         slice_max;

    u64         nr_migrations_cold;
    u64         nr_failed_migrations_affine;
    u64         nr_failed_migrations_running;
    u64         nr_failed_migrations_hot;
    u64         nr_forced_migrations;

    u64         nr_wakeups;
    u64         nr_wakeups_sync;
    u64         nr_wakeups_migrate;
    u64         nr_wakeups_local;
    u64         nr_wakeups_remote;
    u64         nr_wakeups_affine;
    u64         nr_wakeups_affine_attempts;
    u64         nr_wakeups_passive;
    u64         nr_wakeups_idle;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
    struct sched_entity *parent;
    /* rq on which this entity is (to be) queued: */
    struct cfs_rq       *cfs_rq;
    /* rq "owned" by this entity/group: */
    struct cfs_rq       *my_q;
#endif
};

下半部分的成员变量需要设置了CONFIG_SCHEDSTATS和CONFIG_FAIR_GROUP_SCHED时才会启用，重点关注前半部分即可。

调度器实体结构是作为一个名为se的成员变量，嵌入在进程描述符struct task_struct中的。

2019-02-10_203619.png

虚拟实时

sched_entity中的u64 vruntime成员记录进程的虚拟运行时间，单位为ns。CFS使用vruntime变量来记录一个程序运行了多久以及还能运行多久，定义在Linux2.6.34/kernel/sched_fair.c#L518的函数update_curr实现了该记账功能：

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_of(cfs_rq)->clock;
    unsigned long delta_exec;

    if (unlikely(!curr))
        return;

    /*
     * Get the amount of time the current task was running
     * since the last time we changed load (this cannot
     * overflow on 32 bits):
     */
    delta_exec = (unsigned long)(now - curr->exec_start);
    if (!delta_exec)
        return;

    __update_curr(cfs_rq, curr, delta_exec);
    curr->exec_start = now;

    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);

        trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
        cpuacct_charge(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}

update_curr()计算了当前进程的执行时间，并将其存放在变量delta_exec总，然后将与运行时间传递给__update_curr()，由其根据当前可运行进程总数对运行时间进行加权计算，最终将权重值与当前进程的vruntime相加。

注：__update_curr()函数定义在update_curr()函数上方

/*
 * Update the current task's runtime statistics. Skip current tasks that
 * are not in our scheduling class.
 */
static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
          unsigned long delta_exec)
{
    unsigned long delta_exec_weighted;

    schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq, exec_clock, delta_exec);
    delta_exec_weighted = calc_delta_fair(delta_exec, curr);

    curr->vruntime += delta_exec_weighted;
    update_min_vruntime(cfs_rq);
}

update_curr()是由系统定时器周期性调用的，因此vruntime可以准确地测量给定进程的运行时间，且可以知道下一额运行的进程。

进程选择

CFS需要选择下一个进程运行时，它会挑选一个具有最小vruntime的进程。CFS使用了红黑树来组织可运行进程队列，关于红黑树可以参考这篇文章。红黑树是一个自平衡二叉搜索树，以树节点形式存储数据，这些数据对应一个键值，可通过键值快速检索节点上的数据。

假设存在一个红黑树存储了系统中所有的可运行进程，其中的节点键值为可运行进程的虚拟运行时间。CFS选择vruntime最小的，即树中最左侧的叶子节点，可以从树的根节点一直沿左侧子节点向下找直到叶子节点。实现该过程的函数为__pick_next_entity()，其定义在Linux2.6.34/source/kernel/sched_fair.c#L377。

static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct rb_node *left = cfs_rq->rb_leftmost;

    if (!left)
        return NULL;

    return rb_entry(left, struct sched_entity, run_node);
}

注：其实__pick_next_entity()函数本身并不会遍历树找到最左子节点，因为该值已经缓存在rb_leftmost中了。

加入进程

CFS先rbtree(红黑树)中加入可执行进程发生在进程变为可执行状态或者是通过fork()调用第一次创建进程是，enqueue_entity()函数实现了该过程：

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
    /*
     * Update the normalized vruntime before updating min_vruntime
     * through callig update_curr().
     */
    if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE))
        se->vruntime += cfs_rq->min_vruntime;

    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);
    account_entity_enqueue(cfs_rq, se);

    if (flags & ENQUEUE_WAKEUP) {
        place_entity(cfs_rq, se, 0);
        enqueue_sleeper(cfs_rq, se);
    }

    update_stats_enqueue(cfs_rq, se);
    check_spread(cfs_rq, se);
    if (se != cfs_rq->curr)
        __enqueue_entity(cfs_rq, se);
}

该函数更新运行时间和其他一些统计数据，然后调用__enqueue_entity()进行插入操作：

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    s64 key = entity_key(cfs_rq, se);
    int leftmost = 1;

    /*
     * Find the right place in the rbtree:
     */
    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        /*
         * We dont care about collisions. Nodes with
         * the same key stay together.
         */
        if (key < entity_key(cfs_rq, entry)) {
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = 0;
        }
    }

    /*
     * Maintain a cache of leftmost tree entries (it is frequently
     * used):
     */
    if (leftmost)
        cfs_rq->rb_leftmost = &se->run_node;

    rb_link_node(&se->run_node, parent, link);
    rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

删除进程

删除进程发生在进程堵塞或者终止时：

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);

    update_stats_dequeue(cfs_rq, se);
    if (sleep) {
#ifdef CONFIG_SCHEDSTATS
        if (entity_is_task(se)) {
            struct task_struct *tsk = task_of(se);

            if (tsk->state & TASK_INTERRUPTIBLE)
                se->sleep_start = rq_of(cfs_rq)->clock;
            if (tsk->state & TASK_UNINTERRUPTIBLE)
                se->block_start = rq_of(cfs_rq)->clock;
        }
#endif
    }

    clear_buddies(cfs_rq, se);

    if (se != cfs_rq->curr)
        __dequeue_entity(cfs_rq, se);
    account_entity_dequeue(cfs_rq, se);
    update_min_vruntime(cfs_rq);

    /*
     * Normalize the entity after updating the min_vruntime because the
     * update can refer to the ->curr item and we need to reflect this
     * movement in our normalized position.
     */
    if (!sleep)
        se->vruntime -= cfs_rq->min_vruntime;
}

实际删除工作由__dequeue_entity()完成。

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
    if (cfs_rq->rb_leftmost == &se->run_node) {
        struct rb_node *next_node;

        next_node = rb_next(&se->run_node);
        cfs_rq->rb_leftmost = next_node;
    }

    rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}

调度器入口

调度器入口时函数schedule()，该函数定义在Linux2.6.34/kernel/sched.c#L3698中。schedule()时内恶化其他部分调用进程调度器的入口，其通常和一个具体的调度类相关联。

/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq *rq;
    int cpu;

need_resched:
    preempt_disable();
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    rcu_sched_qs(cpu);
    prev = rq->curr;
    switch_count = &prev->nivcsw;

    release_kernel_lock(prev);
need_resched_nonpreemptible:

    schedule_debug(prev);

    if (sched_feat(HRTICK))
        hrtick_clear(rq);

    raw_spin_lock_irq(&rq->lock);
    update_rq_clock(rq);
    clear_tsk_need_resched(prev);

    if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
        if (unlikely(signal_pending_state(prev->state, prev)))
            prev->state = TASK_RUNNING;
        else
            deactivate_task(rq, prev, 1);
        switch_count = &prev->nvcsw;
    }

    pre_schedule(rq, prev);

    if (unlikely(!rq->nr_running))
        idle_balance(cpu, rq);

    put_prev_task(rq, prev);
    next = pick_next_task(rq);

    if (likely(prev != next)) {
        sched_info_switch(prev, next);
        perf_event_task_sched_out(prev, next);

        rq->nr_switches++;
        rq->curr = next;
        ++*switch_count;

        context_switch(rq, prev, next); /* unlocks the rq */
        /*
         * the context switch might have flipped the stack from under
         * us, hence refresh the local variables.
         */
        cpu = smp_processor_id();
        rq = cpu_rq(cpu);
    } else
        raw_spin_unlock_irq(&rq->lock);

    post_schedule(rq);

    if (unlikely(reacquire_kernel_lock(current) < 0)) {
        prev = rq->curr;
        switch_count = &prev->nivcsw;
        goto need_resched_nonpreemptible;
    }

    preempt_enable_no_resched();
    if (need_resched())
        goto need_resched;
}

睡眠和唤醒

睡眠(被阻塞)的进程无法执行，原因是其在等待一些时间发生，比如：文件I/O，网络请求等。休眠中的进程有两个相关的进程状态：TASK_INTERRUPTIBLE和TASK_UNINTERRUPTIBLE。区别在于处于TASK_UNINTERRUPTIBLE的进程会忽略信号，而处于TASK_INTERRUPTIBLE的进程会被信号唤醒并响应信号。此外，两种状态的进程都位于一个等待队列上等待事件发生，不能运行。

唤醒

唤醒通过wake_up()函数完成，它会唤醒指定的等待队列上的所有进程。

Linux内核学习014——进程调度(三)

Linux内核学习014——进程调度(三)

Linux调度算法

Linux调度的实现

时间记账

进程选择

调度器入口

睡眠和唤醒

你可能感兴趣的:(Linux内核学习014——进程调度(三))