进程的调度与切换直接影响着进程子系统的执行效率.Linux摒弃了i386 硬件提供的进程切换方法.手动保存进程上下文.在调度策略上,近几个版本对其都有很大的改动.特别是在2.6.23版本与以前发布的2.6.0更是相差甚远.在调度方面.我们以2.6.9在代码作为基准作为分析.
* one hypercall.
//赋active_mm为空.
/* Here we just switch the register state and the stack. */
* frame will be invalid.
//prev == next .那当前cpu中的active_mm就是prev. 也即next
* tlb flush IPI delivery. We must reload %cr3.
Next->tread.esp à esp(将next->tread.esp加载esp寄存器.这一步实际上是加载next进程的内核栈。运行到这里。理论上prev已经被切换到next了。因为因此如果用curret获得当前进程应该是next)
将标号为1的地址 à prev->thread.eip (movl $1f,%1.虽然理论是进程是切换完成了,但是需要保存prev的执行环境。这里是指明prev进程的下一条运行指令。这样如果切换进prev进程的话,就会从标号1的地址开始运行)
这里要注意的是__switch_to函数是经过regparm(3)来修饰的。这个是GCC的一个扩展语法。即从eax.ebx.ecx寄存器取函数的参数。这样:__switch_to()函数的参数指是从寄存器里取的。并不是像普通函数那样从当前堆栈中取得.在调用函数__switch_to()之前。将next->thread.eip压栈了。这样从函数返回之后,它的下一条运行指令就是next->thread.eip了
对于新创建的进程。我们设置了 p->thread.eip = (unsigned long) ret_from_fork。这样子进程被切换进来之后,就会通过ret_from_fork返回到用户空间了。(详情请参照本站的相关文档)
对于已经运行的进程。我们在这里可以看到,在进程被切换出去的时候,prev->thread.eip会被设置成标号1的地址。即是从标号1的地址开始运行的。
Esi.edi寄存器:有人认为在输出部中有这两句:"=S" (esi),"=D" (edi) 认为会将esi edi压栈。这际上这是错误的。压栈的只是esi edi变量(在嵌入代码之前定义的).而不是esi edi寄存器。在这里。它只是在嵌入代码运行之后。将esi edi寄存器的内容分别放到esi edi变量中而已。在我用gcc –S 测试结果看来也确实如此。
在switch_to()中执行环境的切换过程。在prev被切换回来之后。是从switch_to中标号1的地址继续执行的。这时候已经被切换回了进程prev的内核栈。假设此时没有保存其它的寄存器(flags ,ebp是必须要保存的哦 ^_^ ).它的后续操作是barrier()(GCC的一段嵌入代码。声明内存已经变动.它实际上没有任何操作,这跟寄存器没关系).调用函数finish_task_switch().再从context_switch()函数中返回。再从schedule()中返回。再继续执行prev进程流。
我们知道,C函数的调用机制中。在调用之前先会执行进程环境的保存,再推入参数。最后推入下一条执令地址。然后跳转到函数地址。从函数中返回之后。会将环境恢复。继续执行。
这样我们知道。在switch_to()后的finish_task_switch()的函数调用是不受寄存器影响的(因为这是个函数。在进入函数后相当是一个全新的“世界”).然后从context_switch()返回之后就会恢复schedule()中的执行环境了。到了schedule()之后,这时候的环境跟之前切换出去的环境是完全一样的了.
当然,这样做的前提时:在switch_to()之后的代码对寄存器没有依赖。也就是全为函数。假设在switch_to()之后,有对ecx .edi esi寄存器的读取操作就会产生错误了。
在switch_to中会调用__switch_to()。它的代码如下所示:
struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
struct thread_struct *prev = &prev_p->thread,
*next = &next_p->thread;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(init_tss, cpu);
/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
//MMX,FPU,XMM寄存器的相关操作
//我们在前面已经详细分析过这个函数。详细请参阅本站相关文档
__unlazy_fpu(prev_p);
/* we're going to use this soon, after a few expensive things */
if (next_p->fpu_counter > 5)
prefetch(&next->i387.fxsave);
/*
* Reload esp0.
*/
//将next->thread.esp0 --> tss
//从用户态切换到内核态的时候,将此值加载esp寄存器.即为进程的内核栈顶位置
load_esp0(tss, next);
/*
* Save away %gs. No need to save %fs, as it was saved on the
* stack on entry. No need to save %es and %ds, as those are
* always kernel segments while inside the kernel. Doing this
* before setting the new TLS descriptors avoids the situation
* where we temporarily have non-reloadable segments in %fs
* and %gs. This could be an issue if the NMI handler ever
* used %fs or %gs (it does not today), or if the kernel is
* running inside of a hypervisor layer.
*/
//为 prev保存gs
savesegment(gs, prev->gs);
/*
* Load the per-thread Thread-Local Storage descriptor.
*/
//从next的tls_array 缓存中加载线程的Thread-Local Storage描述符
load_TLS(next, cpu);
/*
* Restore IOPL if needed. In normal use, the flags restore
* in the switch assembly will handle this. But if the kernel
* is running virtualized at a non-zero CPL, the popf will
* not restore flags, so it must be done in a separate step.
*/
//如果当前特权级别是0并且prev->iopl != next->iopl则恢复IOPL设置set_iopl_mask(next->iopl)。
if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))
set_iopl_mask(next->iopl);
/*
* Now maybe handle debug registers and/or IO bitmaps
*/
// 根据thread_info的TIF标志_TIF_WORK_CTXSW和TIF_IO_BITMAP判断是否需要处理debug寄存器和IO位图
if (unlikely(task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV ||
task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
__switch_to_xtra(prev_p, next_p, tss);
/*
* Leave lazy mode, flushing any hypercalls made here.
* This must be done before restoring TLS segments so
* the GDT and LDT are properly updated, and must be
* done before math_state_restore, so the TS bit is up
* to date.
*/
//arch_leave_lazy_cpu_mode()设置CPU的lazy模式
arch_leave_lazy_cpu_mode();
/* If the task has used fpu the last 5 timeslices, just do a full
* restore of the math state immediately to avoid the trap; the
* chances of needing FPU soon are obviously high now
*/
// 如果next_p->fpu_counter > 5则恢复next_p的FPU寄存器内容
if (next_p->fpu_counter > 5)
math_state_restore();
/*
* Restore %gs if needed (which is common)
*/
//如果需要,恢复gs寄存器
if (prev->gs | next->gs)
loadsegment(gs, next->gs);
//把当前的task写入current_task(一个per-cpu变量)
x86_write_percpu(current_task, next_p);
return prev_p;
}
这个函数涉及到很多硬件相关的东西。需要参阅x86的架构体系方面的资料才能完全理解。
进程切换完了之后,还得要进行一些清理工作,这是由finish_task_switch()完成的。它的代码如下:
static inline void finish_task_switch(struct rq *rq, struct task_struct *prev)
__releases(rq->lock)
{
struct mm_struct *mm = rq->prev_mm;
long prev_state;
//清空运行队列的prev_mm .
rq->prev_mm = NULL;
/*
* A task struct has one reference for the use as "current".
* If a task dies, then it sets TASK_DEAD in tsk->state and calls
* schedule one last time. The schedule call will never return, and
* the scheduled task must drop that reference.
* The test for TASK_DEAD must occur while the runqueue locks are
* still held, otherwise prev could be scheduled on another cpu, die
* there before we look at prev->state, and then the reference would
* be dropped twice.
*/
prev_state = prev->state;
finish_arch_switch(prev);
finish_lock_switch(rq, prev);
fire_sched_in_preempt_notifiers(current);
//mmdrop:减少mm的引用计数,如果引用计数为零,释放之
if (mm)
mmdrop(mm);
//如果过时程已经终止
if (unlikely(prev_state == TASK_DEAD)) {
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
*/
//释放该进程的task
kprobe_flush_task(prev);
put_task_struct(prev);
}
}
这个函数比较简单。不需要做详细分析了。
到这里,进程切换已经分析完了。那进程什么时候切换呢?切换的依据又是什么呢?这就是我们接下来要分析的进程调度方面的内容了。
二:进程调度
进程调度在近几个版本中都进行了重要的修改。我们以2.6.9版为例过行分析。在进行具体的代码分析之前。我们先学习一下关于进程调度的原理。
1:进程类型:
在linux调度算法中,将进程分为两种类型。即:I/O消耗型和CPU消耗型。例如文本处理程序与正在执行的Make的程序。文本处理程序大部份时间都在等待I/O设备的输入,而make程序大部份时间都在CPU的处理上。因此为了提高响应速度,I/O消耗程序应该有较高的优先级,才能提高它的交互性。相反的,Make程序相比之下就不那么重要了。只要它能处理完就行了。因此,基于这样的原理,linux有一套交互程序的判断机制。
在task_struct结构中新增了一个成员:sleep_avg.此值初始值为100。进程在CPU上执行时,此值减少。当进程在等待时,此值增加。最后,在调度的时候。根据sleep_avg的值重新计算优先级。
2:进程优先级
正如我们在上面所说的:交互性强的需要高优先级,交互性弱的需要低优先级。在linux系统中,有两种优先级:普通优先级和实时优先级。我们在这里主要分析的是普通优先级,实时优先级部份可自行了解。
3:运行时间片
进程的时间片是指进程在抢占前可以持续运行的时间。在linux中,时间片长短可根据优先级来调度。进程不一定要一次运行完所有的时间片。可以在运时的中途被切换出去。
4:进程抢占
当一个进程被设为TASK_RUNING状态时。它会判断它的优先级是否高于正在运行的进程。如果是,则设置调度标志位,调用schedule()执行进程的调度。当一个进程的时间片为0时,也会执行进程抢占.
重要的数据结构:
在2。6。9中,每个CPU都有一个运行队列,这样就避免了竞争所带来的额外开销。运行队列的结构如下所示:
struct runqueue {
spinlock_t lock;
/*
* nr_running and cpu_load should be in the same cacheline because
* remote CPUs use both these fields when doing load calculation.
*/
//运行队列中进程数目
unsigned long nr_running;
#ifdef CONFIG_SMP
unsigned long cpu_load;
#endif
//进程切换的总数
unsigned long long nr_switches;
//当新一轮的时间片递减开始后,
//expired_timestamp变量记录着最早发生的进程耗完时间片事件的时间
//nr_uninterruptible: 记录本 CPU 尚处于 TASK_UNINTERRUPTIBLE 状态的进程数,和负载信息有关
unsigned long expired_timestamp, nr_uninterruptible;
//本就绪队列最近一次发生调度事件的时间
unsigned long long timestamp_last_tick;
//当前CPU上运行的进程和指定的空闲进程
task_t *curr, *idle;
//上一次调度时的MM结构
struct mm_struct *prev_mm;
//两个队列:active与expired
//active记录的是活动的队列
//expired:记录的是时间片运行完了的队列
prio_array_t *active, *expired, arrays[2];
//记录 expired 就绪进程组中的最高优先级(数值最小)
// 该变量在进程进入 expired 队列的时候保存
int best_expired_prio;
//记录本 CPU 因等待 IO 而处于休眠状态的进程数
atomic_t nr_iowait;
……
……
}
prio_array_t的定义如下:
struct prio_array {
//该队列的进程数
unsigned int nr_active;
//位图.每一位表示一个队列,如果该队列有可以运行的进程,置该位为1
unsigned long bitmap[BITMAP_SIZE];
//运行队列
struct list_head queue[MAX_PRIO];
};
进程调度的实现
对调度有了一般的了解之后,我们来看下进程调度到底是怎么样实现的:
进程调度由schedule()完成,它的代码如下:
asmlinkage void __sched schedule(void)
{
long *switch_count;
task_t *prev, *next;
runqueue_t *rq;
prio_array_t *array;
struct list_head *queue;
unsigned long long now;
unsigned long run_time;
int cpu, idx;
//WARN_ON(system_state == SYSTEM_BOOTING);
/*
* Test if we are atomic. Since do_exit() needs to call into
* schedule() atomically, we ignore that path for now.
* Otherwise, whine if we are scheduling when we should not be.
*/
//in_atomic:判断是否在一个中断上下文或者是原子上下文
if (likely(!(current->state & (TASK_DEAD | TASK_ZOMBIE)))) {
if (unlikely(in_atomic())) {
printk(KERN_ERR "bad: scheduling while atomic!\n");
dump_stack();
}
}
need_resched:
//禁止抢占
preempt_disable();
prev = current;
//得到当前CPU上的运行队列
rq = this_rq();
/*
* The idle thread is not allowed to schedule!
* Remove this check after it has been exercised a bit.
*/
//如果当前进程是空闲进程而且不处于运行状态.
//BUG:
if (unlikely(current == rq->idle) && current->state != TASK_RUNNING) {
printk(KERN_ERR "bad: scheduling from the idle thread!\n");
dump_stack();
}
release_kernel_lock(prev);
schedstat_inc(rq, sched_cnt);
now = sched_clock();
//timestamp: 1:切换上去的时间戳
// 2:切换下来的时间戳
// 3:被唤醒的时间恰戳
//prev是当前正在运行的进程.这个timestamp应该是表示被切换上去的时间戳
//当前运行的时间.如果超过了NS_MAX_SLEEP_AVG. 取运行时间为NS_MAX_SLEEP_AVG
if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
run_time = now - prev->timestamp;
else
run_time = NS_MAX_SLEEP_AVG;
/*
* Tasks with interactive credits get charged less run_time
* at high sleep_avg to delay them losing their interactive
* status
*/
//HIGH_CREDIT()判断是否是一个交互式进程
//interactive_credit 超过了阀值
//如果是交互式进程,则它参与优先级计算的运行时间会比实际运行时间小
//以此获得较高的优先级
if (HIGH_CREDIT(prev))
run_time /= (CURRENT_BONUS(prev) ? : 1);
spin_lock_irq(&rq->lock);
/*
* if entering off of a kernel preemption go straight
* to picking the next task.
*/
switch_count = &prev->nivcsw;
//state: -1:不可运行 0:可运行的 >0 暂停的
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
switch_count = &prev->nvcsw;
//可中断且有末处理的信号.激活之.
if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
unlikely(signal_pending(prev))))
prev->state = TASK_RUNNING;
else
//从运行队列中移除
deactivate_task(prev, rq);
}
//当前CPU
cpu = smp_processor_id();
//运行队列中没有可供运行的进程了
if (unlikely(!rq->nr_running)) {
go_idle:
//如果没有可运行的进程了.从其它CPU上"搬"进程过来
idle_balance(cpu, rq);
//如果还是没有可运行进程.则将Next赋值为空闲进程.
//再跳转到switch_tasks.将空闲进程切换进去
if (!rq->nr_running) {
next = rq->idle;
rq->expired_timestamp = 0;
wake_sleeping_dependent(cpu, rq);
/*
* wake_sleeping_dependent() might have released
* the runqueue, so break out if we got new
* tasks meanwhile:
*/
if (!rq->nr_running)
goto switch_tasks;
}
} else {
if (dependent_sleeper(cpu, rq)) {
schedstat_inc(rq, sched_goidle);
next = rq->idle;
goto switch_tasks;
}
/*
* dependent_sleeper() releases and reacquires the runqueue
* lock, hence go into the idle loop if the rq went
* empty meanwhile:
*/
if (unlikely(!rq->nr_running))
goto go_idle;
}
//活动队列
array = rq->active;
//如果活动队列中进程数为0.
if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.
*/
//对换expired 与active
schedstat_inc(rq, sched_switch);
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
rq->expired_timestamp = 0;
rq->best_expired_prio = MAX_PRIO;
} else
schedstat_inc(rq, sched_noswitch);
//在活动队列组中取第一个不为空的队列
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
//next进程即是将要被切换进的进程
//rt_task():判断是否是一个实时进程
if (!rt_task(next) && next->activated > 0) {
unsigned long long delta = now - next->timestamp;
//task->activated:表示进程因什么原因进入就绪态,这一原因会影响到调度优先级的计算
//-1,进程从 TASK_UNINTERRUPTIBLE 状态被唤醒
//0,缺省值,进程原本就处于就绪态
// 1,进程从 TASK_INTERRUPTIBLE 状态被唤醒,且不在中断上下文中
// 2,进程从 TASK_INTERRUPTIBLE 状态被唤醒,且在中断上下文中
//如果是中断服务程序调用的 activate_task(),也就是说进程由中断激活,则该进程最有可能是交互式的,因此,置 activated=2;否则置activated=1。
//如果进程是从 TASK_UNINTERRUPTIBLE 状态中被唤醒的,则 activated=-1(在try_to_wake_up()函数中)。
if (next->activated == 1)
delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
array = next->array;
//将进程移出队列
dequeue_task(next, array);
//重新计算优先级
recalc_task_prio(next, next->timestamp + delta);
//重新插入运行队列
enqueue_task(next, array);
}
next->activated = 0;
switch_tasks:
prefetch(next);
//清除调度标志
clear_tsk_need_resched(prev);
rcu_qsctr_inc(task_cpu(prev));
//更新prev->sleep_avg
prev->sleep_avg -= run_time;
//如果sleep_avg小于零
if ((long)prev->sleep_avg <= 0) {
prev->sleep_avg = 0;
if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
prev->interactive_credit--;
}
prev->timestamp = prev->last_ran = now;
sched_info_switch(prev, next);
//如果prev与next不相等,则执行切换过程
if (likely(prev != next)) {
next->timestamp = now;
rq->nr_switches++;
rq->curr = next;
++*switch_count;
//执行进程的切换过程
prepare_arch_switch(rq, next);
prev = context_switch(rq, prev, next);
barrier();
finish_task_switch(prev);
} else
spin_unlock_irq(&rq->lock);
reacquire_kernel_lock(current);
preempt_enable_no_resched();
//如果还有调度标志.那就重新执行调度过程.
//这个调度标志可能是在别处被设置的
if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}
在上面的代码中涉及到了SMP的负载平衡,动态优先级的知识。
所谓SMP负载平衡是指,当一个CPU的运行进程过多,但是另一个CPU的进程过少。需要在两者之间平衡运行的进程数。显而易见,这样可以提高系统的运行效率。
动态优先级中涉及到很多的公式,在这里不加详细分析,请自行了解。
我们接着看,如果一个进程的时间片完了,会进行怎么的处理。这个处理过程是在时间中断中完成的。详细请参阅本站的相关分析。
在时间中断处理程序中,会调用sheduler_tick()对当前进程运行的时间片做特定的处理。它的代码如下:
void scheduler_tick(int user_ticks, int sys_ticks)
{
//当前CPU
int cpu = smp_processor_id();
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
//取得当前CPU的运行队列
runqueue_t *rq = this_rq();
task_t *p = current;
//更新timestamp_last_tick为当前时间
rq->timestamp_last_tick = sched_clock();
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_ticks);
/* note: this timer irq context must be accounted for as well */
if (hardirq_count() - HARDIRQ_OFFSET) {
cpustat->irq += sys_ticks;
sys_ticks = 0;
} else if (softirq_count()) {
cpustat->softirq += sys_ticks;
sys_ticks = 0;
}
//如果当前进程为空闲进程
if (p == rq->idle) {
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait += sys_ticks;
else
cpustat->idle += sys_ticks;
if (wake_priority_sleeper(rq))
goto out;
rebalance_tick(cpu, rq, IDLE);
return;
}
if (TASK_NICE(p) > 0)
cpustat->nice += user_ticks;
else
cpustat->user += user_ticks;
cpustat->system += sys_ticks;
/* Task might have expired already, but not scheduled off yet */
//如果当前进程不是在活动队列中
//说明P是过期但末移除的进程
if (p->array != rq->active) {
//设置调度标志位
set_tsk_need_resched(p);
goto out;
}
spin_lock(&rq->lock);
/*
* The task was running during this tick - update the
* time slice counter. Note: we do not update a thread's
* priority until it either goes to sleep or uses up its
* timeslice. This makes it possible for interactive tasks
* to use up their timeslices at their highest priority levels.
*/
//实时进程的处理
if (rt_task(p)) {
/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
if ((p->policy == SCHED_RR) && !--p->time_slice) {
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
set_tsk_need_resched(p);
/* put it at the end of the queue: */
dequeue_task(p, rq->active);
enqueue_task(p, rq->active);
}
goto out_unlock;
}
//如果进程的时间片用完
if (!--p->time_slice) {
//移出活动队列
dequeue_task(p, rq->active);
//设置调度标志
set_tsk_need_resched(p);
//重新计算优先级与时间片
p->prio = effective_prio(p);
p->time_slice = task_timeslice(p);
p->first_time_slice = 0;
//如果rq->expired_timestamp 为空,则设为当前时间
//expired_timestamp:表示运行队列中最先时间片耗完的时间
if (!rq->expired_timestamp)
rq->expired_timestamp = jiffies;
//在这里要注意了:
//如果是一个交互性很强的进程.如果它时间版运行完了.将它移至expired队列的
//话,那就必须要等active队列的进程运行完之后才能被调度.这样会影响用户的交互性
//判断是否是一个交互性很强的进程.如果不是,将其加至expired队列
//如果是,将其加至active队列
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
//加入expired队列
enqueue_task(p, rq->expired);
if (p->static_prio < rq->best_expired_prio)
rq->best_expired_prio = p->static_prio;
} else
//移至active队列
enqueue_task(p, rq->active);
} else {
/*
* Prevent a too long timeslice allowing a task to monopolize
* the CPU. We do this by splitting up the timeslice into
* smaller pieces.
*
* Note: this does not mean the task's timeslices expire or
* get lost in any way, they just might be preempted by
* another task of equal priority. (one with higher
* priority would have preempted this task already.) We
* requeue this task to the end of the list on this priority
* level, which is in essence a round-robin of tasks with
* equal priority.
*
* This only applies to tasks in the interactive
* delta range with at least TIMESLICE_GRANULARITY to requeue.
*/
//当前进程的时间片没有运行完...
//如果进程的剩余时间超长.
if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
(p->array == rq->active)) {
//从活动队列中移除
dequeue_task(p, rq->active);
//设置调度标志
set_tsk_need_resched(p);
//重新计算优先级
p->prio = effective_prio(p);
//重新插入活动队列
enqueue_task(p, rq->active);
}
}
out_unlock:
spin_unlock(&rq->lock);
out:
rebalance_tick(cpu, rq, NOT_IDLE);
}
总结:
操作系统用的进程调度算法是衡量一个操作系统的重要标志。在2.4之前,调度部份算法变动很少。但是在2.6版中的近几个版本中发生了很大的变化。其中重要的变化是支持内核抢占。内核抢占导致了内核的复杂性。很多代码都必须要保持重入。