我提交的一个内核补丁—CFS的child-runs-first

今天提交了一个内核补丁,只要是关于fork的时候子进程优先于父进程运行的补丁,email正文如下:
CFS scheduler become the main scheduler after 2.6.23.everything is fair,no starvation,no complexity.The new task would not simply be queued  at the head to quickly preempt current.according to the code of kernel 2.6.28,if you clear the STAR_DEBIT bit by sysctl -w kernel.sched_features=orig_value&~STSRT_DEBIT_bit,child task would not preempt its father always,and this problem is easier to recur if you use a father task with lower nice value. my test file is:
/*******child_first.c**********/
#include
#include
#include
int main(int argc,char *argv[])
{
        cpu_set_t mask;
        __CPU_ZERO( &mask );
        __CPU_SET(0, &mask );
        sched_setaffinity( 0, sizeof(mask), &mask );
        int v = atoi(argv[1]);
        nice(v);
        int i = 90000;
        while(i-->0)
        {
                v++;
        }
        if(fork() == 0)
        {
                printf("sub/n");
                exit(0);
        }
        printf("main,%d/n",v);
}
just compile it to child_first and do following:
[root@zhaoya ~]#sysctl -w kernel.sched_features=0
[root@zhaoya ~]#./child_first -20
[root@zhaoya ~]#./child_first -xx
...
[root@zhaoya ~]#./child_first 10000...
after all this,believe your eyes.
  because the code judgeing the condition whether the child should preempt the father is very LOOSE!if the nice value of father is very low and the nr_running is very small,the cfs_rq->min_vruntime is always equal with the vruntime of father,so {curr->vruntime <=se->vruntime}.if the nice value if high,the cfs_rq->min_vruntime is always little than father so {cfs_rq->min_vruntime <= curr->vruntime}
Signed-off-by: Ya Zhao < [email protected]>
---
--- linux-2.6.28.1/kernel/sched_fair.c.orig    2009-04-28 22:26:00.000000000 +0800
+++ linux-2.6.28.1/kernel/sched_fair.c    2009-04-28 22:34:49.000000000 +0800
@@ -1628,12 +1628,13 @@ static void task_new_fair(struct rq *rq,
     /* 'curr' will be NULL if the child belongs to a different group */
     if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
-            curr && curr->vruntime < se->vruntime) {
+            curr && (curr->vruntime <= se->vruntime||cfs_rq->min_vruntime <=
curr->vruntime)) {
         /*
          * Upon rescheduling, sched_class::put_prev_task() will place
          * 'current' within the tree based on its new key value.
          */
-        swap(curr->vruntime, se->vruntime);
+        if( curr->vruntime < se->vruntime )
+            swap(curr->vruntime, se->vruntime);
         resched_task(rq->curr);
     }
--

回复:

I say:but if the child runs last,there maybe more copy-on-write.User can disable child-runs-first if he can confirm the child would not do exec or so . now that the kernel provide the policy,why we implement it halfway?

Somebody say:Sure, I just wanted to raise the issue, child-runs-first doesn't really work reliably on SMP, and since even embedded is moving to SMP the value of keeping it around seems to be less each day.

I say:you are right.but i don't think child waking up on another CPU must run first.The kernel will do his best for users.In kernel everything is middle course in my opinion.if one must do one thing perfectly,the other will lose.so if the cpu on which the child waking up is the same as its father and user give the policy of child-runs-first,we confirm child-runs-first.and if not,let god make the war continue.

i think on SMP child-runs-first is not a matter. if we must confirm child-runs-first,these two cpus must spend much time for synchronization,and at when the father can run,god know

But as long as we do have it, I agree that your patch is wanted.

为 了这个补丁,我可做了不少的工作,看看cfs的代码,想象一下它的原理就会明白,前提,不设置START_DEBIT特性,如果你用高权值也就是负 nice值执行上面的测试,那么就不会发生子进程抢占父进程,如果是正nice值得话,抢占的可能性会变大,为何呢?很简单,如果是高权值的进程,它在一 个调度周期被分配的时间会很多,因此它的虚拟时间会推进的很慢,也就是说不会迅速变大,这样的话,看看update_min_vruntime的代码,高 权值得进程的vruntime可能就是cfs_rq的min_vruntime,这样在sysctl_sched_child_runs_first即使 为1的情况下也不会抢占父进程,因为在place_entity中新进程的vruntime直接就是cfs_rq的min_vruntime,如此一来就 不会发生抢占,不是说在wake_up_new_task中最后还有一个check_preempt_curr判断抢占的吗?是的,但是第一,我们不能把 子进程先运行这件特殊的事情委托给一个更一般的机制;第二,即使委托给它了,check_preempt_curr代码也还是无法抢占父进程,在cfs 中,check_preempt_curr的代码有以下逻辑:
s64 gran, vdiff = curr->vruntime - se->vruntime;
if (vdiff <= 0)
    return -1;
gran = wakeup_gran(curr);
if (vdiff > gran)
    return 1;
return 0;
vdiff 显然为0,如此返回一个-1,抢占没戏。按照这样的理论,当用正的nice值进行测试的时候,抢占是否会发生呢?会的,但是不是完全会,而是机率增加了罢 了,而且,正nice值的抢占也不是sysctl_sched_child_runs_first这个if语句的功劳,而是 check_preempt_curr的功劳,因为低权值的进程虚拟时间推进得很快,因为它大部分时间都是在透支,因此curr的vruntime很大几 率要比cfs_rq的min_vruntime要大,于是vdiff就是一个正值,但是一个抢占粒度又成了问题,不到一定的差值,不会抢占,其实根本不要 把子进程先运行这件事摆托给check_preempt_curr,而是要在sysctl_sched_child_runs_first里面搞定,真是 上面理论分析的结果吗?不是的,测试发现即使用正的nice值,还是很少抢占,这到底为何,于是我在update_min_vruntime中的诸多判断 中加入一个计数器,然后在jiffies达到一定量时打印出来这些值,我又怀疑我关于调度周期的猜测是否正确,也就是是否一个进程每个调度周期只运行一 次,于是我又在set_next_entity中加入了一个同一个进程两次运行间隔的测量机制,然后设定两个计数器一个是小于一个周期的数量,另一个是无 论如何的数量,然后也在那个地方打印,于是我发现我错了,进程在一个周期被调度多于一次的数量占总数量的比例太高了,而且很多时候,在 update_min_vruntime中判断红黑树的left_most的结果都是空值,这到底怎么回事?于是我查看/proc /sched_debug文件,我的妈呀,nr_running为1,要么就是2,反正不超过3,于是我明白了ps -e|wc -l查出来的100个进程都是io进程,不是cpu进程,于是我写了一个cpu进程:
int main(int argc,char *argv[])
{
    nice( atoi(argv[1]) );
    int a = 1,b = 0;
    while( a++||1 )
    {
        b += a;
    }
}
这 个进程够cpu把,分别从nice增量值-19到20调用40个该程序,然后查看那些计数器,太帅了,和我想的一样,这时cfs才真正进入正轨,虽然也存 在同一个调度周期调度2次一个进程的,但是明显少了很多,很快这个比例从原来的1:1到了1:15,而且left_most也不再为空了。到此就都明白 了,因为只有少数进程活动,left_most当然很容易就是空了,于是即使是nice值很高的进程,只要它活动,只要它在运 行,min_vruntime就是它的vruntime,毕竟只有它自己了。现在该规整一下这个补丁了,其实在2.6.25中就没有这样的问题,因为无论 如何在新进程创建时都要resched_task,我们知道,cfs中的红黑树相同的key值要加到已有元素的右边,并且既然已经 resched_task了,那么在schedule中要调用put_pre_entity,这样的话即使父子进程的vruntime相等,子进程也会抢 占父进程,如果父进程的vruntime比较大,那么resched_task将终结父进程的运行,如果比较小,那么交换它们,2.6.25的设计起码在 创建子进程这方面很不错,然而,resched_task调用不是很美观,如果确认写时复制不可避免的时候,resched_task就没有必要了,这样 会徒然增加一次cache刷新,没有必要,于是为了使得子进程先运行成为一种策略,那么就把resched_task移到了该策略中,没有想到,由于代码 的不严密,这个策略成了一个谎言,我的补丁就是弥补这样真实的谎言的,上述提交的补丁有些地方不是很好,将(curr->vruntime <= se->vruntime||cfs_rq->min_vruntime <=curr->vruntime)全部去掉会更好,其实就是将curr->vruntime < se->vruntime的判断移到了if语句里面,这样就和2.6.25一样了,马上再次提交。这个补丁核心就是,既然说一个变量可以控制是否子 进程优先运行,那么为何不让它总是起作用呢?

你可能感兴趣的:(struct,活动,测试,less,features,patch)