大概翻译一下,CFS已经没有时间片的概念了,所以不要参考《深入理解Linux内核(第三版)》,书本内容已经过时。
=============
CFS Scheduler CFS调度器
=============
1. OVERVIEW 概略
CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
replacement for the previous vanilla scheduler's SCHED_OTHER interactivity
code.
CFS的意思是“完全公平的调度器”,是Ingo Molnar创造的新的桌面进程调度工具,并在2.6.23
被合并到内核代码。它是用来替代之前的分时互动调度代码。
80% of CFS's design can be summed up in a single sentence: CFS basically models
an "ideal, precise multi-tasking CPU" on real hardware.
CFS中80%的设计可以总结成一句话:CFS是以真实的硬件为基本建模一个“理想的,精确
的多任务CPU”调度器。
"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical
power and which can run each task at precise equal speed, in parallel, each at
1/nr_running speed. For example: if there are 2 tasks running, then it runs
each at 50% physical power --- i.e., actually in parallel.
“理想的多任务CPU”是有100%物理量并能保持一个精确的相等速度运行各个任务,平均
运行在1/nr_running速度上。举例:如果有2个任务在运行,那么CPU在每个任务上使用50%
的物理能力---也就是说,真正的平均。
On real hardware, we can run only a single task at once, so we have to
introduce the concept of "virtual runtime." The virtual runtime of a task
specifies when its next timeslice would start execution on the ideal
multi-tasking CPU described above. In practice, the virtual runtime of a task
is its actual runtime normalized to the total number of running tasks.
在真实的硬件上,我们同一时间只能运行一个任务,所以我们必须引入“虚拟时间”这个概念。
一个任务的虚拟运行时间就是指它在上述理想的多任务CPU上执行的下一个时间片。实际上,
一个任务的虚拟时间是它在多任务系统中的实际运行时间的一种格式化。
2. FEW IMPLEMENTATION DETAILS 少量的细节
In CFS the virtual runtime is expressed and tracked via the per-task
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
timestamp and measure the "expected CPU time" a task should have gotten.
在CFS中,虚拟运行时间被p->se.vruntime 这个变量所表达和追踪。这样就可能精确地打上
时间戳,并量化一个任务应当获得的“预期CPU运行时间”。
[ small detail: on "ideal" hardware, at any time all tasks would have the same
p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
would ever get "out of balance" from the "ideal" share of CPU time. ]
CFS's task picking logic is based on this p->se.vruntime value and it is thus
very simple: it always tries to run the task with the smallest p->se.vruntime
value (i.e., the task which executed least so far). CFS always tries to split
up CPU time between runnable tasks as close to "ideal multitasking hardware" as
possible.
CFS的任务选择逻辑基于它的p->se.vruntime值,并且非常简单:它总是试图运行虚拟时间
最小的任务。CFS总是试图分割CPU时间,在可运行程序中尽可能实现“理想的多任务
硬件”。
Most of the rest of CFS's design just falls out of this really simple concept,
with a few add-on embellishments like nice levels, multiprocessing and various
algorithm variants to recognize sleepers.
剩下的大部分CFS设计通过使用一些附加的元素像优先级、多种处理和各种算法变种,仅仅是
在落实这个简单的理念。
3. THE RBTREE
CFS's design is quite radical: it does not use the old data structures for the
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
task execution, and thus has no "array switch" artifacts (by which both the
previous vanilla scheduler and RSDL/SD are affected).
CFS的设计是十分激进的:它不使用老的数据结构体用于运行队列,而是使用按时间排序
的红黑树来构建未来任务执行的时间线,因此没有数组切换这种人工操作。
CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
increasing value tracking the smallest vruntime among all tasks in the
runqueue. The total amount of work done by the system is tracked using
min_vruntime; that value is used to place newly activated entities on the left
side of the tree as much as possible.
CFS还维护rq-> cfs.min_vruntime 这个值,这是个单调递增的值,用于跟踪运行队列中
所有任务中最小的vruntime。使用min_vruntime来跟踪系统完成的工作总量;该值用于
尽可能地将新激活的实例放在树的最左边。
The total number of running tasks in the runqueue is accounted through the
rq->cfs.load value, which is the sum of the weights of the tasks queued on the
runqueue.
rq->cfs.load用来计算运行队列上任务的总量,该值是运行队列中所有任务的权重之和。
CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
p->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it.
As the system progresses forwards, the executed tasks are put into the tree
more and more to the right --- slowly but surely giving a chance for every task
to become the "leftmost task" and thus get on the CPU within a deterministic
amount of time.
CFS维护一个基于时间排序的rbtree,所有可运行任务都按照他们的p->se.vruntime值来排序。
CFS选择树中最左边的任务并坚持下去。随着系统中进程的推进,执行过的任务越来越被放到
树的右边---缓慢但是确保每个任务都有机会成为“最左”的任务,从而在一个确定的时间内
获取CPU。
Summing up, CFS works like this: it runs a task a bit, and when the task
schedules (or a scheduler tick happens) the task's CPU usage is "accounted
for": the (small) time it just spent using the physical CPU is added to
p->se.vruntime. Once p->se.vruntime gets high enough so that another task
becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
small amount of "granularity" distance relative to the leftmost task so that we
do not over-schedule tasks and trash the cache), then the new leftmost task is
picked and the current task is preempted.
总而言之,CFS这样工作:它执行一个任务一会儿,当任务调度器(或者一个调度滴答发生),
该任务的CPU被占用:刚使用过CPU的短小的时间被累加进p->se.vruntime。一旦p->se.vruntime
变得足够高以至于其他进程变成按时间排序的rbtree上“最左”的任务,那么这个新的最左的任
务就被选中,当前任务就被抢占了。
4. SOME FEATURES OF CFS CFS的一些特性
CFS uses nanosecond granularity accounting and does not rely on any jiffies or
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
way the previous scheduler had, and has no heuristics whatsoever. There is
only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
CFS使用纳秒间隔度量,并不依赖任何jiffies或者其他HZ信息。因此CFS没有以前调度器
的时间片概念,并且没有任何启发式方法。只有一个重要的可调参数(先要打开
CONFIG_SCHED_DEBUG内核选项):
/proc/sys/kernel/sched_min_granularity_ns
which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
"server" (i.e., good batching) workloads. It defaults to a setting suitable
for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.
这个参数可以用来调节“桌面”到“服务器”工作负载。默认是适配桌面用途的工作负载。
批处理程序也由CFS调度模块处理。
Due to its design, the CFS scheduler is not prone to any of the "attacks" that
exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
chew.c, ring-test.c, massive_intr.c all work fine and do not impact
interactivity and produce the expected behavior.
由于这个设计,CFS调度器不容易被任何这些的启发式程序所“攻击”导致性能下降:
fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c。一切工作正常,不影响
交互性,产生预期的行为。
The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
than the previous vanilla scheduler: both types of workloads are isolated much
more aggressively.
与以前的vanilla调度程序相比,CFS调度器对优先级和批处理程序要强得多:两种类型的
工作负载都被更好的隔离。
SMP load-balancing has been reworked/sanitized: the runqueue-walking
assumptions are gone from the load-balancing code now, and iterators of the
scheduling modules are used. The balancing code got quite a bit simpler as a
result.
SMP负载均衡已经被重做:现在从负载均衡代码中去除了runqueue-walking假设,并且使用
了调度模块的迭代程序。负载均衡代码更小更简单。
5. Scheduling policies 调度策略
CFS implements three scheduling policies:
CFS执行三种调度策略:
- SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
policy that is used for regular tasks.
普通进程
- SCHED_BATCH: Does not preempt nearly as often as regular tasks
would, thereby allowing tasks to run longer and make better use of
caches but at the cost of interactivity. This is well suited for
batch jobs.
批处理进程
- SCHED_IDLE: This is even weaker than nice 19, but its not a true
idle timer scheduler in order to avoid to get into priority
inversion problems which would deadlock the machine.
空闲进程
SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
POSIX.
实时级别的FIFO/RR在rt.c中按照POSIX规定执行。
The command chrt from util-linux-ng 2.13.1.1 can set all of these except
SCHED_IDLE.
6. SCHEDULING CLASSES 调度类
The new CFS scheduler has been designed in such a way to introduce "Scheduling
Classes," an extensible hierarchy of scheduler modules. These modules
encapsulate scheduling policy details and are handled by the scheduler core
without the core code assuming too much about them.
新的调度器设计方式是引入“调度类”,一种可扩展的调度模块结构。这些模块压缩调度
策略细节,交给调度核心处理,避免核心代码对调度器作出过多假设。
sched/fair.c implements the CFS scheduler described above.
sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT
priority levels, instead of 140 in the previous scheduler) and it needs no
expired array.
FIFO和RR也比之前调度器简单,用100个队列取代了之前140个数组。
Scheduling classes are implemented through the sched_class structure, which
contains hooks to functions that must be called whenever an interesting event
occurs.
调度类是通过sched_class结构体实现的,包含一系列的回调函数。
This is the (partial) list of the hooks:
这里有部分回调解释
- enqueue_task(...)
任务入队
Called when a task enters a runnable state.
It puts the scheduling entity (task) into the red-black tree and
increments the nr_running variable.
当一个任务进入可运行状态时被调用,它把调度实体放入红黑树并增加nr_running变量
- dequeue_task(...)
任务出队
When a task is no longer runnable, this function is called to keep the
corresponding scheduling entity out of the red-black tree. It decrements
the nr_running variable.
当一个任务不再可运行,这个函数被用了保证相应的调度实体不在红黑树中,它减少
nr_running变量。
- yield_task(...)
任务产生
This function is basically just a dequeue followed by an enqueue, unless the
compat_yield sysctl is turned on; in that case, it places the scheduling
entity at the right-most end of the red-black tree.
这个函数基本上只是一个出队再入队的动作。如果打开了compat_yield开关,这个函数
会把调度实体放入红黑树的最右侧。
- check_preempt_curr(...)
检查当前进程是否被抢占
This function checks if a task that entered the runnable state should
preempt the currently running task.
这个函数检查是否有一个进程进入可执行状态并抢占当前正在运行的任务。
- pick_next_task(...)
选择下一个可执行的任务
This function chooses the most appropriate task eligible to run next.
这个函数选择下一个最合适被执行的任务
- set_curr_task(...)
设置当前任务
This function is called when a task changes its scheduling class or changes
its task group.
当一个任务改变它的调度类或改变它的任务组的时候被调用
- task_tick(...)
任务滴答
This function is mostly called from time tick functions; it might lead to
process switch. This drives the running preemption.
这个函数经常被时间中断调用,它可能导致进程切换。这推动了运行时抢占。
7. GROUP SCHEDULER EXTENSIONS TO CFS CFS扩展的组调度
Normally, the scheduler operates on individual tasks and strives to provide
fair CPU time to each task. Sometimes, it may be desirable to group tasks and
provide fair CPU time to each such task group. For example, it may be
desirable to first provide fair CPU time to each user on the system and then to
each task belonging to a user.
CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be
grouped and divides CPU time fairly among such groups.
CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
SCHED_RR) tasks.
CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
SCHED_BATCH) tasks.
These options need CONFIG_CGROUPS to be defined, and let the administrator
create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
Documentation/cgroups/cgroups.txt for more information about this filesystem.
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem.
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
# cd /sys/fs/cgroup/cpu
# mkdir multimedia # create "multimedia" group of tasks
# mkdir browser # create "browser" group of tasks
# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group
# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares
# firefox & # Launch firefox and move it to "browser" group
# echo > browser/tasks
# #Launch gmplayer (or your favourite movie player)
# echo > multimedia/tasks