公司让做个cfs相关的培训,整理了个ppt,图片均来自网络,源作者在此不一一说明,深表歉意~~
The Outline
Basic concepts about Linux process & thread
Basic concepts about SMP
Linux bootup with BP and how to boot AP
Completely Fair Schedule(CFS) and RT Sched
How to load balance
How to debug smp issue with oprofile and other tools
0. Basic Concepts
Program
A program is a combination of intructions and data, which are put together to perform a task.
Process
A process is an abstraction created to embody the state of a program during its execution. Therefore, a process can also be viewed as an instance of a program, or called “a running program”.
API: fork(clone->do_fork->copy_process->copy_flag(CLONE_*: FILES, VM, FS, SIGHAND, NEWNS, NEWPID, NEWNET, NEWUTS(NEW* for namespace, LXC)...))
exec, wait(for freeing zombie child, have to wait, it is better to handle SIGCHLD from child process)
Thread
LWP. A process can have multiple execution contexts that work together to accomplish its goal. These different execution contexts are called “threads”. These threads share same virtual address space with the process.
API: pthread_create(fork), pthread_attr_init, pthread_attr_setschedpolicy, pthread_attr_setschedparam, pthread_attr_getschedparam, pthread_exit, pthread_cancel, pthread_join(like as wait, free child resource), pthread_detach(self free all resource)
Kernel Thread
A thread without user-mode virtual address space. All instructions and data of a kernel thread are in Kernel VA Space, and in Linux they are usually linked together with kernel as a part of kernel image, such as kswapd, kflushd and ksoftirqd.
API: kthread_run(create+wakeup), kthread_create, kthread_stop
task_struct{
state, pid, mm, active_mm, prio, se, cpus_allowed, children, sibling, fs, files, signal, cgroups}
((struct thread_info *)(sp & ~(THREAD_SIZE – 1)))->task
THREAD_SIZE=8K
init process 0
#define INIT_TASK(tsk) { \
.state = 0, \
.thread_info = &init_thread_info, \
……
.mm = NULL, \\\NULL for kernel task
.active_mm = &init_mm, \
……
}
In init_mm, pgd= swapper_pg_dir(0xC0004000)
Former 768 PGD entries for User VA Space
All User process shares the same kernel PGD.
init process 0 becomes “idle” process ( “idle” kernel thread).
1. Scheduler History
2.4 O(n)
One simple runqueue
nice and counter
2.6 O(1)
Kernel preemptible
Bitmap
runqueue(active/expire) for per-cpu and per-priority
static_prio and time slice
2. CFS Introduction
From kernel 2.6.23 version3.1 priority
Normal task Priority3.2 SCHED_FIFO and SCHED_RR
pick_next_rt_task: Find the highest rt_priority task list by bitmap; pick the first task of the highest rt_priority task list, after running, put it to tail of list
SCHED_FIFO: run till initiative schedule or be preempted by higher priority rt task
SCHED_RR: run till initiative schedule or be preempted by higher priority rt task, or out of timeslice.
RT task(throttling): sched_rt_period_us(1000000us), sched_rt_runtime
4.1 RB Tree for Normal Task Organization
4.2 Node sequence of RB Tree
RB tree, O(log n)
The RB node sequence is determined by vruntime(The most left leaf is the lowest vruntime value)
The speed of Consuming vruntime is determined by prio
Constant arrary for static_prio to weight:
prio_to_weight[ ]/prio_to_wmult[ ](1 nice~10%weight)
vruntime = delta_exec * (NICE_0_LOAD / weight)
4.3 task_struct to RB tree in CFS
4.4 From rq to task
rq is per cpu variable
4.5 Sched_class for Sched Policy
fair_sched_class/rt_sched_class
struct sched_class
{
const struct sched_class *next;...... }
4.6 Sched domain
4.7 Group and Domain
5. How to schedule
l Load Balance Rule
l 1) In cpu_domain level, all cpus share cache and cpu_power(power in cpu group), free to load balance
l 2) In core_domain level, all cpus share L2 cache, load balance when core domain is imbalance
l 3) In phys_domain level, must flush all cache
4) In numa_domain level, load balance costs much
5.1 Two Schedule Entry
l When no running task in current rq||initiative calling||sleep||from system space to user space, call schedule()
l When hrtimer(HZ) timeout, call scheduler_tick()
5.1.1 Schedule()
1) Disable preempt 11) Check if need resched, if so go to 1)
5.1.2 How to pick and put task for fair Sched
pick_next_task(fair scheduler) on rq
1) Find the rb_leftmost node on the cfs_rq
2) get sched entity from rb node with rb_entry()
3) if my_q of the se is null(it is a task)
3.1) find it and dequeue entity from rq
4) else, go to 1)
put_prev_task
6.1) if it belongs to sched group(parent != null)
it will be put in every iterator entity's rq from it to parent)
6.2) else(parent == null)
put it into current rq
pick_next_task_rt:
Find the highest rt_priority task list by bitmap; pick the first task of the highest rt_priority task list.
put_prev_task_rt:
//if task is in running state&&has allowed cpu
if (p->se.on_rq && p->rt.nr_cpus_allowed > 1)
1) del it from current node of list
2) add it to tail of list
5.1.4 Scheduler_tick()
1) update rq clock and load5.1.5 How to load balance
1) find busiest group in the domain5.2 How to find busiest group and rq
Condition:
avg_load>prev.max_load&&find max rq->load.weight in group
6.1 Oprofile
Sampling: event based and time based
Two part:
Kernel module oprofile.ko: for saving sampling data in memory
get performance counter register_timer_hook
User daemon oprofiled: get sampling data, save it to file and parse.
6.2 config and compile
Kernel:
1) menuconfig:
enable Oprofile in profiling menu
enable Local APIC and IO-APIC in Processor type and features menu
2) .config: set CONFIG_PROFILING=y and CONFIG_OPROFILE=y
oprofile toolkit compile:
./configure --with-kernel-support
make
make install
6.3 oprofile toolkit
oprofiled
opcontrol: user interface
opannotate: comments source code for sampling data
opreport: binary and symble map
ophelp: list supported events
opgprof: generate gprof format data(a program analyzer)
opstack: generate call stack, with call-graph patch for kernel
oparchive: archive raw sampling data
op_import: change data format
6.4 How to use oprofile
# opcontrol --setup --ctr0-event=CPU_CLK_UNHALTED
--ctr0-count=600000 --vmlinux=/usr/src/linux-*/vmlinux
# opcontrol --start
# opcontrol --stop/--shutdown/--dump(/var/lib/oprofile/samples /oprofiled.log)
# opcontrol --status
# opcontrol --list-events
# opcontrol --event=L2_CACHE_MISS:500 --event=L1_DTLB_MISS_AND_L2_DTLB_HIT:500
# opreport -l ./testbinary