我们总是在无法满足条件的时候调用sleep()是系统睡眠,在编写网络程序的时候,总是调用poll、select,想明白内核是如何实现基于精确时间的调度操作吗?或许我们该交流交流...
1 前言
延期执行有两种:
第一种是不需要精确地时间控制,比如软中断和tasklet机制,在每个异步中断处理结束时处理或者调用内核线程
ksoftirqd执行。
第二种需要精确地时间控制,像工作队列[内核线程keventd完成]、等待队列[生产者激活]、完成量等涉及到进程休眠
等待的结构体都要依靠定时器机制,在某个精确地时间间隔后,由内核执行某个延期操作
2 定时器类型
低分辨率定时器:典型分辨率为1ms,由PIT(可编程中断定时器 8253芯片构成)
高分辨率定时器:可达到ns级的分辨率,如声卡驱动程序可能需要很短的周期间隔向声卡发送一些数据
由于定时器引发的周期时钟在内核整个生命活动周期内都是活动的,系统不会长时间进入省电模式,从而引入动态时钟
目前有低分辨率动态时钟和高分辨率动态时钟两种,内核中这四种类别的所有可能组合都是有效地
3 低分辨率定时器的实现
在IA-32系统中,一般选用HPET或者PIT作为时钟中断的周期性时钟源,中断每秒大概100次;高HZ适合用交互式应用比较频繁的桌面系统和多媒体系统应用,低HZ适合于服务器和批处理机器。
实现概览:
3-1 时钟时间设备初始化
现在可以系统的介绍内核时钟子系统的初始化过程。系统刚上电时,需要注册 IRQ0 时钟中断,完成时钟源设备,时钟事件设备,tick device 等初始化操作并选择合适的工作模式。由于刚启动时没有特别重要的任务要做,因此默认是进入低精度 + 周期 tick 的工作模式,之后会根据硬件的配置(如硬件上是否支持 HPET 等高精度 timer)和软件的配置(如是否通过命令行参数或者内核配置使能了高精度 timer 等特性)进行切换。在一个支持 hrtimer 高精度模式并使能了 dynamic tick 的系统中,第一次发生 IRQ0 的软中断时 hrtimer 就会进行从低精度到高精度的切换,然后再进一步切换到 NOHZ 模式。IRQ0 为系统的时钟中断,使用全局的时钟事件设备(global_clock_event)来处理的,其定义如下:
static struct irqaction irq0 = { .handler = timer_interrupt, .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER, .name = "timer" }; |
它的中断处理函数 timer_interrupt 的简化实现如清单 12 所示:
static irqreturn_t timer_interrupt(int irq, void *dev_id) { . . . . global_clock_event->event_handler(global_clock_event); . . . . return IRQ_HANDLED; } |
在 global_clock_event->event_handler 的处理中,除了更新 local CPU 上运行进程时间的统计,profile 等工作,更重要的是要完成更新 jiffies 等全局操作。这个全局的时钟事件设备的 event_handler 根据使用环境的不同,在低精度模式下可能是 tick_handle_periodic / tick_handle_periodic_broadcast,在高精度模式下是 hrtimer_interrupt。目前只有 HPET 或者 PIT 可以作为 global_clock_event 使用。其初始化流程清单 13 所示:
void __init time_init(void) { late_time_init = x86_late_time_init; } static __init void x86_late_time_init(void) { x86_init.timers.timer_init(); tsc_init(); } /* x86_init.timers.timer_init 是指向 hpet_time_init 的回调指针 */ void __init hpet_time_init(void) { if (!hpet_enable()) setup_pit_timer(); setup_default_timer_irq(); } |
由上图 可以看到,系统优先使用 HPET 作为 global_clock_event,只有在 HPET 没有使能时,PIT 才有机会成为 global_clock_event。在使能 HPET 的过程中,HPET 会同时被注册为时钟源设备和时钟事件设备。
hpet_enable hpet_clocksource_register hpet_legacy_clockevent_register clockevents_register_device(&hpet_clockevent); |
clockevent_register_device 会触发 CLOCK_EVT_NOTIFY_ADD 事件,即创建对应的 tick_device。然后在 tick_notify 这个事件处理函数中会添加新的 tick_device。
clockevent_register_device trigger event CLOCK_EVT_NOTIFY_ADD tick_notify receives event CLOCK_EVT_NOTIFY_ADD tick_check_new_device tick_setup_device |
在 tick_device 的设置过程中,会根据新加入的时钟事件设备是否使用 broadcast 来分别设置 event_handler。对于 tick device 的处理函数,可下图 所示:
low resolution mode | High resolution mode | |
---|---|---|
periodic tick | tick_handle_periodic | hrtimer_interrupt |
dynamic tick | tick_nohz_handler | hrtimer_interrupt |
所以在低分辨率且周期性时钟中断中,采用tick_handle_periodic作为时钟中断的处理函数,每次时钟中断,都执行该函数一次调用一下函数:
tick_handle_periodic(struct clock_event_device* dev) /Tick-common.c
-> tick_periodic(cpu)
-> do_timer(1) 1 为 1 个jiffies
-> update_process_times(user_mode(get_irq_regs())) 通过寄存器来查看是否位于用户态
-> profile_tick(CPU_PROFILING)
3-2 该事件处理程序执行下面两个步骤:
do_timer() --主cpu执行:更新jiffies、系统墙上时间、系统负载(平均队列长度)
update_process_times()--每个cpu执行:更新当前进程utime、stime、调用的软中断,激活该cpu上定时器函数、
cpu调度、运行当前注册的posix定时器
3-3 do_timer()
/*
* The 64-bit jiffies value is not atomic - you MUST NOT read it
* without sampling the sequence number in xtime_lock.
* jiffies is defined in the linker script...
*/
若采用动态时钟,ticks可能大于1,这样有利于节省电能
void do_timer(unsigned long ticks)
{
jiffies_64 += ticks;
update_wall_time(); //待探索
calc_global_load();
/*系统负载定义:
1 系统平均负载:在特定时间间隔内运行队列中的平均进程数[运行队列长度],一般来说只要每个CPU的当
前活动进程数不大于3 那么系统的性能就是良好的,如果每个CPU的任务数大于5,那么就表示这台机器的性能有严重问题.
2 计算方式: 当前负载:load(t) = [ [5.1s前统计的负载] load(t-1) e-5/60*m + n* (2048 - e-5/60)] >> 2048,
n是系统此刻活动的进程数(包括就绪态和TASK_UNINTERRUPE状态进程)
m表示前m分钟内的系统平均负载,e-5/60 = 1884 e-5/60*5 = 2014
2048 表示作为精度precision标准
*/
}
3-4 update_process_times(int user_tick) 每个cpu统计值更新,目前仍在一步中断处理过程中
/*
* Called from the timer interrupt handler to charge one tick to the current
* process. user_tick is 1 if the tick is user time, 0 for system.
*/
void update_process_times(int user_tick)
{
struct task_struct *p = current;
int cpu = smp_processor_id();
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
run_local_timers();
rcu_check_callbacks(cpu, user_tick);
printk_tick();
scheduler_tick();
run_posix_cpu_timers(p);
}
3-4-1:更新cpu时间:
每个时钟中断,动态时钟中可能若干次时钟中断才触发一次中断,所以描述的是每次触发的时钟中断
更新当前进程utime += 1000000000/HZ [每次中断加10的7次方次],更新线程组的运行时间,更新
每cpu上的内核数据统计值,
/*
* 'kernel_stat.h' contains the definitions needed for doing
* some kernel statistics (CPU usage, context switches ...),
* used by rstatd/perfmeter
*/
struct cpu_usage_stat {
cputime64_t user; //cpustat->user = cputime64_add(cpustat->user, tmp);
cputime64_t nice;
cputime64_t system;
cputime64_t softirq;
cputime64_t irq;
cputime64_t idle;
cputime64_t iowait;
cputime64_t steal;
cputime64_t guest;
};
/*
* Account a single tick of cpu time.
* @p: the process that the cpu time gets accounted to
* @user_tick: indicates if the tick is a user or a system tick
*/
void account_process_tick(struct task_struct *p, int user_tick)
{
cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
struct rq *rq = this_rq();
if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
one_jiffy_scaled);
else
account_idle_time(cputime_one_jiffy);
}
3-4-2 触发软中断,从而处理该cpu上的定时器函数
/*
* Called by the local, per-CPU timer interrupt on SMP.
*/
void run_local_timers(void)
{
hrtimer_run_queues();
raise_softirq(TIMER_SOFTIRQ);
//给task_struct->preemt_count标记,并触发ksoftirqd内核线程,该线程执行的软中断函数分析见下面详解
softlockup_tick();
}
3-4-3 进程调度
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
* It also gets called by the fork code, when changing the parent's
* timeslices.
*/
void scheduler_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
sched_clock_tick();
spin_lock(&rq->lock);
update_rq_clock(rq);
update_cpu_load(rq);
curr->sched_class->task_tick(rq, curr, 0);
spin_unlock(&rq->lock);
perf_event_task_tick(curr, cpu);
#ifdef CONFIG_SMP
rq->idle_at_tick = idle_cpu(cpu);
trigger_load_balance(rq, cpu); //激活内核线程,执行软中断的SHED_SOFTIRQ,各个cpu间负载均衡
#endif
}
4 软中断中处理定时器
1 比较时间操作
获取系统自开机以来的jiffies值
static inline u64 get_jiffies_64(void)
{
return (u64)jiffies;
}
2 标准的时间比较函数,值均为jiffies值
#define time_after(a,b) \
(typecheck(unsigned long, a) && \
typecheck(unsigned long, b) && \
((long)(b) - (long)(a) < 0))
#define time_before(a,b) time_after(b,a)
#define time_after_eq(a,b) \
(typecheck(unsigned long, a) && \
typecheck(unsigned long, b) && \
((long)(a) - (long)(b) >= 0))
#define time_before_eq(a,b) time_after_eq(b,a)
/*
* Calculate whether a is in the range of [b, c].
*/
#define time_in_range(a,b,c) \
(time_after_eq(a,b) && \
time_before_eq(a,c))
3 时间换算
unsigned int inline jiffies_to_msecs(const unsigned long j)
{
#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
return (MSEC_PER_SEC / HZ) * j;
//假设jiffies=1000,即自开机来执行了1000次,HZ=100,每秒运行100个时钟中断,则当前执行了10000ms=10s
#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
#else
# if BITS_PER_LONG == 32
return (HZ_TO_MSEC_MUL32 * j) >> HZ_TO_MSEC_SHR32;
# else
return (j * HZ_TO_MSEC_NUM) / HZ_TO_MSEC_DEN;
# endif
#endif
}
unsigned int inline jiffies_to_usecs(const unsigned long j)
unsigned long msecs_to_jiffies(const unsigned int m)
unsigned long usecs_to_jiffies(const unsigned int u)
4 jiffies和timeval以及timespec之间转化
时间在内核中以jiffies偏移量或绝对值表示,但是用户在定义定时函数时习惯按秒而不是HZ来思考,内核提供了这两种之间的时间换算
程序员定义sleep(2)-->转为结构体
struct timeval {time_t tv_sec;suseconds_t tv_usec; 微秒}
struct timespec {time_t tv_sec;long tv_nsec;纳秒}
-->转为相对jiffies
/* Same for "timeval"
*
* Well, almost. The problem here is that the real system resolution is
* in nanoseconds and the value being converted is in micro seconds.
* Also for some machines (those that use HZ = 1024, in-particular),
* there is a LARGE error in the tick size in microseconds.
* The solution we use is to do the rounding AFTER we convert the
* microsecond part. Thus the USEC_ROUND, the bits to be shifted off.
* Instruction wise, this should cost only an additional add with carry
* instruction above the way it was done above.
*/
unsigned long
timeval_to_jiffies(const struct timeval *value)
{
unsigned long sec = value->tv_sec;
long usec = value->tv_usec;
if (sec >= MAX_SEC_IN_JIFFIES){
sec = MAX_SEC_IN_JIFFIES;
usec = 0;
}
return (((u64)sec * SEC_CONVERSION) +
(((u64)usec * USEC_CONVERSION + USEC_ROUND) >>
(USEC_JIFFIE_SC - SEC_JIFFIE_SC))) >> SEC_JIFFIE_SC;
}
unsigned long timespec_to_jiffies(const struct timespec *value)
void jiffies_to_timeval(const unsigned long jiffies, struct timeval *value)
void jiffies_to_timespec(const unsigned long jiffies, struct timespec *value)
--->转为绝对的jiffies
m = jiffies + n; jiffies 为内核中当前的jiffies值
5 动态定时器原理
5-1 数据结构
定时器数据结构:
struct timer_list {
struct list_head entry;
unsigned long expires; //绝对的jiffies值
void (*function)(unsigned long); //超时回调函数,一般为唤醒进程
unsigned long data; //一般为进程task_struct
struct tvec_base *base; //该cpu对应的定时器数组链表基地址
#ifdef CONFIG_TIMER_STATS
void *start_site;
char start_comm[16];
int start_pid;
#endif
#ifdef CONFIG_LOCKDEP
struct lockdep_map lockdep_map;
#endif
};
定时器数组链表结构
struct tvec_base {
spinlock_t lock; //保护定时器中断的锁结构
struct timer_list *running_timer; //暂存当前执行的所有定时器结构
unsigned long timer_jiffies; //上次的jiffies值
unsigned long next_timer; //下次调用的jiffies值
struct tvec_root tv1; //最低8位用于最近256个jiffies内的定时器链表[2.56s内的中断] 挂载在256个链表
struct tvec tv2; //用于256-2(14)-1时间范围内 共挂载到64个链表中
//每个链表挂载了256个时间间隔的时钟中断
struct tvec tv3; //用于2(14) - 2(20)-1时间范围内 共挂载到64个链表中
struct tvec tv4; //用于2(20) -2(26)-1时间范围内 共挂载到64个链表中
struct tvec tv5; //用于2(26)-2(32)-1时间范围内 共挂载到64个链表中
} ____cacheline_aligned;
struct tvec_base boot_tvec_bases;
static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = &boot_tvec_bases;
为每个cpu定义一个定时器数组链表结构,初始化为boot_tvec_bases
#define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
#define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
#define TVN_SIZE (1 << TVN_BITS) 64
#define TVR_SIZE (1 << TVR_BITS) 256
#define TVN_MASK (TVN_SIZE - 1)
#define TVR_MASK (TVR_SIZE - 1)
struct tvec {
struct list_head vec[TVN_SIZE]; //共64个链表
};
struct tvec_root {
struct list_head vec[TVR_SIZE]; //共256个链表
};
5-2 动态定时器实现原理
定时器数组链表共分为5个组,第一组256项,每项都是一个双链表,挂载0-255个时钟周期内要执行的定时器任务;其余组都是64项,每组都有64个双链表,以第二组为例,这64项个双链表中每个链表可容许的时间间隔为2(8)=256个时钟周期。对第三组,每个链表可容纳的时间间隔是2(14)个时钟周期。内核主要负责关注第一组定时器,内核在每个组中都有一个计数器,保存了该数组的当前位置编号,每当遇到定时中断,内核扫描第一个数组链表,执行特定位置的所有定时器函数,然后将技术器+1,如果达到256,重新归0,并将第二组中的当前位置的定时器列表填充第一个数组列表,然后第二组当前计数器+=1,依此类推,第三组填充第二组,当前计数器+=1 ...,由于在组间通过指针移动,效率相当高
那么每组的当前位置如何体现呢,也是在struct tvec_base->timer_jiffies值体现,该值记录了一个时间点,此前到期的定时器都已经执行完毕,该timers_jiffies 一般等于或略微小于jiffies,每一组的索引位置计算如下:
#define INDEX(N) ((base->timer_jiffies >> (TVR_BITS + (N) * TVN_BITS)) & TVN_MASK)
注第一组为:base->timer_jiffies & TVR_MASK
第二组的N=0为:INDEX(0) ((base->timer_jiffies >> (TVR_BITS ) & TVN_MASK)
第三组的N=1为:INDEX(1) ((base->timer_jiffies >> (TVR_BITS +1*TVN_BITS) & TVN_MASK)
...
5-3 将定时器挂载到每cpu定时器数组链表中
1 定义一个定时器
动态定义:
#define DEFINE_TIMER(_name, _function, _expires, _data) \
struct timer_list _name = \
TIMER_INITIALIZER(_function, _expires, _data)
#define TIMER_INITIALIZER(_function, _expires, _data) { \
.entry = { .prev = TIMER_ENTRY_STATIC }, \
.function = (_function), \
.expires = (_expires), \
.data = (_data), \
.base = &boot_tvec_bases, \
__TIMER_LOCKDEP_MAP_INITIALIZER( \
__FILE__ ":" __stringify(__LINE__)) \
}
静态定义:
struct timer_list my_timer;
init_timer(&my_timer)
my_timer.expires = jiffies + 1*HZ
my_timer.data = &my_timer;
my_timer.function = my_function; void my_function(unsigned long data)
2 激活定时器 将定时器加入到当前cpu对应的定时器数组列表中
2-1
/**
* add_timer - start a timer
* @timer: the timer to be added
*
* The kernel will do timer ->function(->data) callback from the
* timer interrupt at the ->expires point in the future. The
* current time is 'jiffies'.
*
* The timer's ->expires, ->function (and if the handler uses it, ->data)
* fields must be set prior calling this function.
*
* Timers with an ->expires field in the past will be executed in the next
* timer tick.
*/
void add_timer(struct timer_list *timer)
{
BUG_ON(timer_pending(timer)); //确保该timer->enter.next = NULL,即该timer尚未挂载到内核中
mod_timer(timer, timer->expires); //其中的expires为jiffies绝对值
}
2-2 mod_timer 修改timer值的超时时间expires值,如果尚未加入列表则加入其中
/**
* mod_timer - modify a timer's timeout
* @timer: the timer to be modified
* @expires: new timeout in jiffies
*
* mod_timer() is a more efficient way to update the expire field of an
* active timer (if the timer is inactive it will be activated)
*
* mod_timer(timer, expires) is equivalent to:
*
* del_timer(timer); timer->expires = expires; add_timer(timer);
*
* Note that if there are multiple unserialized concurrent users of the
* same timer, then mod_timer() is the only safe way to modify the timeout,
* since add_timer() cannot modify an already running timer.
*
* The function returns whether it has modified a pending timer or not.
* (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an
* active timer returns 1.)
*/
int mod_timer(struct timer_list *timer, unsigned long expires) //expires值为绝对值
{
/*
* This is a common optimization triggered by the
* networking code - if the timer is re-modified
* to be the same thing then just return:
*/
if (timer_pending(timer) && timer->expires == expires)
return 1;
return __mod_timer(timer, expires, false, TIMER_NOT_PINNED);
}
2-3 __mod_timer
static inline int
__mod_timer(struct timer_list *timer, unsigned long expires,
bool pending_only, int pinned)
{
struct tvec_base *base, *new_base;
unsigned long flags;
int ret = 0 , cpu;
timer_stats_timer_set_start_info(timer); //记录该timer对应的进程信息,后面再执行该timer函数时,会收集插入该定时器的进程的相应信息,便于执行某些操作
/*
void __timer_stats_timer_set_start_info(struct timer_list *timer, void *addr=0)
{
if (timer->start_site)
return;
timer->start_site = addr;
memcpy(timer->start_comm, current->comm, TASK_COMM_LEN);
timer->start_pid = current->pid;
}
*/
BUG_ON(!timer->function);
base = lock_timer_base(timer, &flags);
// 给该cpu定时器数组队列struct tvec_base加锁per_cpu(tvec_bases).lock
if (timer_pending(timer)) { //timer已经加入到定时器队列中
detach_timer(timer, 0);
if (timer->expires == base->next_timer &&
!tbase_get_deferrable(timer->base))
base->next_timer = base->timer_jiffies;
ret = 1;
} else {
if (pending_only)
goto out_unlock;
}
debug_activate(timer, expires);
new_base = __get_cpu_var(tvec_bases);
cpu = smp_processor_id();
#if defined(CONFIG_NO_HZ) && defined(CONFIG_SMP)
if (!pinned && get_sysctl_timer_migration() && idle_cpu(cpu)) {
int preferred_cpu = get_nohz_load_balancer();
if (preferred_cpu >= 0)
cpu = preferred_cpu;
}
#endif
new_base = per_cpu(tvec_bases, cpu);
if (base != new_base) {
/*
* We are trying to schedule the timer on the local CPU.
* However we can't change timer's base while it is running,
* otherwise del_timer_sync() can't detect that the timer's
* handler yet has not finished. This also guarantees that
* the timer is serialized wrt itself.
*/
if (likely(base->running_timer != timer)) {
/* See the comment in lock_timer_base() */
timer_set_base(timer, NULL);
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
timer_set_base(timer, base);
}
}
timer->expires = expires;
if (time_before(timer->expires, base->next_timer) &&
!tbase_get_deferrable(timer->base))
base->next_timer = timer->expires;
internal_add_timer(base, timer);
out_unlock:
spin_unlock_irqrestore(&base->lock, flags);
return ret;
}
2.4 internal_add_timer 激活定时器主函数
static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)
{
unsigned long expires = timer->expires; //一般大于jiffies, base->timer_jiffies一般=jiffies-1
unsigned long idx = expires - base->timer_jiffies; //相对当前timer_jiffies时间
struct list_head *vec;
if (idx < TVR_SIZE) { //相对当前jiffies时间在255之内
int i = expires & TVR_MASK;
/*
*确定在第一组中的插入的列表项,比如base_jiffies=254,timer->expires = 259, idx=5,即在未来5个时钟周期触发
*当前第一组的当前位置 = timer_jiffies =254,则按理应该将该timer插入到第一组第3项,而i=259&256=3,即插入*到第一组3项,正确
*/
vec = base->tv1.vec + i;
}
else if (idx < 1<< (TVR_BITS + TVN_BITS)) { //相对当前jiffies时间在256 - 2(14)-1之内
int i = (expires >> TVR_BITS) & TVN_MASK;
vec = base->tv2.vec + i;
/*
* 关键思想是:timer->expires[求出未来执行的定时器当前插入的位置] > jiffies > timer_jiffies[作为当前定时器执
* 行时每项数组下标标准],由于timer->expires 总是 > timer_jiffies,所以插入的位置在各组中总是相对当前定时器
* 而言的
* 依然假设base_jiffies = 254, timer->expires = 512,idx = 258 即距离258个时钟中断触发
* i= (512 >> 8)&63 = 2,即插入第二个数组的第二项中
* 当前第一组的当前位置 = timer_jiffies =254,第二组当前位置为IDN(0)=0,
* 在第二个时钟中断时,timer_jiffies=256,从而将第二组中的当前下
* 标处移动到第一组中,第二组的下标+=1;在第258个时钟中断时,再将第二组的下一个下标项移动到第一组中。由于
* 此时在第一组的下标为0,而当第二组的下一个下标移入第一组时,根据timer->expires= 512%256 = 0从而放入第
* 一组的第0项, 执行到期任务。
*/
} else if (idx < 1 << (TVR_BITS + 2 * TVN_BITS)) {
int i = (expires >> (TVR_BITS + TVN_BITS)) & TVN_MASK;
vec = base->tv3.vec + i;
} else if (idx < 1 << (TVR_BITS + 3 * TVN_BITS)) {
int i = (expires >> (TVR_BITS + 2 * TVN_BITS)) & TVN_MASK;
vec = base->tv4.vec + i;
} else if ((signed long) idx < 0) {
/*
* Can happen if you add a timer with expires == jiffies,
* or you set a timer to go off in the past
*/
vec = base->tv1.vec + (base->timer_jiffies & TVR_MASK);
}
else
{
int i;
/* If the timeout is larger than 0xffffffff on 64-bit
* architectures then we use the maximum timeout:
*/
if (idx > 0xffffffffUL) {
idx = 0xffffffffUL;
expires = idx + base->timer_jiffies;
}
i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
vec = base->tv5.vec + i;
}
/*
* Timers are FIFO:
*/
list_add_tail(&timer->entry, vec);
}
3 定时器软中断,处理该cpu上的定时器
由上文可知,update_process_times时统计完执行进程时间信息后,执行run_local_timers函数
3-1 激活时钟处理软中断
/*
* Called by the local, per-CPU timer interrupt on SMP.
*/
void run_local_timers(void)
{
hrtimer_run_queues();
raise_softirq(TIMER_SOFTIRQ);
softlockup_tick();
}
void raise_softirq(unsigned int nr)
{
unsigned long flags;
local_irq_save(flags);
raise_softirq_irqoff(nr);
local_irq_restore(flags);
}
inline void raise_softirq_irqoff(unsigned int nr)
{
__raise_softirq_irqoff(nr); //设置该cpu相关结构中的softirq_pending向量,共32项,表明该号软中断有内容处理
local_cpu_data->softirq_pending
/*
* If we're in an interrupt or softirq, we're done
* (this also catches softirq-disabled code). We will
* actually run the softirq once we return from
* the irq or softirq.
*
* Otherwise we wake up ksoftirqd to make sure we
* schedule the softirq soon.
*/
if (!in_interrupt())
wakeup_softirqd(); //唤醒内核线程,执行或者在中断处理末尾执行软中断
}
3-2 时钟处理软中断函数执行,与时钟中断相关的软中断执行函数为:
/*
* This function runs timers and the timer-tq in bottom half context.
*/
static void run_timer_softirq(struct softirq_action *h)
{
struct tvec_base *base = __get_cpu_var(tvec_bases);
perf_event_do_pending();
hrtimer_run_pending();
if (time_after_eq(jiffies, base->timer_jiffies))
__run_timers(base);
}
3-3 调用定时器处理函数
/**
* __run_timers - run all expired timers (if any) on this CPU.
* @base: the timer vector to be processed.
*
* This function cascades all vectors and executes all expired timer
* vectors.
*/
static inline void __run_timers (struct tvec_base *base)
{
struct timer_list *timer;
spin_lock_irq(&base->lock);
while (time_after_eq(jiffies, base->timer_jiffies)) {
//如果当前的jiffies 大于上次调用的timer_jiffies,每循环一次,timer_jiffies+=1,知道=jiffies为止
struct list_head work_list;
struct list_head *head = &work_list;
int index = base->timer_jiffies & TVR_MASK; //本次调用时的jiffies值对应的第一组的下标
/*
* Cascade timers: 级联处理,主要关注第一组,当第一组下标循环到0,即每当timer_jiffies%256=0时,
* 将上一组中的一项分流到该组中,依此类推,INDEX(n)= 第n+2组中的当前下标
*/
if (!index &&
(!cascade(base, &base->tv2, INDEX(0))) &&
(!cascade(base, &base->tv3, INDEX(1))) &&
!cascade(base, &base->tv4, INDEX(2)))
cascade(base, &base->tv5, INDEX(3));
++base->timer_jiffies; //每处理一次,定时器+=1,向前进1位
list_replace_init(base->tv1.vec + index, &work_list);
//这样第一组中一般肯定会有来自其他组填充的定时器,将该项清空,并且将work_list填充为该项定时器列表
while (!list_empty(head)) {
void (*fn)(unsigned long);
unsigned long data;
timer = list_first_entry(head, struct timer_list,entry); //对于每个定时器结构,处理
fn = timer->function;
data = timer->data;
timer_stats_account_timer(timer); //统计信息,填充struct entry,跟踪执行该timer的进程信息
set_running_timer(base, timer);
detach_timer(timer, 1); //将该timer从双向链表中删除出去
spin_unlock_irq(&base->lock);
{
int preempt_count = preempt_count();
#ifdef CONFIG_LOCKDEP
/*
* It is permissible to free the timer from
* inside the function that is called from
* it, this we need to take into account for
* lockdep too. To avoid bogus "held lock
* freed" warnings as well as problems when
* looking into timer->lockdep_map, make a
* copy and use that here.
*/
struct lockdep_map lockdep_map =
timer->lockdep_map;
#endif
/*
* Couple the lock chain with the lock chain at
* del_timer_sync() by acquiring the lock_map
* around the fn() call here and in
* del_timer_sync().
*/
lock_map_acquire(&lockdep_map);
trace_timer_expire_entry(timer);
fn(data); //执行处理函数
trace_timer_expire_exit(timer);
lock_map_release(&lockdep_map);
if (preempt_count != preempt_count()) {
printk(KERN_ERR "huh, entered %p "
"with preempt_count %08x, exited"
" with %08x?\n",
fn, preempt_count,
preempt_count());
BUG();
}
}
spin_lock_irq(&base->lock);
}
}
set_running_timer(base, NULL);
spin_unlock_irq(&base->lock);
}
/*
* 作用:将该数组列表中的index项重新插入到定时器数组列表中,由于timers->jiffies此时已变化,所以插入位置也改
变
* base 该定时器数组列表
* tv 该组数组列表
* index timer_jiffies目前在该数组列表中的下标,可以为0
* 当该tvec组满时返回0
*/
static int cascade(struct tvec_base *base, struct tvec *tv, int index)
{
/* cascade all the timers from tv up one level */
struct timer_list *timer, *tmp;
struct list_head tv_list;
list_replace_init(tv->vec + index, &tv_list);
/*
* We are removing _all_ timers from the list, so we
* don't have to detach them individually.
*/
list_for_each_entry_safe(timer, tmp, &tv_list, entry) {
BUG_ON(tbase_get_base(timer->base) != base);
internal_add_timer(base, timer);
}
return index;
}