目录
1 手册
2 KVM模拟
2.1 APIC TImer模式
2.2 定时器模式
2.3 中断注入
我们首先来看下,Intel SDM 3中是怎么描述APIC Timer的,
参考 10.5.4 APIC Timer,
The local APIC unit contains a 32-bit programmable timer that is available to software to time events or operations. This timer is set up by programming four registers:
然后是APIC Timer的模式,即oneshot模式和periodic模式,
从以上,我们得到以下信息:
参考 10.5.4.1 TSC-Deadline Mode,
A write to the LVT Timer Register that changes the timer mode disarms the local APIC timer. The supported timer modes are given in Table 10-2. The three modes of the local APIC timer are mutually exclusive.
TSC-deadline mode allows software to use the local APIC timer to signal an interrupt at an absolute time. In TSC-deadline mode, writes to the initial-count register are ignored; and current-count register always reads 0. Instead, timer behavior is controlled using the IA32_TSC_DEADLINE MSR.
In TSC-deadline mode, writing 0 to the IA32_TSC_DEADLINE MSR disarms the local-APIC timer. Transitioning between TSC-deadline mode and other timer modes also disarms the timer
从以上我们得到以下信息:
三种模式,periodic/oneshot/tscdeadline,相关代码如下:
kvm_lapic_reg_write()
---
case APIC_LVTT:
val &= (apic_lvt_mask[0] | apic->lapic_timer.timer_mode_mask);
kvm_lapic_set_reg(apic, APIC_LVTT, val);
apic_update_lvtt(apic);
break;
case APIC_TMICT:
if (apic_lvtt_tscdeadline(apic))
break;
cancel_apic_timer(apic);
kvm_lapic_set_reg(apic, APIC_TMICT, val);
start_apic_timer(apic);
break;
---
apic_update_lvtt()
---
u32 timer_mode = kvm_lapic_get_reg(apic, APIC_LVTT) &
apic->lapic_timer.timer_mode_mask;
if (apic->lapic_timer.timer_mode != timer_mode) {
if (apic_lvtt_tscdeadline(apic) != (timer_mode ==
APIC_LVT_TIMER_TSCDEADLINE)) {
cancel_apic_timer(apic);
kvm_lapic_set_reg(apic, APIC_TMICT, 0);
apic->lapic_timer.period = 0;
apic->lapic_timer.tscdeadline = 0;
}
apic->lapic_timer.timer_mode = timer_mode;
limit_periodic_timer_frequency(apic);
}
---
参考,apic_update_lvtt(),可以看到,在TSC-deadline和其他模式转换时,会关闭之前的定时器,这点是与手册一致的。
APIC_TMICT就是Initial count register;可以看到在设置了它的值之后,会重启定时器。
TSC-deadline模式的设置是通过MSR,参考代码:
kvm_set_msr_common()/handle_fastpath_set_tscdeadline()
-> kvm_set_lapic_tscdeadline_msr()
---
hrtimer_cancel(&apic->lapic_timer.timer);
apic->lapic_timer.tscdeadline = data;
kvm_pv_update_tscdeadline(vcpu, data);
start_apic_timer(apic);
---
值被设置进了lapic_timer.tscdeadline里面;
oneshot/periodic模式的时间的计算可以参考函数:
__start_apic_timer()
-> set_target_expiration() // oneshot or period mode
---
u64 tscl = rdtsc();
s64 deadline;
now = ktime_get();
apic->lapic_timer.period = tmict_to_ns(apic, kvm_lapic_get_reg(apic, APIC_TMICT));
deadline = apic->lapic_timer.period;
...
apic->lapic_timer.tscdeadline = kvm_read_l1_tsc(apic->vcpu, tscl) +
nsec_to_cycles(apic->vcpu, deadline);
apic->lapic_timer.target_expiration = ktime_add_ns(now, deadline);
---
-> restart_apic_timer()
其中lapic_timer.tscdeadline也被设置了;
In sw timer mode,
apic_timer_fn()
---
if (lapic_is_periodic(apic)) {
advance_periodic_target_expiration(apic);
hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
return HRTIMER_RESTART;
}
---
In hw timer mode
kvm_lapic_expired_hv_timer()
---
if (apic_lvtt_period(apic) && apic->lapic_timer.period) {
advance_periodic_target_expiration(apic);
restart_apic_timer(apic);
}
---
定时器有两种方式实现,
关于preemption timer的详细信息,可以参考之前的Blog中关于Clock Event的虚拟化的部分,
KVM CPU虚拟化_cpu vmx_jianchwa的博客-CSDN博客
hv timer方式的设置和触发,参考以下代码:
start_hv_timer()
-> static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired)
vmx_set_hv_timer()
---
vmx->hv_deadline_tsc = tscl + delta_tsc;
---
vmx_vcpu_run()
-> vmx_update_hv_timer(vcpu);
---
if (vmx->hv_deadline_tsc != -1) {
tscl = rdtsc();
if (vmx->hv_deadline_tsc > tscl)
delta_tsc = (u32)((vmx->hv_deadline_tsc - tscl) >> cpu_preemption_timer_multi);
else
delta_tsc = 0;
vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
}
---
vmx_vcpu_run()
-> vmx_exit_handlers_fastpath()
-> handle_fastpath_preemption_timer()
-> kvm_lapic_expired_hv_timer()
-> apic_timer_expired(apic, false);
定时器的值来自lapic_timer.tscdeadline,然后讲过一些列的转换之后,设置进VMX_PREEMPTION_TIMER_VALUE,在超时之后会触发vm-exit,最终会调用apic_timer_expired()来处理超时事件。
sw timer模式的设置和触发代码如下:
start_sw_period()
---
hrtimer_start(&apic->lapic_timer.timer,
apic->lapic_timer.target_expiration,
HRTIMER_MODE_ABS_HARD);
---
apic_timer_fn()
-> apic_timer_expired(apic, true);
sw tscdeadline also use this hrtimer, but in different code path,
start_sw_tscdeadline()
---
u64 guest_tsc, tscdeadline = ktimer->tscdeadline;
...
now = ktime_get();
guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
ns = (tscdeadline - guest_tsc) * 1000000ULL;
do_div(ns, this_tsc_khz);
if (likely(tscdeadline > guest_tsc) &&
likely(ns > apic->lapic_timer.timer_advance_ns)) {
expire = ktime_add_ns(now, ns);
expire = ktime_sub_ns(expire, ktimer->timer_advance_ns);
hrtimer_start(&ktimer->timer, expire, HRTIMER_MODE_ABS_HARD);
} else
apic_timer_expired(apic, false);
---
lapic_timer.timer这个hrtimer,不仅oneshot和periodic模式会使用,sw tscdeadline模式也会使用;两者区别就在于超时时间的计算方式。
hv timer和sw timer的选择,则取决于下面这个函数:
start_hv_timer()
---
if (!kvm_can_use_hv_timer(vcpu))
return false;
if (!ktimer->tscdeadline)
return false;
if (static_call(kvm_x86_set_hv_timer)(vcpu, ktimer->tscdeadline, &expired))
return false;
---
kvm_can_use_hv_timer()中,主要的变量有两个:
那么oneshot/periodic模式是不是会使用hv timer呢?
答案是会的,在set_target_expiration()中,apic->lapic_timer.tscdeadline也同样被计算了。
两种模式使用的超时时间的值分别保存在:
对于,periodic模式,在超时之后,还需要重新将定时器re-arm,这一点sw timer和hw timer都有实现:
In sw timer mode,
apic_timer_fn()
---
if (lapic_is_periodic(apic)) {
advance_periodic_target_expiration(apic);
hrtimer_add_expires_ns(&ktimer->timer, ktimer->period);
return HRTIMER_RESTART;
}
---
In hw timer mode
kvm_lapic_expired_hv_timer()
---
if (apic_lvtt_period(apic) && apic->lapic_timer.period) {
advance_periodic_target_expiration(apic);
restart_apic_timer(apic);
}
---
需要特别说明的是,在vcpu进入block状态之后,定时器会从hv mode转换到sw mode,参考代码:
static int vmx_pre_block(struct kvm_vcpu *vcpu)
{
if (pi_pre_block(vcpu))
return 1;
if (kvm_lapic_hv_timer_in_use(vcpu))
kvm_lapic_switch_to_sw_timer(vcpu);
return 0;
}
static void vmx_post_block(struct kvm_vcpu *vcpu)
{
if (kvm_x86_ops.set_hv_timer)
kvm_lapic_switch_to_hv_timer(vcpu);
pi_post_block(vcpu);
}
通常,定时器是可以将一个cpu从idle状态唤醒的;如果是hv mode,preemption timer对于一个没有运行的vcpu显然是没用的,所以需要使用sw mode,通过位于host上的hrtimer来唤醒它。
在定时器超时之后,如何将对应的vector注入到guest中:
有两种方式,分别对应着hv timer和sw timer,两者的对apic_timer_expired()的调用上下文不同:
not from timer fn, namely, the preemption_timer vm-exit case
apic_timer_expired()
---
if (!from_timer_fn && vcpu->arch.apicv_active) {
kvm_apic_inject_pending_timer_irqs(apic);
-> kvm_apic_local_deliver(apic, APIC_LVTT)
}
---
from the timer fn, namely, software emulated timer
---
atomic_inc(&apic->lapic_timer.pending);
kvm_make_request(KVM_REQ_UNBLOCK, vcpu);
if (from_timer_fn)
kvm_vcpu_kick(vcpu);
---
vcpu_run()
---
for (;;) {
...
if (kvm_vcpu_running(vcpu)) {
r = vcpu_enter_guest(vcpu);
} else {
r = vcpu_block(kvm, vcpu);
}
if (r <= 0)
break;
kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
if (kvm_cpu_has_pending_timer(vcpu))
kvm_inject_pending_timer_irqs(vcpu);
...
}
---
对于这种情况,在设置了lapic_timer.pending之后,在vm-exit上下文中再次处理;如果timer触发在vcpu所在的pcpu,那么kvm_vcpu_kick()什么都不会做;否则,它会向vcpu所在的pcpu发送IPI。
在apic_timer_expired()中,还有以下注入中断的路径:
apic_timer_expired()
---
if (kvm_use_posted_timer_interrupt(apic->vcpu)) {
if (vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
vcpu->arch.apic->lapic_timer.timer_advance_ns)
__kvm_wait_lapic_expire(vcpu);
kvm_apic_inject_pending_timer_irqs(apic);
return;
}
---
参考其最初的commit,
commit 0c5f81dad46c90792e6c3c4797131323c9e96dcd
Author: Wanpeng Li
Date: Sat Jul 6 09:26:51 2019 +0800
KVM: LAPIC: Inject timer interrupt via posted interrupt
Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside. There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits. This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.
In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.
The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially. Without patch
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )
While with patch:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )
Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: Marcelo Tosatti
Signed-off-by: Wanpeng Li
Signed-off-by: Paolo Bonzini
这个功能对应的是cpu isolation,具体可以参考suse的blog,
CPU Isolation – Housekeeping and tradeoffs – by SUSE Labs (part 4) | SUSE CommunitiesThis blog post is the fourth in a technical series by SUSE Labs team exploring Kernel CPU Isolation along with one of its core components: F...https://www.suse.com/c/cpu-isolation-housekeeping-and-tradeoffs-part-4/摘取其中的一段:
On normal configurations, every CPU get its housekeeping duty share. On the opposite, nohz_full configurations implicitly move away all the housekeeping work outside the nohz_full set. This means that if you have 8 CPUs and you isolate CPUs 1,2,3,4,5,6,7:
nohz_full=1-7
Then CPU 0 will handle the housekeeping workload alone. These duties involve:
使用posted interrupt delivery可以避免kvm_vcpu_kick()带来的一次额外的vm-exit。