[知其然不知其所以然-8] linux cpufreq的sysfs文件详细解释

在学习一个内核的新功能时,有两种捷径可以走。

第一个是查看这个功能最早的git log提交日志,看他最早是出于什么考虑而提出的。由于最早的代码实现一般都很简洁,

阅读起来难度要小很多。

第二个是查看跟这个功能有关的sysfs文件,通过修改sysfs的值,来改变这个模块的运作方式,达到分析他的实现原理。

本文按照第二种方式对cpufreq模块进行分析。

直接给出跟cpufreq相关的一些sysfs文件:

/# ls /sys/devices/system/cpu/cpu0/cpufreq/
affected_cpus     cpuinfo_max_freq            freqdomain_cpus                scaling_available_governors  scaling_governor  scaling_setspeed
bios_limit        cpuinfo_min_freq            related_cpus                   scaling_cur_freq             scaling_max_freq  stats
cpuinfo_cur_freq  cpuinfo_transition_latency  scaling_available_frequencies  scaling_driver               scaling_min_freq
让代码说话,下面挨个解释:

1. scaling_driver

/# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
acpi-cpufreq
我们看到,当前cpufreq的驱动是acpi-cpufreq,

由于作者工作的关系,本文主要讨论x86架构下的cpufreq驱动,

x86架构下的典型驱动有俩,一个是acpi_cpufreq,另一个是intel_pstate。

前者已经在内核存在多年,后者是intel公司为了优化自己的产品而专门引入。

intel_pstate由于优化的太多,跟常规的cpufreq驱动不太吻合,为了推广到其他cpufreq驱动

的分析,本文重点关注acpi_cpufreq,附带分析intel_pstate。


2.cpuinfo_max_freq, cpuinfo_min_freq

来源是:

#define show_one(file_name, object)			\
static ssize_t show_##file_name				\
(struct cpufreq_policy *policy, char *buf)		\
{							\
	return sprintf(buf, "%u\n", policy->object);	\
}

show_one(cpuinfo_min_freq, cpuinfo.min_freq);
show_one(cpuinfo_max_freq, cpuinfo.max_freq);
所以,读取的是policy->cpuinfo.min_freq和policy->cpuinfo.max_freq

那么policy的这些字段又是什么时候赋值的?答案是acpi_cpufreq_driver的init回调acpi_cpufreq_cpu_init

static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
{
//1.根据CPU._PSS获取该cpu在各个pstate下的频率,保存到perf->state数组的core_frequency里
acpi_processor_register_performance(data->acpi_data, cpu);

//2.根据第1步的结果,将频率保存到本地的cpufreq table中,注意_PSS得到的结果按照频率大小递减,所以这里要判断是否严格降序排列
	for (i = 0; i < perf->state_count; i++) {
		if (i > 0 && perf->states[i].core_frequency >=
		    data->freq_table[valid_states-1].frequency / 1000)
			continue;

		data->freq_table[valid_states].driver_data = i;
		data->freq_table[valid_states].frequency =
		    perf->states[i].core_frequency * 1000;
		valid_states++;
	}
//3. 根据第2步的结果,得到最大频率和最小频率
	cpufreq_for_each_valid_entry(pos, table) {
		freq = pos->frequency;

		if (!cpufreq_boost_enabled()
		    && (pos->flags & CPUFREQ_BOOST_FREQ))
			continue;

		pr_debug("table entry %u: %u kHz\n", (int)(pos - table), freq);
		if (freq < min_freq)
			min_freq = freq;
		if (freq > max_freq)
			max_freq = freq;
	}

	policy->min = policy->cpuinfo.min_freq = min_freq;
	policy->max = policy->cpuinfo.max_freq = max_freq;
}

这就是我们看到的最大频率和最小频率。


最后会把policy->freq_table字段指向这个新的cpufreq table,

policy->freq_table = table;

这个policy->freq_table后面会使用到,用来打印目前cpu governor支持哪些频率,见

scaling_available_frequencies

那么问题来了,我们知道,policy是和governor对应的,如果多个cpu公用一个governor的policy,但

他们又有不同的cpufreq table(即_PSS),那该怎么办?


再来看intel_pstate对这两值的赋值:

static int intel_pstate_cpu_init(struct cpufreq_policy *policy)
{
	cpu->pstate.min_pstate = pstate_funcs.get_min();
	cpu->pstate.max_pstate = pstate_funcs.get_max();
	cpu->pstate.turbo_pstate = pstate_funcs.get_turbo();
	cpu->pstate.scaling = pstate_funcs.get_scaling();
	policy->min = cpu->pstate.min_pstate * cpu->pstate.scaling;
	policy->max = cpu->pstate.turbo_pstate * cpu->pstate.scaling;
	policy->cpuinfo.min_freq = cpu->pstate.min_pstate * cpu->pstate.scaling;
	policy->cpuinfo.max_freq =
		cpu->pstate.turbo_pstate * cpu->pstate.scaling;
}
具体的pstate_funcs的回调,是架构相关的,根据cpuid来决定:

#define ICPU(model, policy) \
	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF,\
			(unsigned long)&policy }

static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
	ICPU(0x2a, core_params),
	ICPU(0x2d, core_params),
	ICPU(0x37, byt_params),
	ICPU(0x3a, core_params),
	ICPU(0x3c, core_params),
	ICPU(0x3d, core_params),
	ICPU(0x3e, core_params),
	ICPU(0x3f, core_params),
	ICPU(0x45, core_params),
	ICPU(0x46, core_params),
	ICPU(0x47, core_params),
	ICPU(0x4c, byt_params),
	ICPU(0x4e, core_params),
	ICPU(0x4f, core_params),
	ICPU(0x56, core_params),
	ICPU(0x57, knl_params),
	{}
};
实际上就是和boot_cpu_data里存放的cpu特征做比较,看是哪个model

的cpu:

const struct x86_cpu_id *id;
id = x86_match_cpu(intel_pstate_cpu_ids);

const struct x86_cpu_id *x86_match_cpu(const struct x86_cpu_id *match)
{
	const struct x86_cpu_id *m;
	struct cpuinfo_x86 *c = &boot_cpu_data;

	for (m = match; m->vendor | m->family | m->model | m->feature; m++) {
		if (m->vendor != X86_VENDOR_ANY && c->x86_vendor != m->vendor)
			continue;
		if (m->family != X86_FAMILY_ANY && c->x86 != m->family)
			continue;
		if (m->model != X86_MODEL_ANY && c->x86_model != m->model)
			continue;
		if (m->feature != X86_FEATURE_ANY && !cpu_has(c, m->feature))
			continue;
		return m;
	}
	return NULL;
}
上述匹配算法的意思就是说,如果该字段不是ANY属性(匹配任何都合法),那么必须等于

struct cpuinfo_x86 结构体 boot_cpu_data里的对应vendor,family,model,feature属性,

而boot_cpu_data的初始化是在identify_boot_cpu的

identify_cpu(&boot_cpu_data);
identify_cpu->generic_identify
void cpu_detect(struct cpuinfo_x86 *c)
{
/* Get vendor name */
	cpuid(0x00000000, (unsigned int *)&c->cpuid_level,
	      (unsigned int *)&c->x86_vendor_id[0],
	      (unsigned int *)&c->x86_vendor_id[8],
	      (unsigned int *)&c->x86_vendor_id[4]);
	if (c->cpuid_level >= 0x00000001) {
		u32 junk, tfms, cap0, misc;

		cpuid(0x00000001, &tfms, &misc, &junk, &cap0);
		c->x86 = (tfms >> 8) & 0xf;
		c->x86_model = (tfms >> 4) & 0xf;
		c->x86_mask = tfms & 0xf;

		if (c->x86 == 0xf)
			c->x86 += (tfms >> 20) & 0xff;
		if (c->x86 >= 0x6)
			c->x86_model += ((tfms >> 16) & 0xf) << 4;
}

}
所以,关键就是cpuid(0x1)这句话得到的结果,来设置x86_model字段。

[知其然不知其所以然-8] linux cpufreq的sysfs文件详细解释_第1张图片
好吧,关键是,有没有办法从现成的sysfs文件里看出这个model是多少?因为我们需要在intel_pstate

里添加一个新架构支持,就得新加一个model。答案是必须有,就在/proc/cpuinfo里的model字段。

比如我的surface pro 3上的model就是78, 也就是0x4e,对应的pstate就应该是:

	ICPU(0x4e, core_params),

cpu->pstate.min_pstate :

	rdmsrl(MSR_PLATFORM_INFO, value);
	return (value >> 40) & 0xFF;
cpu->pstate.max_pstate:

	rdmsrl(MSR_PLATFORM_INFO, value);
	return (value >> 8) & 0xFF;

在MSR里min,max这两个字段的解释如下:(MSR 0xCE)

[知其然不知其所以然-8] linux cpufreq的sysfs文件详细解释_第2张图片


cpu->pstate.turbo_pstate:(MSR 0x1ad)

rdmsrl(MSR_NHM_TURBO_RATIO_LIMIT, turbo_state);
ret = (turbo_state) & 255;

取的是7:0 bit,说明这个值是理论上但只有一个core处于active的时候的最大值。

[知其然不知其所以然-8] linux cpufreq的sysfs文件详细解释_第3张图片


根据

	policy->min = cpu->pstate.min_pstate * cpu->pstate.scaling;
	policy->max = cpu->pstate.turbo_pstate * cpu->pstate.scaling;
最后sysfs里看到的cpuinfo_max_freq和cpuinfo_min_freq,就是pstate的最小值到turbo的最大值。

3.

/# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq
400000
:/# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
400000
如果是acpi-cpufreq驱动:

static ssize_t show_cpuinfo_cur_freq(struct cpufreq_policy *policy,
					char *buf)
{
	ret_freq = cpufreq_driver->get(policy->cpu);
}

static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
{
msr = read_msr(MSR_IA32_PERF_CTL);
cpufreq_for_each_entry(pos, data->freq_table)
		if (msr == perf->states[pos->driver_data].status)
			return pos->frequency;
}

上面的代码相当晦涩,实际上跟第2步的cpufreq_table有点联系,

cpufreq_table里的成员,有一个index变量专门指向_PSS得到的performance state数组,

表示该成员是从performance state数组的哪一项生成的。根据ACPI 规范里对_PSS的解释:

Table 8-248 PState Package Values
Element Object Type Description
Core
Frequency
Integer
(DWORD)
Indicates the core CPU operating frequency (in MHz).
Power Integer
(DWORD)
Indicates the performance state’s maximum power dissipation (in milliwatts).
Latency Integer
(DWORD)
Indicates the worst-case latency in microseconds that the CPU is unavailable
during a transition from any performance state to this performance state.
Bus Master
Latency
Integer
(DWORD)
Indicates the worst-case latency in microseconds that Bus Masters are
prevented from accessing memory during a transition from any performance
state to this performance state.
Control Integer
(DWORD)
Indicates the value to be written to the Performance Control Register
(PERF_CTRL) in order to initiate a transition to the performance state.
Status Integer
(DWORD)
Indicates the value that OSPM will compare to a value read from the
Performance Status Register (PERF_STATUS) to ensure that the transition to
the performance state was successful. OSPM may always place the CPU in
the lowest power state, but additional states are only available when indicated
by the _PPC method.

根据_PSS的package的最后2个字段Control/Status的解释,当cpu想要进入Core Frequency,

先把cpu的PERF_CTRL寄存器设置为指定的值,接着读取PERF_STATUS的值,看是否和指定

的值相等,是的话就说明转换P state成功。

那我们想知道目前cpu的pstate频率,就只有靠上面的算法来匹配,到底cpu属于_PSS的哪个state?

所以代码会遍历cpufreq_table的所有成员,查看哪个成员的status,和当前cpu的status相等,然后把相等

的cpufreq_table成员,取出他的core_frequency,作为当前运行的频率值返回。

但是,请大家注意到我刚才的说法,我们是比照当前cpu的PERF_STATUS值,去查找cpufreq_table表的,而代码里是怎样的?

我们看到,他是取的PERF_CTRL的值去搜索。我们看到有一个commit把PERF_STATUS改成了PERF_CTRL:

commit 8673b83bf2f013379453b4779047bf3c6ae387e4
Author: Ross Lagerwall 
Date:   Fri May 31 20:45:17 2013 +0100

    acpi-cpufreq: set current frequency based on target P-State

    Commit 4b31e774 (Always set P-state on initialization) fixed bug
    #4634 and caused the driver to always set the target P-State at
    least once since the initial P-State may not be the desired one.
    Commit 5a1c0228 (cpufreq: Avoid calling cpufreq driver's target()
    routine if target_freq == policy->cur) caused a regression in
    this behavior.

    This fixes the regression by setting policy->cur based on the CPU's
    target frequency rather than the CPU's current reported frequency
    (which may be different).  This means that the P-State will be set
    initially if the CPU's target frequency is different from the
    governor's target frequency.

    This fixes an issue where setting the default governor to
    performance wouldn't correctly enable turbo mode on all cores.

    Signed-off-by: Ross Lagerwall 
    Reviewed-by: Len Brown 
    Acked-by: Viresh Kumar 
    Cc: 3.8+ 
    Signed-off-by: Rafael J. Wysocki 

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 11b8b4b..edc089e 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -347,11 +347,11 @@ static u32 get_cur_val(const struct cpumask *mask)
        switch (per_cpu(acfreq_data, cpumask_first(mask))->cpu_feature) {
        case SYSTEM_INTEL_MSR_CAPABLE:
                cmd.type = SYSTEM_INTEL_MSR_CAPABLE;
-               cmd.addr.msr.reg = MSR_IA32_PERF_STATUS;
+               cmd.addr.msr.reg = MSR_IA32_PERF_CTL;
                break;
        case SYSTEM_AMD_MSR_CAPABLE:
                cmd.type = SYSTEM_AMD_MSR_CAPABLE;
-               cmd.addr.msr.reg = MSR_AMD_PERF_STATUS;
+               cmd.addr.msr.reg = MSR_AMD_PERF_CTL;
:
这又是为什么?这样算下来,一个都匹配不上,从而每次返回的cpu频率,都是默认的

data->freq_table[0].frequency;

我们来看SDM手册里怎么说这两个寄存器的:

Reads of IA32_PERF_CTL determine the last targeted operating point. 
The current operating point can be read from
IA32_PERF_STATUS. IA32_PERF_STATUS is updated dynamically.

CTL是上次pstate的值,STATUS是本次pstate的值。我们明明是想算本次cpu频率,怎么

就算到上次的cpu频率去了。于是我问了这个commit的作者:

On Tue, Oct 27, 2015 at 04:47:05PM +0000, Chen, Yu C wrote:
> Hi, Ross,(Viresh)
> Sorry to disturb you guys,
> when I was using acpi-cpufreq driver, I found that, 
> the code used the MSR_IA32_PERF_CTL rather than
> MSR_IA32_PERF_STATUS to determine the state current
> cpu is running on. And I searched the git log, found that
> there was a commit converting the MSR_IA32_PERF_STATUS
> to MSR_IA32_PERF_CTL in commit 8673b83bf2f013379453b4779047bf3c6ae387e4
> " acpi-cpufreq: set current frequency based on target P-State"
>
> Although I studied the git log for this commit, I still can not quite understand 
> why we  change MSR_IA32_PERF_STATUS to MSR_IA32_PERF_CTL, because in
> ACPI spec, we should check _PSS.package.status with MSR_IA32_PERF_STATUS
> (not MSR_IA32_PERF_CTL )to find if the cpu is really in that pstate.
> Can you please explain a little more why we changed it, or is there a Bugzilla thread describing
> the history? Thank you !
>

I think the commit explains it. From what I recall (and I'm no expert),
the problem was that the function get_cur_val() needs to get the _target
frequency_ but it was using MSR_IA32_PERF_STATUS which returns the
_current frequency_ (which may be different). MSR_IA32_PERF_CTL returns
the target frequency so I used it.

他的意思是,get_cur_val想要使用的是 target_frequency,但是如果用STATUS的话,就使用的

是当前的频率,而这两个值是可能不同的。我后来的理解是,target_frequency是一个整数,表示当时

设置的理想状态的频率,而当前实际频率是一个连续变化的值,可能不是整数。所以,我们得用最后一次

写入的target_freqncy(估计就认为是当前频率),去匹配STATUS里表示的频率状态。

这里把当时他提交patch时,基于的版本的代码贴出来:

static unsigned int get_cur_freq_on_cpu(unsigned int cpu)
{
        struct acpi_cpufreq_data *data = per_cpu(acfreq_data, cpu);
        unsigned int freq;
        unsigned int cached_freq;

        pr_debug("get_cur_freq_on_cpu (%d)\n", cpu);

        if (unlikely(data == NULL ||
                     data->acpi_data == NULL || data->freq_table == NULL)) {
                return 0;
        }

        cached_freq = data->freq_table[data->acpi_data->state].frequency;
        freq = extract_freq(get_cur_val(cpumask_of(cpu)), data);
        if (freq != cached_freq) {
                /*
                 * The dreaded BIOS frequency change behind our back.
                 * Force set the frequency on next target call.
                 */
                data->resume = 1;
        }

        pr_debug("cur freq = %u\n", freq);

        return freq;
}

如果是intel_pstate,则频率来源是:

static unsigned int intel_pstate_get(unsigned int cpu_num)
{
	struct sample *sample;
	struct cpudata *cpu;

	cpu = all_cpu_data[cpu_num];
	if (!cpu)
		return 0;
	sample = &cpu->sample;
	return sample->freq;
}

我们看到这个频率的来源是每cpu变量的sample->freq,这个值是哪里计算的呢?

答案是定时器:

cpu->timer.expires = jiffies + HZ/100;

	if (!hwp_active)
		cpu->timer.function = intel_pstate_timer_func;
	else
		cpu->timer.function = intel_hwp_timer_func;
我们先看非hwp支持的定时器实现intel_pstate_timer_func。

定时器超时时间为10ms。采样函数里,先是计算cpu利用率:

static inline void intel_pstate_sample(struct cpudata *cpu)
{
	//读取当前aperf寄存器
	rdmsrl(MSR_IA32_APERF, aperf);
	//读取当前mperf寄存器
	rdmsrl(MSR_IA32_MPERF, mperf);
	//读取当前tsc寄存器
	tsc = native_read_tsc();
	//计算从上次采样到本次采样的delta
	cpu->sample.aperf = aperf - cpu->prev_aperf;
	cpu->sample.mperf = mperf - cpu->prev_mperf;
	cpu->sample.tsc = tsc - cpu->prev_tsc;
	//计算aperf相对mperf变化率(百分比)
	core_pct = div64_u64(int_tofp(sample->aperf) * int_tofp(100), 
					int_tofp(sample->mperf));
	//aperf相对mperf的变化率,就是实际频率相对固定pstate频率的比率
	//我们可以把aperf理解成actual frequency计数器,而mperf理解成tsc frequency计数器
	//从这里我们可以看出,这个cpu频率,因为是相对max_pstate,所以
	sample->freq = fp_toint(
		mul_fp(int_tofp(
			cpu->pstate.max_pstate * cpu->pstate.scaling / 100),
			core_pct));
			
	cpu->prev_aperf = aperf;
	cpu->prev_mperf = mperf;
	cpu->prev_tsc = tsc;
}

cpu利用率计算完毕后,cpu将根据利用率来推测本次需要设置的频率,

设置频率的前提是算出cpu的负载忙的程度,这个繁忙程度怎么计算呢?请读者思考30秒钟。












按照我们的思路,利用率应该是cpu处于c0的时间/cpu整体流逝的时间,

但intel_pstate的代码实现的比较精巧,根据他的注释:

core_busy is the ratio of actual performance to max
首先,他想使用上一步骤我们计算出的cpu在C0状态时,实际运行频率相对固定频率的

比率aperf/mperf,就是说,他想根据cpu处于某个pstate的历史停留时间,来决定下一次

是设置为pstate+m还是pstate-n。

在此基础之上,才考虑到idle对整个利用率的影响,算法是,因为timer是固定时间来的,

而且一定要cpu处于C0时才触发,所以如果根据两次采样间隔时间,发现大于3个timer周期了,

说明这个cpu有可能处于idle一段时间了,需要把之前算出来的core_busy再按比例减小一点,这就是

这段代码的来历:

	core_busy = cpu->sample.core_pct_busy;
	max_pstate = int_tofp(cpu->pstate.max_pstate);
	current_pstate = int_tofp(cpu->pstate.current_pstate);
	core_busy = mul_fp(core_busy, div_fp(max_pstate, current_pstate));
	//到这步为止,core_busy就是cpu处于current_pstate的比例了
	sample_time = pid_params.sample_rate_ms  * USEC_PER_MSEC;
	duration_us = ktime_us_delta(cpu->sample.time,
				     cpu->last_sample_time);
	if (duration_us > sample_time * 3) {
		sample_ratio = div_fp(int_tofp(sample_time),
				      int_tofp(duration_us));
		core_busy = mul_fp(core_busy, sample_ratio);
	}

接下来,就是根据这个core_busy来估算下一次cpufreq的pstate频率了。

我们看到代码计算利用率的过程中有很多宏,int_tofp,fp_toint,mul_fp,div_fp

这些宏是干嘛的?我们把它们的定义粘贴出来:

#define FRAC_BITS 8
#define int_tofp(X) ((int64_t)(X) << FRAC_BITS)
#define fp_toint(X) ((X) >> FRAC_BITS)


static inline int32_t mul_fp(int32_t x, int32_t y)
{
	return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
}

static inline int32_t div_fp(s64 x, s64 y)
{
	return div64_s64((int64_t)x << FRAC_BITS, y);
}
原来就是,把原来的数乘以256再计算。为什么要这么搞?其实是为了降低

精度的损失。我们来看一个简单的算术:

A=30, B=20,C=10,

求:

(A/B)×C等于多少。

如果全部按照整形来算,那么由于A/B会取整得到1,最后的结果是10。但实际上

如果保留除法小数点后的位数的话,A/B是1.5,最后的结果应该是15。

于是这里就引入了一个技巧,先把A乘以256,再除以B,再乘以C,最后再除以256:

30×256 / 20 ×10 = 15, 得到了我们想要的值,原因在于A扩大后再除以B可以保留精度。


再来看根据busy程度算出pstate要调整的尺度,pid_calc函数。

pid_calc相当的晦涩难懂,为了便于理解,我们把调整cpu频率的trace功能先打开:

/sys/kernel/debug/tracing# echo 1 > events/power/pstate_sample/enable
然后得到idle时的log:

root@acpi-Surface-Pro-3:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 28/28   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
          -0     [003] ..s.  1006.461715: pstate_sample: core_busy=33 scaled=20 from=8 to=8 mperf=393968 aperf=132038 tsc=129459507 freq=837792
          -0     [002] ..s.  1006.473728: pstate_sample: core_busy=33 scaled=20 from=8 to=8 mperf=2315411 aperf=774904 tsc=129407775 freq=836621
          compiz-1915  [001] ..s.  1006.729803: pstate_sample: core_busy=36 scaled=3 from=8 to=8 mperf=2950901 aperf=1062331 tsc=677911411 freq=900000
          -0     [000] .Ns.  1006.734043: pstate_sample: core_busy=35 scaled=3 from=9 to=8 mperf=1834252 aperf=658086 tsc=708510673 freq=896875
          -0     [002] ..s.  1006.834113: pstate_sample: core_busy=35 scaled=3 from=8 to=8 mperf=470223 aperf=166567 tsc=897893538 freq=885546
          -0     [001] ..s.  1007.234221: pstate_sample: core_busy=32 scaled=1 from=8 to=8 mperf=1768226 aperf=576339 tsc=1256743136 freq=814843
          -0     [003] ..s.  1007.410598: pstate_sample: core_busy=33 scaled=0 from=8 to=8 mperf=653057 aperf=216885 tsc=2364126280 freq=830175
          -0     [001] ..s.  1007.438399: pstate_sample: core_busy=32 scaled=4 from=8 to=8 mperf=224712 aperf=71919 tsc=508707663 freq=800097
          -0     [000] ..s.  1007.558411: pstate_sample: core_busy=32 scaled=1 from=8 to=8 mperf=2288240 aperf=735900 tsc=2053971390 freq=803906
          -0     [002] ..s.  1007.734633: pstate_sample: core_busy=32 scaled=0 from=8 to=8 mperf=2512577 aperf=804343 tsc=2243630862 freq=800292
          -0     [003] ..s.  1008.235028: pstate_sample: core_busy=32 scaled=1 from=8 to=8 mperf=374626 aperf=119932 tsc=2054051738 freq=800292
          -0     [003] ..s.  1008.410963: pstate_sample: core_busy=32 scaled=5 from=8 to=8 mperf=220113 aperf=70457 tsc=438341412 freq=800195
          -0     [002] ..s.  1008.423235: pstate_sample: core_busy=32 scaled=1 from=8 to=8 mperf=1141878 aperf=365573 tsc=1715641350 freq=800292
          -0     [002] ..s.  1008.546972: pstate_sample: core_busy=32 scaled=7 from=8 to=8 mperf=2003189 aperf=641141 tsc=308289563 freq=800097
          -0     [000] ..s.  1008.735075: pstate_sample: core_busy=32 scaled=0 from=8 to=8 mperf=664312 aperf=212608 tsc=2931640462 freq=800097
          -0     [002] ..s.  1008.891180: pstate_sample: core_busy=32 scaled=2 from=8 to=8 mperf=2338401 aperf=748481 tsc=857589037 freq=800195
          -0     [002] ..s.  1009.079292: pstate_sample: core_busy=32 scaled=5 from=8 to=8 mperf=1411451 aperf=451765 tsc=468676075 freq=800097
          -0     [001] ..s.  1009.235418: pstate_sample: core_busy=32 scaled=0 from=8 to=8 mperf=8795229 aperf=2814556 tsc=4477245287 freq=800000
          -0     [002] ..s.  1009.351460: pstate_sample: core_busy=32 scaled=3 from=8 to=8 mperf=2070464 aperf=662743 tsc=678103050 freq=800195
          -0     [003] ..s.  1009.411830: pstate_sample: core_busy=32 scaled=0 from=8 to=8 mperf=432350 aperf=138410 tsc=2493642800 freq=800292
          -0     [002] ..s.  1009.423517: pstate_sample: core_busy=32 scaled=13 from=8 to=8 mperf=353063 aperf=113031 tsc=179528450 freq=800292
          -0     [000] ..s.  1009.563727: pstate_sample: core_busy=32 scaled=1 from=8 to=8 mperf=1020485 aperf=326586 tsc=2064572363 freq=800000
          -0     [002] ..s.  1009.659634: pstate_sample: core_busy=32 scaled=3 from=8 to=8 mperf=915339 aperf=293031 tsc=588282200 freq=800292
          -0     [002] ..s.  1009.735708: pstate_sample: core_busy=32 scaled=12 from=8 to=8 mperf=203325 aperf=65090 tsc=189535625 freq=800292
          -0     [003] ..s.  1010.236226: pstate_sample: core_busy=32 scaled=1 from=8 to=8 mperf=365864 aperf=117128 tsc=2053970088 freq=800292
          -0     [003] ..s.  1010.412155: pstate_sample: core_busy=32 scaled=5 from=8 to=8 mperf=527138 aperf=168738 tsc=438323712 freq=800195
          -0     [002] ..s.  1010.440104: pstate_sample: core_busy=32 scaled=1 from=8 to=8 mperf=1380190 aperf=441872 tsc=1754992250 freq=800292
          -0     [001] ..s.  1010.440106: pstate_sample: core_busy=32 scaled=0 from=8 to=8 mperf=5236747 aperf=1675815 tsc=3001464150 freq=800000

关于调整的关键句就是from=8 to=8,表示pstate从800MHZ调整到800MHZ。

首先,from为啥是800M?根据之前我们介绍的min,max范围得到的:

cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
800000
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
2900000
然后就是,为什么to也是800M?答案就是pid_calc算出来的ctl值,传递给

intel_pstate_set_pstate函数后,本来想设置成8-ctl,但由于min必须大于等于8,

所以本次就设置成了8。关键还是得看ctl值是如何算出来的。

这里涉及到三个值,

第一个值是cpu处于c0时的pstate的比例,这个是根据aperf/mperf算出来的,

注意这个值没有考虑idle的情况,因为mperf和aperf都是cpu处于c0时才增加的,即core_busy;

第二个值是scaled,主要是考虑了cpu idle的情况的加权cpu运行情况,另外把这个值归一化

到current_pstate和max_pstate的比例下,这样得到相对上次pstate频率的利用率,即scaled,

我们看到上面trace的log里,scaled的值有0,1,3,4,5,7,13,20等;

第三个值是ctrl,这个值是根据scaled算出来的,大体公式是:

idle = setpoint - busy;
pterm = (idle * p_gain)/256;
integral += idle;
dterm = ((idle-last_idle) * d_gain)/256;
result = pterm + (integral * i_gain)/256 + dterm;
result = result + 128;
ctrl = result/256;
其中几个参数的值,根据x86的各个model有不同的值,在surface pro 3上是:

grep . /sys/kernel/debug/pstate_snb/*
/sys/kernel/debug/pstate_snb/deadband:0
/sys/kernel/debug/pstate_snb/d_gain_pct:0
/sys/kernel/debug/pstate_snb/i_gain_pct:0
/sys/kernel/debug/pstate_snb/p_gain_pct:20
/sys/kernel/debug/pstate_snb/sample_rate_ms:10
/sys/kernel/debug/pstate_snb/setpoint:97

由于i_gain和d_gain都是0,最后的取值就是pterm,代入后是:

		  fp_error = 97*256 - busy
		  p_gain = 20 * 256 * 256 /(100*256) = 20*256/100 = 51
		  pterm = mul_fp(pid->p_gain, fp_error) = 
					
result = pterm+128
result = result/256


我们打开trace功能,然后让他跑stress测试,得到pstate变化的数据,同时,我们

只抽取一个cpu上的变化数据,抽取其中升频和降频的拐点::

core_busy=80 scaled=252 from=8 to=29 mperf=29846497 aperf=24111378 tsc=29913941 freq=2019531 ctl=-31
core_busy=103 scaled=26 from=27 to=13 mperf=1459168 aperf=1517308 tsc=89947125 freq=2599609 ctl=14
先验算升频:

fp_err = 97*256 - 252*256 = -39680

p_gain = 51

pterm = 51*( -39680)/256 = -7905

result = (-7905 + 128)/256 = -30


再验算降频:

fp_err = 97 * 256 - 26*256 = 18176

p_gain = 51

pterm = 51*18176 /256 = 3621

result = (3621 + 128)/256 = 14

这个算法到底是怎样的一坨翔啊,凡是busy比97大,就应该升高频率,

凡是比97小,就应该降低频率,这个97是个什么玩意。。。


4. 说到sysfs下的cpu频率,我们就顺带说一下/proc/cpuinfo里的cpu相关频率参数

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 78
model name	: Intel(R) Core(TM) m5-6Y57 CPU @ 1.10GHz
stepping	: 3
microcode	: 0x37
cpu MHz		: 400.000
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 2
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm hwp hwp_notify hwp_act_window hwp_epp intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
bugs		:
bogomips	: 3023.86
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

我们关注两个字段,一个是model name里的 1.10GHZ,一个是cpu MHz的400.000,先看前者:

model name的来源是 show_cpuinfo里的 struct cpuinfo_x86->x86_modle_id字符串,

而这个字符串的来源是arch/x86/kernel/cpu/common.c的get_model_name函数:

v = (unsigned int *)c->x86_model_id;
        cpuid(0x80000002, &v[0], &v[1], &v[2], &v[3]);
        cpuid(0x80000003, &v[4], &v[5], &v[6], &v[7]);
        cpuid(0x80000004, &v[8], &v[9], &v[10], &v[11]);
        c->x86_model_id[48] = 0;
好吧,又是几个cpuid命令,我们来看最后一个0x80000004,搜索SDM的CPUID命令:

How Brand Strings Work
To use the brand string method, execute CPUID with EAX input of 8000002H through 80000004H. For each input
value, CPUID returns 16 ASCII characters using EAX, EBX, ECX, and EDX. The returned string will be NULL-termi-
nated.

也就是说,这些字符串是cpu生产时就印上去的,其中包括了cpu的base freq,获取算法SDM也画了一张示意图,

就是根据得到的string,倒序遍历得到相应的base freq,也就是1.1GHZ

再来看cpu MHz:

if (cpu_has(c, X86_FEATURE_TSC)) {
                unsigned int freq = cpufreq_quick_get(cpu);

                if (!freq)
                        freq = cpu_khz;
                seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
                           freq / 1000, (freq % 1000));
        }

然后cpufreq_quick_get的注释说明了一切,这个值就是sysfs下scaling_cur_freq的值。

/**
 * cpufreq_quick_get - get the CPU frequency (in kHz) from policy->cur
 * @cpu: CPU number
 *
 * This is the last known freq, without actually getting it from the driver.
 * Return value will be same as what is shown in scaling_cur_freq in sysfs.
 */
unsigned int cpufreq_quick_get(unsigned int cpu)
{
        struct cpufreq_policy *policy;
        unsigned int ret_freq = 0;

        if (cpufreq_driver && cpufreq_driver->setpolicy && cpufreq_driver->get)
                return cpufreq_driver->get(cpu);

        policy = cpufreq_cpu_get(cpu);
        if (policy) {
                ret_freq = policy->cur;
                cpufreq_cpu_put(policy);
        }

        return ret_freq;
}
可以看出,如果driver没有提供专门的get回调,那么这里就取的是policy->cur的值,这个值什么时候赋值?

答案是在每次cpufreq切换频率时设置,见:

static int __target_index(struct cpufreq_policy *policy,
                          struct cpufreq_frequency_table *freq_table, int index)
{
    freqs.new = freq_table[index].frequency;
    for_each_cpu(freqs->cpu, policy->cpus)
        if (likely(policy) && likely(policy->cpu == freqs->cpu))
                        policy->cur = freqs->new;
}

首先,每个cpu都有一个policy(governor),但大多数情况下,所有cpu都用的同一个policy(也不排除有些用

performance,有些用ondemand等等),policy->cpu表示的是’管理员‘ cpu的编号,然后把这个

policy的当前频率设置成要调整到的频率, 也就是/proc/cpuinfo里cpu MHz的值啦,可以看出,这个值根本就

不准。


5.接下来是scaling_min_freq和scaling_max_freq

show_one(scaling_min_freq, min);
show_one(scaling_max_freq, max);
根据

#define show_one(file_name, object)			\
static ssize_t show_##file_name				\
(struct cpufreq_policy *policy, char *buf)		\
{							\
	return sprintf(buf, "%u\n", policy->object);	\
}
这两个值是policy->min和policy->max,

根据acpi_cpufreq驱动里的

acpi_cpufreq_cpu_init
	policy->min = policy->cpuinfo.min_freq = min_freq;
	policy->max = policy->cpuinfo.max_freq = max_freq;

在acpi_cpufreq下,这两个值和cpuinfo_max_freq, cpuinfo_min_freq是一样的。

6. bios_limit

static ssize_t show_bios_limit(struct cpufreq_policy *policy, char *buf)
{
	unsigned int limit;
	int ret;
	if (cpufreq_driver->bios_limit) {
		ret = cpufreq_driver->bios_limit(policy->cpu, &limit);
		if (!ret)
			return sprintf(buf, "%u\n", limit);
	}
	return sprintf(buf, "%u\n", policy->cpuinfo.max_freq);
}

对acpi_cpufreq来说,有一个bios_limit回调是:

int acpi_processor_get_bios_limit(int cpu, unsigned int *limit)
{
	struct acpi_processor *pr;

	pr = per_cpu(processors, cpu);
	if (!pr || !pr->performance || !pr->performance->state_count)
		return -ENODEV;
	*limit = pr->performance->states[pr->performance_platform_limit].
		core_frequency * 1000;
	return 0;
}
所以取的是当前cpu的performance_platform_limit的值对应的频率,

这个performance_platform_limit是一个index,来源是_PPC返回值。这个方法要

结合_PSS方法来使用,表示的是_PSS的到底前几个pstate可以使用,比如:

Return Value:
An Integer containing the range of states supported
0 – States 0 through nth state are available (all states available)
1 – States 1 through nth state are available
2 – States 2 through nth state are available
…
n – State n is available only
假如performance_platform_limit是0,那么limit的最终结果是_PSS.states[0],

由于_PSS是按照频率递减的顺序排列,所以_PSS.states[0]就是最大的那个频率。

如果是1,则可调的频率范围是从第二大的频率递减到最小。

9. scaling_available_frequencies

/sys/devices/system/cpu/cpu0/cpufreq$ cat scaling_available_frequencies
2501000 2500000 2400000 2200000 2100000 1900000 1800000 1700000 1600000 1500000 1300000 1200000 1100000 1000000 800000 775000
drivers/cpufreq/freq_table.c:
static ssize_t scaling_available_frequencies_show(struct cpufreq_policy *policy,
						  char *buf)
{
	return show_available_freqs(policy, buf, false);
}
cpufreq_attr_available_freq(scaling_available);


其实就是遍历该cpu对应的policy,其支持的cpufreq_table的值:

struct cpufreq_frequency_table *pos, *table = policy->freq_table;
cpufreq_for_each_valid_entry(pos, table) {
count += sprintf(&buf[count], "%d ", pos->frequency);
}
10. affected_cpus

/sys/devices/system/cpu/cpu0/cpufreq$ cat affected_cpus
0

根据代码:
static ssize_t show_affected_cpus(struct cpufreq_policy *policy, char *buf)
{
	return cpufreq_show_cpus(policy->cpus, buf);
}

其实就是打印该cpu所属的policy,可以同时管哪些cpu,我们看到这个cpu0所属的policy,只管得到

cpu0自己一个。按照代码逻辑,如果cpu0,cpu1,cpu2,cpu3都用的同一个governor类型,难道他们不应该

也用同一个policy吗,为什么这4个cpu用不同的policy。



11.scaling_setspeed

这个字段是专门给userspace governor用的,用户用来往这个字段输入想要达到的频率。

/sys/devices/system/cpu/cpu0/cpufreq# echo userspace > scaling_governor
root@acpi-Surface-Pro-3:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_setspeed
775000
具体的实现就是借助driver的target来设置频率:

__cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
不过,还是得在policy->min和policy->max之间:

	/* Make sure that target_freq is within supported range */
	if (target_freq > policy->max)
		target_freq = policy->max;
	if (target_freq < policy->min)
		target_freq = policy->min;


12. freqdomain_cpus

这个字段只有acpi_cpufreq会打开,

static ssize_t show_freqdomain_cpus(struct cpufreq_policy *policy, char *buf)
{
	struct acpi_cpufreq_data *data = per_cpu(acfreq_data, policy->cpu);

	return cpufreq_show_cpus(data->freqdomain_cpus, buf);
}
这里引入了一个cpufreq的domain的概念,意思就是说,如果cpu0,cpu1,cpu2,cpu3都属于domain0,

那么无论哪个cpu修改频率,都需要把剩下的cpu频率一起改掉。这是由ACPI规范里的_PSD方法来

指示的:

 Scope (\_PR.CPU0)
{

 Name (_PSS, Package (0x10)  // _PSS: Performance Supported States
        {
            Package (0x06)
            {
                0x000009C5,
                0x00003A98,
                0x0000000A,
                0x0000000A,
                0x00001D00,
                0x00001D00
            },

            Package (0x06)
            {
                0x000009C4,
                0x00003A98,
                0x0000000A,
                0x0000000A,
                0x00001900,
                0x00001900
            },
        Method (_PSD, 0, NotSerialized)  // _PSD: Power State Dependencies
        {
            If (LNot (PSDF))
            {
                Store (TCNT, Index (DerefOf (Index (HPSD, Zero)), 0x04))
                Store (TCNT, Index (DerefOf (Index (SPSD, Zero)), 0x04))
                Store (Ones, PSDF)
            }

            If (And (PDC0, 0x0800))
            {
                Return (HPSD)
            }

            Return (SPSD)
        }

}

接着看HPSD的package成员:

        Name (HPSD, Package (0x01)
        {
            Package (0x05)
            {
                0x05,
                Zero,
                Zero,
                0xFE,
                0x80
            }
        })
根据acpi spec解释:

Package {
NumEntries // Integer
Revision // Integer (BYTE)
Domain // Integer (DWORD)
CoordType // Integer (DWORD)
NumProcessors // Integer (DWORD)
}
最终的就是获取该cpu属于哪个Domain,然后把相同domain的

cpu设置到policy->freqdomain_cpus,这就是和该字段的来历。



13. turbostat

说到频率,就绕不过turbostat。实验环境还是surface pro 3。

root@acpi-Surface-Pro-3:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_min_freq
800000
root@acpi-Surface-Pro-3:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_max_freq
2900000

root@acpi-Surface-Pro-3:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor
powersave
root@acpi-Surface-Pro-3:/sys/devices/system/cpu/cpu0/cpufreq# cat cpuinfo_max_freq
2900000
root@acpi-Surface-Pro-3:/sys/devices/system/cpu/cpu0/cpufreq# cat cpuinfo_min_freq
800000
processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 69
model name      : Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz
stepping        : 1
microcode       : 0x1c
cpu MHz         : 800.195

先在powersave模式下空载采集:

     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -       1    0.07     813    2494
       0       1    0.11     815    2494
       2       1    0.11     815    2494
       1       0    0.03     809    2494
       3       0    0.05     806    2494

基本就是按照scaling_min_freq来跑的。

再来看,加压后的turbostat数据,为了对比,先在cpu1上绑定加压程序:

# taskset -c 1 stress -c 1 -t 120
CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -     779   27.45    2839    2494
       0       1    0.05    2552    2494
       2     253    9.76    2593    2494
       1    2863   99.96    2864    2494
       3       1    0.02    2705    2494
基本上也就是scaling_max_freq的值。

再试试4个cpu打满:

root@acpi-Surface-Pro-3:/sys/devices/system/cpu/cpu0/cpufreq# stress -c 4 -t 120

     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -    2593  100.00    2593    2494
       0    2593  100.00    2593    2493
       2    2593  100.00    2593    2493
       1    2594  100.00    2594    2494
       3    2594  100.00    2594    2494

可以看到,并不能保证达到最大scaling_max_freq,因为分频的缘故。


下来来看最重要的Bzy_MHz是如何计算。

下面就不再以surface pro 3为例了,而是以bug成群的skylake为例。

首先,有几个频率需要澄清一下。 

第一个是TSC freq,在skylake之前的cpu上,TSC的基础频率是这样计算的:

n*100MHZ,以sandy bridge为例,寄存器MSR_PLATFORM_INFO (0xCE)

表示了这个n应该是多少:

15:8 Package Maximum Non-Turbo Ratio (R/O)
The is the ratio of the frequency that invariant TSC runs at.
Frequency = ratio * 100 MHz.
对于skylake来说,这个100就要改成24MHz了,因为他从Blk时钟改成了 crystal

晶振时钟,实际运行的TSC频率算法改成了m*24MHz,m是靠

CPUID(0x15)获取的值来计算的:

“TSC frequency” = “core crystal clock frequency” * EBX/EAX.
其中core crystal clock frequency是24MHz。

crystal clock domain,晶振时钟域, 这个是skylake使用的时钟计数器,

git commit a2b7b74945dbfe5d734eafe8aa52f9f1f8bc6931
tools/power turbostat: SKL: Adjust for TSC difference from base frequency
On a Skylake with 1500MHz base frequency,
the TSC runs at 1512MHz.

This is because the TSC is no longer in the n*100 MHz BCLK domain,
but is now in the m*24MHz crystal clock domain. (24 MHz * 63 = 1512 MHz)

This adds error to several calculations in turbostat,
unless the TSC sample sizes are adjusted for this difference.


我们先在skylake上看一下sysfs下的基本信息:

root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu/cpu0/cpufreq# grep . *
affected_cpus:0
cpuinfo_cur_freq:500039
cpuinfo_max_freq:2800000
cpuinfo_min_freq:400000
cpuinfo_transition_latency:4294967295
related_cpus:0
scaling_available_governors:performance powersave
scaling_cur_freq:500039
scaling_driver:intel_pstate
scaling_governor:powersave
scaling_max_freq:2800000
scaling_min_freq:400000
scaling_setspeed:


恩下面来看基础频率和TSC实际频率:

 turbostat -i 10 --debug
turbostat version 4.8 26-Sep, 2015 - Len Brown 
CPUID(0): GenuineIntel 22 CPUID levels; family:model:stepping 0x6:4e:3 (6:78:3)
CPUID(6): APERF, DTS, PTM, EPB
CPUID(0x15): eax_crystal: 2 ebx_tsc: 126 ecx_crystal_hz: 0
TSC: 1512 MHz (24000000 Hz * 126 / 2 / 1000000)
RAPL: 58254 sec. Joule Counter Range, at 4 Watts
cpu2: MSR_NHM_PLATFORM_INFO: 0x4043df9010f00
4 * 100 = 400 MHz max efficiency frequency
15 * 100 = 1500 MHz base frequency

可以看出,TSC的频率是1512MHz,计算方法就是24MHz * EBX/EAX,代入

CPUID(0x15)的结果,得到是1512MHz:

               crystal_hz = 24000000;	/* 24 MHz */	
		if (crystal_hz) {
				tsc_hz =  (unsigned long long) crystal_hz * ebx_tsc / eax_crystal;
				if (debug)
					fprintf(stderr, "TSC: %lld MHz (%d Hz * %d / %d / 1000000)\n",
						tsc_hz / 1000000, crystal_hz, ebx_tsc,  eax_crystal);
			}
然后是基础频率,显示为100MHz * 15 = 1500MHz:

	ratio = (msr >> 8) & 0xFF;
	fprintf(stderr, "%d * %.0f = %.0f MHz base frequency\n",
		ratio, bclk, ratio * bclk);

基础频率和实际TSC频率有一个比值,这个比值用来计算实际的Bzy_MHz。

比例的计算在

static void
calculate_tsc_tweak()
{
	unsigned long long msr;
	unsigned int base_ratio;

	get_msr(base_cpu, MSR_NHM_PLATFORM_INFO, &msr);
	base_ratio = (msr >> 8) & 0xFF;
	base_hz = base_ratio * bclk * 1000000;
	tsc_tweak = base_hz / tsc_hz;
}
可以看出,这个值是1500 /1512 = 0 , Busy_MHz的计算:

	/* Bzy_MHz */
	if (has_aperf)
		outp += sprintf(outp, "%8.0f",
			1.0 * t->tsc * tsc_tweak / units * t->aperf / t->mperf / interval_float);
上面式子里tsc,aperf和mperf都是两次采样的delta,
前面说过,mperf的增量速度是按照base freq跑的,而perf的增量速度是实际cpu运行速度跑,见SDM解释:

Increments at fixed interval (relative to TSC
freq.) when the logical processor is in C0.
Cleared upon overflow / wrap-around of
IA32_APERF.
Accumulates core clock counts at the
coordinated clock frequency, when the
logical processor is in C0.
Cleared upon overflow / wrap-around of
IA32_MPERF.

因此,

最后得到的结果,是这段采样周期内,cpu处于C0运行态时的实际运行频率。我们来空载看看结果:

     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -       0    0.09     501    1512
       0       0    0.07     500    1512
       2       1    0.12     503    1512
       1       0    0.05     500    1512
       3       1    0.11     500    1512
然后给cpu1打满再看看:

CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -     523   25.75    2030    1512
       0       1    0.05    2068    1512
       2       1    0.03    2032    1512
       1    2026   99.77    2031    1512
       3      63    3.15    2002    1512

可以看出,cpu1最高是2031MHz,而我们看到pstate的scaling_max实际上是

2.8G,这是怎么回事?再观察sysfs下cpuinfo_cur_freq:

root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu# grep . cpu*/cpufreq/cpuinfo_cur_freq
cpu0/cpufreq/cpuinfo_cur_freq:2399648
cpu1/cpufreq/cpuinfo_cur_freq:2799960
cpu2/cpufreq/cpuinfo_cur_freq:2400000
cpu3/cpufreq/cpuinfo_cur_freq:2800605
root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu# grep . cpu*/cpufreq/cpuinfo_cur_freq
cpu0/cpufreq/cpuinfo_cur_freq:2400000
cpu1/cpufreq/cpuinfo_cur_freq:2792695
cpu2/cpufreq/cpuinfo_cur_freq:2400058
cpu3/cpufreq/cpuinfo_cur_freq:2799843
root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu# grep . cpu*/cpufreq/cpuinfo_cur_freq
cpu0/cpufreq/cpuinfo_cur_freq:2400644
cpu1/cpufreq/cpuinfo_cur_freq:2799960
cpu2/cpufreq/cpuinfo_cur_freq:2400527
cpu3/cpufreq/cpuinfo_cur_freq:2526093
root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu# grep . cpu*/cpufreq/cpuinfo_cur_freq
cpu0/cpufreq/cpuinfo_cur_freq:2400000
cpu1/cpufreq/cpuinfo_cur_freq:2799960
cpu2/cpufreq/cpuinfo_cur_freq:2400234
cpu3/cpufreq/cpuinfo_cur_freq:2800371
root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu# grep . cpu*/cpufreq/cpuinfo_cur_freq
cpu0/cpufreq/cpuinfo_cur_freq:2400000
cpu1/cpufreq/cpuinfo_cur_freq:2400000
cpu2/cpufreq/cpuinfo_cur_freq:2400292
cpu3/cpufreq/cpuinfo_cur_freq:2462109
root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu# grep . cpu*/cpufreq/cpuinfo_cur_freq
cpu0/cpufreq/cpuinfo_cur_freq:1999980
cpu1/cpufreq/cpuinfo_cur_freq:1999980
cpu2/cpufreq/cpuinfo_cur_freq:2000097
cpu3/cpufreq/cpuinfo_cur_freq:1999570
root@chenyu-Broadwell-Client-platform:/sys/devices/system/cpu# grep . cpu*/cpufreq/cpuinfo_cur_freq
cpu0/cpufreq/cpuinfo_cur_freq:1999980
cpu1/cpufreq/cpuinfo_cur_freq:1999980
cpu2/cpufreq/cpuinfo_cur_freq:2000039
cpu3/cpufreq/cpuinfo_cur_freq:2000214

可以看出,刚开始频率还是对的,2.79HZ,后面就一路降低,稳定在2G了。这一定是一个BUG。

























你可能感兴趣的:([知其然不知其所以然-8] linux cpufreq的sysfs文件详细解释)