首先,按照cpufreq的框架,所有cpu公用一个driver,但是每个cpu都可以跑不同的governor。
但也有限制,同一个governor对应的所有cpu,能变化的频率范围都是一样的,也就是说,
假如cpu1和cpu2用的是ondemand,那么cpu1的min和cpu2的min
相等,cpu1的max和cpu2的max相等,其实这是为了代码实现方便而这么设计的。因此所有的cpu都用的同一个governor,对任意一个cpu切换governor都将导致
所有的cpu都切换到该governor上,这是不合理的。
为了便于说明,我们先来看经典的acpi-cpufreq是如何处理不同的governor的,并
通过分析acpi-cpufreq的行为,来修正intel_pstate的逻辑(通过在
启动参数加intel_pstate=disable来关闭intel_pstate)
1. 修改governor前,
grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor:ondemand
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor:ondemand
2. 使能cpufreq.c的dynamic debug,将cpu1的governor改为powersave,观察cpu
切换governor时的记录:
[ 7429.811709] cpufreq: setting new policy for CPU 1: 775000 - 1900000 kHz
[ 7429.811721] cpufreq: new min and max freqs are 775000 - 1900000 kHz
[ 7429.811724] cpufreq: governor switch
[ 7429.811728] cpufreq: __cpufreq_governor: for CPU 1, event 2
[ 7429.811736] cpufreq: __cpufreq_governor: for CPU 1, event 5
[ 7429.811740] cpufreq: __cpufreq_governor: for CPU 1, event 4
[ 7429.811743] cpufreq: __cpufreq_governor: for CPU 1, event 1
从上面的log可以看出,powersave这个新policy的min和max都和原来一样,这是因为
代码里是直接把old_policy拷贝给new_policy。然后新policy的governor改成了
powersave,根据cpufreq_governor_powersave的实现,每次governor都取
policy->min作为下次要跳转到的target pstate。
另外我们看到,测试结果也说明了这个原理:除去被修改为powersave的cpu1以外,
其他cpu的policy都没有发生变化,而且经stress绑定cpu测试,cpu1每次都尝试获取最小pstate;
而非1号cpu上,一旦cpu work load上来后,就直接跳到最高pstate(这个是ondemand的实现原理,比较激进的增加频率)
也说明了cpu1用的policy和其他cpu用的policy不是同一个。
那么问题来了,是不是系统中,所有的cpu都有自己的policy,
答案是,不一定,有可能cpu0,cpu2,cpu3用的是同一个policy。
我们先来看往sysfs的scaling_governor 条目写字符串的时候发生了什么:
static ssize_t store_scaling_governor(struct cpufreq_policy *policy,
const char *buf, size_t count)
{
int ret;
char str_governor[16];
struct cpufreq_policy new_policy;
memcpy(&new_policy, policy, sizeof(*policy));
ret = sscanf(buf, "%15s", str_governor);
if (ret != 1)
return -EINVAL;
if (cpufreq_parse_governor(str_governor, &new_policy.policy,
&new_policy.governor))
return -EINVAL;
ret = cpufreq_set_policy(policy, &new_policy);
return ret ? ret : count;
}
改成powersave后,再覆盖当前policy。因此除了policy的governor字段外,
policy的其他字符基本没有变化。
有了上面对policy的直观认识,我们来仔细分析,是如何对cpu分组的,
就是说,有一些cpu是属于同一个组的,一旦对组内的某个cpu调整了频率,
那么需要对整个组的cpu都调整到一样的频率。
首先,在系统启动阶段,会生成policy。
最早的初始化在
late_initcall(acpi_cpufreq_init);
static int __init acpi_cpufreq_init(void)
{
acpi_cpufreq_early_init();
cpufreq_register_driver(&acpi_cpufreq_driver);
}
其中,跟我们说到的cpu分组有密切关系的函数是acpi_cpufreq_early_init,
这个函数会根据acpi spec里_PSD的信息,将各个cpu进行分组。
在继续说这个函数之前,我们需要先说明
_PSD的含义,通过acpidump得到cpu0和cpu1的_PSD结构:
Scope (\_PR.CPU0)
{
Method (_PSD, 0, NotSerialized)
{
Package (0x05)
{
0x05,
Zero,
Zero,
0xFC,
0x80
}
}
}
Scope (\_PR.CPU1)
{
Method (_PSD, 0, NotSerialized)
{
Package (0x05)
{
0x05,
Zero,
Zero,
0xFC,
0x80
}
}
}
根据ACPI spec中_PSD的定义:
Package {
NumEntries // Integer
Revision // Integer (BYTE)
Domain // Integer (DWORD)
CoordType // Integer (DWORD)
NumProcessors // Integer (DWORD)
}
CPU0和CPU1都属于Domain0,并且Domain0包含的cpu有128个
acpi_cpufreq_early_init要做的事就是把同属于一个domain的cpu,
存放到该cpu对应policy的cpus这个mask里,详细的算法
解释如下,抱歉我用英文注释了:)
static int __init acpi_processor_preregister_performance(void)
{
//1. allocate memory for per_cpu.shared_cpu_map
for_each_possible_cpu(i) {
zalloc_cpumask_var_node(
&per_cpu_ptr(acpi_perf_data, i)->shared_cpu_map,
GFP_KERNEL, cpu_to_node(i)))
}
//2. set i to cpu[i].cpumask ,get _PSD for cpu i
for_each_possible_cpu(i) {
pr = per_cpu(processors, i);
pr->performance = per_cpu_ptr(performance, i);
cpumask_set_cpu(i, pr->performance->shared_cpu_map);
acpi_processor_get_psd(pr);
}
//3. deal with cpufreq domain
for_each_possible_cpu(i) {
pr = per_cpu(processors, i);
//3.1 deal with cpu[i]'s domain
pdomain = &(pr->performance->domain_info);
cpumask_set_cpu(i, pr->performance->shared_cpu_map);
//3.1.1 get the cpu number that this domain contains
count_target = pdomain->num_processors;
//3.1.2 find all the cpu[j] have the same domain as cpu[i]
for_each_possible_cpu(j) {
if (i == j)
continue;
match_pr = per_cpu(processors, j);
match_pdomain = &(match_pr->performance->domain_info);
//3.1.2.1 must be of the same domain id
if (match_pdomain->domain != pdomain->domain)
continue;
//3.1.2.2 must be of the same domain cpu numbers
if (match_pdomain->num_processors != count_target) {
goto err_ret;
}
//3.1.2.3 must be of the same domain type
if (pdomain->coord_type != match_pdomain->coord_type) {
goto err_ret;
}
//3.1.2.4 match for cpu[i] found, set j to cpu[i].cpumask
cpumask_set_cpu(j, pr->performance->shared_cpu_map);
} //cpu[i].cpumask done
//3.2 deal with cpu[j]'s domain which are in the same domain as cpu[i]
//from this point, cpu[i]'s domain(cpumask) has finished setting all
//the related cpu ids, we just need to copy it back to all related
//cpu j's cpumask
for_each_possible_cpu(j) {
if (i == j)
continue;
match_pr = per_cpu(processors, j);
match_pdomain = &(match_pr->performance->domain_info);
if (match_pdomain->domain != pdomain->domain)
continue;
match_pr->performance->shared_type =
pr->performance->shared_type;
cpumask_copy(match_pr->performance->shared_cpu_map,
pr->performance->shared_cpu_map);
}
}
在更新了所有cpu的shared_cpu_map后,再回到最初的
acpi_cpufreq_init,下面的这个函数,cpufreq_register_driver将尝试把acpi_cpufreq_driver注册到系统里。
该函数的核心就是首先把全局的cpufreq_driver赋值为acpi_cpufreq_driver,
然后通过将cpufreq_interface 子系统注册到系统里:
static struct subsys_interface cpufreq_interface = {
.name = "cpufreq",
.subsys = &cpu_subsys,
.add_dev = cpufreq_add_dev,
.remove_dev = cpufreq_remove_dev,
};
一个cpufreq_interface表示的是总线为cpu_subsys下的所有设备的添加/删除操作
方式的集合。注册是靠subsys_interface_register来完成的:
subsys_interface_register(&cpufreq_interface);
注册这个操作集合的过程中,会遍历subsys_interface所属的subsys总线下
的所有设备,也就是我们的cpufreq设备,将他们挨个的加到系统中。
(这些cpufreq设备是在早期cpufreq_core_init时,添加到cpu_subsys中的)
添加设备的回调操作是cpufreq_add_dev,这个函数就是为了调用cpufreq_driver的一系列
初始化回调函数而写的。在cpufreq_add_dev里,为每个cpu创建policy。
如果首先是cpu0进来,这个时候,cpu0对应的policy还没有生成,因此需要
动态分配一个new_policy,并启动该policy上的governor:
static int cpufreq_online(unsigned int cpu)
{
policy = per_cpu(cpufreq_cpu_data, cpu);
if (policy) {
return;
} else {
new_policy = true;
policy = cpufreq_policy_alloc(cpu);
}
if (new_policy) {
/* related_cpus should at least include policy->cpus. */
cpumask_copy(policy->related_cpus, policy->cpus);
cpufreq_driver->init(policy);
if (new_policy) {
for_each_cpu(j, policy->related_cpus)
per_cpu(cpufreq_cpu_data, j) = policy;
}
cpufreq_init_policy(policy);
}
可以看出,分组的依据就是policy->cpus的值,而这个值是期望cpufreq_driver的init函数来初始化的,
也即acpi_cpufreq_cpu_init,这个函数里,根据之前计算的shared_cpu_map来设置每个policy的cpus位图。
#define DOMAIN_COORD_TYPE_SW_ALL 0xfc
#define DOMAIN_COORD_TYPE_SW_ANY 0xfd
#define DOMAIN_COORD_TYPE_HW_ALL 0xfe
if (pdomain->coord_type == DOMAIN_COORD_TYPE_SW_ALL)
pr->performance->shared_type = CPUFREQ_SHARED_TYPE_ALL;
else if (pdomain->coord_type == DOMAIN_COORD_TYPE_HW_ALL)
pr->performance->shared_type = CPUFREQ_SHARED_TYPE_HW;
else if (pdomain->coord_type == DOMAIN_COORD_TYPE_SW_ANY)
pr->performance->shared_type = CPUFREQ_SHARED_TYPE_ANY;
if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL ||
policy->shared_type == CPUFREQ_SHARED_TYPE_ANY) {
cpumask_copy(policy->cpus, perf->shared_cpu_map);
cpumask_copy(data->freqdomain_cpus, perf->shared_cpu_map);
根据_PSD里domain的属性,来决定是否要对cpu做分组,只有0xfd和0xfc属性的domain才需要做分组,
从而将上一步得到的shared_cpu_map拷贝到该policy对应的cpu位图里,表示这个cpu所在的policy,同时
还需要管理policy->cpus位图里的其他cpu。在我们之前的例子中,cpu0和cpu1的cpudomain的属性
是0xfc,因此需要执行分组,cpu0和cpu1各自的policy->cpus既包含cpu0也包含cpu1。
于是,在系统启动完毕后,至少cpu0和cpu1都用同一个policy。所以,按照理论来说,cpu0,cpu1,cpu2,cpu3都用的同一个policy。当时事实上呢?
根据我们最早的测试,如果只把cpu1的governor改成powersave,那么在100%workload起来后,
cpu1会按照最低频率跑,而其他的cpu则按照最高频率跑。这说明cpu1和其他cpu并没有共享同一个
policy。而且,从另一个角度来看,通过sysfs看到的数据,也是每个cpu用自己的policy:
ls /sys/devices/system/cpu/cpufreq/
boost ondemand policy0 policy1 policy2 policy3
挑其中一个policy1看看:
grep . /sys/devices/system/cpu/cpufreq/policy1/*
/sys/devices/system/cpu/cpufreq/policy1/affected_cpus:1
/sys/devices/system/cpu/cpufreq/policy1/bios_limit:1900000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_max_freq:2501000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_min_freq:775000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_transition_latency:10000
/sys/devices/system/cpu/cpufreq/policy1/freqdomain_cpus:0 1 2 3
/sys/devices/system/cpu/cpufreq/policy1/related_cpus:1
/sys/devices/system/cpu/cpufreq/policy1/scaling_available_frequencies:2501000 2500000 2400000 2200000 2100000 1900000 1800 00 1700000 1600000 1500000 1300000 1200000 1100000 1000000 800000 775000
/sys/devices/system/cpu/cpufreq/policy1/scaling_available_governors:conservative ondemand userspace powersave performance
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:800000
/sys/devices/system/cpu/cpufreq/policy1/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpufreq/policy1/scaling_governor:ondemand
/sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq:1900000
/sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq:775000
/sys/devices/system/cpu/cpufreq/policy1/scaling_setspeed:
grep: /sys/devices/system/cpu/cpufreq/policy1/stats: Is a directory
其中的freqdomain显示cpu0,1,2,3是同一个domain,我们把cpu1切换到powersave再看看:
grep . /sys/devices/system/cpu/cpufreq/policy1/* /sys/devices/system/cpu/cpufreq/policy1/affected_cpus:1
/sys/devices/system/cpu/cpufreq/policy1/bios_limit:1900000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_cur_freq:2501000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_max_freq:2501000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_min_freq:775000
/sys/devices/system/cpu/cpufreq/policy1/cpuinfo_transition_latency:10000
/sys/devices/system/cpu/cpufreq/policy1/freqdomain_cpus:0 1 2 3
/sys/devices/system/cpu/cpufreq/policy1/related_cpus:1
/sys/devices/system/cpu/cpufreq/policy1/scaling_available_frequencies:2501000 2500000 2400000 2200000 2100000 1900000 1800 00 1700000 1600000 1500000 1300000 1200000 1100000 1000000 800000 775000
/sys/devices/system/cpu/cpufreq/policy1/scaling_available_governors:conservative ondemand userspace powersave performance
/sys/devices/system/cpu/cpufreq/policy1/scaling_cur_freq:775000
/sys/devices/system/cpu/cpufreq/policy1/scaling_driver:acpi-cpufreq
/sys/devices/system/cpu/cpufreq/policy1/scaling_governor:ondemand
/sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq:1900000
/sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq:775000
/sys/devices/system/cpu/cpufreq/policy1/scaling_setspeed:
grep: /sys/devices/system/cpu/cpufreq/policy1/stats: Is a directory
domain还是0,1,2,3,没有变化。
static ssize_t show_affected_cpus(struct cpufreq_policy *policy, char *buf)
{
return cpufreq_show_cpus(policy->cpus, buf);
}
啊!! 同志们,这是怎么回事,按照刚才的分析,policy->cpus难道不应该是cpu0,1,2,3的集合吗?
哪里出了问题?我们已经知道cpu1的policy->cpus是1,也就是只有他自己的cpu编号,然后cpu1的
policy->shared_cpu_map却是0,1,2,3(freqdomain_cpus),于是我们就顺藤摸瓜回之前给这两个值赋值的地方:
if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL ||
policy->shared_type == CPUFREQ_SHARED_TYPE_ANY) {
cpumask_copy(policy->cpus, perf->shared_cpu_map);
cpumask_copy(data->freqdomain_cpus, perf->shared_cpu_map);
得到的结论只能是,cpu1所在的domain绝对不是TYPE_ANY或者是TYPE_ALL,于是我们再仔细检查
acpidump文件,发现cpu1所在的_PSD定义有两处,我们之前看的是第一个定义,叫SPSD,即
software PSD,实际上硬件支持自动分组,因此firmware启用的是HPSD,即hardwarePSD,
第二个定义:
Name (HPSD, Package (0x01)
{
Package (0x05)
{
0x05,
Zero,
Zero,
0xFE,
0x80
}
})
所以实际上这个cpudomain0的属性是HW,于是,policy->cpus自然就只有cpu1一个人,而shared_cpu_map
则无论如何是0,1,2,3了。 这就圆满解释了cpufreq里policy对cpu管理的设计思路。
好吧,绕了这么大一个圈子,我们来回头看intel_pstate driver的设计,我们曾经提到过,
intel_pstate的所有cpu都共享同一个limit上下限变量,因此对任意一个cpu policy的切换,
都将导致所有的cpu都切换到该policy上,这是不合理的。