什么是per-CPU变量?
per-CPU变量主要用在多处理器系统中,用来为系统中的每个CPU生成一个变量副本,per-CPU变量对于每个处理器都有一个互相独立的副本。per-CPU变量分为静态分配与动态分配两种,静态分配是指在编译内核期间分配好的per-CPU变量,动态分配是指运行期间调用per-CPU memory allocator 分配的per-CPU变量。
Linux使用Chunk数据结构来管理per-CPU变量的分配,当要分配某个size的per-CPU变量时,每个CPU的per-CPU变量副本都在同一个chunk当中分配,如果一个chunk分配满了,那么会再新增一个chunk.
为了便于分配与查找,Linux按照每个Chunk的空闲空间的size将Chunk链接到不同的list中,每次分配时从满足allocate size要求且空闲空间最小的list中的chunk中进行分配。(There are usually many small percpu allocations many of them being as small as 4 bytes. The allocator organizes chunks into lists according to free size and tries to allocate from the fullest one.)
下面是源码分析:
参考源代码: Linux 3.10
Linux实现per-CPU模块的源文件主要位于 /include/linux/percpu.h 和 mm/percpu.c中。
struct Chunk定义如下:
struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
int free_size; /* free bytes in the chunk */
int contig_hint; /* max contiguous size hint */
void *base_addr; /* base address of this chunk */
int map_used; /* # of map entries used */
int map_alloc; /* # of map entries allocated */
int *map; /* allocation map */
void *data; /* chunk data */
bool immutable; /* no [de]population allowed */
unsigned long populated[]; /* populated bitmap */
};
free_size: 表示此Chunk空闲空间的大小。
contig_hint: 最大的连续的空闲size
base_addr: 这个Chunk所管理的memory的起始地址(虚拟地址)
map_used: map数组中的已使用成员个数
map_alloc: map数组的大小
map: 用于分配size的数组
Chunk使用map数组来实现指定size的分配,每个数组成员是一个int类型的值,记录了一个分配好的size或一个free size(正数表示可分配的size,负数表示已经分配出去的size),map数组大小初始化为PCPU_DFL_MAP_ALLOC,map_used初始化为1,表示只用了一个map成员来记录,因此初始化后map数组只有map[0]有效,大小为整个Chunk可供分配的size(正数),表示现在Chunk为空。在运行的过程中每当有新的size分配请求,Chunk会在map数组里寻找满足要求的空闲的size,找到后分配指定的size并记录在map数组中,最后将空闲的size减去己分配的size,必要的话会根据情况将某些空闲的size合并。随着不断的进行各种size的动态分配,map_used会一直增长,当初始化的map大小不够用的时候map数组的大小也会增长。(Allocation state in each chunk is kept using an array of integers on chunk->map. A positive value in the map represents a free region and negative allocated. Allocation inside a chunk is done by scanning this map sequentially and serving the first matching entry. )
data: 指向为Chunk分配的vms结构的指针。Chunk管理的memory本质上还是以page的形式分配的(first Chunk除外)。
immutable: 一个布尔变量,置1表示不可再allocate和map page.
populated[]: unsigned long 数组用于记录已经map成功的page
前面说过,Linux使用Chunk数据结构来管理per-CPU变量的分配,现在假设系统有Nr个CPU,那么意味着在外部调用per-cpu allocater分配变量时,Chunk可以同时为Nr个CPU分配per-CPU变量。而实际情况还要复杂一些,linux还要对Nr个CPU分组,这个后面会结合code讨论。
Linux用一组全局变量记录CPU的信息及per-cpu allocater最大可分配内存的信息:
static int pcpu_unit_pages __read_mostly;
每个Chunk中单个CPU可供分配per-cpu变量的内存的大小,单位page。(很明显对于每个CPU,这个值是一样的,因为per-cpu变量是对所有的CPU同时分配的)
static int pcpu_unit_size __read_mostly;
每个Chunk中单个CPU可供分配per-cpu变量的内存的大小,单位byte。单个CPU可供分配per-cpu变量的内存的大小称为一个unit。
static int pcpu_nr_units __read_mostly;
每个Chunk中unit的数量,也就是系统中CPU的数量。
static int pcpu_atom_size __read_mostly;
用于align的size。
static struct list_head *pcpu_slot __read_mostly;
pcpu_slot是list_head数组,按照空闲空间的大小链接各个Chunk到其中不同的list_head中。
static int pcpu_nr_slots __read_mostly;
pcpu_slot数组的size。
static size_t pcpu_chunk_struct_size __read_mostly;
Chunk结构的size,在分配新的Chunk时用到。
void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);
第一个Chunk所管理内存的基地址,(the address of the first chunk which starts with the kernel static area.)
前面知道全局变量pcpu_uint_size表示单个CPU可供分配的内存大小,又由前面叙述可知: 当要分配某个size的per-CPU变量时,每个CPU的per-CPU变量副本都在同一个chunk当中分配。因此每个Chunk管理的内存大小必须为Nr x (pcpu_uint_size)。
per-cpu变量分为静态分配(编译时分配)与动态分配(运行时分配),先来看静态分配:
per-cpu变量的静态分配通过将变量定义在特殊的数据段中来实现(include/linux/percpu-defs.h):
#define DECLARE_PER_CPU(type, name) \
DECLARE_PER_CPU_SECTION(type, name, "")
#define DEFINE_PER_CPU(type, name) \
DEFINE_PER_CPU_SECTION(type, name, "")
这两个宏静态声明与分配一个类型为type的per-cpu变量。DECLARE_PER_CPU_SECTION和DEFINE_PER_CPU_SECTION又分别定义为:
#define DECLARE_PER_CPU_SECTION(type, name, sec) \
extern __PCPU_ATTRS(sec) __typeof__(type) name
#define DEFINE_PER_CPU_SECTION(type, name, sec) \
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES \
__typeof__(type) name
其中__PCPU_ATTRS定义为:
#define __PCPU_ATTRS(sec) \
__percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \
PER_CPU_ATTRIBUTES
__percpu是个编译扩展类型,在include/linux/compile.h文件中,__percpu是空的。而传进来的sec也是空的,PER_CPU_ATTRIBUTES也是空的,前面PER_CPU_DEF_ATTRIBUTES还是空的,所以DEFINE_PER_CPU(type, name)展开就是:
__attribute__((section(PER_CPU_BASE_SECTION sec)))
__typeof__(type) name
其中,PER_CPU_BASE_SECTION定义在(include/linux/asm-generic/percpu.h).
#define PER_CPU_BASE_SECTION ".data..percpu"
DEFINE_PER_CPU(type, name)最后展开就是:
__attribute__((section(.data..percpu)))
__typeof__(type) name
对于宏DECLARE_PER_CPU(type, name)同样展开就是:
extern __attribute__((section(.data..percpu))) __typeof__(type) name
由此看来per-cpu变量的静态定义就是用这个宏在.data..percpu段中定义一个per-cpu变量。那么在编译的时候这个段就会被编译进内核镜像,.data..percpu段的起始地址为__per_cpu_start,结束地址为 __per_cpu_end,这两个地址符号定义在linux内核链接脚本中(arch/arm/kernel/vmlinux.lds)。
还记得前面说过per-cpu变量是每个CPU对应有一个副本,可是这里明明只定义了一个变量到.data..percpu段中,这是怎么回事呢?
原来linux启动后start_kernel会调用setup_per_cpu_areas函数来初始化系统第一个chunk,在这个函数中会把.data..percpu段中的变量数据copy到每个CPU在该chunk内对应的内存中。
setup_per_cpu_areas实现在(mm/percpu.c)
void __init setup_per_cpu_areas(void)
{
unsigned long delta;
unsigned int cpu;
int rc;
/*
* Always reserve area for module percpu variables. That's
* what the legacy allocator did.
*/
rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, NULL,
pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
if (rc < 0)
panic("Failed to initialize percpu areas.");
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu)
__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
}
函数首先调用pcpu_embed_first_chunk创建系统第一个chunk,然后初始化__per_cpu_offset[]数组。先来看
pcpu_embed_first_chunk的实现,这个函数比较长,因此分段来看:
int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
size_t atom_size,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
pcpu_fc_alloc_fn_t alloc_fn,
pcpu_fc_free_fn_t free_fn)
{
void *base = (void *)ULONG_MAX;
void **areas = NULL;
struct pcpu_alloc_info *ai;
size_t size_sum, areas_size, max_distance;
int group, i, rc;
ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size,
cpu_distance_fn);
if (IS_ERR(ai))
return PTR_ERR(ai);
... ...
}
参数reserved_size和dyn_size分别表示这个chunk中用于为reserved分配保留的空间和用于为动态分配保留的空间的大小,这两个参数由setup_per_cpu_areas直接传递。参数atom_size用于对齐,这里传入的是PAGE_SIZE也就是页对齐。参数
cpu_distance_fn是可选的,用于计算cpu之间的distance,这里传入NULL. (cpu_distance_fn: callback to determine distance between cpus, optional),最后两个参数也是函数指针,用于分配和释放内存,分别为pcpu_dfl_fc_alloc和pcpu_dfl_fc_free,实现如下:
static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size,
size_t align)
{
return __alloc_bootmem_nopanic(size, align, __pa(MAX_DMA_ADDRESS));
}
static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
{
free_bootmem(__pa(ptr), size);
}
这两个函数用于在系统刚刚初始化的时候分配与释放内存。
pcpu_embed_first_chunk函数先调用pcpu_build_alloc_info函数来收集alloc info信息,这个函数也比较长,分段来看:
static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
size_t reserved_size, size_t dyn_size,
size_t atom_size,
pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
{
static int group_map[NR_CPUS] __initdata;
static int group_cnt[NR_CPUS] __initdata;
const size_t static_size = __per_cpu_end - __per_cpu_start;
int nr_groups = 1, nr_units = 0;
size_t size_sum, min_unit_size, alloc_size;
int upa, max_upa, uninitialized_var(best_upa); /* units_per_alloc */
int last_allocs, group, unit;
unsigned int cpu, tcpu;
struct pcpu_alloc_info *ai;
unsigned int *cpu_map;
/* this function may be called multiple times */
memset(group_map, 0, sizeof(group_map));
memset(group_cnt, 0, sizeof(group_cnt));
/* calculate size_sum and ensure dyn_size is enough for early alloc */
size_sum = PFN_ALIGN(static_size + reserved_size +
max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
dyn_size = size_sum - static_size - reserved_size;
/*
* Determine min_unit_size, alloc_size and max_upa such that
* alloc_size is multiple of atom_size and is the smallest
* which can accommodate 4k aligned segments which are equal to
* or larger than min_unit_size.
*/
min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);
alloc_size = roundup(min_unit_size, atom_size);
... ...
size_sum = PFN_ALIGN(static_size + reserved_size + max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE)); 这里函数先计算chunk中每个cpu需要分配的空间大小(静态+保留+动态)并对齐到页。min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE) 是将上步的结果与PCPU_MIN_UINT_SIZE比较取其中较大的值,最后alloc_size是min_unit_size按照参数atom_size向上取整的结果。
(Determine min_unit_size, alloc_size and max_upa such that alloc_size is multiple of atom_size and is the smallest which can accommodate 4k aligned segments which are equal to or larger than min_unit_size.) alloc_size就是最终要为每个CPU分配的空间大小。
#define PCPU_MIN_UNIT_SIZE PFN_ALIGN(32 << 10)
PCPU_MIN_UINT_SIZE定义为8个page
...
for_each_possible_cpu(cpu) {
group = 0;
next_group:
for_each_possible_cpu(tcpu) {
if (cpu == tcpu)
break;
if (group_map[tcpu] == group && cpu_distance_fn &&
(cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE ||
cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) {
group++;
nr_groups = max(nr_groups, group + 1);
goto next_group;
}
}
group_map[cpu] = group;
group_cnt[group]++;
}
... ...
...
/* allocate and fill alloc_info */
for (group = 0; group < nr_groups; group++)
nr_units += roundup(group_cnt[group], upa);
ai = pcpu_alloc_alloc_info(nr_groups, nr_units);
if (!ai)
return ERR_PTR(-ENOMEM);
... ...
接着计算出CPU的数量并记录在nr_units中,然后调用pcpu_alloc_alloc_info函数分配一个pcpu_alloc_info结构。
先看下pcpu_alloc_info结构定义:
struct pcpu_group_info {
int nr_units; /* aligned # of units */
unsigned long base_offset; /* base address offset */
unsigned int *cpu_map; /* unit->cpu map, empty
* entries contain NR_CPUS */
};
struct pcpu_alloc_info {
size_t static_size;
size_t reserved_size;
size_t dyn_size;
size_t unit_size;
size_t atom_size;
size_t alloc_size;
size_t __ai_size; /* internal, don't use */
int nr_groups; /* 0 if grouping unnecessary */
struct pcpu_group_info groups[];
};
CPU分组信息就保存在groups[]数组中,pcpu_group_info结构记录每个group的信息,nr_units表示这个group包含CPU的数量,base_offset表示这个group对应的内存起始地址到chunk所管理的整个内存的起始地址的offset,cpu_map记录group内包含哪些cpu。
接着看pcpu_alloc_alloc_info这个函数:
struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
int nr_units)
{
struct pcpu_alloc_info *ai;
size_t base_size, ai_size;
void *ptr;
int unit;
base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]),
__alignof__(ai->groups[0].cpu_map[0]));
ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);
ptr = alloc_bootmem_nopanic(PFN_ALIGN(ai_size));
if (!ptr)
return NULL;
ai = ptr;
ptr += base_size;
ai->groups[0].cpu_map = ptr;
for (unit = 0; unit < nr_units; unit++)
ai->groups[0].cpu_map[unit] = NR_CPUS;
ai->nr_groups = nr_groups;
ai->__ai_size = PFN_ALIGN(ai_size);
return ai;
}
base_size是不包含cpu_map数组的size,ai_size是包含cpu_map数组的总的size(所有group是共用一个cpu_map数组的,只不过各个group的cpu_map指针指向的offset不同)。
函数将group[0]的cpu_map指针初始化为cpu_map数组的起始地址,其他group的cpu_map指针由这个函数的调用者负责初始化,然后函数将cpu_map数组的成员全部初始化为NR_CPUS。最后设置pcpu_group_info结构的nr_groups和__ai_size.
返回到pcpu_build_alloc_info:
...
cpu_map = ai->groups[0].cpu_map;
for (group = 0; group < nr_groups; group++) {
ai->groups[group].cpu_map = cpu_map;
cpu_map += roundup(group_cnt[group], upa);
}
ai->static_size = static_size;
ai->reserved_size = reserved_size;
ai->dyn_size = dyn_size;
ai->unit_size = alloc_size / upa;
ai->atom_size = atom_size;
ai->alloc_size = alloc_size;
... ...
...
for (group = 0, unit = 0; group_cnt[group]; group++) {
struct pcpu_group_info *gi = &ai->groups[group];
/*
* Initialize base_offset as if all groups are located
* back-to-back. The caller should update this to
* reflect actual allocation.
*/
gi->base_offset = unit * ai->unit_size;
for_each_possible_cpu(cpu)
if (group_map[cpu] == group)
gi->cpu_map[gi->nr_units++] = cpu;
gi->nr_units = roundup(gi->nr_units, upa);
unit += gi->nr_units;
}
BUG_ON(unit != nr_units);
return ai;
}
接下来这个for循环初始化group的base_offset,前面讲过base_offset表示这个group对应的内存起始地址到chunk所管理的整个内存的起始地址的offset,ai->unit_size表示为每个CPU需要分配的空间大小,unit在循环中就表示当前的group前面有多少cpu。可以看出base_offset的初始化暂时假设各个group分配的空间之间是连续的。后面调用者会重新初始化这个base_offset,每个group内为各个cpu分配的空间之间是连续的。
然后返回到pcpu_embed_first_chunk:
...
size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));
areas = alloc_bootmem_nopanic(areas_size);
if (!areas) {
rc = -ENOMEM;
goto out_free;
}
/* allocate, copy and determine base address */
for (group = 0; group < ai->nr_groups; group++) {
struct pcpu_group_info *gi = &ai->groups[group];
unsigned int cpu = NR_CPUS;
void *ptr;
for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
cpu = gi->cpu_map[i];
BUG_ON(cpu == NR_CPUS);
/* allocate space for the whole group */
ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
if (!ptr) {
rc = -ENOMEM;
goto out_free_areas;
}
/* kmemleak tracks the percpu allocations separately */
kmemleak_free(ptr);
areas[group] = ptr;
base = min(ptr, base);
}
... ...
areas是指向各个group的空间的指针数组,接下来的for循环为每个group分配空间,并将areas数组初始化为各个group的空间的首地址。这段代码最终还会计算出其中最小的基地址保存在base。
...
/*
* Copy data and free unused parts. This should happen after all
* allocations are complete; otherwise, we may end up with
* overlapping groups.
*/
for (group = 0; group < ai->nr_groups; group++) {
struct pcpu_group_info *gi = &ai->groups[group];
void *ptr = areas[group];
for (i = 0; i < gi->nr_units; i++, ptr += ai->unit_size) {
if (gi->cpu_map[i] == NR_CPUS) {
/* unused unit, free whole */
free_fn(ptr, ai->unit_size);
continue;
}
/* copy and return the unused part */
memcpy(ptr, __per_cpu_load, ai->static_size);
free_fn(ptr + size_sum, ai->unit_size - size_sum);
}
}
... ...
这段代码将静态per-cpu变量copy到每一个CPU对应的空间中,前面说过per-cpu变量的静态定义就是用宏在.data..percpu段中定义一个per-cpu变量,__per_cpu_load符号就是.data..percpu段的首地址。
...
/* base address is now known, determine group base offsets */
max_distance = 0;
for (group = 0; group < ai->nr_groups; group++) {
ai->groups[group].base_offset = areas[group] - base;
max_distance = max_t(size_t, max_distance,
ai->groups[group].base_offset);
}
max_distance += ai->unit_size;
/* warn if maximum distance is further than 75% of vmalloc space */
if (max_distance > (VMALLOC_END - VMALLOC_START) * 3 / 4) {
pr_warning("PERCPU: max_distance=0x%zx too large for vmalloc "
"space 0x%lx\n", max_distance,
(unsigned long)(VMALLOC_END - VMALLOC_START));
#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
/* and fail if we have fallback */
rc = -EINVAL;
goto out_free;
#endif
}
pr_info("PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n",
PFN_DOWN(size_sum), base, ai->static_size, ai->reserved_size,
ai->dyn_size, ai->unit_size);
rc = pcpu_setup_first_chunk(ai, base);
goto out_free;
out_free_areas:
for (group = 0; group < ai->nr_groups; group++)
free_fn(areas[group],
ai->groups[group].nr_units * ai->unit_size);
out_free:
pcpu_free_alloc_info(ai);
if (areas)
free_bootmem(__pa(areas), areas_size);
return rc;
}
int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
void *base_addr)
{
static char cpus_buf[4096] __initdata;
static int smap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
size_t dyn_size = ai->dyn_size;
size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
struct pcpu_chunk *schunk, *dchunk = NULL;
unsigned long *group_offsets;
size_t *group_sizes;
unsigned long *unit_off;
unsigned int cpu;
int *unit_map;
int group, unit, i;
cpumask_scnprintf(cpus_buf, sizeof(cpus_buf), cpu_possible_mask);
#define PCPU_SETUP_BUG_ON(cond) do { \
if (unlikely(cond)) { \
pr_emerg("PERCPU: failed to initialize, %s", #cond); \
pr_emerg("PERCPU: cpu_possible_mask=%s\n", cpus_buf); \
pcpu_dump_alloc_info(KERN_EMERG, ai); \
BUG(); \
} \
} while (0)
/* sanity checks */
PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
#ifdef CONFIG_SMP
PCPU_SETUP_BUG_ON(!ai->static_size);
PCPU_SETUP_BUG_ON((unsigned long)__per_cpu_start & ~PAGE_MASK);
#endif
PCPU_SETUP_BUG_ON(!base_addr);
PCPU_SETUP_BUG_ON((unsigned long)base_addr & ~PAGE_MASK);
PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
PCPU_SETUP_BUG_ON(ai->unit_size & ~PAGE_MASK);
PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);
PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE);
PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0);
/* process group information and build config tables accordingly */
group_offsets = alloc_bootmem(ai->nr_groups * sizeof(group_offsets[0]));
group_sizes = alloc_bootmem(ai->nr_groups * sizeof(group_sizes[0]));
unit_map = alloc_bootmem(nr_cpu_ids * sizeof(unit_map[0]));
unit_off = alloc_bootmem(nr_cpu_ids * sizeof(unit_off[0]));
for (cpu = 0; cpu < nr_cpu_ids; cpu++)
unit_map[cpu] = UINT_MAX;
pcpu_low_unit_cpu = NR_CPUS;
pcpu_high_unit_cpu = NR_CPUS;
for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
const struct pcpu_group_info *gi = &ai->groups[group];
group_offsets[group] = gi->base_offset;
group_sizes[group] = gi->nr_units * ai->unit_size;
for (i = 0; i < gi->nr_units; i++) {
cpu = gi->cpu_map[i];
if (cpu == NR_CPUS)
continue;
PCPU_SETUP_BUG_ON(cpu > nr_cpu_ids);
PCPU_SETUP_BUG_ON(!cpu_possible(cpu));
PCPU_SETUP_BUG_ON(unit_map[cpu] != UINT_MAX);
unit_map[cpu] = unit + i;
unit_off[cpu] = gi->base_offset + i * ai->unit_size;
/* determine low/high unit_cpu */
if (pcpu_low_unit_cpu == NR_CPUS ||
unit_off[cpu] < unit_off[pcpu_low_unit_cpu])
pcpu_low_unit_cpu = cpu;
if (pcpu_high_unit_cpu == NR_CPUS ||
unit_off[cpu] > unit_off[pcpu_high_unit_cpu])
pcpu_high_unit_cpu = cpu;
}
}
pcpu_nr_units = unit;
for_each_possible_cpu(cpu)
PCPU_SETUP_BUG_ON(unit_map[cpu] == UINT_MAX);
... ...
base_addr是前面计算得到的最小的group基地址,做一些检查,然后给一些全局变量分配空间:group_offsets[]就是各个group的base_offset,group_sizes[]是每个group包含的总的空间的大小,unit_map[]记录cpu对应的unit号,unit号是根据对应cpu的空间在Chunk所管理空间中的偏移位置从小到大排列的序号,unit_off[]记录cpu对应的空间相对于Chunk所管理的空间基地址的偏移。for循环就是遍历每个group来初始化这些变量。最后将总的cpu数量保存在全局变量pcpu_nr_units中。pcpu_low_unit_cpu表示unit_off最小的cpu,pcpu_high_unit_cpu表示unit_off最大的cpu.
...
/* we're done parsing the input, undefine BUG macro and dump config */
#undef PCPU_SETUP_BUG_ON
pcpu_dump_alloc_info(KERN_DEBUG, ai);
pcpu_nr_groups = ai->nr_groups;
pcpu_group_offsets = group_offsets;
pcpu_group_sizes = group_sizes;
pcpu_unit_map = unit_map;
pcpu_unit_offsets = unit_off;
/* determine basic parameters */
pcpu_unit_pages = ai->unit_size >> PAGE_SHIFT;
pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
pcpu_atom_size = ai->atom_size;
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long);
/*
* Allocate chunk slots. The additional last slot is for
* empty chunks.
*/
pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
pcpu_slot = alloc_bootmem(pcpu_nr_slots * sizeof(pcpu_slot[0]));
for (i = 0; i < pcpu_nr_slots; i++)
INIT_LIST_HEAD(&pcpu_slot[i]);
... ...
前面这些为全局变量分配的空间都是用临时指针保存的,接下来把他们赋给真正的全局指针:pcpu_nr_groups,pcpu_group_offsets,pcpu_group_sizes,pcpu_unit_map,pcpu_unit_offsets,然后继续给其他一些全局变量赋值: pcpu_unit_pages是unit_size以页为单位的大小,pcpu_unit_size,pcpu_atom_size前面都已经说明过了,pcpu_chunk_struct_size是分配一个Chunk结构所需要的大小,前面说过Chunk结构体最后有一个unsigned long populated[]数组,用于记录已经map成功的page,所以除了结构体本身的大小,还要再加上用于map内存页的bitmap数组的size。然后分配Chunk slot,赋给全局指针pcpu_slot,前面说过pcpu_slot是一个list_head数组,Linux将所有的Chunk按照其空闲空间的大小链入数组对应的list中。先来看两个函数:
static int __pcpu_size_to_slot(int size)
{
int highbit = fls(size); /* size is in bytes */
return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
}
static int pcpu_size_to_slot(int size)
{
if (size == pcpu_unit_size)
return pcpu_nr_slots - 1;
return __pcpu_size_to_slot(size);
}
static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
{
if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
return 0;
return pcpu_size_to_slot(chunk->free_size);
}
fls定义在(include/asm-generic/bitops/fls.h),就是find last (most-significant) bit set,比如: fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.
#define PCPU_SLOT_BASE_SHIFT 5
pcpu_chunk_slot函数根据Chunk中空闲空间的大小将Chunk链入pcpu_slot对应的list中。pcpu_slot数组有两个特殊的list_head,一个是pcpu_slot[0],从pcpu_chunk_slot函数可以看出如果Chunk的空闲空间小于sizeof(int)或者最大连续的空闲空间小于sizeof(int),函数将其链入pcpu_slot[0]中,否则继续调用pcpu_size_to_slot,pcpu_size_to_slot进一步判断如果空闲空间等于pcpu_unit_size(也就是全部空闲),那么链入pcpu_slot[pcpu_nr_slots - 1]中(也就是数组最后一个list_head),否则继续调用__pcpu_size_to_slot,由__pcpu_size_to_slot函数来判断这个Chunk应该被链入哪一个list_head中。
因此在pcpu_setup_first_chunk函数中,在分配Chunk slot的时候就多分配了两个slot,一个就是pcpu_slot[0],另一个就是最后一个slot,最后将pcpu_slot数组里的list_head指针全部初始化。
接着看pcpu_setup_first_chunk函数:
/*
* Initialize static chunk. If reserved_size is zero, the
* static chunk covers static area + dynamic allocation area
* in the first chunk. If reserved_size is not zero, it
* covers static area + reserved area (mostly used for module
* static percpu allocation).
*/
schunk = alloc_bootmem(pcpu_chunk_struct_size);
INIT_LIST_HEAD(&schunk->list);
schunk->base_addr = base_addr;
schunk->map = smap;
schunk->map_alloc = ARRAY_SIZE(smap);
schunk->immutable = true;
bitmap_fill(schunk->populated, pcpu_unit_pages);
if (ai->reserved_size) {
schunk->free_size = ai->reserved_size;
pcpu_reserved_chunk = schunk;
pcpu_reserved_chunk_limit = ai->static_size + ai->reserved_size;
} else {
schunk->free_size = dyn_size;
dyn_size = 0; /* dynamic area covered */
}
schunk->contig_hint = schunk->free_size;
schunk->map[schunk->map_used++] = -ai->static_size;
if (schunk->free_size)
schunk->map[schunk->map_used++] = schunk->free_size;
/* init dynamic chunk if necessary */
if (dyn_size) {
dchunk = alloc_bootmem(pcpu_chunk_struct_size);
INIT_LIST_HEAD(&dchunk->list);
dchunk->base_addr = base_addr;
dchunk->map = dmap;
dchunk->map_alloc = ARRAY_SIZE(dmap);
dchunk->immutable = true;
bitmap_fill(dchunk->populated, pcpu_unit_pages);
dchunk->contig_hint = dchunk->free_size = dyn_size;
dchunk->map[dchunk->map_used++] = -pcpu_reserved_chunk_limit;
dchunk->map[dchunk->map_used++] = dchunk->free_size;
}
/* link the first chunk in */
pcpu_first_chunk = dchunk ?: schunk;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
/* we're done */
pcpu_base_addr = base_addr;
return 0;
}
在这里开始初始化static chunk,chunk的base address就是chunk中最小的group基地址,bitmap_fill函数将populated数组全部置1表示对应的page已经全部map好了(静态chunk的空间在刚刚pcpu_embed_first_chunk函数中已经分配好了),这里ai->reserved_size不为空,所以schunk->free_size设为ai->reserved_size,pcpu_reserved_chunk指针直接赋值为schunk,可以看出今后的reserved allocate就使用这个chunk来分配。然后开始初始化这个chunk的map数组,已使用空间是ai->static_size,可分配空间大小是ai->reserved_size,dyn_size不为0,接着初始化另一个dchunk用作动态分配,dchunk的已使用空间是ai->static_size + ai->reserved_size,可分配空间是dyn_size。最后设置pcpu_first_chunk为dchunk(如果ai->reserved_size为0则设为schunk),这里可以看到,如果没有reserved size,那么动态分配直接使用static chunk,如果有reserved size,那么reserved allocate使用static chunk,动态分配使用dchunk。最后将pcpu_first_chunk链入pcpu_slot[]对应的list_head中,并将pcpu_base_addr设为base_addr.