Linux per-CPU变量分配与管理源码分析(未完)

什么是per-CPU变量?

per-CPU变量主要用在多处理器系统中,用来为系统中的每个CPU生成一个变量副本,per-CPU变量对于每个处理器都有一个互相独立的副本。per-CPU变量分为静态分配与动态分配两种,静态分配是指在编译内核期间分配好的per-CPU变量,动态分配是指运行期间调用per-CPU memory allocator 分配的per-CPU变量。


Linux使用Chunk数据结构来管理per-CPU变量的分配,当要分配某个size的per-CPU变量时,每个CPU的per-CPU变量副本都在同一个chunk当中分配,如果一个chunk分配满了,那么会再新增一个chunk.

为了便于分配与查找,Linux按照每个Chunk的空闲空间的size将Chunk链接到不同的list中,每次分配时从满足allocate size要求且空闲空间最小的list中的chunk中进行分配。(There are usually many small percpu allocations many of them being as small as 4 bytes.  The allocator organizes chunks into lists according to free size and tries to allocate from the fullest one.)



下面是源码分析:

参考源代码: Linux 3.10


Linux实现per-CPU模块的源文件主要位于 /include/linux/percpu.h 和 mm/percpu.c中。

struct Chunk定义如下:

struct pcpu_chunk {
	struct list_head	list;		/* linked to pcpu_slot lists */
	int			free_size;	/* free bytes in the chunk */
	int			contig_hint;	/* max contiguous size hint */
	void			*base_addr;	/* base address of this chunk */
	int			map_used;	/* # of map entries used */
	int			map_alloc;	/* # of map entries allocated */
	int			*map;		/* allocation map */
	void			*data;		/* chunk data */
	bool			immutable;	/* no [de]population allowed */
	unsigned long		populated[];	/* populated bitmap */
};

list: 用于将每个Chunk链接到pcpu_slot lists中,pcpu_slot是一个list_head数组,Linux将所有的Chunk按照其空闲空间的大小链入pcpu_slot数组对应的list中。

free_size: 表示此Chunk空闲空间的大小。

contig_hint: 最大的连续的空闲size

base_addr: 这个Chunk所管理的memory的起始地址(虚拟地址)

map_used: map数组中的已使用成员个数 

map_alloc: map数组的大小

map: 用于分配size的数组

Chunk使用map数组来实现指定size的分配,每个数组成员是一个int类型的值,记录了一个分配好的size或一个free size(正数表示可分配的size,负数表示已经分配出去的size),map数组大小初始化为PCPU_DFL_MAP_ALLOC,map_used初始化为1,表示只用了一个map成员来记录,因此初始化后map数组只有map[0]有效,大小为整个Chunk可供分配的size(正数),表示现在Chunk为空。在运行的过程中每当有新的size分配请求,Chunk会在map数组里寻找满足要求的空闲的size,找到后分配指定的size并记录在map数组中,最后将空闲的size减去己分配的size,必要的话会根据情况将某些空闲的size合并。随着不断的进行各种size的动态分配,map_used会一直增长,当初始化的map大小不够用的时候map数组的大小也会增长。(Allocation state in each chunk is kept using an array of integers on chunk->map.  A positive value in the map represents a free region and negative allocated.  Allocation inside a chunk is done by scanning this map sequentially and serving the first matching entry. )

data:  指向为Chunk分配的vms结构的指针。Chunk管理的memory本质上还是以page的形式分配的(first Chunk除外)。

immutable: 一个布尔变量,置1表示不可再allocate和map page.

populated[]: unsigned long 数组用于记录已经map成功的page


前面说过,Linux使用Chunk数据结构来管理per-CPU变量的分配,现在假设系统有Nr个CPU,那么意味着在外部调用per-cpu allocater分配变量时,Chunk可以同时为Nr个CPU分配per-CPU变量。而实际情况还要复杂一些,linux还要对Nr个CPU分组,这个后面会结合code讨论。


Linux用一组全局变量记录CPU的信息及per-cpu allocater最大可分配内存的信息:

static int pcpu_unit_pages __read_mostly;

每个Chunk中单个CPU可供分配per-cpu变量的内存的大小,单位page。(很明显对于每个CPU,这个值是一样的,因为per-cpu变量是对所有的CPU同时分配的)

static int pcpu_unit_size __read_mostly;
每个Chunk中单个CPU可供分配per-cpu变量的内存的大小,单位byte。单个CPU可供分配per-cpu变量的内存的大小称为一个unit。

static int pcpu_nr_units __read_mostly;
每个Chunk中unit的数量,也就是系统中CPU的数量。

static int pcpu_atom_size __read_mostly;

用于align的size。

static struct list_head *pcpu_slot __read_mostly;
pcpu_slot是list_head数组,按照空闲空间的大小链接各个Chunk到其中不同的list_head中。

static int pcpu_nr_slots __read_mostly;
pcpu_slot数组的size。 

static size_t pcpu_chunk_struct_size __read_mostly;
Chunk结构的size,在分配新的Chunk时用到。

void *pcpu_base_addr __read_mostly;
EXPORT_SYMBOL_GPL(pcpu_base_addr);

第一个Chunk所管理内存的基地址,(the address of the first chunk which starts with the kernel static area.)

前面知道全局变量pcpu_uint_size表示单个CPU可供分配的内存大小,又由前面叙述可知: 当要分配某个size的per-CPU变量时,每个CPU的per-CPU变量副本都在同一个chunk当中分配。因此每个Chunk管理的内存大小必须为Nr x (pcpu_uint_size)。


per-cpu变量分为静态分配(编译时分配)与动态分配(运行时分配),先来看静态分配:

per-cpu变量的静态分配通过将变量定义在特殊的数据段中来实现(include/linux/percpu-defs.h):

#define DECLARE_PER_CPU(type, name)					\
	DECLARE_PER_CPU_SECTION(type, name, "")

#define DEFINE_PER_CPU(type, name)					\
	DEFINE_PER_CPU_SECTION(type, name, "")
这两个宏静态声明与分配一个类型为type的per-cpu变量。DECLARE_PER_CPU_SECTION和DEFINE_PER_CPU_SECTION又分别定义为:
#define DECLARE_PER_CPU_SECTION(type, name, sec)			\
	extern __PCPU_ATTRS(sec) __typeof__(type) name

#define DEFINE_PER_CPU_SECTION(type, name, sec)				\
	__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES			\
	__typeof__(type) name
其中__PCPU_ATTRS定义为:

#define __PCPU_ATTRS(sec)						\
	__percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))	\
	PER_CPU_ATTRIBUTES

__percpu是个编译扩展类型,在include/linux/compile.h文件中,__percpu是空的。而传进来的sec也是空的,PER_CPU_ATTRIBUTES也是空的,前面PER_CPU_DEF_ATTRIBUTES还是空的,所以DEFINE_PER_CPU(type, name)展开就是:

__attribute__((section(PER_CPU_BASE_SECTION sec)))

__typeof__(type) name

其中,PER_CPU_BASE_SECTION定义在(include/linux/asm-generic/percpu.h).

#define PER_CPU_BASE_SECTION ".data..percpu"

DEFINE_PER_CPU(type, name)最后展开就是:

__attribute__((section(.data..percpu)))

__typeof__(type) name

对于宏DECLARE_PER_CPU(type, name)同样展开就是:

extern __attribute__((section(.data..percpu))) __typeof__(type) name

由此看来per-cpu变量的静态定义就是用这个宏在.data..percpu段中定义一个per-cpu变量。那么在编译的时候这个段就会被编译进内核镜像,.data..percpu段的起始地址为__per_cpu_start,结束地址为 __per_cpu_end,这两个地址符号定义在linux内核链接脚本中(arch/arm/kernel/vmlinux.lds)。

还记得前面说过per-cpu变量是每个CPU对应有一个副本,可是这里明明只定义了一个变量到.data..percpu段中,这是怎么回事呢?

原来linux启动后start_kernel会调用setup_per_cpu_areas函数来初始化系统第一个chunk,在这个函数中会把.data..percpu段中的变量数据copy到每个CPU在该chunk内对应的内存中。

setup_per_cpu_areas实现在(mm/percpu.c)

void __init setup_per_cpu_areas(void)
{
	unsigned long delta;
	unsigned int cpu;
	int rc;

	/*
	 * Always reserve area for module percpu variables.  That's
	 * what the legacy allocator did.
	 */
	rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
				    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE, NULL,
				    pcpu_dfl_fc_alloc, pcpu_dfl_fc_free);
	if (rc < 0)
		panic("Failed to initialize percpu areas.");

	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
	for_each_possible_cpu(cpu)
		__per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
}
函数首先调用pcpu_embed_first_chunk创建系统第一个chunk,然后初始化__per_cpu_offset[]数组。先来看 pcpu_embed_first_chunk的实现,这个函数比较长,因此分段来看:

int __init pcpu_embed_first_chunk(size_t reserved_size, size_t dyn_size,
				  size_t atom_size,
				  pcpu_fc_cpu_distance_fn_t cpu_distance_fn,
				  pcpu_fc_alloc_fn_t alloc_fn,
				  pcpu_fc_free_fn_t free_fn)
{
	void *base = (void *)ULONG_MAX;
	void **areas = NULL;
	struct pcpu_alloc_info *ai;
	size_t size_sum, areas_size, max_distance;
	int group, i, rc;

	ai = pcpu_build_alloc_info(reserved_size, dyn_size, atom_size,
				   cpu_distance_fn);
	if (IS_ERR(ai))
		return PTR_ERR(ai);
            ... ...
}
参数reserved_size和dyn_size分别表示这个chunk中用于为reserved分配保留的空间和用于为动态分配保留的空间的大小,这两个参数由setup_per_cpu_areas直接传递。参数atom_size用于对齐,这里传入的是PAGE_SIZE也就是页对齐。参数 cpu_distance_fn是可选的,用于计算cpu之间的distance,这里传入NULL. (cpu_distance_fn: callback to determine distance between cpus, optional),最后两个参数也是函数指针,用于分配和释放内存,分别为pcpu_dfl_fc_alloc和pcpu_dfl_fc_free,实现如下:

static void * __init pcpu_dfl_fc_alloc(unsigned int cpu, size_t size,
				       size_t align)
{
	return __alloc_bootmem_nopanic(size, align, __pa(MAX_DMA_ADDRESS));
}

static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
{
	free_bootmem(__pa(ptr), size);
}
这两个函数用于在系统刚刚初始化的时候分配与释放内存。

pcpu_embed_first_chunk函数先调用pcpu_build_alloc_info函数来收集alloc info信息,这个函数也比较长,分段来看:

static struct pcpu_alloc_info * __init pcpu_build_alloc_info(
				size_t reserved_size, size_t dyn_size,
				size_t atom_size,
				pcpu_fc_cpu_distance_fn_t cpu_distance_fn)
{
	static int group_map[NR_CPUS] __initdata;
	static int group_cnt[NR_CPUS] __initdata;
	const size_t static_size = __per_cpu_end - __per_cpu_start;
	int nr_groups = 1, nr_units = 0;
	size_t size_sum, min_unit_size, alloc_size;
	int upa, max_upa, uninitialized_var(best_upa);	/* units_per_alloc */
	int last_allocs, group, unit;
	unsigned int cpu, tcpu;
	struct pcpu_alloc_info *ai;
	unsigned int *cpu_map;

	/* this function may be called multiple times */
	memset(group_map, 0, sizeof(group_map));
	memset(group_cnt, 0, sizeof(group_cnt));

	/* calculate size_sum and ensure dyn_size is enough for early alloc */
	size_sum = PFN_ALIGN(static_size + reserved_size +
			    max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE));
	dyn_size = size_sum - static_size - reserved_size;

	/*
	 * Determine min_unit_size, alloc_size and max_upa such that
	 * alloc_size is multiple of atom_size and is the smallest
	 * which can accommodate 4k aligned segments which are equal to
	 * or larger than min_unit_size.
	 */
	min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE);

	alloc_size = roundup(min_unit_size, atom_size);
            ... ...

static int group_map[NR_CPUS], static int group_cnt[NR_CPUS]两个数组是计算cpu分组用的, group_map[]记录每个cpu到对应group的映射,group_cnt[]记录每个group对应的cpu数量。

size_sum = PFN_ALIGN(static_size + reserved_size + max_t(size_t, dyn_size, PERCPU_DYNAMIC_EARLY_SIZE)); 这里函数先计算chunk中每个cpu需要分配的空间大小(静态+保留+动态)并对齐到页。min_unit_size = max_t(size_t, size_sum, PCPU_MIN_UNIT_SIZE) 是将上步的结果与PCPU_MIN_UINT_SIZE比较取其中较大的值,最后alloc_size是min_unit_size按照参数atom_size向上取整的结果。

(Determine min_unit_size, alloc_size and max_upa such that alloc_size is multiple of atom_size and is the smallest which can accommodate 4k aligned segments which are equal to or larger than min_unit_size.) alloc_size就是最终要为每个CPU分配的空间大小。

#define PCPU_MIN_UNIT_SIZE		PFN_ALIGN(32 << 10)

PCPU_MIN_UINT_SIZE定义为8个page

...
	for_each_possible_cpu(cpu) {
		group = 0;
	next_group:
		for_each_possible_cpu(tcpu) {
			if (cpu == tcpu)
				break;
			if (group_map[tcpu] == group && cpu_distance_fn &&
			    (cpu_distance_fn(cpu, tcpu) > LOCAL_DISTANCE ||
			     cpu_distance_fn(tcpu, cpu) > LOCAL_DISTANCE)) {
				group++;
				nr_groups = max(nr_groups, group + 1);
				goto next_group;
			}
		}
		group_map[cpu] = group;
		group_cnt[group]++;
	}
        ... ...

这段代码是对cpu进行分组,因为我们传进来的参数cpu_distance_fn是NULL,所以实际上所有的cpu都分在group0里。

    ...
	/* allocate and fill alloc_info */
	for (group = 0; group < nr_groups; group++)
		nr_units += roundup(group_cnt[group], upa);

	ai = pcpu_alloc_alloc_info(nr_groups, nr_units);
	if (!ai)
		return ERR_PTR(-ENOMEM);
        ... ...

接着计算出CPU的数量并记录在nr_units中,然后调用pcpu_alloc_alloc_info函数分配一个pcpu_alloc_info结构。

先看下pcpu_alloc_info结构定义:

struct pcpu_group_info {
	int			nr_units;	/* aligned # of units */
	unsigned long		base_offset;	/* base address offset */
	unsigned int		*cpu_map;	/* unit->cpu map, empty
						 * entries contain NR_CPUS */
};

struct pcpu_alloc_info {
	size_t			static_size;
	size_t			reserved_size;
	size_t			dyn_size;
	size_t			unit_size;
	size_t			atom_size;
	size_t			alloc_size;
	size_t			__ai_size;	/* internal, don't use */
	int			nr_groups;	/* 0 if grouping unnecessary */
	struct pcpu_group_info	groups[];
};

CPU分组信息就保存在groups[]数组中,pcpu_group_info结构记录每个group的信息,nr_units表示这个group包含CPU的数量,base_offset表示这个group对应的内存起始地址到chunk所管理的整个内存的起始地址的offset,cpu_map记录group内包含哪些cpu。

接着看pcpu_alloc_alloc_info这个函数:

struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups,
						      int nr_units)
{
	struct pcpu_alloc_info *ai;
	size_t base_size, ai_size;
	void *ptr;
	int unit;

	base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]),
			  __alignof__(ai->groups[0].cpu_map[0]));
	ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]);

	ptr = alloc_bootmem_nopanic(PFN_ALIGN(ai_size));
	if (!ptr)
		return NULL;
	ai = ptr;
	ptr += base_size;

	ai->groups[0].cpu_map = ptr;

	for (unit = 0; unit < nr_units; unit++)
		ai->groups[0].cpu_map[unit] = NR_CPUS;

	ai->nr_groups = nr_groups;
	ai->__ai_size = PFN_ALIGN(ai_size);

	return ai;
}
base_size是不包含cpu_map数组的size,ai_size是包含cpu_map数组的总的size(所有group是共用一个cpu_map数组的,只不过各个group的cpu_map指针指向的offset不同)。

函数将group[0]的cpu_map指针初始化为cpu_map数组的起始地址,其他group的cpu_map指针由这个函数的调用者负责初始化,然后函数将cpu_map数组的成员全部初始化为NR_CPUS。最后设置pcpu_group_info结构的nr_groups和__ai_size.

返回到pcpu_build_alloc_info:

...
	cpu_map = ai->groups[0].cpu_map;

	for (group = 0; group < nr_groups; group++) {
		ai->groups[group].cpu_map = cpu_map;
		cpu_map += roundup(group_cnt[group], upa);
	}

	ai->static_size = static_size;
	ai->reserved_size = reserved_size;
	ai->dyn_size = dyn_size;
	ai->unit_size = alloc_size / upa;
	ai->atom_size = atom_size;
	ai->alloc_size = alloc_size;
        ... ...

for循环初始化所有group的cpu_map指针,这里可以看到各个group的cpu_map指针都是基于公用的cpu_map的一个offset。然后分别初始化pcpu_alloc_info的static_size,reserved_size,dyn_size,unit_size,atom_size,alloc_size,这里upa=1,所以unit_size就等于alloc_size。

...
	for (group = 0, unit = 0; group_cnt[group]; group++) {
		struct pcpu_group_info *gi = &ai->groups[group];

		/*
		 * Initialize base_offset as if all groups are located
		 * back-to-back.  The caller should update this to
		 * reflect actual allocation.
		 */
		gi->base_offset = unit * ai->unit_size;

		for_each_possible_cpu(cpu)
			if (group_map[cpu] == group)
				gi->cpu_map[gi->nr_units++] = cpu;
		gi->nr_units = roundup(gi->nr_units, upa);
		unit += gi->nr_units;
	}
	BUG_ON(unit != nr_units);

	return ai;
}
接下来这个for循环初始化group的base_offset,前面讲过base_offset表示这个group对应的内存起始地址到chunk所管理的整个内存的起始地址的offset,ai->unit_size表示为每个CPU需要分配的空间大小,unit在循环中就表示当前的group前面有多少cpu。可以看出base_offset的初始化暂时假设各个group分配的空间之间是连续的。后面调用者会重新初始化这个base_offset,每个group内为各个cpu分配的空间之间是连续的。

然后返回到pcpu_embed_first_chunk:

...
	size_sum = ai->static_size + ai->reserved_size + ai->dyn_size;
	areas_size = PFN_ALIGN(ai->nr_groups * sizeof(void *));

	areas = alloc_bootmem_nopanic(areas_size);
	if (!areas) {
		rc = -ENOMEM;
		goto out_free;
	}

	/* allocate, copy and determine base address */
	for (group = 0; group < ai->nr_groups; group++) {
		struct pcpu_group_info *gi = &ai->groups[group];
		unsigned int cpu = NR_CPUS;
		void *ptr;

		for (i = 0; i < gi->nr_units && cpu == NR_CPUS; i++)
			cpu = gi->cpu_map[i];
		BUG_ON(cpu == NR_CPUS);

		/* allocate space for the whole group */
		ptr = alloc_fn(cpu, gi->nr_units * ai->unit_size, atom_size);
		if (!ptr) {
			rc = -ENOMEM;
			goto out_free_areas;
		}
		/* kmemleak tracks the percpu allocations separately */
		kmemleak_free(ptr);
		areas[group] = ptr;

		base = min(ptr, base);
	}
        ... ...
areas是指向各个group的空间的指针数组,接下来的for循环为每个group分配空间,并将areas数组初始化为各个group的空间的首地址。这段代码最终还会计算出其中最小的基地址保存在base。
...
	/*
	 * Copy data and free unused parts.  This should happen after all
	 * allocations are complete; otherwise, we may end up with
	 * overlapping groups.
	 */
	for (group = 0; group < ai->nr_groups; group++) {
		struct pcpu_group_info *gi = &ai->groups[group];
		void *ptr = areas[group];

		for (i = 0; i < gi->nr_units; i++, ptr += ai->unit_size) {
			if (gi->cpu_map[i] == NR_CPUS) {
				/* unused unit, free whole */
				free_fn(ptr, ai->unit_size);
				continue;
			}
			/* copy and return the unused part */
			memcpy(ptr, __per_cpu_load, ai->static_size);
			free_fn(ptr + size_sum, ai->unit_size - size_sum);
		}
	}
         ... ...

这段代码将静态per-cpu变量copy到每一个CPU对应的空间中,前面说过per-cpu变量的静态定义就是用宏在.data..percpu段中定义一个per-cpu变量,__per_cpu_load符号就是.data..percpu段的首地址。

...
	/* base address is now known, determine group base offsets */
	max_distance = 0;
	for (group = 0; group < ai->nr_groups; group++) {
		ai->groups[group].base_offset = areas[group] - base;
		max_distance = max_t(size_t, max_distance,
				     ai->groups[group].base_offset);
	}
	max_distance += ai->unit_size;

	/* warn if maximum distance is further than 75% of vmalloc space */
	if (max_distance > (VMALLOC_END - VMALLOC_START) * 3 / 4) {
		pr_warning("PERCPU: max_distance=0x%zx too large for vmalloc "
			   "space 0x%lx\n", max_distance,
			   (unsigned long)(VMALLOC_END - VMALLOC_START));
#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
		/* and fail if we have fallback */
		rc = -EINVAL;
		goto out_free;
#endif
	}

	pr_info("PERCPU: Embedded %zu pages/cpu @%p s%zu r%zu d%zu u%zu\n",
		PFN_DOWN(size_sum), base, ai->static_size, ai->reserved_size,
		ai->dyn_size, ai->unit_size);

	rc = pcpu_setup_first_chunk(ai, base);
	goto out_free;

out_free_areas:
	for (group = 0; group < ai->nr_groups; group++)
		free_fn(areas[group],
			ai->groups[group].nr_units * ai->unit_size);
out_free:
	pcpu_free_alloc_info(ai);
	if (areas)
		free_bootmem(__pa(areas), areas_size);
	return rc;
}

重新计算每个group的base_offset,max_distance为最大的offset。函数最后就调用pcpu_setup_first_chunk设置第一个Chunk,这个函数比较长分段来看:

int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,
				  void *base_addr)
{
	static char cpus_buf[4096] __initdata;
	static int smap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
	static int dmap[PERCPU_DYNAMIC_EARLY_SLOTS] __initdata;
	size_t dyn_size = ai->dyn_size;
	size_t size_sum = ai->static_size + ai->reserved_size + dyn_size;
	struct pcpu_chunk *schunk, *dchunk = NULL;
	unsigned long *group_offsets;
	size_t *group_sizes;
	unsigned long *unit_off;
	unsigned int cpu;
	int *unit_map;
	int group, unit, i;

	cpumask_scnprintf(cpus_buf, sizeof(cpus_buf), cpu_possible_mask);

#define PCPU_SETUP_BUG_ON(cond)	do {					\
	if (unlikely(cond)) {						\
		pr_emerg("PERCPU: failed to initialize, %s", #cond);	\
		pr_emerg("PERCPU: cpu_possible_mask=%s\n", cpus_buf);	\
		pcpu_dump_alloc_info(KERN_EMERG, ai);			\
		BUG();							\
	}								\
} while (0)

	/* sanity checks */
	PCPU_SETUP_BUG_ON(ai->nr_groups <= 0);
#ifdef CONFIG_SMP
	PCPU_SETUP_BUG_ON(!ai->static_size);
	PCPU_SETUP_BUG_ON((unsigned long)__per_cpu_start & ~PAGE_MASK);
#endif
	PCPU_SETUP_BUG_ON(!base_addr);
	PCPU_SETUP_BUG_ON((unsigned long)base_addr & ~PAGE_MASK);
	PCPU_SETUP_BUG_ON(ai->unit_size < size_sum);
	PCPU_SETUP_BUG_ON(ai->unit_size & ~PAGE_MASK);
	PCPU_SETUP_BUG_ON(ai->unit_size < PCPU_MIN_UNIT_SIZE);
	PCPU_SETUP_BUG_ON(ai->dyn_size < PERCPU_DYNAMIC_EARLY_SIZE);
	PCPU_SETUP_BUG_ON(pcpu_verify_alloc_info(ai) < 0);

	/* process group information and build config tables accordingly */
	group_offsets = alloc_bootmem(ai->nr_groups * sizeof(group_offsets[0]));
	group_sizes = alloc_bootmem(ai->nr_groups * sizeof(group_sizes[0]));
	unit_map = alloc_bootmem(nr_cpu_ids * sizeof(unit_map[0]));
	unit_off = alloc_bootmem(nr_cpu_ids * sizeof(unit_off[0]));

	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
		unit_map[cpu] = UINT_MAX;

	pcpu_low_unit_cpu = NR_CPUS;
	pcpu_high_unit_cpu = NR_CPUS;

	for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {
		const struct pcpu_group_info *gi = &ai->groups[group];

		group_offsets[group] = gi->base_offset;
		group_sizes[group] = gi->nr_units * ai->unit_size;

		for (i = 0; i < gi->nr_units; i++) {
			cpu = gi->cpu_map[i];
			if (cpu == NR_CPUS)
				continue;

			PCPU_SETUP_BUG_ON(cpu > nr_cpu_ids);
			PCPU_SETUP_BUG_ON(!cpu_possible(cpu));
			PCPU_SETUP_BUG_ON(unit_map[cpu] != UINT_MAX);

			unit_map[cpu] = unit + i;
			unit_off[cpu] = gi->base_offset + i * ai->unit_size;

			/* determine low/high unit_cpu */
			if (pcpu_low_unit_cpu == NR_CPUS ||
			    unit_off[cpu] < unit_off[pcpu_low_unit_cpu])
				pcpu_low_unit_cpu = cpu;
			if (pcpu_high_unit_cpu == NR_CPUS ||
			    unit_off[cpu] > unit_off[pcpu_high_unit_cpu])
				pcpu_high_unit_cpu = cpu;
		}
	}
	pcpu_nr_units = unit;

	for_each_possible_cpu(cpu)
		PCPU_SETUP_BUG_ON(unit_map[cpu] == UINT_MAX);
         ... ...
base_addr是前面计算得到的最小的group基地址,做一些检查,然后给一些全局变量分配空间:group_offsets[]就是各个group的base_offset,group_sizes[]是每个group包含的总的空间的大小,unit_map[]记录cpu对应的unit号,unit号是根据对应cpu的空间在Chunk所管理空间中的偏移位置从小到大排列的序号,unit_off[]记录cpu对应的空间相对于Chunk所管理的空间基地址的偏移。for循环就是遍历每个group来初始化这些变量。最后将总的cpu数量保存在全局变量pcpu_nr_units中。pcpu_low_unit_cpu表示unit_off最小的cpu,pcpu_high_unit_cpu表示unit_off最大的cpu.

...
	/* we're done parsing the input, undefine BUG macro and dump config */
#undef PCPU_SETUP_BUG_ON
	pcpu_dump_alloc_info(KERN_DEBUG, ai);

	pcpu_nr_groups = ai->nr_groups;
	pcpu_group_offsets = group_offsets;
	pcpu_group_sizes = group_sizes;
	pcpu_unit_map = unit_map;
	pcpu_unit_offsets = unit_off;

	/* determine basic parameters */
	pcpu_unit_pages = ai->unit_size >> PAGE_SHIFT;
	pcpu_unit_size = pcpu_unit_pages << PAGE_SHIFT;
	pcpu_atom_size = ai->atom_size;
	pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
		BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long);

	/*
	 * Allocate chunk slots.  The additional last slot is for
	 * empty chunks.
	 */
	pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2;
	pcpu_slot = alloc_bootmem(pcpu_nr_slots * sizeof(pcpu_slot[0]));
	for (i = 0; i < pcpu_nr_slots; i++)
		INIT_LIST_HEAD(&pcpu_slot[i]);
         ... ...

前面这些为全局变量分配的空间都是用临时指针保存的,接下来把他们赋给真正的全局指针:pcpu_nr_groups,pcpu_group_offsets,pcpu_group_sizes,pcpu_unit_map,pcpu_unit_offsets,然后继续给其他一些全局变量赋值: pcpu_unit_pages是unit_size以页为单位的大小,pcpu_unit_size,pcpu_atom_size前面都已经说明过了,pcpu_chunk_struct_size是分配一个Chunk结构所需要的大小,前面说过Chunk结构体最后有一个unsigned long populated[]数组,用于记录已经map成功的page,所以除了结构体本身的大小,还要再加上用于map内存页的bitmap数组的size。然后分配Chunk slot,赋给全局指针pcpu_slot,前面说过pcpu_slot是一个list_head数组,Linux将所有的Chunk按照其空闲空间的大小链入数组对应的list中。先来看两个函数:

static int __pcpu_size_to_slot(int size)
{
	int highbit = fls(size);	/* size is in bytes */
	return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
}

static int pcpu_size_to_slot(int size)
{
	if (size == pcpu_unit_size)
		return pcpu_nr_slots - 1;
	return __pcpu_size_to_slot(size);
}

static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
{
	if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
		return 0;

	return pcpu_size_to_slot(chunk->free_size);
}
fls定义在(include/asm-generic/bitops/fls.h),就是find last (most-significant) bit set,比如: fls(0) = 0, fls(1) = 1, fls(0x80000000) = 32.

#define PCPU_SLOT_BASE_SHIFT		5	

pcpu_chunk_slot函数根据Chunk中空闲空间的大小将Chunk链入pcpu_slot对应的list中。pcpu_slot数组有两个特殊的list_head,一个是pcpu_slot[0],从pcpu_chunk_slot函数可以看出如果Chunk的空闲空间小于sizeof(int)或者最大连续的空闲空间小于sizeof(int),函数将其链入pcpu_slot[0]中,否则继续调用pcpu_size_to_slot,pcpu_size_to_slot进一步判断如果空闲空间等于pcpu_unit_size(也就是全部空闲),那么链入pcpu_slot[pcpu_nr_slots - 1]中(也就是数组最后一个list_head),否则继续调用__pcpu_size_to_slot,由__pcpu_size_to_slot函数来判断这个Chunk应该被链入哪一个list_head中。

因此在pcpu_setup_first_chunk函数中,在分配Chunk slot的时候就多分配了两个slot,一个就是pcpu_slot[0],另一个就是最后一个slot,最后将pcpu_slot数组里的list_head指针全部初始化。

接着看pcpu_setup_first_chunk函数:

	/*
	 * Initialize static chunk.  If reserved_size is zero, the
	 * static chunk covers static area + dynamic allocation area
	 * in the first chunk.  If reserved_size is not zero, it
	 * covers static area + reserved area (mostly used for module
	 * static percpu allocation).
	 */
	schunk = alloc_bootmem(pcpu_chunk_struct_size);
	INIT_LIST_HEAD(&schunk->list);
	schunk->base_addr = base_addr;
	schunk->map = smap;
	schunk->map_alloc = ARRAY_SIZE(smap);
	schunk->immutable = true;
	bitmap_fill(schunk->populated, pcpu_unit_pages);

	if (ai->reserved_size) {
		schunk->free_size = ai->reserved_size;
		pcpu_reserved_chunk = schunk;
		pcpu_reserved_chunk_limit = ai->static_size + ai->reserved_size;
	} else {
		schunk->free_size = dyn_size;
		dyn_size = 0;			/* dynamic area covered */
	}
	schunk->contig_hint = schunk->free_size;

	schunk->map[schunk->map_used++] = -ai->static_size;
	if (schunk->free_size)
		schunk->map[schunk->map_used++] = schunk->free_size;

	/* init dynamic chunk if necessary */
	if (dyn_size) {
		dchunk = alloc_bootmem(pcpu_chunk_struct_size);
		INIT_LIST_HEAD(&dchunk->list);
		dchunk->base_addr = base_addr;
		dchunk->map = dmap;
		dchunk->map_alloc = ARRAY_SIZE(dmap);
		dchunk->immutable = true;
		bitmap_fill(dchunk->populated, pcpu_unit_pages);

		dchunk->contig_hint = dchunk->free_size = dyn_size;
		dchunk->map[dchunk->map_used++] = -pcpu_reserved_chunk_limit;
		dchunk->map[dchunk->map_used++] = dchunk->free_size;
	}

	/* link the first chunk in */
	pcpu_first_chunk = dchunk ?: schunk;
	pcpu_chunk_relocate(pcpu_first_chunk, -1);

	/* we're done */
	pcpu_base_addr = base_addr;
	return 0;
}

在这里开始初始化static chunk,chunk的base address就是chunk中最小的group基地址,bitmap_fill函数将populated数组全部置1表示对应的page已经全部map好了(静态chunk的空间在刚刚pcpu_embed_first_chunk函数中已经分配好了),这里ai->reserved_size不为空,所以schunk->free_size设为ai->reserved_size,pcpu_reserved_chunk指针直接赋值为schunk,可以看出今后的reserved allocate就使用这个chunk来分配。然后开始初始化这个chunk的map数组,已使用空间是ai->static_size,可分配空间大小是ai->reserved_size,dyn_size不为0,接着初始化另一个dchunk用作动态分配,dchunk的已使用空间是ai->static_size + ai->reserved_size,可分配空间是dyn_size。最后设置pcpu_first_chunk为dchunk(如果ai->reserved_size为0则设为schunk),这里可以看到,如果没有reserved size,那么动态分配直接使用static chunk,如果有reserved size,那么reserved allocate使用static chunk,动态分配使用dchunk。最后将pcpu_first_chunk链入pcpu_slot[]对应的list_head中,并将pcpu_base_addr设为base_addr.




















你可能感兴趣的:(linux技术)