Allocating memory to kernel mode processes ==== >
Three ways for a kernel function to get dynamic memory:
1. __get_free_pages() or alloc_pages() to get pages from zoned page frame allocator
2. Kmem_cache_alloc() or kmalloc() to use slab allocator
3. vmalloc() or vmalloc_32() to get a noncontiguous memory area
The kernel trusts itself. All kernel functions are assumed to be error-free, so the kernel does not need to insert any protection against programming errors.
Allocating memory to User Mode processes ==== >
1. The kernel tries to defer allocating dynamic memory to User Mode processes.
2. The kernel must be prepared to catch all addressing errors caused by User Mode processes.
Key concepts: Memory Region, address space management, Page Fault exception handler
The address space consists of all linear addresses that the process is allowed to use. Because each process sees a different set of linear addresses, the address used by one process bears no relation to the address used by another. (This implies the not all set of the linear addresses can be used by a process! The addresses can be used by the process only after it has been added to its address space.)
The kernel represents intervals of linear addresses by means of memory regions, which are characterized by initial linear address, the length and some access rights.
Memory region creation and deletion
ð Process creation and destruction
ð Exec function
ð Memory mapping
ð Expand user mode stack
ð Expand heap
ð IPC shared memory
Even if some interval of linear addresses has been added to the process, page frames corresponding to the linear addresses may not have been allocated yet. The programmer or the user usually don’t need to care about this, because use of this kind of linear address will cause a Page Fault, and the Page Fault handler will do demand paging for us. This kind of page fault is different from those caused by programming errors.
Type: mm_struct
The memory descriptor object, referenced as mm in process descriptor’s fields, contains all information related to process’s address space.
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
struct vm_area_struct * mmap_cache; /* last find_vma result */
#ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags);
void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
#endif
unsigned long mmap_base; /* base of mmap area */
unsigned long task_size; /* size of task vm space */
unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
int map_count; /* number of VMAs */
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock; /* Protects page tables and some counters */
struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung
* together off init_mm.mmlist, and are protected
* by mmlist_lock
*/
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
unsigned long total_vm, locked_vm, shared_vm, exec_vm;
unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
/*
* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
struct mm_rss_stat rss_stat;
struct linux_binfmt *binfmt;
cpumask_t cpu_vm_mask;
/* Architecture-specific MM context */
mm_context_t context;
/* Swap token stuff */
/*
* Last value of global fault stamp as seen by this process.
* In other words, this value gives an indication of how long
* it has been since this task got the token.
* Look at mm/thrash.c
*/
unsigned int faultstamp;
unsigned int token_priority;
unsigned int last_interval;
unsigned long flags; /* Must use atomic bitops to access the bits */
struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct hlist_head ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
/*
* "owner" points to a task that is regarded as the canonical
* user/owner of this mm. All of the following must be true in
* order for it to be changed:
*
* current == mm->owner
* current->mm != mm
* new_owner->mm == mm
* new_owner->alloc_lock is held
*/
struct task_struct *owner;
#endif
#ifdef CONFIG_PROC_FS
/* store ref to file /proc/<pid>/exe symlink points to */
struct file *exe_file;
unsigned long num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
};
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
Type: vm_area_struct
Two adjacent memory regions merge if their access rights match.
Red-black tree
关于树结构
树的结构,本质上是可以在O(1)时间内得到其child(有时可以得到parent或者sibling child)。
在O(1)时间内得到child的方法,可以数组下标索引,可以是通过存储在node中的child指针。后者用的更加普遍,但是前者也是很强大的一个方法。比如,可以用数组来实现一个heap(二叉树);这个heap结构,可以用来实现heapsort,也可以实现简单的priority queue。
实际应用中,如果某个数据结构A的实例间需要互相关联,但是A的size比较大,或者A的实例的个数会动态的增删,而且个数差别比较大,那么,我们就不能利用数组来存储A。通常的解决方案是,利用一个list将A的实例全部串联起来。于是问题来了,在这个list中,如果element的个数不多,那么search,add,delete的开销不大,可以接受;但是如果element的个数比较多,比如成百上千,那么search,add,delete的开销就比较大,是O(n)时间的(list是一个sorted list)。这个时候,如果在将A成员组织成list的同时,将他们组织成树,那么search,add,delete操作的开销就会大大减小,是O(lgn)时间的。这种解决方案,在Linux的Kernel中,应用比较广泛。比如,memory descriptor中memory region descriptor的组织就是利用这种策略。这是一种用空间换时间,用复杂数据结构换取高效算法的策略。
Page Fault interrupt service routine for 8086 architecture: do_page_fault()
从这张图可以看到,如果编程时出现SIGSEGV错误,那么有两种可能:
1. 访问地址不在process address space内
2. 访问地址在进程空间地址内,但是权限不匹配。
Clone() -> copy_mm()
If CLONE_VM flag is set, copy_mm() gives the clone(tsk) the address space of its parent(current).
If CLONE_VM flag is not set, copy_mm will create a new address space, even though no memory is allocated with that address space until some address is requested. This function allocates new memory descriptor, copies the contents in current->mm into tsk->mm, and then changes a few fields of it.
Exit_mm -> mm_release()
Heap: a specific memory region, start_brk -> brk, used to satisfy a process’s dynamic memory requests.
Malloc
Calloc
Realloc
Free
Brk
Sbrk