今天看linux内核的maillist,发现了一个很有创意的补丁,叫做ksm,也就是kernel shared memory driver,读了之后感觉太有创意了,可是不知道到底有没有实际用处。这个补丁的大致思想就是,扫描系统中所有的页面,把内容一样的页面合并为一个,并且设置为只读,然后写时复制,如果系统中存在很多潜在的内容一模一样的页面,那么这个补丁显然可以节省大量的内存,但是问题是,第一,存在这种内容一样的页面的几率大吗?第二就是现在的内存都很不值钱,有必要这种时间换空间的行为吗?可是不管怎样,这个创意很值得欣赏,特别是它的一些数据结构个算法。首先先看一下它的主要数据结构。
struct ksm_memory_region { //每个可以被这个补丁管理的可以扫描的区域用这个结构体表示
__u32 npages; //该区域内的页面数量
__u32 pad;
__u64 addr; //起始的虚拟地址
__u64 reserved_bits;
};
以上的这个数据结构就好比一个vma,它其实是对进程有效的。
struct ksm_mem_slot {
struct list_head link; //系统中所有的这个slot连接成一个链表
struct list_head sma_link; //同时一个sma中的slot也连接成一个链表
struct mm_struct *mm; //这个slot所在的mm,也就是地址空间
unsigned long addr; //这个slot的起始地址
unsigned npages; //这个slot的页面数量
};
注意上面的两个数据结构的异同,它们其实很多地方时一样的,只不过它们有意义的层次不同,ksm_memory_region是用户进程注册用的,ksm内核接收到这个ksm_memory_region以后就会初始化一个slot,然后把这个slot链接到ksm系统,还要注意的就是,slot中的地址区间实际上就是vma中的区间,因此它也有一个mm,指明了使用的是哪一个地址空间,这个mm字段很有用,一会分析代码的时候就会分析到。
struct ksm_sma {
struct list_head sma_slots;
};
这个结构再也简单不过了,这个list_head代表的就是一串slot,哪一串呢?实际上这个结构体是每进程一个的,那么里面的这个list就是该进程的的slot的链表,每个进程都有一个ksm_sma,然后这个进程的所有的slot链接到这个ksm_sma的sma_slots链表中,链表的节点其实就是上面的ksm_mem_slot。
struct ksm_scan {
struct ksm_mem_slot *slot_index; //当前正在扫描的slot,注意肯定有一个mm上下文
unsigned long page_index; //当前的slot的当前页面,因为每个slot中可能有npages个页面,而这个数不一定为1
};
为何slot要有一个mm上下文呢?因为一会合并页面的时候需要找到要合并的页面,这是肯定的,这里给出的都是虚拟地址,而找到页面就需要物理地址,而找到物理地址就需要用MMU的方式,通过页目录->页表的方式,而mmu是基于地址空间的,mm恰恰就是一个地址空间的内核表征,其中存储有pgd。
struct tree_item {
struct rb_node node;
struct rmap_item *rmap_item;
};
这个结构是个树节点,怎么又联系到树了?其实在这个补丁中,要扫描就必须有效的找到需要扫描的页面,那么必须有一套很高效的方式存储这些需要定时扫描的页面,于是红黑树是一个不错的选择,那么这一棵红黑树的键值是什么呢?当然是页面的内容了,在没有更好的方式之前,用memcmp比较两个页面的内荣是一种方式,可是我倒是觉得可以变相用strcmp,因为memcmp必须比较所有的一个页面的内容,而strcmp只需要比较/0之前的就可以了,可是为何说变相使用呢?因为如果两个页面的前面几个都是0,后面的不同,难道strcmp会返回不同吗?很显然不能,因此就要改造这个strcmp,使得不必比较整个页面但是却不会漏掉任何字节。
struct rmap_item { //这是一个反向映射,因为我们不但需要从slot找到树节点,还要有相反的映射
struct hlist_node link;
struct mm_struct *mm;
unsigned long address;
unsigned int oldchecksum; //上一次的校验码
unsigned char stable_tree; //是否在“稳定树”中
struct tree_item *tree_item;
struct rmap_item *next;
struct rmap_item *prev;
};
这个rmap是一个很高效的数据结构,linux的虚拟内存就有一个反向映射,正向映射是从一个页表项映射到唯一一个页面,而反向映射就是从一个页面映射到可能很多的若干个页表项,也就是一对多的反向映射。这里的rmap_item也是这样,一个树节点可能有多个页面与之对应,因为这个补丁的作用就是促使很多的相同的物理页面合并为一个,也就是很多的slot对应一个树节点。下面开始动人心魄的算法。
static inline int PageKsm(struct page *page)
{
return !PageAnon(page);
}
这个函数很有意思,虽然这个补丁的设计者不能区分什么样的页面是共享页面,但是他可以断定,只要不是匿名的页面,那么一定是共享的,也就是说非匿名页是页面共享的必要条件而不是充分条件。
static inline u32 calc_checksum(struct page *page)
{
u32 checksum;
void *addr = kmap_atomic(page, KM_USER0); //临时映射到KM_USER0,注意,不要睡觉
checksum = jhash(addr, PAGE_SIZE, 17); //计算这个页面的内容的hash值
kunmap_atomic(addr, KM_USER0);
return checksum; //返回这个hash值
}
以上这个函数计算一个页面的hash值,最终将这个hash值存入rmap_item结构的oldchechsum中,如果再次计算的时候这个值变了,那么就说明页面被写了。
static struct rmap_item *get_rmap_item(struct mm_struct *mm, unsigned long addr)
{
struct rmap_item *rmap_item;
struct hlist_head *bucket;
struct hlist_node *node;
bucket = &rmap_hash[addr % nrmaps_hash]; //得到这个地址的hash桶
hlist_for_each_entry(rmap_item, node, bucket, link) {
if (mm == rmap_item->mm && rmap_item->address == addr) {
return rmap_item;
}
}
return NULL;
}
以上这个函数得到一个反向映射的结构,也就是说,给一个虚拟地址和一个地址空间,然后返回一个反向映射,要知道既然是反向映射,那么就不止一个,那么这些相同的反向映射就连接进一个hash表中,也就是说,红黑树的每一个节点存有一个共享的物理页面,所有共享这个物理页面的slot链接进一个hash链表。
static struct rmap_item *create_new_rmap_item(struct mm_struct *mm, unsigned long addr, unsigned int checksum)
{
struct rmap_item *rmap_item;
struct hlist_head *bucket;
rmap_item = alloc_rmap_item();
if (!rmap_item)
return NULL;
rmap_item->mm = mm;
rmap_item->address = addr;
rmap_item->oldchecksum = checksum;
rmap_item->stable_tree = 0;
rmap_item->tree_item = NULL;
bucket = &rmap_hash[addr % nrmaps_hash]; //计算hash桶
hlist_add_head(&rmap_item->link, bucket); //加入hash链表
return rmap_item;
}
以上函数创建一个新的反向映射节点,这个反向映射节点有两个用途,一个就是计算hash值之后,然后链接进相应hash桶的链表,另外一个用途就是链接进红黑树,如果已经可以在红黑树中找到,那么就连接入该找到的节点的反向映射的链表。一共有两棵树,一棵是stable树,表述已经被共享的页面,一棵是unstable树,表述还没有被共享的,但是可能被共享的页面,凡是扫描到一个页面,就会先在stable树种寻找可能和这个页面一样的页面节点,如果找到的话,那么将这两个页面合并,在合并之前就将合并后的页面设置为一个写保护的写时复制页面,然后将这个反向映射加入树,既然找了可以合并的节点,那么只需要将这个反向映射连接到这个找到的节点的反向映射的链表就可以了;如果没有找到,那么无论如何先将这个被扫描的页面和unstable树的每个节点比较,如果找到相同的,那么尝试合并,如果合并成功则退出unstable树而加入stable树,如果不成功,那么最起码加入了unstable树,等到下次扫描的时候,如果没有变化,那么就有可能加入stable树,注意,查找hash的时候是根据mm和addr进行的,这一切都在下面的这个函数中:
static int cmp_and_merge_page(struct ksm_scan *ksm_scan, struct page *page)
{
struct page *page2[1];
struct ksm_mem_slot *slot;
struct tree_item *tree_item;
struct rmap_item *rmap_item;
struct rmap_item *tree_rmap_item;
unsigned int checksum;
unsigned long addr;
int wait = 0;
int ret;
slot = ksm_scan->slot_index;
addr = slot->addr + ksm_scan->page_index * PAGE_SIZE;
rmap_item = get_rmap_item(slot->mm, addr);
if (rmap_item) {
if (update_tree(rmap_item, &wait))
rmap_item = NULL;
} //以下先在stable树中寻找和这个page对应的反向映射,并且初始化page2,如果找到,那就说明page2就是一个已经被共享的页面,这个page就可以和page2合并,并且也不用再将这个反向映射插入到树中了,而只需要连接入树节点的吊链就可以
tree_rmap_item = stable_tree_search(page, page2, rmap_item);
if (tree_rmap_item) {
ret = try_to_merge_two_pages(slot->mm, page, tree_rmap_item->mm, page2[0], addr, tree_rmap_item->address);
put_page(page2[0]);
if (!ret) {
if (!rmap_item)
rmap_item = create_new_rmap_item(slot->mm, addr, 0); //如果还没有一个反向映射,那么构造一个并且连接到相应hash桶的hlist链表
...
rmap_item->next = tree_rmap_item->next; //既然已经在stable树中找到了这个tree_item,那么就不用插入了,而是在相应的节点的反向映射的链表中连接入就可以了,这个和文件缓存页面的优先级树的反向映射思想一样,都是树吊链结构,每个节点吊一条链子
rmap_item->prev = tree_rmap_item;
if (tree_rmap_item->next)
tree_rmap_item->next->prev = rmap_item;
tree_rmap_item->next = rmap_item;
rmap_item->stable_tree = 1;
rmap_item->tree_item = tree_rmap_item->tree_item;
}
ret = !ret;
goto out;
}
if (rmap_item) { //在已经找到反向映射的情况下,说明上一次加入stable树没有成功,那么如果这次还是不成功,则仍然放弃
checksum = calc_checksum(page); //如果校验码变化了,那么就说明这个页面是变的,就不必要继续了,它即使和别的页面合并,那么由于写时复制而被踢出的可能性也是很大。
if (rmap_item->oldchecksum != checksum) {
rmap_item->oldchecksum = checksum;
goto out;
}
}
tree_item = unstable_tree_search_insert(page, page2, rmap_item);
if (tree_item) {
rmap_item = tree_item->rmap_item;
ret = try_to_merge_two_pages(slot->mm, page, rmap_item->mm, page2[0], addr, rmap_item->address);
if (!ret) { //如果合并成功,那么就将这个反向映射插入到stable树中,其实也就是在这里有机会将反向映射插入stable树
rb_erase(&tree_item->node, &root_unstable_tree);
stable_tree_insert(page2[0], tree_item, rmap_item);
}
put_page(page2[0]);
ret = !ret;
goto out;
}
...
}
最后,我们看一下扫面的过程,其实很简单,就是将slots链表的节点挨个扫描,每一个slot的每一个页面对齐的地址都要按照get_user_pages页面都要根据slot结构中的mm字段用get_user_pages来的到页面,用这个页面按照stable树->unstable树的顺序每个节点比较,如果找到相同的,那么尝试合并,如果比较过程中,发现有些页面被换出或者由于cow改变了属性,那么就在树中删除它。系统每隔一段时间就会扫描整个系统的所有slots尝试发现可以被合并的页面。整个过程十分简单,就是基于两棵树,可是这个算法的思想非常不错,可以说极端的不错,虽然显得有些没有什么用处,但是我就是觉得好,一棵stable树维持了一个稳定的已经被合并的地址反向映射集合,而一棵unstable树提供了候选的节点,这就是一个阶梯状的体系,一个朴素的选择优先级的体现。最后的问题就是用户接口,这个补丁用的是vfs接口,ioctl控制。
这个补丁的实现依赖于一个帮助函数,就是replace_page,这个函数按照其作者的解释就是:this function is needed in cases you want to change the userspace virtual mapping into diffrent physical page,this function is working by removing the oldpage from the rmap and
calling put_page on it, and by setting the virtual address pte to point into the new page.
ksm补丁巧妙的用rmap_item这个数据结构来作为键值来索引红黑树中的元素,我们看一下红黑树中吊链的查找代码:
while (found_rmap_item) {
if (!rmap_item ||
!(found_rmap_item->mm == rmap_item->mm &&
found_rmap_item->address == rmap_item->address)) {
if (!is_zapped_item(found_rmap_item, page2))
break;
remove_rmap_item_from_tree(found_rmap_item);
}
found_rmap_item = found_rmap_item->next;
}
这 里的逻辑是红黑树中所存的不应该是页面,而是合并过的页面所对应的一个实体,就是一个叫做反向映射的数据结构,同一个红黑树元素可能会挂有很多很多这样的 结构,我称作吊链。为何不能直接比较页面呢?因为可能会有不同的进程地址空间的页面合并为一个,我们又不能阻止其中的一个进程退出或者换出页面或者别的什 么行为比如退出ksm机制,因此必须有一种方式来保存原始的地址空间和虚拟地址信息,这就是这里的反向映射和吊链机制,在每一轮扫描的时候,对于稳定树, 必须更新其每一个元素的状态,也就是上面的循环,一来查找,二来更新,可谓妙矣!
以下就将该ksm补丁中没有提到的代码罗列出来,以备查阅:
static unsigned long addr_in_vma(struct vm_area_struct *vma, struct page *page)
{
pgoff_t pgoff = page->index unsigned long addr;
addr = vma->vm_start + ((pgoff - vma->vm_pgoff) if (unlikely(addr vm_start || addr >= vma->vm_end))
return -EFAULT;
return addr;
}
static pte_t *get_pte(struct mm_struct *mm, unsigned long addr)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *ptep = NULL;
pgd = pgd_offset(mm, addr);
if (!pgd_present(*pgd))
goto out;
pud = pud_offset(pgd, addr);
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, addr);
if (!pmd_present(*pmd))
goto out;
ptep = pte_offset_map(pmd, addr);
out:
return ptep;
}
//判断该mm对应的地址addr对应的页面是否在内存
static int is_present_pte(struct mm_struct *mm, unsigned long addr)
{
pte_t *ptep;
int r;
ptep = get_pte(mm, addr);
if (!ptep)
return 0;
r = pte_present(*ptep);
pte_unmap(ptep);
return r;
}
static int is_dirty_pte(struct mm_struct *mm, unsigned long addr)
{
pte_t *ptep;
int r;
ptep = get_pte(mm, addr);
if (!ptep)
return 0;
r = pte_dirty(*ptep);
pte_unmap(ptep);
return r;
}
static int rmap_hash_init(void)
{
if (!rmap_hash_size) {
struct sysinfo sinfo;
si_meminfo(&sinfo);
rmap_hash_size = sinfo.totalram / 10;
}
nrmaps_hash = rmap_hash_size;
rmap_hash = vmalloc(nrmaps_hash * sizeof(struct hlist_head));
if (!rmap_hash)
return -ENOMEM;
memset(rmap_hash, 0, nrmaps_hash * sizeof(struct hlist_head));
return 0;
}
static void rmap_hash_free(void)
{
int i;
struct hlist_head *bucket;
struct hlist_node *node, *n;
struct rmap_item *rmap_item;
for (i = 0; i bucket = &rmap_hash[i];
hlist_for_each_entry_safe(rmap_item, node, n, bucket, link) {
hlist_del(&rmap_item->link);
free_rmap_item(rmap_item);
}
}
vfree(rmap_hash);
}
static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
{
struct tree_item *tree_item;
tree_item = rmap_item->tree_item;
rmap_item->tree_item = NULL;
if (rmap_item->stable_tree) {
ksm_pages_shared--;
if (rmap_item->prev) {
BUG_ON(rmap_item->prev->next != rmap_item);
rmap_item->prev->next = rmap_item->next;
}
if (rmap_item->next) {
BUG_ON(rmap_item->next->prev != rmap_item);
rmap_item->next->prev = rmap_item->prev;
}
} else if (rmap_item->kpage_outside_tree) {
ksm_pages_shared--;
nkpage_out_tree--;
}
if (tree_item) {
if (rmap_item->stable_tree) {
if (!rmap_item->next && !rmap_item->prev) {
rb_erase(&tree_item->node, &root_stable_tree);
free_tree_item(tree_item);
nnodes_stable_tree--;
} else if (!rmap_item->prev) {
tree_item->rmap_item = rmap_item->next;
} else {
tree_item->rmap_item = rmap_item->prev;
}
} else {
free_tree_item(tree_item);
}
}
hlist_del(&rmap_item->link);
free_rmap_item(rmap_item);
}
static void break_cow(struct mm_struct *mm, unsigned long addr)
{
struct page *page[1];
down_read(&mm->mmap_sem);
if (get_user_pages(current, mm, addr, 1, 1, 0, page, NULL)) {
put_page(page[0]);
}
up_read(&mm->mmap_sem);
}
//从树中删除
static void remove_page_from_tree(struct mm_struct *mm, unsigned long addr)
{
struct rmap_item *rmap_item;
rmap_item = get_rmap_item(mm, addr);
if (!rmap_item)
return;
if (rmap_item->stable_tree) {
/* We are breaking all the KsmPages of area that is removed */
break_cow(mm, addr);
} else {
if (rmap_item->kpage_outside_tree)
break_cow(mm, addr);
}
remove_rmap_item_from_tree(rmap_item);
}
//注册需要被扫描的区域
static int ksm_sma_ioctl_register_memory_region(struct ksm_sma *ksm_sma,struct ksm_memory_region *mem)
{
struct ksm_mem_slot *slot;
int ret = -EPERM;
slot = kzalloc(sizeof(struct ksm_mem_slot), GFP_KERNEL);
if (!slot) {
ret = -ENOMEM;
goto out;
}
slot->mm = get_task_mm(current);
if (!slot->mm)
goto out_free;
slot->addr = mem->addr;
slot->npages = mem->npages;
down_write(&slots_lock);
list_add_tail(&slot->link, &slots);
list_add_tail(&slot->sma_link, &ksm_sma->sma_slots);
up_write(&slots_lock);
return 0;
out_free:
kfree(slot);
out:
return ret;
}
//执行比较
static int memcmp_pages(struct page *page1, struct page *page2)
{
char *addr1, *addr2;
int r;
addr1 = kmap_atomic(page1, KM_USER0);
addr2 = kmap_atomic(page2, KM_USER1);
r = memcmp(addr1, addr2, PAGE_SIZE);
kunmap_atomic(addr1, KM_USER0);
kunmap_atomic(addr2, KM_USER1);
return r;
}
//比较两个页面
static inline int pages_identical(struct page *page1, struct page *page2)
{
return !memcmp_pages(page1, page2);
}
//执行合并
static int try_to_merge_one_page(struct mm_struct *mm,
struct vm_area_struct *vma,
struct page *oldpage,
struct page *newpage,
pgprot_t newprot)
{
int ret = 1;
int odirect_sync;
unsigned long page_addr_in_vma;
pte_t orig_pte, *orig_ptep;
get_page(newpage);
get_page(oldpage);
down_read(&mm->mmap_sem);
page_addr_in_vma = addr_in_vma(vma, oldpage);
if (page_addr_in_vma == -EFAULT)
goto out_unlock;
orig_ptep = get_pte(mm, page_addr_in_vma);
if (!orig_ptep)
goto out_unlock;
orig_pte = *orig_ptep;
pte_unmap(orig_ptep);
if (!pte_present(orig_pte))
goto out_unlock;
if (page_to_pfn(oldpage) != pte_pfn(orig_pte))
goto out_unlock;
if (!trylock_page(oldpage))
goto out_unlock;
if (!page_wrprotect(oldpage, &odirect_sync, 2)) {
unlock_page(oldpage);
goto out_unlock;
}
unlock_page(oldpage);
if (!odirect_sync)
goto out_unlock;
orig_pte = pte_wrprotect(orig_pte);
if (pages_identical(oldpage, newpage))
ret = replace_page(vma, oldpage, newpage, orig_pte, newprot);
out_unlock:
up_read(&mm->mmap_sem);
put_page(oldpage);
put_page(newpage);
return ret;
}
//合并页面
static int try_to_merge_two_pages(struct mm_struct *mm1, struct page *page1,
struct mm_struct *mm2, struct page *page2,
unsigned long addr1, unsigned long addr2)
{
struct vm_area_struct *vma;
pgprot_t prot;
int ret = 1;
if (PageKsm(page2, mm2, addr2)) {
down_read(&mm1->mmap_sem);
vma = find_vma(mm1, addr1);
up_read(&mm1->mmap_sem);
if (!vma)
return ret;
prot = vma->vm_page_prot;
pgprot_val(prot) &= ~_PAGE_RW;
ret = try_to_merge_one_page(mm1, vma, page1, page2, prot);
if (!ret)
ksm_pages_shared++;
} else {
struct page *kpage;
if (kthread_max_kernel_pages &&
(nnodes_stable_tree + nkpage_out_tree) >=
kthread_max_kernel_pages)
return ret;
kpage = alloc_page(GFP_HIGHUSER);
if (!kpage)
return ret;
down_read(&mm1->mmap_sem);
vma = find_vma(mm1, addr1);
up_read(&mm1->mmap_sem);
if (!vma) {
put_page(kpage);
return ret;
}
prot = vma->vm_page_prot;
pgprot_val(prot) &= ~_PAGE_RW;
copy_user_highpage(kpage, page1, addr1, vma);
ret = try_to_merge_one_page(mm1, vma, page1, kpage, prot);
if (!ret) {
down_read(&mm2->mmap_sem);
vma = find_vma(mm2, addr2);
up_read(&mm2->mmap_sem);
if (!vma) {
put_page(kpage);
break_cow(mm1, addr1);
ret = 1;
return ret;
}
prot = vma->vm_page_prot;
pgprot_val(prot) &= ~_PAGE_RW;
ret = try_to_merge_one_page(mm2, vma, page2, kpage, prot);
if (ret) {
break_cow(mm1, addr1);
} else {
ksm_pages_shared += 2;
}
}
put_page(kpage);
}
return ret;
}
//判断页面是否已经无效,无效指的是对此ksm机制的无效
static int is_zapped_item(struct rmap_item *rmap_item, struct page **page)
{
int ret = 0;
cond_resched();
if (is_present_pte(rmap_item->mm, rmap_item->address)) {
down_read(&rmap_item->mm->mmap_sem);
ret = get_user_pages(current, rmap_item->mm, rmap_item->address,
1, 0, 0, page, NULL);
up_read(&rmap_item->mm->mmap_sem);
}
if (!ret)
return 1;
if (unlikely(!PageKsm(page[0], rmap_item->mm, rmap_item->address))) {
put_page(page[0]);
return 1;
}
return 0;
}
//用page和rmap_item作为键值查找稳定树。
static struct rmap_item *stable_tree_search(struct page *page,
struct page **page2,
struct rmap_item *rmap_item)
{
struct rb_node *node = root_stable_tree.rb_node;
struct tree_item *tree_item;
struct rmap_item *found_rmap_item;
while (node) {
int ret;
tree_item = rb_entry(node, struct tree_item, node);
found_rmap_item = tree_item->rmap_item;
while (found_rmap_item) {
BUG_ON(!found_rmap_item->stable_tree);
BUG_ON(!found_rmap_item->tree_item);
if (!rmap_item || //如果rmap_item为空则可能匹配,因此作校验
!(found_rmap_item->mm == rmap_item->mm &&
found_rmap_item->address == rmap_item->address)) {
if (!is_zapped_item(found_rmap_item, page2)) //可能的一个匹配
break;
remove_rmap_item_from_tree(found_rmap_item); //到此不可能匹配稳定树中的当前节点了,删除之。
}
found_rmap_item = found_rmap_item->next;
}
if (!found_rmap_item)
goto out_didnt_find;
ret = memcmp_pages(page, page2[0]);
if (ret put_page(page2[0]);
node = node->rb_left;
} else if (ret > 0) {
put_page(page2[0]);
node = node->rb_right;
} else {
goto out_found;
}
}
out_didnt_find:
found_rmap_item = NULL;
out_found:
return found_rmap_item;
}
//插入“稳定树”,稳定树中的页面元素是已经被合并的
static int stable_tree_insert(struct page *page,
struct tree_item *new_tree_item,
struct rmap_item *rmap_item)
{
struct rb_node **new = &(root_stable_tree.rb_node);
struct rb_node *parent = NULL;
struct tree_item *tree_item;
struct page *page2[1];
while (*new) {
int ret;
struct rmap_item *insert_rmap_item;
tree_item = rb_entry(*new, struct tree_item, node);
insert_rmap_item = tree_item->rmap_item;
while (insert_rmap_item) {
BUG_ON(!insert_rmap_item->stable_tree);
BUG_ON(!insert_rmap_item->tree_item);
if (!(insert_rmap_item->mm == rmap_item->mm &&
insert_rmap_item->address == rmap_item->address)) {
if (!is_zapped_item(insert_rmap_item, page2))
break;
remove_rmap_item_from_tree(insert_rmap_item);
}
insert_rmap_item = insert_rmap_item->next;
}
if (!insert_rmap_item)
return 1;
ret = memcmp_pages(page, page2[0]);
parent = *new;
if (ret put_page(page2[0]);
new = &((*new)->rb_left);
} else if (ret > 0) {
put_page(page2[0]);
new = &((*new)->rb_right);
} else {
return 1;
}
}
rb_link_node(&new_tree_item->node, parent, new);
rb_insert_color(&new_tree_item->node, &root_stable_tree);
nnodes_stable_tree++;
rmap_item->stable_tree = 1;
rmap_item->tree_item = new_tree_item;
return 0;
}
//插入“不稳定”树,不稳定树作为一个比较的后备元素。本函数执行不稳定树的查找工作,找到的话,返回找到的元素,之后要做的就是试着合并这个找到的元素和当前扫描的元素,一旦合并成功就将其插入稳定树,注意这是唯一加入稳定树的机会;如果找不到的话就将当前扫描元素插入到不稳定树,作为下次的储备,这个思想很好,再捧一下
static struct tree_item *unstable_tree_search_insert(struct page *page, struct page **page2, struct rmap_item *page_rmap_item)
{
struct rb_node **new = &(root_unstable_tree.rb_node);
struct rb_node *parent = NULL;
struct tree_item *tree_item;
struct tree_item *new_tree_item;
struct rmap_item *rmap_item;
while (*new) {
int ret;
tree_item = rb_entry(*new, struct tree_item, node);
rmap_item = tree_item->rmap_item;
/*如果不在物理内存,直接返回NULL*/
if (!is_present_pte(rmap_item->mm, rmap_item->address))
return NULL;
down_read(&rmap_item->mm->mmap_sem);
ret = get_user_pages(current, rmap_item->mm, rmap_item->address,
1, 0, 0, page2, NULL);
up_read(&rmap_item->mm->mmap_sem);
if (!ret)
return NULL;
ret = memcmp_pages(page, page2[0]);
parent = *new;
if (ret put_page(page2[0]);
new = &((*new)->rb_left);
} else if (ret > 0) {
put_page(page2[0]);
new = &((*new)->rb_right);
} else {
return tree_item;
}
}
if (!page_rmap_item)
return NULL;
new_tree_item = alloc_tree_item();
if (!new_tree_item)
return NULL;
page_rmap_item->tree_item = new_tree_item;
page_rmap_item->stable_tree = 0;
new_tree_item->rmap_item = page_rmap_item;
rb_link_node(&new_tree_item->node, parent, new);
rb_insert_color(&new_tree_item->node, &root_unstable_tree);
return NULL;
}
//建立一个干净的比较环境
int update_tree(struct rmap_item *rmap_item)
{
if (!rmap_item->stable_tree) {
if (unlikely(rmap_item->kpage_outside_tree)) {
remove_rmap_item_from_tree(rmap_item);
return 1;
}
if (rmap_item->tree_item) {
free_tree_item(rmap_item->tree_item);
rmap_item->tree_item = NULL;
return 0;
}
return 0;
}
remove_rmap_item_from_tree(rmap_item);
return 1;
}
//本质上就是一个页面挨着一个页面的遍历,这里涉及两个层次,一个是同一个slot,另一个是不同的slot
static int scan_get_next_index(struct ksm_scan *ksm_scan)
{
struct ksm_mem_slot *slot;
if (list_empty(&slots))
return -EAGAIN;
slot = ksm_scan->slot_index;
/* 本slot的下一个页面 */
if ((slot->npages - ksm_scan->page_index - 1) > 0) {
ksm_scan->page_index++;
return 0;
}
/*下一个slot*/
list_for_each_entry_from(slot, &slots, link) {
if (slot == ksm_scan->slot_index)
continue;
ksm_scan->page_index = 0;
ksm_scan->slot_index = slot;
return 0;
}
/* 如果执行到这里,那么就意味着又一轮扫描开始了 */
root_unstable_tree = RB_ROOT;
ksm_scan->page_index = 0;
ksm_scan->slot_index = list_first_entry(&slots,struct ksm_mem_slot, link);
return 0;
}
//循环扫描,本质就是扫描ksm_scan中每一个slot的每一个页面,将每一个这样的页面和两棵树中的页面相比较
static int ksm_scan_start(struct ksm_scan *ksm_scan, unsigned int scan_npages)
{
struct ksm_mem_slot *slot;
struct page *page[1];
int val;
int ret = 0;
down_read(&slots_lock);
scan_update_old_index(ksm_scan);
while (scan_npages > 0) {
ret = scan_get_next_index(ksm_scan);
if (ret)
goto out;
slot = ksm_scan->slot_index;
cond_resched(); //以下表明只有在页面在物理内存的情况下才扫描,因为本补丁就是节省物理内存的
if (is_present_pte(slot->mm, slot->addr + ksm_scan->page_index * PAGE_SIZE)) {
down_read(&slot->mm->mmap_sem);
val = get_user_pages(current, slot->mm, slot->addr + ksm_scan->page_index * PAGE_SIZE , 1, 0, 0, page, NULL);
up_read(&slot->mm->mmap_sem);
if (val == 1) {
if (!PageKsm(page[0], slot->mm, slot->addr + ksm_scan->page_index *PAGE_SIZE))
cmp_and_merge_page(ksm_scan, page[0]);
put_page(page[0]);
}
}
scan_npages--;
}
scan_get_next_index(ksm_scan);
out:
up_read(&slots_lock);
return ret;
}