快乐虾
http://blog.csdn.net/lights_joy/
本文适用于
ADI bf561 DSP
优视BF561EVB开发板
uclinux-2008r1-rc8 (移植到vdsp5)
Visual DSP++ 5.0
欢迎转载,但请保留作者信息
Linux支持非一致内存访问(Non-Uniform Memory Access, NUMA)模型,在这种模型中,给定CPU对不同内存单元的访问时间可能不一样。系统的物理内存被划分为几个节点。在一个单独的节点内,任一给定CPU访问页面所需的时间都是相同的。但是对不同的CPU,这个时间可能就不同。对每个CPU而言,内核都试图把耗时节点的访问次数减到最少,这就需要小心地选择CPU最常引用的内核数据结构的存放位置。每个节点中的物理内存又可以分为几个管理区(zone)。
在x86的体系结构下,Linux将内存分为3个管理区:
ZONE_DMA:包含低于 16M 的内存页,因为ISA总线的DMA处理器有严格限制,只能对RAM的前 16M 进行寻址。
ZONE_NORMAL:包含高于 16M 且低于 896M 的内存。
ZONE_HIGHMEM:包含高于 896M 的内存。
对于BF561而言,DMA将可以访问整个内存区域,当然,由于anomaly-05000263的缘故,在启用ICACHE的情况下,可用内存将限制为 60M 。所以在内核中,实际只使用了一个内存区,ZONE_DMA,在这个内存区中,包含了所有的内存范围。
下面试图通过对存储区域的初始化来分析下内核的数据表示。
内核的存储管理是以page为单位的,每个page的大小为4K,且每个page都用一个page结构体来进行描述,这些page的描述做为一个数组放在内核代码之后。因而这里就涉及到3个量:一个是4K的实际页面的地址,即物理地址,在内核中表示为Virtual Address;另一个为描述此页面的page结构指针,用page表示;还有一个是这个page结构体在整个page数组中的序号,内核称之为pfn。这三个地址之间可以用6个宏来进行相互转换。
page结构体的定义在include/linux/mm_types.h中:
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page, though if it is a pagecache page, rmap structures can tell us
* who is mapping it.
*/
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
atomic_t _count; /* Usage count, see below. */
union {
atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map searches.
*/
struct { /* SLUB uses */
short unsigned int inuse;
short unsigned int offset;
};
};
union {
struct {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.
*/
};
spinlock_t ptl;
struct { /* SLUB uses */
void **lockless_freelist;
struct kmem_cache *slab; /* Pointer to slab */
};
struct {
struct page *first_page; /* Compound pages */
};
};
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
*/
};
l lru
这个结构体根据这个页面应用的不同有不同的含义,当此页面空闲时或者此页面用于高速缓存时,这个结构体用于将页面链接起来的双链表。而当此页面用于SLAB算法时,它的next指针将指向这个slab所在的cache,即一个kmem_cache的结构体。而其prev指针将指向这个slab。这两个指针都是通过slab_map_pages函数进行设置的。
l flags
这个结构体中的flags成员每个位的定义在include/linux/page_flags.h中:
/*
* Various page->flags bits:
*
* PG_reserved is set for special pages, which can never be swapped out. Some
* of them might not even exist (eg empty_bad_page)...
*
* The PG_private bitflag is set on pagecache pages if they contain filesystem
* specific data (which is normally at page->private). It can be used by
* private allocations for its own usage.
*
* During initiation of disk I/O, PG_locked is set. This bit is set before I/O
* and cleared when writeback _starts_ or when read _completes_. PG_writeback
* is set before writeback starts and cleared when it finishes.
*
* PG_locked also pins a page in pagecache, and blocks truncation of the file
* while it is held.
*
* page_waitqueue(page) is a wait queue of all tasks waiting for the page
* to become unlocked.
*
* PG_uptodate tells whether the page's contents is valid. When a read
* completes, the page becomes uptodate, unless a disk I/O error happened.
*
* PG_referenced, PG_reclaim are used for page reclaim for anonymous and
* file-backed pagecache (see mm/vmscan.c).
*
* PG_error is set to indicate that an I/O error occurred on this page.
*
* PG_arch_1 is an architecture specific page state bit. The generic code
* guarantees that this bit is cleared for a page when it first is entered into
* the page cache.
*
* PG_highmem pages are not permanently mapped into the kernel virtual address
* space, they need to be kmapped separately for doing IO on the pages. The
* struct page (these bits with information) are always mapped into kernel
* address space...
*
* PG_buddy is set to indicate that the page is free and in the buddy system
* (see mm/page_alloc.c).
*
*/
/*
* Don't use the *_dontuse flags. Use the macros. Otherwise you'll break
* locked- and dirty-page accounting.
*
* The page flags field is split into two parts, the main flags area
* which extends from the low bits upwards, and the fields area which
* extends from the high bits downwards.
*
* | FIELD | ... | FLAGS |
* N-1 ^ 0
* (N-FLAGS_RESERVED)
*
* The fields area is reserved for fields mapping zone, node and SPARSEMEM
* section. The boundry between these two areas is defined by
* FLAGS_RESERVED which defines the width of the fields section
* (see linux/mmzone.h). New flags must _not_ overlap with this area.
*/
#define PG_locked 0 /* Page is locked. Don't touch. */
#define PG_error 1
#define PG_referenced 2
#define PG_uptodate 3
#define PG_dirty 4
#define PG_lru 5
#define PG_active 6
#define PG_slab 7 /* slab debug (Suparna wants this) */
#define PG_owner_priv_1 8 /* Owner use. If pagecache, fs may use*/
#define PG_arch_1 9
#define PG_reserved 10
#define PG_private 11 /* If pagecache, has fs-private data */
#define PG_writeback 12 /* Page is under writeback */
#define PG_compound 14 /* Part of a compound page */
#define PG_swapcache 15 /* Swap page: swp_entry_t in private */
#define PG_mappedtodisk 16 /* Has blocks allocated on-disk */
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_buddy 19 /* Page is free, on buddy lists */
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked PG_owner_priv_1 /* Used by some filesystems */
初始化时,flags都将初始化为PG_reserved。
对page的初始化过程可参见memmap_init_zone,位于mm/page_alloc.c:
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
*/
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
* handed to this function. They do not
* exist on hotplugged memory.
*/
if (context == MEMMAP_EARLY) {
if (!early_pfn_valid(pfn))
continue;
if (!early_pfn_in_nid(pfn, nid))
continue;
}
page = pfn_to_page(pfn);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
reset_page_mapcount(page);
SetPageReserved(page);
INIT_LIST_HEAD(&page->lru);
}
}
在上述函数中,由于zone参数为0,故而实际只有
SetPageReserved(page)
这行语句起作用,将flags成员设置为0x400(PG_reserved)。
下面是几个对PAGE进行操作的宏:
这个宏位于include/asm/page.h中:
#define virt_to_pfn(kaddr) (__pa(kaddr) >> PAGE_SHIFT)
#define __pa(vaddr) virt_to_phys((void *)(vaddr))
#define virt_to_phys(vaddr) ((unsigned long) (vaddr))
从这三个宏的定义可以看出,virt_to_pfn这个宏接受一个物理地址做为参数,并计算这个物理地址页在整个页表中的序号。
这个宏位于include/asm/page.h中:
#define page_to_virt(page) ((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET)
在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配。
在这里有
#define PAGE_SHIFT 12
而PAGE_OFFSET的值则为0。
在上述分析可以看出,这个宏接受一个page结构体指针做为参数,并计算得到这个page结构体所代表的物理内存页的地址。
这个宏位于include/asm/page.h中:
#define page_to_pfn(page) virt_to_pfn(page_to_virt(page))
根据virt_to_pfn和page_to_virt这两个宏定义可知,page_to_pfn这个宏接一个page结构体的指针做为参数,并取得这个页在整个页表中的序号。
其实在这里用
#define page_to_pfn(page) (page – mem_map)
也可以达到相同的目的。
这个宏位于include/asm/page.h中:
#define pfn_to_virt(pfn) __va((pfn) << PAGE_SHIFT)
#define __va(paddr) phys_to_virt((unsigned long)(paddr))
#define phys_to_virt(vaddr) ((void *) (vaddr))
从这三个宏定义可以看出,这个宏接受页面序号做为参数,并返回指定序号的页面的物理地址。
内核对每一个4K的内存页都用一个page结构进行描述,使用virt_to_page这个宏定义可以快速地找到指定物理地址对应的page结构体。
这个宏位于include/asm/page.h中:
#define virt_to_page(addr) (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))
在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配,但是与其它用bootmem分配的内存不同,这块内存空间永远不会被回收。
在这里有
#define PAGE_SHIFT 12
而PAGE_OFFSET的值则为0。
在上述分析可以看出,这个宏接受一个物理地址做为参数,并计算得到表示这个物理地址所在页面的page结构体指针。
这个宏位于include/asm/page.h中:
#define pfn_to_page(pfn) virt_to_page(pfn_to_virt(pfn))
根据pfn_to_virt和virt_to_page这两个宏定义可知,pfn_to_page这个宏接受一个page序号做为参数,并取得这个序号所代表的page结构体指针。
其实在这里用
#define pfn_to_ page (pfn) (page + mem_map)
也可以达到相同的目的。
内核将整个存储空间划分为几个zone,每个zone都用一个zone结构体表示。由于在BF561中只使用了ZONE_DMA这一个区域,因而可以认为内核中每次出现的zone指针都指向一个全局唯一的地址。
zone这个结构体的定义位于include/linux/mmzone.h,它实际管理着所有的内存页面链表:
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long pages_min, pages_low, pages_high;
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
* to run OOM on the lower zones despite there's tons of freeable ram
* on the higher zones). This array is recalculated at runtime if the
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
struct per_cpu_pageset pageset[NR_CPUS];
/*
* free areas of different sizes
*/
spinlock_t lock;
struct free_area free_area[MAX_ORDER];
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct list_head active_list;
struct list_head inactive_list;
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
/* A count of how many reclaimers are scanning this zone */
atomic_t reclaim_in_progress;
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
* target at the previous try_to_free_pages() or balance_pgdat()
* invokation.
*
* We use prev_priority as a measure of how much stress page reclaim is
* under - it drives the swappiness decision: whether to unmap mapped
* pages.
*
* Access to both this field is quite racy even on uniprocessor. But
* it is expected to average out OK.
*/
int prev_priority;
ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
/*
* wait_table -- the array holding the hash table
* wait_table_hash_nr_entries -- the size of the hash table array
* wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
/*
* zone_start_pfn, spanned_pages and present_pages are all
* protected by span_seqlock. It is a seqlock because it has
* to be read outside of zone->lock, and it is done in the main
* allocator path. But, it is written quite infrequently.
*
* The lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*/
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
/*
* rarely used fields:
*/
const char *name;
} ____cacheline_internodealigned_in_smp;
在内核中访问的zone全部是指contig_page_data->node_zones[0],因为node_zones[1]没有可用空间,在内核中可以忽略它。
l spanned_pages和present_pages
表示可用的SDRAM的页的数量,其内存范围从0到 60M 。而present_pages则在spanned_pages的基础上减去了page数组所占用的页数。对于64MSDRAM,不启用MTD而言,spanned_pages的值为0x3bff,而present_pages的值则为0x3b 6a 。
l zone_pgdat
将指向全局唯一的一个pglist_data:contig_page_data
l prev_priority
被初始化为DEF_PRIORITY:
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
* queues ("queue_length >> 12") during an aging round.
*/
#define DEF_PRIORITY 12
l wait_table_hash_nr_entries和wait_table_bits
对于 64M 内存,wait_table_hash_nr_entries的值为0x40。而wait_table_bits的值则为6。而wait_table的大小将为wait_table_hash_nr_entries * sizeof(wait_queue_head_t)。
l free_area[MAX_ORDER]
struct free_area {
struct list_head free_list;
unsigned long nr_free;
};
在buddy算法中,将空闲页面分为11个块链表,每个块链表分别包含大小为1、2、4、8、16、32、64、128、256、512和1024个连续的页。这个成员就是为了表示此链表而设的。
这个结构体的定义在include/linux/mmzone.h中:
/*
* One allocation request operates on a zonelist. A zonelist
* is a list of zones, the first one is the 'goal' of the
* allocation, the other zones are fallback zones, in decreasing
* priority.
*
* If zlcache_ptr is not NULL, then it is just the address of zlcache,
* as explained above. If zlcache_ptr is NULL, there is no zlcache.
*/
struct zonelist {
struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
struct zone *zones[MAX_ZONES_PER_ZONELIST + 1]; // NULL delimited
};
在这里有
/* Maximum number of zones on a zonelist */
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
其值为2。
l zones
这个结构体只用在pglist_data中,在初始化后,zones实际只有一个元素,即指向ZONE_DMA,contig_page_data->zone[0]。
l zlcache_ptr
这个值实际没有使用,为NULL。
这个结构体用于表示一个NUMA节点,因为BF561系统中只有一个节点,因而可以认为所有的内核中出现的pglist_data指针都指向同一个位置,即contig_page_data这个全局变量。
pglist_data结构体的定义在include/linux/mmzone.h中:
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
* (mostly NUMA machines?) to denote a higher-level memory zone than the
* zone denotes.
*
* On NUMA machines, each NUMA node would have a pg_data_t to describe
* it's memory layout.
*
* Memory statistics and page replacement data structures are maintained on a
* per-zone basis.
*/
struct bootmem_data;
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_NR_ZONES];
int nr_zones;
struct page *node_mem_map;
struct bootmem_data *bdata;
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
struct task_struct *kswapd;
int kswapd_max_order;
} pg_data_t;
这个结构体用于描述可用存储空间的情况。
l bdata
static bootmem_data_t contig_bootmem_data;
struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };
从这个定义还可以看出在这个结构体中,bdata实际将指向一个固定的位置contig_bootmem_data且在mem_init函数调用后此成员将不再使用。
l zone
对于这个结构体中的zone,内核实际只使用了ZONE_DMA(0)这个区域,它的范围从内核代码结束一直到物理内存结束。
l node_id
因为整个内核只使用了一个NODE,因此在这个结构体中node_id的值将为0。
l node_start_pfn
将为0。
l node_spanned_pages和node_present_pages
两个成员的初始化在calculate_node_totalpages函数中完成,它们的值为SDRAM的页表数量,包含未用的区域和内核代码等,其值相等。对于 64M 内存而言(实际限制到 60M ),其值为0x3bff。
l node_mem_map
在内核中每个4K的页都有一个struct page结构体与之对应,这个成员指向这个page数组的首地址,它将在初始化时由alloc_node_mem_map函数进行空间分配(使用bootmem)。
l nr_zones
这个值用于表示可用的zone的最高序号+1。对于BF561而言,只使用了ZONE_DMA,因此这个值将为1。
uClinux2.6(bf561)中的CPLB( 2008/2/19 )
uclinux2.6(bf561)中的bootmem分析(1):猜测( 2008/5/9 )
uclinux2.6(bf561)中的bootmem分析(2):调用前的参数分析( 2008/5/9 )
uclinux2.6(bf561)中的bootmem分析(3):init_bootmem_node( 2008/5/9 )
uclinux2.6(bf561)中的bootmem分析(4):alloc_bootmem_pages( 2008/5/9 )
uclinux2.6(bf561)内核中的paging_init( 2008/5/12 )
uclinux-2008r1(bf561)内核的icache支持(1):寄存器配置初始化( 2008/5/16 )
uclinux-2008r1(bf561)内核的icache支持(2):icplb_table的生成( 2008/5/16 )
uclinux-2008r1(bf561)内核的icache支持(3):__fill_code_cplbtab( 2008/5/16 )
uclinux-2008r1(bf561)内核的icache支持(4):换页问题( 2008/5/16 )
再读uclinux-2008r1(bf561)内核中的bootmem( 2008/6/3 )
uclinux-2008r1(bf561)内核中与存储管理相关的几个全局变量( 2008/6/4 )
uclinux-2008r1(bf561)内核存储区域初探( 2008/6/4 )
uclinux-2008r1(bf561)内核中的zonelist初始化( 2008/6/5 )
uclinux-2008r1(bf561)内核中内存管理相关的几个结构体( 2008/6/5 )
再读内核存储管理(1):相关的全局变量( 2008/6/17 )
再读内核存储管理(2):相关的数据结构( 2008/6/17 )
再读内核存储管理(3):bootmem分配策略( 2008/6/17 )
再读内核存储管理(4):存储区域管理( 2008/6/17 )
再读内核存储管理(5):buddy算法( 2008/6/17 )
再读内核存储管理(6):高速缓存的应用( 2008/6/17 )
再读内核存储管理(7):icache支持( 2008/6/17 )
再读内核存储管理(8):片内SRAM的使用( 2008/6/17 )
初读SLAB( 2008/6/26 )
三读bootmem( 2008/7/24 )