再读uclinux-2008r1(bf561)内核存储区域管理(1):相关数据结构

 

快乐虾

http://blog.csdn.net/lights_joy/

[email protected]

   

本文适用于

ADI bf561 DSP

优视BF561EVB开发板

uclinux-2008r1-rc8 (移植到vdsp5)

Visual DSP++ 5.0

   

欢迎转载,但请保留作者信息

 

Linux支持非一致内存访问(Non-Uniform Memory Access, NUMA)模型,在这种模型中,给定CPU对不同内存单元的访问时间可能不一样。系统的物理内存被划分为几个节点。在一个单独的节点内,任一给定CPU访问页面所需的时间都是相同的。但是对不同的CPU,这个时间可能就不同。对每个CPU而言,内核都试图把耗时节点的访问次数减到最少,这就需要小心地选择CPU最常引用的内核数据结构的存放位置。每个节点中的物理内存又可以分为几个管理区(zone)

x86的体系结构下,Linux将内存分为3个管理区:

ZONE_DMA:包含低于 16M 的内存页,因为ISA总线的DMA处理器有严格限制,只能对RAM的前 16M 进行寻址。

ZONE_NORMAL:包含高于 16M 且低于 896M 的内存。

ZONE_HIGHMEM:包含高于 896M 的内存。

对于BF561而言,DMA将可以访问整个内存区域,当然,由于anomaly-05000263的缘故,在启用ICACHE的情况下,可用内存将限制为 60M 。所以在内核中,实际只使用了一个内存区,ZONE_DMA,在这个内存区中,包含了所有的内存范围。

下面试图通过对存储区域的初始化来分析下内核的数据表示。

1.1.1   相关数据结构

1.1.1.1             page

内核的存储管理是以page为单位的,每个page的大小为4K,且每个page都用一个page结构体来进行描述,这些page的描述做为一个数组放在内核代码之后。因而这里就涉及到3个量:一个是4K的实际页面的地址,即物理地址,在内核中表示为Virtual Address;另一个为描述此页面的page结构指针,用page表示;还有一个是这个page结构体在整个page数组中的序号,内核称之为pfn。这三个地址之间可以用6个宏来进行相互转换。

page结构体的定义在include/linux/mm_types.h中:

/*

 * Each physical page in the system has a struct page associated with

 * it to keep track of whatever it is we are using the page for at the

 * moment. Note that we have no way to track which tasks are using

 * a page, though if it is a pagecache page, rmap structures can tell us

 * who is mapping it.

 */

struct page {

     unsigned long flags;        /* Atomic flags, some possibly

                        * updated asynchronously */

     atomic_t _count;       /* Usage count, see below. */

     union {

         atomic_t _mapcount;    /* Count of ptes mapped in mms,

                        * to show when page is mapped

                        * & limit reverse map searches.

                        */

         struct { /* SLUB uses */

              short unsigned int inuse;

              short unsigned int offset;

         };

     };

     union {

         struct {

         unsigned long private;      /* Mapping-private opaque data:

                             * usually used for buffer_heads

                             * if PagePrivate set; used for

                             * swp_entry_t if PageSwapCache;

                             * indicates order in the buddy

                             * system if PG_buddy is set.

                             */

         struct address_space *mapping;   /* If low bit clear, points to

                             * inode address_space, or NULL.

                             * If page mapped as anonymous

                             * memory, low bit is set, and

                             * it points to anon_vma object:

                             * see PAGE_MAPPING_ANON below.

                             */

         };

         spinlock_t ptl;

         struct {           /* SLUB uses */

         void **lockless_freelist;

         struct kmem_cache *slab;    /* Pointer to slab */

         };

         struct {

         struct page *first_page;    /* Compound pages */

         };

     };

     union {

         pgoff_t index;         /* Our offset within mapping. */

         void *freelist;        /* SLUB: freelist req. slab lock */

     };

     struct list_head lru;       /* Pageout list, eg. active_list

                        * protected by zone->lru_lock !

                        */

};

l         lru

这个结构体根据这个页面应用的不同有不同的含义,当此页面空闲时或者此页面用于高速缓存时,这个结构体用于将页面链接起来的双链表。而当此页面用于SLAB算法时,它的next指针将指向这个slab所在的cache,即一个kmem_cache的结构体。而其prev指针将指向这个slab。这两个指针都是通过slab_map_pages函数进行设置的。

l         flags

这个结构体中的flags成员每个位的定义在include/linux/page_flags.h中:

/*

 * Various page->flags bits:

 *

 * PG_reserved is set for special pages, which can never be swapped out. Some

 * of them might not even exist (eg empty_bad_page)...

 *

 * The PG_private bitflag is set on pagecache pages if they contain filesystem

 * specific data (which is normally at page->private). It can be used by

 * private allocations for its own usage.

 *

 * During initiation of disk I/O, PG_locked is set. This bit is set before I/O

 * and cleared when writeback _starts_ or when read _completes_. PG_writeback

 * is set before writeback starts and cleared when it finishes.

 *

 * PG_locked also pins a page in pagecache, and blocks truncation of the file

 * while it is held.

 *

 * page_waitqueue(page) is a wait queue of all tasks waiting for the page

 * to become unlocked.

 *

 * PG_uptodate tells whether the page's contents is valid.  When a read

 * completes, the page becomes uptodate, unless a disk I/O error happened.

 *

 * PG_referenced, PG_reclaim are used for page reclaim for anonymous and

 * file-backed pagecache (see mm/vmscan.c).

 *

 * PG_error is set to indicate that an I/O error occurred on this page.

 *

 * PG_arch_1 is an architecture specific page state bit.  The generic code

 * guarantees that this bit is cleared for a page when it first is entered into

 * the page cache.

 *

 * PG_highmem pages are not permanently mapped into the kernel virtual address

 * space, they need to be kmapped separately for doing IO on the pages.  The

 * struct page (these bits with information) are always mapped into kernel

 * address space...

 *

 * PG_buddy is set to indicate that the page is free and in the buddy system

 * (see mm/page_alloc.c).

 *

 */

 

/*

 * Don't use the *_dontuse flags.  Use the macros.  Otherwise you'll break

 * locked- and dirty-page accounting.

 *

 * The page flags field is split into two parts, the main flags area

 * which extends from the low bits upwards, and the fields area which

 * extends from the high bits downwards.

 *

 *  | FIELD | ... | FLAGS |

 *  N-1     ^             0

 *          (N-FLAGS_RESERVED)

 *

 * The fields area is reserved for fields mapping zone, node and SPARSEMEM

 * section.  The boundry between these two areas is defined by

 * FLAGS_RESERVED which defines the width of the fields section

 * (see linux/mmzone.h).  New flags must _not_ overlap with this area.

 */

#define PG_locked       0   /* Page is locked. Don't touch. */

#define PG_error       1

#define PG_referenced       2

#define PG_uptodate         3

 

#define PG_dirty        4

#define PG_lru              5

#define PG_active      6

#define PG_slab             7   /* slab debug (Suparna wants this) */

 

#define PG_owner_priv_1     8   /* Owner use. If pagecache, fs may use*/

#define PG_arch_1      9

#define PG_reserved         10

#define PG_private     11   /* If pagecache, has fs-private data */

 

#define PG_writeback        12   /* Page is under writeback */

#define PG_compound         14   /* Part of a compound page */

#define PG_swapcache        15   /* Swap page: swp_entry_t in private */

 

#define PG_mappedtodisk     16   /* Has blocks allocated on-disk */

#define PG_reclaim     17   /* To be reclaimed asap */

#define PG_buddy       19   /* Page is free, on buddy lists */

 

/* PG_owner_priv_1 users should have descriptive aliases */

#define PG_checked     PG_owner_priv_1 /* Used by some filesystems */

 

初始化时,flags都将初始化为PG_reserved

page的初始化过程可参见memmap_init_zone,位于mm/page_alloc.c

/*

 * Initially all pages are reserved - free ones are freed

 * up by free_all_bootmem() once the early boot process is

 * done. Non-atomic initialization, single-pass.

 */

void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,

         unsigned long start_pfn, enum memmap_context context)

{

     struct page *page;

     unsigned long end_pfn = start_pfn + size;

     unsigned long pfn;

 

     for (pfn = start_pfn; pfn < end_pfn; pfn++) {

         /*

          * There can be holes in boot-time mem_map[]s

          * handed to this function.  They do not

          * exist on hotplugged memory.

          */

         if (context == MEMMAP_EARLY) {

              if (!early_pfn_valid(pfn))

                   continue;

              if (!early_pfn_in_nid(pfn, nid))

                   continue;

         }

         page = pfn_to_page(pfn);

         set_page_links(page, zone, nid, pfn);

         init_page_count(page);

         reset_page_mapcount(page);

         SetPageReserved(page);

         INIT_LIST_HEAD(&page->lru);

     }

}

在上述函数中,由于zone参数为0,故而实际只有

SetPageReserved(page)

这行语句起作用,将flags成员设置为0x400(PG_reserved)

下面是几个对PAGE进行操作的宏:

1.1.1.1.1       virt_to_pfn

这个宏位于include/asm/page.h中:

#define virt_to_pfn(kaddr)  (__pa(kaddr) >> PAGE_SHIFT)

#define __pa(vaddr)         virt_to_phys((void *)(vaddr))

#define virt_to_phys(vaddr) ((unsigned long) (vaddr))

从这三个宏的定义可以看出,virt_to_pfn这个宏接受一个物理地址做为参数,并计算这个物理地址页在整个页表中的序号。

1.1.1.1.2       page_to_virt

这个宏位于include/asm/page.h中:

#define page_to_virt(page)  ((((page) - mem_map) << PAGE_SHIFT) + PAGE_OFFSET)

在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配。

在这里有

#define PAGE_SHIFT 12

PAGE_OFFSET的值则为0

在上述分析可以看出,这个宏接受一个page结构体指针做为参数,并计算得到这个page结构体所代表的物理内存页的地址。

 

1.1.1.1.3       page_to_pfn

这个宏位于include/asm/page.h中:

#define page_to_pfn(page)   virt_to_pfn(page_to_virt(page))

根据virt_to_pfnpage_to_virt这两个宏定义可知,page_to_pfn这个宏接一个page结构体的指针做为参数,并取得这个页在整个页表中的序号。

其实在这里用

#define page_to_pfn(page)   (page mem_map)

也可以达到相同的目的。

1.1.1.1.4       pfn_to_virt

这个宏位于include/asm/page.h中:

#define pfn_to_virt(pfn)    __va((pfn) << PAGE_SHIFT)

#define __va(paddr)         phys_to_virt((unsigned long)(paddr))

#define phys_to_virt(vaddr) ((void *) (vaddr))

从这三个宏定义可以看出,这个宏接受页面序号做为参数,并返回指定序号的页面的物理地址。

1.1.1.1.5       virt_to_page

内核对每一个4K的内存页都用一个page结构进行描述,使用virt_to_page这个宏定义可以快速地找到指定物理地址对应的page结构体。

这个宏位于include/asm/page.h中:

#define virt_to_page(addr)  (mem_map + (((unsigned long)(addr)-PAGE_OFFSET) >> PAGE_SHIFT))

在这里mem_map指向page数组的第一个元素。这块内存空间在alloc_node_mem_map函数中使用bootmem分配,但是与其它用bootmem分配的内存不同,这块内存空间永远不会被回收。

在这里有

#define PAGE_SHIFT 12

PAGE_OFFSET的值则为0

在上述分析可以看出,这个宏接受一个物理地址做为参数,并计算得到表示这个物理地址所在页面的page结构体指针。

1.1.1.1.6       pfn_to_page

这个宏位于include/asm/page.h中:

#define pfn_to_page(pfn)    virt_to_page(pfn_to_virt(pfn))

根据pfn_to_virtvirt_to_page这两个宏定义可知,pfn_to_page这个宏接受一个page序号做为参数,并取得这个序号所代表的page结构体指针。

其实在这里用

#define pfn_to_ page (pfn)  (page + mem_map)

也可以达到相同的目的。

1.1.1.2             zone

内核将整个存储空间划分为几个zone,每个zone都用一个zone结构体表示。由于在BF561中只使用了ZONE_DMA这一个区域,因而可以认为内核中每次出现的zone指针都指向一个全局唯一的地址。

zone这个结构体的定义位于include/linux/mmzone.h,它实际管理着所有的内存页面链表:

struct zone {

     /* Fields commonly accessed by the page allocator */

     unsigned long      pages_min, pages_low, pages_high;

     /*

      * We don't know if the memory that we're going to allocate will be freeable

      * or/and it will be released eventually, so to avoid totally wasting several

      * GB of ram we must reserve some of the lower zone memory (otherwise we risk

      * to run OOM on the lower zones despite there's tons of freeable ram

      * on the higher zones). This array is recalculated at runtime if the

      * sysctl_lowmem_reserve_ratio sysctl changes.

      */

     unsigned long      lowmem_reserve[MAX_NR_ZONES];

 

     struct per_cpu_pageset pageset[NR_CPUS];

     /*

      * free areas of different sizes

      */

     spinlock_t         lock;

     struct free_area   free_area[MAX_ORDER];

 

     ZONE_PADDING(_pad1_)

 

     /* Fields commonly accessed by the page reclaim scanner */

     spinlock_t         lru_lock;

     struct list_head   active_list;

     struct list_head   inactive_list;

     unsigned long      nr_scan_active;

     unsigned long      nr_scan_inactive;

     unsigned long      pages_scanned;        /* since last reclaim */

     int           all_unreclaimable; /* All pages pinned */

 

     /* A count of how many reclaimers are scanning this zone */

     atomic_t      reclaim_in_progress;

 

     /* Zone statistics */

     atomic_long_t      vm_stat[NR_VM_ZONE_STAT_ITEMS];

 

     /*

      * prev_priority holds the scanning priority for this zone.  It is

      * defined as the scanning priority at which we achieved our reclaim

      * target at the previous try_to_free_pages() or balance_pgdat()

      * invokation.

      *

      * We use prev_priority as a measure of how much stress page reclaim is

      * under - it drives the swappiness decision: whether to unmap mapped

      * pages.

      *

      * Access to both this field is quite racy even on uniprocessor.  But

      * it is expected to average out OK.

      */

     int prev_priority;

 

 

     ZONE_PADDING(_pad2_)

     /* Rarely used or read-mostly fields */

 

     /*

      * wait_table      -- the array holding the hash table

      * wait_table_hash_nr_entries    -- the size of the hash table array

      * wait_table_bits -- wait_table_size == (1 << wait_table_bits)

      *

      * The purpose of all these is to keep track of the people

      * waiting for a page to become available and make them

      * runnable again when possible. The trouble is that this

      * consumes a lot of space, especially when so few things

      * wait on pages at a given time. So instead of using

      * per-page waitqueues, we use a waitqueue hash table.

      *

      * The bucket discipline is to sleep on the same queue when

      * colliding and wake all in that wait queue when removing.

      * When something wakes, it must check to be sure its page is

      * truly available, a la thundering herd. The cost of a

      * collision is great, but given the expected load of the

      * table, they should be so rare as to be outweighed by the

      * benefits from the saved space.

      *

      * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the

      * primary users of these fields, and in mm/page_alloc.c

      * free_area_init_core() performs the initialization of them.

      */

     wait_queue_head_t  * wait_table;

     unsigned long      wait_table_hash_nr_entries;

     unsigned long      wait_table_bits;

 

     /*

      * Discontig memory support fields.

      */

     struct pglist_data *zone_pgdat;

     /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */

     unsigned long      zone_start_pfn;

 

     /*

      * zone_start_pfn, spanned_pages and present_pages are all

      * protected by span_seqlock.  It is a seqlock because it has

      * to be read outside of zone->lock, and it is done in the main

      * allocator path.  But, it is written quite infrequently.

      *

      * The lock is declared along with zone->lock because it is

      * frequently read in proximity to zone->lock.  It's good to

      * give them a chance of being in the same cacheline.

      */

     unsigned long      spanned_pages;     /* total size, including holes */

     unsigned long      present_pages;     /* amount of memory (excluding holes) */

 

     /*

      * rarely used fields:

      */

     const char         *name;

} ____cacheline_internodealigned_in_smp;

在内核中访问的zone全部是指contig_page_data->node_zones[0],因为node_zones[1]没有可用空间,在内核中可以忽略它。

l         spanned_pagespresent_pages

表示可用的SDRAM的页的数量,其内存范围从0 60M 。而present_pages则在spanned_pages的基础上减去了page数组所占用的页数。对于64MSDRAM,不启用MTD而言,spanned_pages的值为0x3bff,而present_pages的值则为0x3b 6a

l         zone_pgdat

将指向全局唯一的一个pglist_datacontig_page_data

l          prev_priority

被初始化为DEF_PRIORITY

/*

 * The "priority" of VM scanning is how much of the queues we will scan in one

 * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the

 * queues ("queue_length >> 12") during an aging round.

 */

#define DEF_PRIORITY 12

l          wait_table_hash_nr_entrieswait_table_bits

对于 64M 内存,wait_table_hash_nr_entries的值为0x40。而wait_table_bits的值则为6。而wait_table的大小将为wait_table_hash_nr_entries * sizeof(wait_queue_head_t)

l         free_area[MAX_ORDER]

struct free_area {

     struct list_head   free_list;

     unsigned long      nr_free;

};

buddy算法中,将空闲页面分为11个块链表,每个块链表分别包含大小为12481632641282565121024个连续的页。这个成员就是为了表示此链表而设的。

1.1.1.3             zonelist

这个结构体的定义在include/linux/mmzone.h中:

/*

 * One allocation request operates on a zonelist. A zonelist

 * is a list of zones, the first one is the 'goal' of the

 * allocation, the other zones are fallback zones, in decreasing

 * priority.

 *

 * If zlcache_ptr is not NULL, then it is just the address of zlcache,

 * as explained above.  If zlcache_ptr is NULL, there is no zlcache.

 */

 

struct zonelist {

     struct zonelist_cache *zlcache_ptr;            // NULL or &zlcache

     struct zone *zones[MAX_ZONES_PER_ZONELIST + 1];      // NULL delimited

};

在这里有

/* Maximum number of zones on a zonelist */

#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)

其值为2

l         zones

这个结构体只用在pglist_data中,在初始化后,zones实际只有一个元素,即指向ZONE_DMAcontig_page_data->zone[0]

l         zlcache_ptr

这个值实际没有使用,为NULL

1.1.1.4             pglist_data

这个结构体用于表示一个NUMA节点,因为BF561系统中只有一个节点,因而可以认为所有的内核中出现的pglist_data指针都指向同一个位置,即contig_page_data这个全局变量。

pglist_data结构体的定义在include/linux/mmzone.h中:

 

/*

 * The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM

 * (mostly NUMA machines?) to denote a higher-level memory zone than the

 * zone denotes.

 *

 * On NUMA machines, each NUMA node would have a pg_data_t to describe

 * it's memory layout.

 *

 * Memory statistics and page replacement data structures are maintained on a

 * per-zone basis.

 */

struct bootmem_data;

typedef struct pglist_data {

     struct zone node_zones[MAX_NR_ZONES];

     struct zonelist node_zonelists[MAX_NR_ZONES];

     int nr_zones;

     struct page *node_mem_map;

     struct bootmem_data *bdata;

     unsigned long node_start_pfn;

     unsigned long node_present_pages; /* total number of physical pages */

     unsigned long node_spanned_pages; /* total size of physical page

                            range, including holes */

     int node_id;

     wait_queue_head_t kswapd_wait;

     struct task_struct *kswapd;

     int kswapd_max_order;

} pg_data_t;

这个结构体用于描述可用存储空间的情况。

l         bdata

static bootmem_data_t contig_bootmem_data;

struct pglist_data contig_page_data = { .bdata = &contig_bootmem_data };

从这个定义还可以看出在这个结构体中,bdata实际将指向一个固定的位置contig_bootmem_data且在mem_init函数调用后此成员将不再使用。

l         zone

对于这个结构体中的zone,内核实际只使用了ZONE_DMA(0)这个区域,它的范围从内核代码结束一直到物理内存结束。

l         node_id

因为整个内核只使用了一个NODE,因此在这个结构体中node_id的值将为0

l         node_start_pfn

将为0

l         node_spanned_pagesnode_present_pages

两个成员的初始化在calculate_node_totalpages函数中完成,它们的值为SDRAM的页表数量,包含未用的区域和内核代码等,其值相等。对于 64M 内存而言(实际限制到 60M ),其值为0x3bff

l         node_mem_map

在内核中每个4K的页都有一个struct page结构体与之对应,这个成员指向这个page数组的首地址,它将在初始化时由alloc_node_mem_map函数进行空间分配(使用bootmem)

l         nr_zones

这个值用于表示可用的zone的最高序号+1。对于BF561而言,只使用了ZONE_DMA,因此这个值将为1

 

参考资料

uClinux2.6(bf561)中的CPLB( 2008/2/19 )

uclinux2.6(bf561)中的bootmem分析(1):猜测( 2008/5/9 )

uclinux2.6(bf561)中的bootmem分析(2):调用前的参数分析( 2008/5/9 )

uclinux2.6(bf561)中的bootmem分析(3)init_bootmem_node( 2008/5/9 )

uclinux2.6(bf561)中的bootmem分析(4)alloc_bootmem_pages( 2008/5/9 )

uclinux2.6(bf561)内核中的paging_init( 2008/5/12 )

uclinux-2008r1(bf561)内核的icache支持(1):寄存器配置初始化( 2008/5/16 )

uclinux-2008r1(bf561)内核的icache支持(2)icplb_table的生成( 2008/5/16 )

uclinux-2008r1(bf561)内核的icache支持(3)__fill_code_cplbtab( 2008/5/16 )

uclinux-2008r1(bf561)内核的icache支持(4):换页问题( 2008/5/16 )

再读uclinux-2008r1(bf561)内核中的bootmem( 2008/6/3 )

uclinux-2008r1(bf561)内核中与存储管理相关的几个全局变量( 2008/6/4 )

uclinux-2008r1(bf561)内核存储区域初探( 2008/6/4 )

uclinux-2008r1(bf561)内核中的zonelist初始化( 2008/6/5 )

uclinux-2008r1(bf561)内核中内存管理相关的几个结构体( 2008/6/5 )

再读内核存储管理(1):相关的全局变量( 2008/6/17 )

再读内核存储管理(2):相关的数据结构( 2008/6/17 )

再读内核存储管理(3)bootmem分配策略( 2008/6/17 )

再读内核存储管理(4):存储区域管理( 2008/6/17 )

再读内核存储管理(5)buddy算法( 2008/6/17 )

再读内核存储管理(6):高速缓存的应用( 2008/6/17 )

再读内核存储管理(7)icache支持( 2008/6/17 )

再读内核存储管理(8):片内SRAM的使用( 2008/6/17 )

初读SLAB( 2008/6/26 )

三读bootmem( 2008/7/24 )

 

 

 

 

你可能感兴趣的:(数据结构,struct,table,存储,initialization,Allocation)