Chapter 8. Memory Management
8.1. Page Frame Management
8.1.1. Page Descriptors
State information of a page frame is kept in a page descriptor of type page
All page descriptors are stored in the mem_map array.
virt_to_page(addr)
pfn_to_page(pfn)
8.1.2. Non-Uniform Memory Access (NUMA)
The physical memory inside each node can be split into several zones, as we will see in the next
section. Each node has a descriptor of type pg_data_t,
8.1.3. Memory Zones
Linux 2.6 partitions the physical memory of every memory node
into three zones. In the 80 x 86 UMA architecture the zones are:
ZONE_DMA
Contains page frames of memory below 16 MB
ZONE_NORMAL
Contains page frames of memory at and above 16 MB and below 896 MB
ZONE_HIGHMEM
Contains page frames of memory at and above 896 MB
The ZONE_DMA and ZONE_NORMAL zones include the "normal" page frames that can be directly accessed
by the kernel through the linear mapping in the fourth gigabyte of the linear address space (see the
section "Kernel Page Tables" in Chapter 2). Conversely, the ZONE_HIGHMEM zone includes page frames
that cannot be directly accessed by the kernel through the linear mapping in the fourth gigabyte of
linear address space (see the section "Kernel Mappings of High-Memory Page Frames" later in this
chapter). The ZONE_HIGHMEM zone is always empty on 64-bit architectures.
Each memory zone has its own descriptor of type zone. Its fields are shown in Table 8-4.
8.1.4. The Pool of Reserved Page Frames
min_free_kbytes,
initially min_free_kbytes cannot be lower than 128 and greater than 65,536
The pages_min field of the zone descriptor stores the number of reserved page frames inside the
zone. As we'll see in Chapter 17, this field plays also a role for the page frame reclaiming algorithm,
together with the pages_low and pages_high fields. The pages_low field is always set to 5/4 of the
value of pages_min, and pages_high is always set to 3/2 of the value of pages_min
8.1.5. The Zoned Page Frame Allocator
8.1.5.1. Requesting and releasing page frames
alloc_pages(gfp_mask, order)
alloc_page(gfp_mask)
Macro used to request 2order contiguous page frames. It returns the address of the descriptor
of the first allocated page frame or returns NULL if the allocation failed.
_ _get_free_pages(gfp_mask, order
_ _get_free_page(gfp_mask)
get_zeroed_page(gfp_mask)
_ _get_dma_pages(gfp_mask, order)
but it returns the linear address of the first allocated page.
_ _free_pages(page, order)
_ _free_page(page)
This function checks the page descriptor pointed to by page; if the page frame is not reserved
(i.e., if the PG_reserved flag is equal to 0), it decreases the count field of the descriptor. If
count becomes 0, it assumes that 2order contiguous page frames starting from the one
corresponding to page are no longer used. In this case, the function releases the page frames
as explained in the later section
free_pages(addr, order)
free_page(addr)
but it receives as an argument the linear address addr of the first page frame to be released.
8.1.6. Kernel Mappings of High-Memory Page Frames????
The kernel uses three different mechanisms to map page frames in high memory; they are called
permanent kernel mapping, temporary kernel mapping, and noncontiguous memory allocation. In
this section, we'll cover the first two techniques; the third one is discussed in the section
"Noncontiguous Memory Area Management" later in this chapter
8.1.6.1. Permanent kernel mappings
page_address( );
The page_address( ) function returns the linear address associated with the page frame, or NULL if the page frame is in high memory and is not mapped.
kmap_high()
The kmap_high( ) function is invoked if the page frame really belongs to high memory.
kunmap( )
The kunmap( ) function destroys a permanent kernel mapping established previously by kmap( ).
8.1.6.2. Temporary kernel mappings
kmap_atomic( )
8.1.7. The Buddy System Algorithm
The technique adopted by Linux to solve the external fragmentation problem is based on the wellknown
buddy system algorithm. All free page frames are grouped into 11 lists of blocks that contain
groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames, respectively. The
largest request of 1024 page frames corresponds to a chunk of 4 MB of contiguous RAM. The
physical address of the first page frame of a block is a multiple of the group size.for example, the
initial address of a 16-page-frame block is a multiple of 16 x 212 (212 = 4,096, which is the regular
page size).
8.1.7.1. Data structures
1:zone->zone_mem_map Pointer to first page descriptor of the zone.
2:An array consisting of eleven elements of type free_area, one element for each group size.
The array is stored in the free_area field of the zone descriptor.
zone->free_area [k]
8.1.7.2. Allocating a block
The _ _rmqueue( ) function is used to find a free block in a zone
8.1.7.3. Freeing a block
_ _free_pages_bulk( )/__free_one_page()
function implements the buddy system strategy for freeing page frames
8.1.8. The Per-CPU Page Frame Cache
The main data structure implementing the per-CPU page frame cache is an array of per_cpu_pageset
data structures stored in the pageset field of the memory zone descriptor. The array includes one
element for each CPU; this element, in turn, consists of two per_cpu_pages descriptors, one for the
hot cache and the other for the cold cache. The fields of the per_cpu_pages descriptor are listed in
Table 8-7. The fields of the per_cpu_pages descriptor
Type Name Description
int count Number of pages frame in the cache
int low Low watermark for cache replenishing
int high High watermark for cache depletion
int batch Number of page frames to be added or subtracted from the cache
struct list_head list List of descriptors of the page frames included in the cache
8.1.8.1. Allocating page frames through the per-CPU page frame caches
buffered_rmqueue( )
8.1.8.2. Releasing page frames to the per-CPU page frame caches
free_hot_cold_page( )
8.1.9. The Zone Allocator
_ _alloc_pages( )-->zone_watermark_ok( )
_ _free_pages( )-->__free_one_page()
8.2. Memory Area Management
8.2.1. The Slab Allocator
Figure 8-3. The slab allocator components
8.2.2. Cache Descriptor
1: Each cache is described by a structure of type kmem_cache_t(eg:kmem_cache)
Table 8-8. The fields of the kmem_cache_t descriptor
Type Name Description
struct array_cache *array[] array Per-CPU array of pointers to local caches of free objects (see the section "Local Caches of Free Slab Objects" later in this chapter).
unsigned int batchcount Number of objects to be transferred in bulk to or from the local caches.
unsigned int limit Maximum number of free objects in the local caches. This is tunable.
struct kmem_list3 lists See next table.
unsigned int objsize Size of the objects included in the cache
unsigned int flags Set of flags that describes permanent properties of the cache.
unsigned int num Number of objects packed into a single slab. (All slabs of the cache
have the same size.)
unsigned int free_limit Upper limit of free objects in the whole slab cache
spinlock_t spinlock Cache spin lock.
unsigned int gfporder Logarithm of the number of contiguous page frames included in a single slab.
unsigned int gfpflags Set of flags passed to the buddy system function when allocating page frames.
size_t colour Number of colors for the slabs (see the section "Slab Coloring" later
in this chapter).
unsigned int colour_off Basic alignment offset in the slabs.
unsigned int colour_next Color to use for the next allocated slab.
kmem_cache_t* slabp_cache Pointer to the general slab cache containing the slab descriptors
(NULL if internal slab descriptors are used; see next section).
unsigned int slab_size The size of a single slab
unsigned int dflags Set of flags that describe dynamic properties of the cache
void * ctor Pointer to destructor method associated with the cache
void * dtor Pointer to destructor method associated with the cache
const char * name Character array storing the name of the cache
struct list_head next Pointers for the doubly linked list of cache descriptors.
The CFLGS_OFF_SLAB flag in the flags field of the cache descriptor is set to one if the slab descriptor is stored outside the slab; it is set to zero otherwise.
2: The lists field of the kmem_cache_t descriptor
8.2.3. Slab Descriptor
kmem_cache->flags :
The CFLGS_OFF_SLAB flag in the flags field of the cache descriptor
is set to one if the slab descriptor is stored outside the slab;
External slab descriptor
Internal slab descriptor
Figure 8-4. Relationship between cache and slab descriptors
8.2.4. General and Specific Caches
general caches are:
1: A first cache called kmem_cache whose objects are the cache descriptors of the remaining
caches used by the kernel. The cache_cache variable contains the descriptor of this special cache.
2: Several additional caches contain general purpose memory areas. The range of the memory
area sizes typically includes 13 geometrically distributed sizes. A table called malloc_sizes
(whose elements are of type cache_sizes) points to 26 cache descriptors associated with
memory areas of size 32, 64, 128, 256, 512, 1,024, 2,048, 4,096, 8,192, 16,384, 32,768,
65,536, and 131,072 bytes. For each size, there are two caches: one suitable for ISA DMA
allocations and the other for normal allocations.
specific caches
Specific caches are created by the kmem_cache_create( ) function. Depending on the parameters,
the function first determines the best way to handle the new cache (for instance, whether to include
the slab descriptor inside or outside of the slab). It then allocates a cache descriptor for the new
cache from the cache_cache general cache and inserts the descriptor in the cache_chain list of cache
descriptors (the insertion is done after having acquired the cache_chain_sem semaphore that
protects the list from concurrent accesses).
It is also possible to destroy a cache and remove it from the cache_chain list by invoking
kmem_cache_destroy( ). This function is mostly useful to modules that create their own caches when
loaded and destroy them when unloaded. To avoid wasting memory space, the kernel must destroy
all slabs before destroying the cache itself. The kmem_cache_shrink( ) function destroys all the slabs
in a cache by invoking slab_destroy( ) iteratively (see the later section "Releasing a Slab from a
Cache").
The names of all general and specific caches can be obtained at runtime by reading /proc/slabinfo;
this file also specifies the number of free objects and the number of allocated objects in each cach
8.2.5. Interfacing the Slab Allocator with the Zoned Page Frame Allocator
kmem_getpages( )
kmem_freepages( )
8.2.6. Allocating a Slab to a Cache
cache_ grow( )
8.2.7. Releasing a Slab from a Cache
slab_dest
8.2.8. Object Descriptor
Internal object descriptors
External object descriptors
The first object descriptor in the array describes the first object in the slab, and so on. An object
descriptor is simply an unsigned short integer, which is meaningful only when the object is free. It
contains the index of the next free object in the slab, thus implementing a simple list of free objects
inside the slab. The object descriptor of the last element in the free object list is marked by the
conventional value BUFCTL_END (0xffff).
Figure 8-5. Relationships between slab and object descriptors
8.2.9. Aligning Objects in Memory
8.2.10. Slab Coloring
Objects that have the same offset within different slabs will end up mapped in the same cache line.
The cache hardware might therefore waste memory cycles transferring two objects from the same cache line back and forth to different RAM
locations, while other cache lines go underutilized.
policy called slab coloring : different arbitrary values called colors are assigned to the slabs.
Figure 8-6. Slab with color col and alignment aln
8.2.11. Local Caches of Free Slab Objects
cache of the slab allocator includes a per-CPU data structure consisting of a small array of pointers to freed objects called the slab local cache, the slab data structures get involved only when the local cache underflows or overflows
kmem_cache->array
Table 8-11. The fields of the array_cache structure
Type Name Description
unsigned int avail Number of pointers to available objects in the local cache. The field also
acts as the index of the first free slot in the cache.
unsigned int limit Size of the local cachethat is, the maximum number of pointers in the
local cache.
unsigned int batchcount Chunk size for local cache refill or empty
unsigned int touched Flag set to 1 if the local cache has been recently used
8.2.12. Allocating a Slab Object
kmem_cache_alloc( )
-->cache_alloc_refill( )
8.2.13. Freeing a Slab Object
kmem_cache_free( )
-->cache_flusharray( )
8.2.14. General Purpose Objects
kmalloc( )
kfree()
8.2.15. Memory Pools
"The Pool of Reserved Page Frames."
those page frames can be used only to satisfy atomic memory allocation requests issued by interrupt handlers or inside critical regions.
Memory Pools
is a reserve of dynamic memory that can be used only by a specific kernel component, namely the "owner" of the pool
A memory pool is described by a mempool_t object
Table 8-12. The fields of the mempool_t object
Type Name Description
spinlock_t lock Spin lock protecting the object fields
int min_nr Minimum number of elements in the memory pool
int curr_nr Current number of elements in the memory pool
void ** elements Pointer to an array of pointers to the reserved elements
void * pool_data Private data available to the pool's owner
mempool_alloc_t * alloc Method to allocate an element
mempool_free_t * free Method to free an element
wait_queue_head_t wait Wait queue used when the memory pool is empty
mempool_create( )
mempool_destroy( )
mempool_alloc( )
mempool_free( )
8.3. Noncontiguous Memory Area Management
it makes sense to consider an allocation scheme based on noncontiguous page frames accessed through contiguous linear
addresses . The main advantage of this schema is to avoid external fragmentation,
8.3.1. Linear Addresses of Noncontiguous Memory Areas
Figure 8-7. The linear address interval starting from PAGE_OFFSET
1: The beginning of the area includes the linear addresses that map the first 896 MB of RAM (see
the section "Process Page Tables" in Chapter 2); the linear address that corresponds to the end
of the directly mapped physical memory is stored in the high_memory variable.
2: The end of the area contains the fix-mapped linear addresses (see the section "Fix-Mapped
Linear Addresses" in Chapter 2).
3: The remaining linear addresses can be used for noncontiguous memory areas. A safety interval
of size 8 MB (macro VMALLOC_OFFSET) is inserted between the end of the physical memory
mapping and the first memory area; its purpose is to "capture" out-of-bounds memory
accesses. For the same reason, additional safety intervals of size 4 KB are inserted to separate
noncontiguous memory areas.
8.3.2. Descriptors of Noncontiguous Memory Areas
Each noncontiguous memory area is associated with a descriptor of type vm_struct
Table 8-13. The fields of the vm_struct descriptor
Type Name Description
void * addr Linear address of the first memory cell of the area
unsigned long size Size of the area plus 4,096 (inter-area safety interval)
unsigned long flags Type of memory mapped by the noncontiguous memory area
struct page ** pages Pointer to array of nr_pages pointers to page descriptors
unsigned int nr_pages Number of pages filled by the area
unsigned long phys_addr Set to 0 unless the area has been created to map the I/O shared
memory of a hardware device
struct vm_struct * next Pointer to next vm_struct structure
1: These descriptors are inserted in a simple list by means of the next field; the address of the first element of the list is stored in the vmlist variable.
2: The flags field identifies the type of memory mapped by the area:
VM_ALLOC for pages obtained by means of vmalloc( ),
VM_MAP for already allocated pages mapped by means of vmap() (see the next section), and
VM_IOREMAP for on-board memory of hardware devices mapped by means of ioremap( ) (see Chapter 13).
3: The get_vm_area( ) function looks for a free range of linear addresses between VMALLOC_START and VMALLOC_END.
8.3.3. Allocating a Noncontiguous Memory Area
vmalloc( )
The last crucial step consists of fiddling with the page table entries used by the kernel to
indicate that each page frame allocated to the noncontiguous memory area is now associated with a
linear address included in the interval of contiguous linear addresses yielded by vmalloc( ). This is
what map_vm_area( ) does.
8.3.4. Releasing a Noncontiguous Memory Area
vfree( )
-->remove_vm_area( )
Chapter 9. Process Address Space
9.1. The Process's Address Space
The kernel represents intervals of linear addresses by means of resources called memory regions
Table 9-1. System calls related to memory region creation and deletion
System call Description
brk( ) Changes the heap size of the process
execve( ) Loads a new executable file, thus changing the process address space
_exit( ) Terminates the current process and destroys its address space
fork( ) Creates a new process, and thus a new address space
mmap( ), mmap2( ) Creates a memory mapping for a file, thus enlarging the process address space
mremap( ) Expands or shrinks a memory region
remap_file_pages() Creates a non-linear mapping for a file (see Chapter 16)
munmap( ) Destroys a memory mapping for a file, thus contracting the process address space
shmat( ) Attaches a shared memory region
shmdt( ) Detaches a shared memory region
9.2. The Memory Descriptor
mm_struct
Table 9-2. The fields of the memory descriptor
Type Field Description
struct vm_area_struct* mmap Pointer to the head of the list of memory region objects
struct rb_root mm_rb Pointer to the root of the red-black tree of memory region objects
struct vm_area_struct* mmap_cache Pointer to the last referenced memory region object
unsigned long(*)( ) get_unmapped_area Method that searches an available linear address interval in
the process address space
void (*)( ) unmap_area Method invoked when releasing a linear address interval
unsigned long mmap_base Identifies the linear address of the first allocated
anonymous memory region or file memory mapping (see
the section "Program Segments and Process Memory
Regions" in Chapter 20)
unsigned long free_area_cache Address from which the kernel will look for a free interval of
linear addresses in the process address space
pgd_t * pgd Pointer to the Page Global Directory
atomic_t mm_users Secondary usage counter
atomic_t mm_count Main usage counter
struct rw_semaphore mmap_sem Memory regions' read/write semaphore
spinlock_t page_table_lock Memory regions' and Page Tables' spin lock
struct list_head mmlist Pointers to adjacent elements in the list of memory descriptors
unsigned long start_code Initial address of executable code
unsigned long end_code Final address of executable code
unsigned long start_data Initial address of initialized data
unsigned long end_data Final address of initialized data
unsigned long start_brk Initial address of the heap
unsigned long brk Current final address of the heap
unsigned long start_stack Initial address of User Mode stack
unsigned long arg_start Initial address of command-line arguments
unsigned long arg_end Final address of command-line arguments
unsigned long env_start Initial address of environment variables
unsigned long env_end Final address of environment variables
unsigned long rss Number of page frames allocated to the process
unsigned long anon_rss Number of page frames assigned to anonymous memory mappings
unsigned long total_vm Size of the process address space (number of pages)
unsigned long locked_vm Number of "locked" pages that cannot be swapped out (see Chapter 17)
unsigned long shared_vm Number of pages in shared file memory mappings
unsigned long exec_vm Number of pages in executable memory mappings
unsigned long stack_vm Number of pages in the User Mode stack
unsigned long reserved_vm Number of pages in reserved or special memory regions
unsigned long def_flags Default access flags of the memory regions
unsigned long nr_ptes Number of Page Tables of this process
unsigned long[] saved_auxv Used when starting the execution of an ELF program (see Chapter 20)
unsigned int dumpable Flag that specifies whether the process can produce a core dump of the memory
cpumask_t cpu_vm_mask Bit mask for lazy TLB switches (see Chapter 2)
mm_context_t context Pointer to table for architecture-specific information (e.g., LDT's address in 80 86 platforms)
unsigned long swap_token_time When this process will become eligible for having the swap
token (see the section "The Swap Token" in Chapter 17)
char recent_pagein Flag set if a major Page Fault has recently occurred
int core_waiters Number of lightweight processes that are dumping the
contents of the process address space to a core file (see the
section "Deleting a Process Address Space" later in this
chapter)
struct completion * core_startup_done Pointer to a completion used when creating a core file (see
the section "Completions" in Chapter 5)
struct completion core_done Completion used when creating a core file
rwlock_t ioctx_list_lock Lock used to protect the list of asynchronous I/O contexts
(see Chapter 16)
struct kioctx * ioctx_list List of asynchronous I/O contexts (see Chapter 16)
struct kioctx default_kioctx Default asynchronous I/O context (see Chapter 16)
unsigned long hiwater_rss Maximum number of page frames ever owned by the process
unsigned long hiwater_vm Maximum number of pages ever included in the memory regions of the process
The mm_alloc( ) function is invoked to get a new memory descriptor.
the mmput( ) function decreases the mm_users field of a memory descriptor
9.2.1. Memory Descriptor of Kernel Threads
9.3. Memory Regions
Linux implements a memory region by means of an object of type vm_area_struct;
Table 9-3. The fields of the memory region object
Type Field Description
struct mm_struct * vm_mm Pointer to the memory descriptor that owns the region
unsigned long vm_start First linear address inside the region
unsigned long vm_end First linear address after the region
struct vm_area_struct * vm_next Next region in the process list.
pgprot_t vm_page_prot Access permissions for the page frames of the region
unsigned long vm_flags Flags of the region
struct rb_node vm_rb Data for the red-black tree (see later in this chapter).
union shared inks to the data structures used for reverse mapping
(see the section "Reverse Mapping for Mapped Pages" in Chapter 17).
struct list_head anon_vma_node Pointers for the list of anonymous memory regions
struct anon_vma * anon_vma Pointer to the anon_vma data structure
struct vm_operations_struct* vm_ops Pointer to the methods of the memory region
unsigned long vm_pgoff Offset in mapped file (see Chapter 16).For anonymous pages,
it is either zero or equal to vm_start/PAGE_SIZE (see Chapter 17).
struct file * vm_file Pointer to the file object of the mapped file, if any.
void * vm_private_data Pointer to private data of the memory region.
unsigned long vm_truncate_count Used when releasing a linear address interval in a non-linear file memory mapping.
9.3.1. Memory Region Data Structures
Figure 9-2. Descriptors related to the address space of a process
9.3.2. Memory Region Access Rights
vm_area_struct->vm_flags
9.3.3. Memory Region Handling
9.3.3.1. Finding the closest region to a given address: find_vma( )
The find_vma( ) function acts on two parameters: the address mm of a process memory descriptor and a linear address addr.
It locates the first memory region whose vm_end field is greater than addr and returns the address of its descriptor
9.3.3.2. Finding a region that overlaps a given interval: find_vma_intersection( )
The find_vma_intersection( ) function finds the first memory region that overlaps a given linear
address interval; the mm parameter points to the memory descriptor of the process, while the
start_addr and end_addr linear addresses specify the interval
9.3.3.3. Finding a free interval: get_unmapped_area( )
The get_unmapped_area( ) function searches the process address space to find an available linear address interval.
the function invokes either one of two methods, depending on whether the linear address interval should be used for a file memory
mapping or for an anonymous memory mapping
In the former case, the function executes the get_unmapped_area file operation; this is discussed in Chapter 16.
In the latter case, the function executes the get_unmapped_area method of the memory descriptor
In turn, this method is implemented by either the arch_get_unmapped_area( ) function, or the
arch_get_unmapped_area_topdown( ) function, according to the memory region layout of the process.
9.3.3.4. Inserting a region in the memory descriptor list: insert_vm_struct( )
insert_vm_struct( ) inserts a vm_area_struct structure in the memory region object list and redblack
tree of a memory descriptor.
9.3.4. Allocating a Linear Address Interval
the do_mmap( ) function creates and initializes a new memory region for the current process..
9.3.5. Releasing a Linear Address Interval
When the kernel must delete a linear address interval from the address space of the current process,
it uses the do_munmap( ) function
9.3.5.1. The do_munmap( ) function
split_vma( )--->detach_vmas_to_be_unmapped( )--->unmap_region( )
9.3.5.2. The split_vma( ) function
9.3.5.3. The unmap_region( ) function
9.4. Page Fault Exception Handler //重点
Figure 9-4. Overall scheme for the Page Fault handler
9.4.5. Handling Noncontiguous Memory Area Accesses
9.5. Creating and Deleting a Process Address Space
9.5.1. Creating a Process Address Space
9.5.2. Deleting a Process Address Space
9.6. Managing the Heap
Chapter 10. System Calls
10.1. POSIX APIs and System Calls
an application programmer interface a function definition that specifies how to obtain a given service.
a system call an explicit request to the kernel made via a software interrupt.
10.2. System Call Handler and Service Routines
Figure 10-1. Invoking a system call
10.3. Entering and Exiting a System Call
By executing the int $0x80 assembly language instruction; in older versions of the Linux
kernel, this was the only way to switch from User Mode to Kernel Mode
By executing the sysenter assembly language instruction, introduced in the Intel Pentium II
microprocessors; this instruction is now supported by the Linux 2.6 kernel.
By executing the iret assembly language instruction.
By executing the sysexit assembly language instruction, which was introduced in the Intel
Pentium II microprocessors together with the sysenter instruction.
10.3.1. Issuing a System Call via the int $0x80 Instruction
10.3.1.1. The system_call( ) function
10.3.1.2. Exiting from the system call
10.3.2. Issuing a System Call via the sysenter Instruction
10.3.2.1. The sysenter instruction
10.3.2.2. The vsyscall page
10.3.2.3. Entering the system call
10.3.2.4. Exiting from the system call
10.3.2.5. The sysexit instruction
10.4. Parameter Passing
10.4.1. Verifying the Parameters
access_ok( )
verify_area( )
10.4.2. Accessing the Process Address Space
get_user( )/put_user( ).
copy_from_user/copy_to_user
10.4.3. Dynamic Address Checking: The Fix-up Code
1.
The kernel attempts to address a page belonging to the process address space, but either the
corresponding page frame does not exist or the kernel tries to write a read-only page. In these
cases, the handler must allocate and initialize a new page frame (see the sections "Demand
Paging" and "Copy On Write" in Chapter 9).
2.
The kernel addresses a page belonging to its address space, but the corresponding Page Table
entry has not yet been initialized (see the section "Handling Noncontiguous Memory Area
Accesses" in Chapter 9). In this case, the kernel must properly set up some entries in the Page
Tables of the current process.
3.
Some kernel functions include a programming bug that causes the exception to be raised when
that program is executed; alternatively, the exception might be caused by a transient hardware
error. When this occurs, the handler must perform a kernel oops (see the section "Handling a
Faulty Address Inside the Address Space" in Chapter 9).
4.
The case introduced in this chapter: a system call service routine attempts to read or write into
a memory area whose address has been passed as a system call parameter, but that address
does not belong to the process address space.
10.4.4. The Exception Tables
exception_table_entry->insn
The linear address of an instruction that accesses the process address space
exception_table_entry->fixup
The address of the assembly language code to be invoked when a Page Fault exception
triggered by the instruction located at insn occurs
search_exception_tables( )
10.4.5. Generating the Exception Tables and the Fixup Code
10.5. Kernel Wrapper Routines
Chapter 11. Signals
11.1. The Role of Signals
Table 11-1. The first 31 signals in Linux/i386
#Signal name Default action Comment POSIX
1 SIGHUP Terminate Hang up controlling terminal or process Yes
2 SIGINT Terminate Interrupt from keyboard Yes
3 SIGQUIT Dump Quit from keyboard Yes
4 SIGILL Dump Illegal instruction Yes
5 SIGTRAP Dump Breakpoint for debugging No
6 SIGABRT Dump Abnormal termination Yes
6 SIGIOT Dump Equivalent to SIGABRT No
7 SIGBUS Dump Bus error No
8 SIGFPE Dump Floating-point exception Yes
9 SIGKILL Terminate Forced-process termination Yes
10 SIGUSR1 Terminate Available to processes Yes
11 SIGSEGV Dump Invalid memory reference Yes
12 SIGUSR2 Terminate Available to processes Yes
13 SIGPIPE Terminate Write to pipe with no readers Yes
14 SIGALRM Terminate Real-timerclock Yes
15 SIGTERM Terminate Process termination Yes
16 SIGSTKFLT Terminate Coprocessor stack error No
17 SIGCHLD Ignore Child process stopped or terminated,
or got signal if traced Yes
18 SIGCONT Continue Resume execution, if stopped Yes
19 SIGSTOP Stop Stop process execution Yes
20 SIGTSTP Stop Stop process issued from tty Yes
21 SIGTTIN Stop Background process requires input Yes
22 SIGTTOU Stop Background process requires output Yes
23 SIGURG Ignore Urgent condition on socket No
24 SIGXCPU Dump CPU time limit exceeded No
25 SIGXFSZ Dump File size limit exceeded No
26 SIGVTALRM Terminate Virtual timer clock No
27 SIGPROF Terminate Profile timer clock No
28 SIGWINCH Ignore Window resizing No
29 SIGIO Terminate I/O now possible No
29 SIGPOLL Terminate Equivalent to SIGIO No
30 SIGPWR Terminate Power supply failure No
31 SIGSYS Dump Bad system call No
31 SIGUNUSED Dump Equivalent to SIGSYS No
Table 11-2. The most significant system calls related to signals
System call Description
kill( ) Send a signal to a thread group
tkill( ) Send a signal to a process
tgkill( ) Send a signal to a process in a specific thread group
sigaction( ) Change the action associated with a signal
signal( ) Similar to sigaction( )
sigpending( ) Check whether there are pending signals
sigprocmask( ) Modify the set of blocked signals
sigsuspend( ) Wait for a signal
rt_sigaction( ) Change the action associated with a real-time signal
rt_sigpending( ) Check whether there are pending real-time signals
rt_sigprocmask( ) Modify the set of blocked real-time signals
rt_sigqueueinfo( ) Send a real-time signal to a thread group
rt_sigsuspend( ) Wait for a real-time signal
rt_sigtimedwait( ) Similar to rt_sigsuspend( )
11.1.1. Actions Performed upon Delivering a Signal
There are three ways in which a process can respond to a signal:
1. Explicitly ignore the signal.
2. Execute the default action associated with the signal (see Table 11-1). This action, which is
predefined by the kernel, depends on the signal type and may be any one of the following:
Terminate
The process is terminated (killed).
Dump
The process is terminated (killed) and a core file containing its execution context is
created, if possible; this file may be used for debug purposes
Ignore
The signal is ignored
Stop
The process is stoppedi.e., put in the TASK_STOPPED state (see the section "Process State" in Chapter 3).
Continue
If the process was stopped (TASK_STOPPED), it is put into the TASK_RUNNING state.
3. Catch the signal by invoking a corresponding signal-handler function.
11.1.2. POSIX Signals and Multithreaded Applications
11.1.3. Data Structures Associated with Signals
The fields of the process descriptor related to signal handling are listed in Table 11-3.
Table 11-3. Process descriptor fields related to signal handling
Type Name Description
structsignal_struct * signal Pointer to the process's signal descriptor
struct sighand_struct* sighand Pointer to the process's signal handler descriptor
sigset_t blocked Mask of blocked signals
sigset_t real_blocked Temporary mask of blocked signals (used by the rt_sigtimedwait( ) system call)
struct sigpending pending Data structure storing the private pending signals
unsigned long sas_ss_sp Address of alternative signal handler stack
size_t sas_ss_size Size of alternative signal handler stack
int (*) (void *) notifier Pointer to a function used by a device driver to block some signals of the process
void * notifier_data Pointer to data that might be used by the notifier function (previous field of table)
sigset_t * notifier_mask Bit mask of signals blocked by a device driver through a notifier function
11.1.3.1. The signal descriptor and the signal handler descriptor
Table 11-4. The fields of the signal descriptor related to signal handling
Type Name Description
atomic_t count Usage counter of the signal descriptor
atomic_t live Number of live processes in the thread group
wait_queue_head_t wait_chldexit Wait queue for the processes sleeping in a wait4( )system call
struct task_struct* curr_target Descriptor of the last process in the thread group that received a signal
struct sigpending shared_pending Data structure storing the shared pending signals
int group_exit_code Process termination code for the thread group
struct task_struct * group_exit_task Used when killing a whole thread group
int notify_count Used when killing a whole thread group
int group_stop_count Used when stopping a whole thread group
unsigned int flags Flags used when delivering signals that modify the status of the process
Table 11-5. The fields of the signal handler descriptor
Type Name Description
atomic_t count Usage counter of the signal handler descriptor
structk_sigaction[64] action Array of structures specifying the actions to be performed upon delivering the signals
spinlock_t siglock Spin lock protecting both the signal descriptor and the signal handler descriptor
11.1.3.2. The sigaction data structure
sigaction->,
sa_handler
This field specifies the type of action to be performed; its value can be a pointer to the signal
handler, SIG_DFL (that is, the value 0) to specify that the default action is performed, or
SIG_IGN (that is, the value 1) to specify that the signal is ignored.
sa_flags
This set of flags specifies how the signal must be handled; some of them are listed in Table 11-6.
sa_mask
This sigset_t variable specifies the signals to be masked when running the signal handler.
Table 11-6. Flags specifying how to handle a signal
Flag Name Description
SA_NOCLDSTOP Applies only to SIGCHLD; do not send SIGCHLD to the parent when the processis stopped
SA_NOCLDWAIT Applies only to SIGCHLD; do not create a zombie when the process terminates
SA_SIGINFO Provide additional information to the signal handler (see the later section "Changing a Signal Action")
SA_ONSTACK Use an alternative stack for the signal handler (see the later section"Catching the Signal")
SA_RESTART Interrupted system calls are automatically restarted (see the later section "Reexecution of System Calls")
SA_NODEFER/SA_NOMASK Do not mask the signal while executing the signal handler
SA_RESETHAND/SA_ONESHOT,Reset to default action after executing the signal handler
11.1.3.3. The pending signal queues
struct sigpending {
struct list_head list;
sigset_t signal;
}
The signal field is a bit mask specifying the pending signals, while the list field is the head of a
doubly linked list containing sigqueue data structures; the fields of this structure are shown in Table 11-7.
Table 11-7. The fields of the sigqueue data structure
Type Name Description
struct list_head list Links for the pending signal queue's list
spinlock_t * lock Pointer to the siglock field in the signal handler descriptor corresponding to the pending signal
int flags Flags of the sigqueue data structure
siginfo_t info Describes the event that raised the signal
struct user_struct * user Pointer to the per-user data structure of the process's owner (see the
section "The clone( ), fork( ), and vfork( ) System Calls" in Chapter 3)
siginfo_t->
si_signo
The signal number
si_errno
The error code of the instruction that caused the signal to be raised, or 0 if there was no error
si_code
A code identifying who raised the signal (see Table 11-8)
_sifields
A union storing information depending on the type of signal. For instance, the siginfo_t data
structure relative to an occurrence of the SIGKILL signal records the PID and the UID of the
sender process here; conversely, the data structure relative to an occurrence of the SIGSEGV
signal stores the memory address whose access caused the signal to be raised
11.1.4. Operations on Signal Data Structures
set is a pointer to a sigset_t variable, nsig is the number of a signal, and mask is an unsigned long bit mask.
sigemptyset(set) and sigfillset(set)
Sets the bits in the sigset_t variable to 0 or 1, respectively
sigaddset(set,nsig) and sigdelset(set,nsig)
Sets the bit of the sigset_t variable corresponding to signal nsig to 1 or 0, respectively. In
practice, sigaddset( ) reduces to:
sigaddsetmask(set,mask) and sigdelsetmask(set,mask)
Sets all the bits of the sigset_t variable whose corresponding bits of mask are on 1 or 0,
respectively. They can be used only with signals that are between 1 and 32. The corresponding
functions reduce to:
sigismember(set,nsig)
Returns the value of the bit of the sigset_t variable corresponding to the signal nsig
sigmask(nsig)
Yields the bit index of the signal nsig. In other words, if the kernel needs to set, clear, or test
a bit in an element of sigset_t that corresponds to a particular signal, it can derive the proper
bit through this macro
sigandsets(d,s1,s2), sigorsets(d,s1,s2), and signandsets(d,s1,s2)
Performs a logical AND, a logical OR, and a logical NAND, respectively, between the sigset_t
variables to which s1 and s2 point; the result is stored in the sigset_t variable to which d points.
sigtestsetmask(set,mask)
Returns the value 1 if any of the bits in the sigset_t variable that correspond to the bits set to
1 in mask is set; it returns 0 otherwise. It can be used only with signals that have a number
between 1 and 32.
siginitset(set,mask)
Initializes the low bits of the sigset_t variable corresponding to signals between 1 and 32
with the bits contained in mask, and clears the bits corresponding to signals between 33 and 63.
ssiginitsetinv(set,mask)
Initializes the low bits of the sigset_t variable corresponding to signals between 1 and 32
with the complement of the bits contained in mask, and sets the bits corresponding to signals
between 33 and 63.
signal_pending(p)
Returns the value 1 (true) if the process identified by the *p process descriptor has nonblocked
pending signals, and returns the value 0 (false) if it doesn't. The function is implemented as a
simple check on the TIF_SIGPENDING flag of the process.
recalc_sigpending_tsk(t) and recalc_sigpending( )
The first function checks whether there are pending signals either for the process identified by
the process descriptor at *t (by looking at the t->pending->signal field) or for the thread
group to which the process belongs (by looking at the t->signal->shared_pending->signal
field). The function then sets accordingly the TIF_SIGPENDING flag in t->thread_info->flags.
The recalc_sigpending( ) function is equivalent to recalc_sigpending_tsk(current).
rm_from_queue(mask,q)
Removes from the pending signal queue q the pending signals corresponding to the bit mask mask.
flush_sigqueue(q)
Removes from the pending signal queue q all pending signals.
flush_signals(t)
Deletes all signals sent to the process identified by the process descriptor at *t. This is done
by clearing the TIF_SIGPENDING flag in t->thread_info->flags and invoking twice
flush_sigqueue( ) on the t->pending and t->signal->shared_pending queues.
11.2. Generating a Signal
Table 11-9. Kernel functions that generate a signal for a process
Name Description
send_sig( ) Sends a signal to a single process
send_sig_info( ) Like send_sig( ), with extended information in a siginfo_t structure
force_sig( ) Sends a signal that cannot be explicitly ignored or blocked by the process
force_sig_info( ) Like force_sig( ), with extended information in a siginfo_t structure
force_sig_specific() Like force_sig( ), but optimized for SIGSTOP and SIGKILL signals
sys_tkill( ) System call handler of tkill( ) (see the later section "System Calls Related to Signal Handling")
sys_tgkill( ) System call handler of tgkill( )
Table 11-10. Kernel functions that generate a signal for a thread group
Name Description
send_group_sig_info() Sends a signal to a single thread group identified by the process descriptor of one of its members
kill_pg( ) Sends a signal to all thread groups in a process group (see the section "Process Management" in Chapter 1)
kill_pg_info( ) Like kill_pg( ), with extended information in a siginfo_t structure
kill_proc( ) Sends a signal to a single thread group identified by the PID of one of its members
kill_proc_info( ) Like kill_proc( ), with extended information in a siginfo_t structure
sys_kill( ) System call handler of kill( ) (see the later section "System Calls Related to Signal Handling")
sys_rt_sigqueueinfo( ) System call handler of rt_sigqueueinfo( )
11.2.1. The specific_send_sig_info( ) Function
11.2.2. The send_signal( ) Function
11.2.3. The group_send_sig_info( ) Function
11.3. Delivering a Signal
To handle the nonblocked pending signals, the kernel invokes the do_signal( ) function
Then do_signal( ) loads the ka local variable with the address of the k_sigaction data structure of
the signal to be handled:
ka = ¤t->sig->action[signr-1];
Depending on the contents, three kinds of actions may be performed: ignoring the signal, executing
a default action, or executing a signal handler.
11.3.1. Executing the Default Action for the Signal
1: SIGSTOP
2: dump // 这个地方可以研究一下
The signals whose default action is "dump" may create a core file in the process working directory;
this file lists the complete contents of the process's address space and CPU registers
11.3.2. Catching the Signal
handle_signal( ):
Figure 11-2. Catching a signal
11.3.2.1. Setting up the frame
11.3.2.2. Evaluating the signal flags
11.3.2.3. Starting the signal handler
11.3.2.4. Terminating the signal handler
11.3.3. Reexecution of System Calls
11.3.3.1. Restarting a system call interrupted by a non-caught signal
11.3.3.2. Restarting a system call for a caught signal
11.4. System Calls Related to Signal Handling
11.4.1. The kill( ) System Call
kill(pid,sig)
11.4.2. The tkill( ) and tgkill( ) System Calls
tkill( ) and tgkill( )
11.4.3. Changing a Signal Action
sigaction(sig,act,oact)
11.4.4. Examining the Pending Blocked Signals
sigpending( )
11.4.5. Modifying the Set of Blocked Signals
sigprocmask( )
11.4.6. Suspending the Process
sigsuspend( )
431