The variable stack" is allocated in arch/x86/setup.c as a global array as follows:
The ESI register is then pushed on the stack and start_kernel() is called. This is main function that sets the ball rolling. It is evident that the ESI register must be pointing to a start_info_t structure made available to the kernel by the domain creator (xm) since start_kernel() accepts it as a parameter. The definition of this structure is in $XEN_SRC/xen/include/public/xen.h
/*
* Just allocate the kernel stack here. SS:ESP is set up to point here
* in head.S.
*/
char stack [8192]; /* allocated in kernel bss */
/*
* Start-of-day memory layout:
* 1. The domain is started within contiguous virtual-memory region.
* 2. The contiguous region ends on an aligned 4MB boundary (in Mini-OS it ends at 4MB).
* 3. This the order of bootstrap elements in the initial virtual region:
* a. relocated kernel image
* b. initial ram disk [mod_start, mod_len]
* c. list of allocated page frames [mfn_list, nr_pages]
* d. start_info_t structure [register ESI (x86)]
* e. bootstrap page tables [pt_base, CR3 (x86)]
* f. bootstrap stack [register ESP (x86)]
* 4. Bootstrap elements are packed together, but each is 4kB-aligned.
* 5. The initial ram disk may be omitted.
* 6. The list of page frames forms a contiguous 'pseudo-physical' memory
* layout for the domain. In particular, the bootstrap virtual-memory
* region is a 1:1 mapping to the first section of the pseudo-physical map.
* 7. All bootstrap elements are mapped read-writable for the guest OS. The
* only exception is the bootstrap page table, which is mapped read-only.
* 8. There is guaranteed to be at least 512kB padding after the final
* bootstrap element. If necessary, the bootstrap virtual region is
* extended by an extra 4MB to ensure this.
*/
There are 3 different types of address spaces:
struct start_info {The information in this structure is filled in by the domain loader (xm) when creating a new domain. The shared_infostructure is shared between a DomU kernel and a Dom0 kernel. This structure provides one means of communication between the two:
/* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
char magic[32]; /* "xen--". */
unsigned long nr_pages; /* Total pages allocated to this domain. */
unsigned long shared_info; /* MACHINE address of shared info struct. */
uint32_t flags; /* SIF_xxx flags. */
xen_pfn_t store_mfn; /* MACHINE page number of shared page. */
uint32_t store_evtchn; /* Event channel for store communication. */
union {
struct {
xen_pfn_t mfn; /* MACHINE page number of console page. */
uint32_t evtchn; /* Event channel for console page. */
} domU;
struct {
uint32_t info_off; /* Offset of console_info struct. */
uint32_t info_size; /* Size of console_info struct from start.*/
} dom0;
} console;
/* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
unsigned long pt_base; /* VIRTUAL address of page directory. */
unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
int8_t cmd_line[MAX_GUEST_CMDLINE];
};
/*
* Xen/kernel shared data -- pointer provided in start_info.
*
* This structure is defined to be both smaller than a page, and the
* only data on the shared page, but may vary in actual size even within
* compatible Xen versions; guests should not rely on the size
* of this structure remaining constant.
*/
struct shared_info {
struct vcpu_info vcpu_info[MAX_VIRT_CPUS];
unsigned long evtchn_pending[sizeof(unsigned long) * 8];
unsigned long evtchn_mask[sizeof(unsigned long) * 8];
/*
* Wallclock time: updated only by control software. Guests should base
* their gettimeofday() syscall on this wallclock-base value.
*/
uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */
uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */
uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */
struct arch_shared_info arch;
};
phys_to_machine_mapping = (unsigned long *)start_info.mfn_list;
/* mfn_list is part of start_info structure:
unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
*/
HYPERVISOR_update_va_mapping(Only the machine frame number of the shared_info structure is provided by the domain loader. The DomU kernel must map this page into it's own virtual address space before it can access it's contents.
(unsigned long)shared_info, __pte(pa | 7), UVMF_INVLPG)
Regardless of which page table update mode is being used, however, there are some occasions (notably handling a demand page fault) where a guest OS will wish to modify exactly one PTE rather than a batch, and where that PTE is mapped into the current address space. This is catered for by the following:
update_va_mapping(unsigned long va, uint64_t val, unsigned long flags)
Update the currently installed PTE that maps virtual address va to new value val. As with mmu_update, Xen checks the modification is safe before applying it. The flags determine which kind of TLB flush, if any, should follow the update.
HYPERVISOR_set_callbacks(
__KERNEL_CS, (unsigned long)hypervisor_callback,
__KERNEL_CS, (unsigned long)failsafe_callback);
At start of day, a guest operating system needs to setup the virtual CPU it is executing on. This includes installing vectors for the virtual IDT so that the guest OS can handle interrupts, page faults, etc. However the very first thing a guest OS must setup is a pair of hypervisor callbacks: these are the entry points which Xen will use when it wishes to notify the guest OS of an occurrence.
set_callbacks(unsigned long event_selector, unsigned long event_address, unsigned long failsafe_selector, unsigned long failsafe_address)
Register the normal (event) and failsafe callbacks for event processing. In each case the code segment selector and address within that segment are provided. The selectors must have RPL 1; in XenLinux we simply use the kernel's CS for both event_selector and failsafe_selector. The value event_address specifies the address of the guest OSes event handling and dispatch routine; the failsafe_address specifies a separate entry point which is used only if a fault occurs when Xen attempts to use the normal callback.
void trap_init(void)The trap table is defined as follows:
{
HYPERVISOR_set_trap_table(trap_table);
}
/*These handler entry points (code for these handlers) are mostly defined in arch/x86/x86_32.h
* Submit a virtual IDT to the hypervisor. This consists of tuples
* (interrupt vector, privilege ring, CS:EIP of handler).
* The 'privilege ring' field specifies the least-privileged ring that
* can trap to that vector using a software-interrupt instruction (INT).
*/
static trap_info_t trap_table[] = {
{ 0, 0, __KERNEL_CS, (unsigned long)divide_error },
{ 1, 0, __KERNEL_CS, (unsigned long)debug },
{ 3, 3, __KERNEL_CS, (unsigned long)int3 },
{ 4, 3, __KERNEL_CS, (unsigned long)overflow },
{ 5, 3, __KERNEL_CS, (unsigned long)bounds },
{ 6, 0, __KERNEL_CS, (unsigned long)invalid_op },
{ 7, 0, __KERNEL_CS, (unsigned long)device_not_available },
{ 9, 0, __KERNEL_CS, (unsigned long)coprocessor_segment_overrun },
{ 10, 0, __KERNEL_CS, (unsigned long)invalid_TSS },
{ 11, 0, __KERNEL_CS, (unsigned long)segment_not_present },
{ 12, 0, __KERNEL_CS, (unsigned long)stack_segment },
{ 13, 0, __KERNEL_CS, (unsigned long)general_protection },
{ 14, 0, __KERNEL_CS, (unsigned long)page_fault },
{ 15, 0, __KERNEL_CS, (unsigned long)spurious_interrupt_bug },
{ 16, 0, __KERNEL_CS, (unsigned long)coprocessor_error },
{ 17, 0, __KERNEL_CS, (unsigned long)alignment_check },
{ 19, 0, __KERNEL_CS, (unsigned long)simd_coprocessor_error },
{ 0, 0, 0, 0 }
};
init_mm() first calls arch_init_mm(&start_pfn, &max_pfn) - both args being return values from arch_init_mm. This function does a number of important things.
start_pfn = PFN_UP(to_phys(start_info.pt_base)) +PFN_UP macro gives a page frame number after rounding the given address to the next page frame boundary; to_phys() converts virtual address to physical address by subtracting the address of _text
start_info.nr_pt_frames + 3;
#define VIRT_START ((unsigned long)&_text)In the case of mini-os, _text starts at 0x0 (see linker script arc/x86/minios-x86_32.lds), so virtual addresses are same as (pseudo) physical addresses; max_pfn is same as the total 3 of pages.
#define to_phys(x) ((unsigned long)(x)-VIRT_START)
#define to_virt(x) ((void *)((unsigned long)(x)+VIRT_START))
This function determines the range of (pseudo) physical memory to be mapped, and then creates PDE and PTE entries to map remaining kernel virtual addresses to these locations. The PDE and PTE entries need to specify the real machine page frames so, the physical_to_machine mapping array is used to convert physical frame numbers to machine numbers. Page tables are added/updated by calling HYPERVISOR_mmu_update()
Page table levels : L1 is a PT (Page Table - contains physical addresses of page frames) L2, L3, L4 etc are higher level page directories. For a 32 bit machine with PAE disabled, only L1 and L2 are used. Just 1 L2 frame is enough to map the whole physical and virtual address space.
To find where to start mapping memory, the following calculation is used:
pfn_to_map = (start_info.nr_pt_frames - NOT_L1_FRAMES) * L1_PAGETABLE_ENTRIES;where,
Our start virtual address is pfn_to_virt(pfn_to_map) and we have to map until pfn_to_virt(max_pfn). A loop maps these virtual addresses to physical (machine) frames. Here is the pseudo code:
address = start_address; pt_pfn = start_pfn; /* page frame number of start pfn */The function new_pt_frame initializes a new frame to be a page table.
while (there are more addresses to be mapped) {
page_directory = start_info.pt_base; /* virtual address of page directory page */
page_directory_machine_frame = pfn_to_mfn(virt_to_pfn(start_info.pt_base));
offset = l2_table_offset(address); /* address of the slot of the page directory where this address belongs */
if ( address & L1_MASK == 0x00000000) {
/* L1_MASK evaluates to 0x003fffff 0000 0000 0011 1111 1111 1111 1111 1111
* i.e top 10 bits (PD index) are masked off. If this evaluates to zero, we are at a
* 4M boundary and a new L1 table is required (each L1 table maps 1024 frames of 4KB each = 4M bytes of memory)
*/
/* This will create a new L1 frame for this base address.
* notice the parameters:
* pt_pfn -> the page frame number that will contain the new PTE
* page_directory_machine_frame -> this is needed to setup a machine pointer to the new page frame
* offset -> index of the appropriate slot in the L2 directory. Needed to setup ptr to new page frame
* L1_FRAME -> we need an L1 frame
*/
new_pt_frame (&pt_pfn, page_directory_machine_frame, offset, L1_FRAME);
}
page = page_directory[offset]; /* machine pointer to pte (in the form of an pgdir entry*/
mfn = pte_to_mfn(page); /* convert the pgdir entry into an mfn */
offset = L1_table_offset(start_address); /* get the offset into PTE for this address */
/* then HYPERVISOR_mmu_update with ptr = machine address of PTE, val = pfn_to_mfn(pfn_to_map) */
}
void new_pt_frame(unsigned long *pt_pfn, unsigned long prev_l_mfn,The new_pt_frame() takes 4 actions (for the simple case where an L1 table is made): it zeroes out the new page, maps it into existing page tables, informs the hypervisor that this new page is going to contain a PAGE TABLE (this is known as pinning the page table) and finally links the new page table to the page directory.
unsigned long offset, unsigned long level)
{
/* step1: zero out the pte page */
memset((unsigned long*)pfn_to_virt(*pt_pfn), 0, PAGE_SIZE);
/* step2: Map the new page into existing page tables
* below, we are trying to map the virtual address of the page that is to contain the new page table
* point to the exact machine address of the PTE entry
* which is: page_dir[page_dir_offset(pt_page)] + sizeof(pgentry) * pte_offset(pt_page)
*/
mmu_updates[0].ptr = ((pgentry_t)tab[l2_table_offset(pt_page)] & PAGE_MASK) +
sizeof(pgentry_t) * l1_table_offset(pt_page);
mmu_updates[0].val = (pgentry_t)pfn_to_mfn(*pt_pfn) << PAGE_SHIFT |
(prot_e & ~_PAGE_RW);
HYPERVISOR_mmu_update(mmu_updates, 1, NULL, DOMID_SELF);
/*step 3: Issue a PIN request - tell the hypervisor that this page is going to contain a PAGE TABLE */
pin_request.cmd = pincmd; /* pincmd = MMUEXT_PIN_L1_TABLE */
pin_request.arg1.mfn = pfn_to_mfn(*pt_pfn);
HYPERVISOR_mmuext_op(&pin_request, 1, NULL, DOMID_SELF);
/*step 4: Link the newly created PTE to it's parent PAGE DIRECTORY */
mmu_updates[0].ptr = ((pgentry_t)prev_l_mfn << PAGE_SHIFT) + sizeof(pgentry_t) * offset;
mmu_updates[0].val = (pgentry_t)pfn_to_mfn(*pt_pfn) << PAGE_SHIFT | prot_t;
HYPERVISOR_mmu_update(mmu_updates, 1, NULL, DOMID_SELF);
}
arch_init_mm() returns after bulding any additional PDE/PTE frames to map the entire memory allocated to mini-os. The variable start_pfn is updated to account for any newly created PD/PT page frames.
Thanks toGoogle code prettifyproject for their syntax highlighting script.
N.B. : Changes subject tohttp://www.cs.uic.edu/~spopuri/minios.html