How domain 0 works on xen hypervisor

A tour of the Mini-OS kernel

Satya Popuri
Graduate Student
University of Illinois at Chicago
Chicago, IL 60607
spopur2 [at] uic [dot] edu

Introduction

Mini-OS is a small OS kernel distributed with the Xen hypervisorsources. I have documented some of the basic parts of this kernel for the reference of people trying to port their OSes to Xen (also for people writing new OSes for Xen). This work is not completed yet. The present document includes a discussion of initialization and page table setup. Watch this space for more on event channels, grant tables, Xen bus etc.

Mini-OS initialization

Mini-OS boots at the symbol _start in arch/x86/x86_32.S (see arch/x86/minios-x86_32.lds linker script). It begins by loading SS (stack segment) and ESP (stack pointer) registers with the address stored at stack_start. KERNEL_SS is a default segment descriptor provided in the GDT by xen. Since the stack grows downwards in x86 processors, ESP points to an address stack+8192 bytes. Read documentation on LSS instruction to find how this works out.

The variable stack" is allocated in arch/x86/setup.c as a global array as follows:


/*
* Just allocate the kernel stack here. SS:ESP is set up to point here
* in head.S.
*/

char stack [8192]; /* allocated in kernel bss */
The ESI register is then pushed on the stack and start_kernel() is called. This is main function that sets the ball rolling. It is evident that the ESI register must be pointing to a start_info_t structure made available to the kernel by the domain creator (xm) since start_kernel() accepts it as a parameter. The definition of this structure is in $XEN_SRC/xen/include/public/xen.h

Start of day memory layout (notes from xen.h)

/*
* Start-of-day memory layout:
* 1. The domain is started within contiguous virtual-memory region.
* 2. The contiguous region ends on an aligned 4MB boundary (in Mini-OS it ends at 4MB).
* 3. This the order of bootstrap elements in the initial virtual region:
* a. relocated kernel image
* b. initial ram disk [mod_start, mod_len]
* c. list of allocated page frames [mfn_list, nr_pages]
* d. start_info_t structure [register ESI (x86)]
* e. bootstrap page tables [pt_base, CR3 (x86)]
* f. bootstrap stack [register ESP (x86)]
* 4. Bootstrap elements are packed together, but each is 4kB-aligned.
* 5. The initial ram disk may be omitted.
* 6. The list of page frames forms a contiguous 'pseudo-physical' memory
* layout for the domain. In particular, the bootstrap virtual-memory
* region is a 1:1 mapping to the first section of the pseudo-physical map.
* 7. All bootstrap elements are mapped read-writable for the guest OS. The
* only exception is the bootstrap page table, which is mapped read-only.
* 8. There is guaranteed to be at least 512kB padding after the final
* bootstrap element. If necessary, the bootstrap virtual region is
* extended by an extra 4MB to ensure this.
*/
There are 3 different types of address spaces:
  • Machine page frames - These are the page frames allocated in the physical RAM
  • (Pseudo) Physical page frames - These are contiguous page frames that map to Machine page frames. This is basically virtualized physical RAM for the domU kernel. For example physical page frames 0,1,2,3,4,5,.. map to machine frames 55,23,49,123,32.... (without any specific order).
  • domU kernel virtual address space - When the mini-os kernel is compiled and linked by gcc, all addresses in the .text section are virtual addresses that need translation to (pseudo) physical addresses and then to machine addresses. This translation takes place at run time by initial page tables set up by the domain loader (xm).
The start_infostructure provides initial boot time information for the domU kernel:
struct start_info {
/* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
char magic[32]; /* "xen--". */
unsigned long nr_pages; /* Total pages allocated to this domain. */
unsigned long shared_info; /* MACHINE address of shared info struct. */
uint32_t flags
; /* SIF_xxx flags. */
xen_pfn_t store_mfn
; /* MACHINE page number of shared page. */
uint32_t store_evtchn
; /* Event channel for store communication. */
union {
struct {
xen_pfn_t mfn
; /* MACHINE page number of console page. */
uint32_t evtchn
; /* Event channel for console page. */
} domU;
struct {
uint32_t info_off
; /* Offset of console_info struct. */
uint32_t info_size
; /* Size of console_info struct from start.*/
} dom0;
} console;
/* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
unsigned long pt_base; /* VIRTUAL address of page directory. */
unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
int8_t cmd_line
[MAX_GUEST_CMDLINE];
};
The information in this structure is filled in by the domain loader (xm) when creating a new domain. The shared_infostructure is shared between a DomU kernel and a Dom0 kernel. This structure provides one means of communication between the two:
/*
* Xen/kernel shared data -- pointer provided in start_info.
*
* This structure is defined to be both smaller than a page, and the
* only data on the shared page, but may vary in actual size even within
* compatible Xen versions; guests should not rely on the size
* of this structure remaining constant.
*/

struct shared_info {
struct vcpu_info vcpu_info[MAX_VIRT_CPUS];
unsigned long evtchn_pending[sizeof(unsigned long) * 8];
unsigned long evtchn_mask[sizeof(unsigned long) * 8];
/*
* Wallclock time: updated only by control software. Guests should base
* their gettimeofday() syscall on this wallclock-base value.
*/

uint32_t wc_version
; /* Version counter: see vcpu_time_info_t. */
uint32_t wc_sec
; /* Secs 00:00:00 UTC, Jan 1, 1970. */
uint32_t wc_nsec
; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */
struct arch_shared_info arch;
};

The start_kernel() function

This is where Mini-OS sets the ball rolling. It calls a bunch of initialization functions and then sets up the three kernel threads to run. The non-preemptive scheduler provided with Mini-OS will then schedule these threads one after another. The functions called by start_kernel()are now documented (_rougly_) below.

arch_init() [arch/x86/setup.c]

  • copies the start_info structure into a global area within the kernel image (to start_info_union.start_info).
  • stores the physical to machine frame mappings into a global variable 'phys_to_machine_mapping' declared in mm.c

phys_to_machine_mapping
= (unsigned long *)start_info.mfn_list;
/* mfn_list is part of start_info structure:
unsigned long mfn_list; /* VIRTUAL address of page-frame list. */

*/
  • Maps shared_info page at 0x2000 (start of 3rd megabyte of kernel virtual address space - this info is from arch/x86/x86_32.S). This is done using a hypercall - HYPERVISOR_update_va_mapping
  • HYPERVISOR_update_va_mapping(
    (unsigned long)shared_info, __pte(pa | 7), UVMF_INVLPG)
    Only the machine frame number of the shared_info structure is provided by the domain loader. The DomU kernel must map this page into it's own virtual address space before it can access it's contents.
    Documentation for this call

    Regardless of which page table update mode is being used, however, there are some occasions (notably handling a demand page fault) where a guest OS will wish to modify exactly one PTE rather than a batch, and where that PTE is mapped into the current address space. This is catered for by the following:

    update_va_mapping(unsigned long va, uint64_t val, unsigned long flags)

    Update the currently installed PTE that maps virtual address va to new value val. As with mmu_update, Xen checks the modification is safe before applying it. The flags determine which kind of TLB flush, if any, should follow the update.

  • registers callback handlers:

  • HYPERVISOR_set_callbacks(
    __KERNEL_CS
    , (unsigned long)hypervisor_callback,
    __KERNEL_CS
    , (unsigned long)failsafe_callback);
    documentation:

    At start of day, a guest operating system needs to setup the virtual CPU it is executing on. This includes installing vectors for the virtual IDT so that the guest OS can handle interrupts, page faults, etc. However the very first thing a guest OS must setup is a pair of hypervisor callbacks: these are the entry points which Xen will use when it wishes to notify the guest OS of an occurrence.

    set_callbacks(unsigned long event_selector, unsigned long event_address, unsigned long failsafe_selector, unsigned long failsafe_address)

    Register the normal (event) and failsafe callbacks for event processing. In each case the code segment selector and address within that segment are provided. The selectors must have RPL 1; in XenLinux we simply use the kernel's CS for both event_selector and failsafe_selector. The value event_address specifies the address of the guest OSes event handling and dispatch routine; the failsafe_address specifies a separate entry point which is used only if a fault occurs when Xen attempts to use the normal callback.

    trap_init() [arch/x86/traps.c]

    This function registers a trap handler table with xen using the set_trap_table() hypercall.
    void trap_init(void)
    {
    HYPERVISOR_set_trap_table(trap_table);
    }
    The trap table is defined as follows:
    /*
    * Submit a virtual IDT to the hypervisor. This consists of tuples
    * (interrupt vector, privilege ring, CS:EIP of handler).
    * The 'privilege ring' field specifies the least-privileged ring that
    * can trap to that vector using a software-interrupt instruction (INT).
    */

    static trap_info_t trap_table[] = {
    { 0, 0, __KERNEL_CS, (unsigned long)divide_error },
    { 1, 0, __KERNEL_CS, (unsigned long)debug },
    { 3, 3, __KERNEL_CS, (unsigned long)int3 },
    { 4, 3, __KERNEL_CS, (unsigned long)overflow },
    { 5, 3, __KERNEL_CS, (unsigned long)bounds },
    { 6, 0, __KERNEL_CS, (unsigned long)invalid_op },
    { 7, 0, __KERNEL_CS, (unsigned long)device_not_available },
    { 9, 0, __KERNEL_CS, (unsigned long)coprocessor_segment_overrun },
    { 10, 0, __KERNEL_CS, (unsigned long)invalid_TSS },
    { 11, 0, __KERNEL_CS, (unsigned long)segment_not_present },
    { 12, 0, __KERNEL_CS, (unsigned long)stack_segment },
    { 13, 0, __KERNEL_CS, (unsigned long)general_protection },
    { 14, 0, __KERNEL_CS, (unsigned long)page_fault },
    { 15, 0, __KERNEL_CS, (unsigned long)spurious_interrupt_bug },
    { 16, 0, __KERNEL_CS, (unsigned long)coprocessor_error },
    { 17, 0, __KERNEL_CS, (unsigned long)alignment_check },
    { 19, 0, __KERNEL_CS, (unsigned long)simd_coprocessor_error },
    { 0, 0, 0, 0 }
    };
    These handler entry points (code for these handlers) are mostly defined in arch/x86/x86_32.h

    init_mm() [mm.c]

    This function initializes memory management in mini-os. Provisional start of day page tables are already setup by the domain loader. The mini-os kernel's task now is to map the rest of the memory.

    init_mm() first calls arch_init_mm(&start_pfn, &max_pfn) - both args being return values from arch_init_mm. This function does a number of important things.

    • arch_init_mm() [arch/x86/mm.c]

      The first job of this function is to calculate the starting and max page frame numbers for mapping memory. start_pfn is calculated as follows:
       start_pfn = PFN_UP(to_phys(start_info.pt_base)) +
      start_info
      .nr_pt_frames + 3;
      PFN_UP macro gives a page frame number after rounding the given address to the next page frame boundary; to_phys() converts virtual address to physical address by subtracting the address of _text
      #define VIRT_START         ((unsigned long)&_text)
      #define to_phys(x) ((unsigned long)(x)-VIRT_START)
      #define to_virt(x) ((void *)((unsigned long)(x)+VIRT_START))
      In the case of mini-os, _text starts at 0x0 (see linker script arc/x86/minios-x86_32.lds), so virtual addresses are same as (pseudo) physical addresses; max_pfn is same as the total 3 of pages.
      build_pagetables()
      build_pagetables(start_pfn, max_pfn) is called next to map the rest of the memory allocated to mini-os:

      This function determines the range of (pseudo) physical memory to be mapped, and then creates PDE and PTE entries to map remaining kernel virtual addresses to these locations. The PDE and PTE entries need to specify the real machine page frames so, the physical_to_machine mapping array is used to convert physical frame numbers to machine numbers. Page tables are added/updated by calling HYPERVISOR_mmu_update()

      Page table levels : L1 is a PT (Page Table - contains physical addresses of page frames) L2, L3, L4 etc are higher level page directories. For a 32 bit machine with PAE disabled, only L1 and L2 are used. Just 1 L2 frame is enough to map the whole physical and virtual address space.

      To find where to start mapping memory, the following calculation is used:

      pfn_to_map = (start_info.nr_pt_frames - NOT_L1_FRAMES) * L1_PAGETABLE_ENTRIES;
      where,
      pfn_to_map => page frame number to start mapping
      start_info.nr_pt_frames => total # of PD+PT frames allocated for startup mapping
      NOT_L1_FRAMES => 1 in case of 32 bit machine. There is one L2 frame and rest of the frames are L1 the term in parenthesis tells us how many L1 frames are allocated in physical memory. Multiply this by the # of entries in each L1 frame. This will put us at a page # after all the L1 frames (assuming they are contiguous in virtual and physical address spaces).

      Our start virtual address is pfn_to_virt(pfn_to_map) and we have to map until pfn_to_virt(max_pfn). A loop maps these virtual addresses to physical (machine) frames. Here is the pseudo code:

      address = start_address; pt_pfn = start_pfn; /* page frame number of start pfn */

      while (there are more addresses to be mapped) {

      page_directory
      = start_info.pt_base; /* virtual address of page directory page */
      page_directory_machine_frame
      = pfn_to_mfn(virt_to_pfn(start_info.pt_base));
      offset
      = l2_table_offset(address); /* address of the slot of the page directory where this address belongs */

      if ( address & L1_MASK == 0x00000000) {

      /* L1_MASK evaluates to 0x003fffff 0000 0000 0011 1111 1111 1111 1111 1111
      * i.e top 10 bits (PD index) are masked off. If this evaluates to zero, we are at a
      * 4M boundary and a new L1 table is required (each L1 table maps 1024 frames of 4KB each = 4M bytes of memory)
      */


      /* This will create a new L1 frame for this base address.
      * notice the parameters:
      * pt_pfn -> the page frame number that will contain the new PTE
      * page_directory_machine_frame -> this is needed to setup a machine pointer to the new page frame
      * offset -> index of the appropriate slot in the L2 directory. Needed to setup ptr to new page frame
      * L1_FRAME -> we need an L1 frame
      */

      new_pt_frame
      (&pt_pfn, page_directory_machine_frame, offset, L1_FRAME);
      }
      page
      = page_directory[offset]; /* machine pointer to pte (in the form of an pgdir entry*/
      mfn
      = pte_to_mfn(page); /* convert the pgdir entry into an mfn */
      offset
      = L1_table_offset(start_address); /* get the offset into PTE for this address */

      /* then HYPERVISOR_mmu_update with ptr = machine address of PTE, val = pfn_to_mfn(pfn_to_map) */
      }
      The function new_pt_frame initializes a new frame to be a page table.
      void new_pt_frame(unsigned long *pt_pfn, unsigned long prev_l_mfn,
      unsigned long offset, unsigned long level)
      {
      /* step1: zero out the pte page */
      memset
      ((unsigned long*)pfn_to_virt(*pt_pfn), 0, PAGE_SIZE);

      /* step2: Map the new page into existing page tables
      * below, we are trying to map the virtual address of the page that is to contain the new page table
      * point to the exact machine address of the PTE entry
      * which is: page_dir[page_dir_offset(pt_page)] + sizeof(pgentry) * pte_offset(pt_page)
      */

      mmu_updates
      [0].ptr = ((pgentry_t)tab[l2_table_offset(pt_page)] & PAGE_MASK) +
      sizeof(pgentry_t) * l1_table_offset(pt_page);

      mmu_updates
      [0].val = (pgentry_t)pfn_to_mfn(*pt_pfn) << PAGE_SHIFT |
      (prot_e & ~_PAGE_RW);

      HYPERVISOR_mmu_update(mmu_updates, 1, NULL, DOMID_SELF);

      /*step 3: Issue a PIN request - tell the hypervisor that this page is going to contain a PAGE TABLE */
      pin_request
      .cmd = pincmd; /* pincmd = MMUEXT_PIN_L1_TABLE */
      pin_request
      .arg1.mfn = pfn_to_mfn(*pt_pfn);
      HYPERVISOR_mmuext_op(&pin_request, 1, NULL, DOMID_SELF);

      /*step 4: Link the newly created PTE to it's parent PAGE DIRECTORY */
      mmu_updates
      [0].ptr = ((pgentry_t)prev_l_mfn << PAGE_SHIFT) + sizeof(pgentry_t) * offset;
      mmu_updates
      [0].val = (pgentry_t)pfn_to_mfn(*pt_pfn) << PAGE_SHIFT | prot_t;
      HYPERVISOR_mmu_update(mmu_updates, 1, NULL, DOMID_SELF);
      }
      The new_pt_frame() takes 4 actions (for the simple case where an L1 table is made): it zeroes out the new page, maps it into existing page tables, informs the hypervisor that this new page is going to contain a PAGE TABLE (this is known as pinning the page table) and finally links the new page table to the page directory.

      arch_init_mm() returns after bulding any additional PDE/PTE frames to map the entire memory allocated to mini-os. The variable start_pfn is updated to account for any newly created PD/PT page frames.

    • arch_init_p2m() is called next to initialize the HYPERVISOR_shared_info->arch.pfn_to_mfn_frame_list_list variable with the pointer to a page table like structure of a virtual to physical mapping of DomU mapped page frames.
    • arch_init_demand_mapping_area() function called next, creates an additional PTE to map virtual addresses beyond max_pfn. Any virtual address above max_pfn will can now be mapped to a physical address on demand using this PTE. This might be for process virtual address space.

    Watch this space!

    More documentation coming on Xen grant tables, event channels, process scheduling etc.

    Thanks toGoogle code prettifyproject for their syntax highlighting script.

    N.B. : Changes subject tohttp://www.cs.uic.edu/~spopuri/minios.html

    你可能感兴趣的:(REST,OS,gcc,Google,UP)