上一篇文章解释了计算机如何启动,直到引导加载程序将内核映像填充到内存中之后即将跳入内核入口点。
关于引导的最后一篇文章介绍了内核的内幕,以了解操作系统如何开始运行。
由于我有经验,我将在Linux Cross Reference上大量链接到Linux内核2.6.25.6的资源。
如果您熟悉类似C的语法,那么这些源将非常容易阅读。
即使您错过一些细节,您也可以了解正在发生的事情。
主要障碍是某些代码缺乏上下文,例如何时或为何运行代码或计算机的基本功能。
我希望提供一些背景信息。
由于简洁(哈!),很多有趣的东西(例如中断和内存)暂时只得到了点头。
该帖子以Windows启动的要点结尾。
在英特尔x86引导store中的这一点上,处理器以实模式运行,能够寻址1 MB内存,而对于现代Linux系统,RAM如下所示:
引导加载程序完成后的RAM内容
引导程序已使用BIOS磁盘I / O服务将内核映像加载到内存中。
此图像是硬盘中包含内核的文件的精确副本,例如
/boot/vmlinuz-2.6.22-14-server。
该映像分为两部分:一小部分包含实模式内核代码,加载到640K屏障以下;
在受保护模式下运行的内核的大部分将在第一个兆字节的内存之后加载。
该操作从上图所示的实模式内核头开始。
此内存区域用于在引导加载程序和内核之间实现Linux引导协议。
引导加载程序在执行工作时会读取其中的某些值。
这些功能包括便利性,例如包含内核版本的可读字符串,还包括关键信息,例如实模式内核片段的大小。
引导加载程序还将值写入该区域,例如用户在引导菜单中给定的命令行参数的内存地址。
引导加载程序完成后,它将填写内核头文件所需的所有参数。
然后是时候进入内核入口点了。
下图显示了内核初始化的代码序列,以及源目录,文件和行号:
特定于体系结构的Linux内核初始化
英特尔架构的早期内核启动位于arch / x86 / boot / header.S文件中。
/*
* header.S
*
* Copyright (C) 1991, 1992 Linus Torvalds
*
* Based on bootsect.S and setup.S
* modified by more people than can be counted
*
* Rewritten as a common file by H. Peter Anvin (Apr 2007)
*
* BIG FAT NOTE: We're in real mode using 64k segments. Therefore segment
* addresses must be multiplied by 16 to obtain their respective linear
* addresses. To avoid confusion, linear addresses are written using leading
* hex while segment addresses are written as segment:offset.
*
*/
#include
#include
#include
#include
#include
#include
#include "boot.h"
SETUPSECTS = 4 /* default nr of setup-sectors */
BOOTSEG = 0x07C0 /* original address of boot-sector */
SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */
SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */
/* to be loaded */
ROOT_DEV = 0 /* ROOT_DEV is now written by "build" */
SWAP_DEV = 0 /* SWAP_DEV is now written by "build" */
#ifndef SVGA_MODE
#define SVGA_MODE ASK_VGA
#endif
#ifndef RAMDISK
#define RAMDISK 0
#endif
#ifndef ROOT_RDONLY
#define ROOT_RDONLY 1
#endif
.code16
.section ".bstext", "ax"
.global bootsect_start
bootsect_start:
# Normalize the start address
ljmp $BOOTSEG, $start2
start2:
movw %cs, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %ss
xorw %sp, %sp
sti
cld
movw $bugger_off_msg, %si
msg_loop:
lodsb
andb %al, %al
jz bs_die
movb $0xe, %ah
movw $7, %bx
int $0x10
jmp msg_loop
bs_die:
# Allow the user to press a key, then reboot
xorw %ax, %ax
int $0x16
int $0x19
# int 0x19 should never return. In case it does anyway,
# invoke the BIOS reset code...
ljmp $0xf000,$0xfff0
.section ".bsdata", "a"
bugger_off_msg:
.ascii "Direct booting from floppy is no longer supported.\r\n"
.ascii "Please use a boot loader program instead.\r\n"
.ascii "\n"
.ascii "Remove disk and press any key to reboot . . .\r\n"
.byte 0
# Kernel attributes; used by setup. This is part 1 of the
# header, from the old boot sector.
.section ".header", "a"
.globl hdr
hdr:
setup_sects: .byte SETUPSECTS
root_flags: .word ROOT_RDONLY
syssize: .long SYSSIZE
ram_size: .word RAMDISK
vid_mode: .word SVGA_MODE
root_dev: .word ROOT_DEV
boot_flag: .word 0xAA55
# offset 512, entry point
.globl _start
_start:
# Explicitly enter this as bytes, or the assembler
# tries to generate a 3-byte jump here, which causes
# everything else to push off to the wrong offset.
.byte 0xeb # short (2-byte) jump
.byte start_of_setup-1f
1:
# Part 2 of the header, from the old setup.S
.ascii "HdrS" # header signature
.word 0x0207 # header version number (>= 0x0105)
# or else old loadlin-1.5 will fail)
.globl realmode_swtch
realmode_swtch: .word 0, 0 # default_switch, SETUPSEG
start_sys_seg: .word SYSSEG
.word kernel_version-512 # pointing to kernel version string
# above section of header is compatible
# with loadlin-1.5 (header v1.5). Don't
# change it.
type_of_loader: .byte 0 # = 0, old one (LILO, Loadlin,
# Bootlin, SYSLX, bootsect...)
# See Documentation/i386/boot.txt for
# assigned ids
# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
LOADED_HIGH = 1 # If set, the kernel is loaded high
CAN_USE_HEAP = 0x80 # If set, the loader also has set
# heap_end_ptr to tell how much
# space behind setup.S can be used for
# heap purposes.
# Only the loader knows what is free
#ifndef __BIG_KERNEL__
.byte 0
#else
.byte LOADED_HIGH
#endif
setup_move_size: .word 0x8000 # size to move, when setup is not
# loaded at 0x90000. We will move setup
# to 0x90000 then just before jumping
# into the kernel. However, only the
# loader knows how much data behind
# us also needs to be loaded.
code32_start: # here loaders can put a different
# start address for 32-bit code.
#ifndef __BIG_KERNEL__
.long 0x1000 # 0x1000 = default for zImage
#else
.long 0x100000 # 0x100000 = default for big kernel
#endif
ramdisk_image: .long 0 # address of loaded ramdisk image
# Here the loader puts the 32-bit
# address where it loaded the image.
# This only will be read by the kernel.
ramdisk_size: .long 0 # its size in bytes
bootsect_kludge:
.long 0 # obsolete
heap_end_ptr: .word _end+STACK_SIZE-512
# (Header version 0x0201 or later)
# space from here (exclusive) down to
# end of setup code can be used by setup
# for local heap purposes.
pad1: .word 0
cmd_line_ptr: .long 0 # (Header version 0x0202 or later)
# If nonzero, a 32-bit pointer
# to the kernel command line.
# The command line should be
# located between the start of
# setup and the end of low
# memory (0xa0000), or it may
# get overwritten before it
# gets read. If this field is
# used, there is no longer
# anything magical about the
# 0x90000 segment; the setup
# can be located anywhere in
# low memory 0x10000 or higher.
ramdisk_max: .long 0x7fffffff
# (Header version 0x0203 or later)
# The highest safe address for
# the contents of an initrd
# The current kernel allows up to 4 GB,
# but leave it at 2 GB to avoid
# possible bootloader bugs.
kernel_alignment: .long CONFIG_PHYSICAL_ALIGN #physical addr alignment
#required for protected mode
#kernel
#ifdef CONFIG_RELOCATABLE
relocatable_kernel: .byte 1
#else
relocatable_kernel: .byte 0
#endif
pad2: .byte 0
pad3: .word 0
cmdline_size: .long COMMAND_LINE_SIZE-1 #length of the command line,
#added with boot protocol
#version 2.06
hardware_subarch: .long 0 # subarchitecture, added with 2.07
# default to 0 for normal x86 PC
hardware_subarch_data: .quad 0
# End of setup header #####################################################
.section ".inittext", "ax"
start_of_setup:
#ifdef SAFE_RESET_DISK_CONTROLLER
# Reset the disk controller.
movw $0x0000, %ax # Reset disk controller
movb $0x80, %dl # All disks
int $0x13
#endif
# Force %es = %ds
movw %ds, %ax
movw %ax, %es
cld
# Apparently some ancient versions of LILO invoked the kernel with %ss != %ds,
# which happened to work by accident for the old code. Recalculate the stack
# pointer if %ss is invalid. Otherwise leave it alone, LOADLIN sets up the
# stack behind its own code, so we can't blindly put it directly past the heap.
movw %ss, %dx
cmpw %ax, %dx # %ds == %ss?
movw %sp, %dx
je 2f # -> assume %sp is reasonably set
# Invalid %ss, make up a new stack
movw $_end, %dx
testb $CAN_USE_HEAP, loadflags
jz 1f
movw heap_end_ptr, %dx
1: addw $STACK_SIZE, %dx
jnc 2f
xorw %dx, %dx # Prevent wraparound
2: # Now %dx should point to the end of our stack space
andw $~3, %dx # dword align (might as well...)
jnz 3f
movw $0xfffc, %dx # Make sure we're not zero
3: movw %ax, %ss
movzwl %dx, %esp # Clear upper half of %esp
sti # Now we should have a working stack
# We will have entered with %cs = %ds+0x20, normalize %cs so
# it is on par with the other segments.
pushw %ds
pushw $6f
lretw
6:
# Check signature at end of setup
cmpl $0x5a5aaa55, setup_sig
jne setup_bad
# Zero the bss
movw $__bss_start, %di
movw $_end+3, %cx
xorl %eax, %eax
subw %di, %cx
shrw $2, %cx
rep; stosl
# Jump to C code (should not return)
calll main
# Setup corrupt somehow...
setup_bad:
movl $setup_corrupt, %eax
calll puts
# Fall through...
.globl die
.type die, @function
die:
hlt
jmp die
.size die, .-die
.section ".initdata", "a"
setup_corrupt:
.byte 7
.string "No setup signature found...\n"
它使用汇编语言,这对于整个内核来说是很少见的,但对于启动代码却很常见。
该文件的开头实际上包含引导扇区代码,这是Linux在没有引导加载程序的情况下可以工作的剩余时间。
如今,此引导扇区(如果执行)仅向用户打印“ bugger_off_msg”并重新引导。
现代引导加载程序会忽略此遗留代码。
在引导扇区代码之后,我们有实模式内核头文件的前15个字节。
这两部分加起来最多为512字节,这是Intel硬件上典型磁盘扇区的大小。
在这512个字节之后,在偏移量0x200处,我们找到了作为Linux内核的一部分运行的第一条指令:实模式入口点。
它位于header.S:110中,是直接以机器代码0x3aeb写入的2字节跳转。
您可以通过在内核映像上运行hexdump并查看该偏移量处的字节来验证这一点-只是进行健全性检查,以确保这并非梦dream以求。
引导加载程序完成后会跳到该位置,然后又跳到header.S:229,这里有一个常规的汇编例程,称为start_of_setup。
这个简短的例程设置了堆栈,将实模式内核的bss段(包含静态变量的区域,因此从零值开始)归零,然后跳转至arch / x86 / boot / main上的旧C代码。
c:122。
1/* -*- linux-c -*- ------------------------------------------------------- *
2 *
3 * Copyright (C) 1991, 1992 Linus Torvalds
4 * Copyright 2007 rPath, Inc. - All Rights Reserved
5 *
6 * This file is part of the Linux kernel, and is made available under
7 * the terms of the GNU General Public License version 2.
8 *
9 * ----------------------------------------------------------------------- */
10
11/*
12 * arch/i386/boot/main.c
13 *
14 * Main module for the real-mode kernel code
15 */
16
17#include "boot.h"
18
19struct boot_params boot_params __attribute__((aligned(16)));
20
21char *HEAP = _end;
22char *heap_end = _end; /* Default end of heap = no heap */
23
24/*
25 * Copy the header into the boot parameter block. Since this
26 * screws up the old-style command line protocol, adjust by
27 * filling in the new-style command line pointer instead.
28 */
29
30static void copy_boot_params(void)
31{
32 struct old_cmdline {
33 u16 cl_magic;
34 u16 cl_offset;
35 };
36 const struct old_cmdline * const oldcmd =
37 (const struct old_cmdline *)OLD_CL_ADDRESS;
38
39 BUILD_BUG_ON(sizeof boot_params != 4096);
40 memcpy(&boot_params.hdr, &hdr, sizeof hdr);
41
42 if (!
main()做一些内部工作,例如检测内存布局,设置视频模式等。然后调用go_to_protected_mode()。
1/* -*- linux-c -*- ------------------------------------------------------- *
2 *
3 * Copyright (C) 1991, 1992 Linus Torvalds
4 * Copyright 2007 rPath, Inc. - All Rights Reserved
5 *
6 * This file is part of the Linux kernel, and is made available under
7 * the terms of the GNU General Public License version 2.
8 *
9 * ----------------------------------------------------------------------- */
10
11/*
12 * arch/i386/boot/pm.c
13 *
14 * Prepare the machine for transition to protected mode.
15 */
16
17#include "boot.h"
18#include <asm/segment.h>
19
20/*
21 * Invoke the realmode switch hook if present; otherwise
22 * disable all interrupts.
23 */
24static void realmode_switch_hook(void)
25{
26 if (boot_params.hdr.realmode_swtch) {
27 asm volatile("lcallw *%0"
28 : : "m" (boot_params.hdr.realmode_swtch)
29 : "eax", "ebx", "ecx", "edx");
30 } else {
31 asm volatile("cli");
32 outb(0x80, 0x70); /* Disable NMI */
33 io_delay();
34 }
35}
36
37/*
38 * A zImage kernel is loaded at 0x10000 but wants to run at 0x1000.
39 * A bzImage kernel is loaded and runs at 0x100000.
40 */
41static void move_kernel_around(void)
42{
43 /* Note: rely on the compile-time option here rather than
44 the LOADED_HIGH flag. The Qemu kernel loader unconditionally
45 sets the loadflags to zero. */
46#ifndef __BIG_KERNEL__
47 u16 dst_seg, src_seg;
48 u32 syssize;
49
50 dst_seg = 0x1000 >> 4;
51 src_seg = 0x10000 >> 4;
52 syssize = boot_params.hdr.syssize; /* Size in 16-byte paragraphs */
53
54 while (syssize) {
55 int paras = (syssize >= 0x1000) ? 0x1000 : syssize;
56 int dwords = paras << 2;
57
58 asm volatile("pushw %%es ; "
59 "pushw %%ds ; "
60 "movw %1,%%es ; "
61 "movw %2,%%ds ; "
62 "xorw %%di,%%di ; "
63 "xorw %%si,%%si ; "
64 "rep;movsl ; "
65 "popw %%ds ; "
66 "popw %%es"
67 : "+c" (dwords)
68 : "r" (dst_seg), "r" (src_seg)
69 : "esi", "edi");
70
71 syssize -= paras;
72 dst_seg += paras;
73 src_seg += paras;
74 }
75#endif
76}
77
78/*
79 * Disable all interrupts at the legacy PIC.
80 */
81static void mask_all_interrupts(void)
82{
83 outb(0xff, 0xa1); /* Mask all interrupts on the secondary PIC */
84 io_delay();
85 outb(0xfb, 0x21); /* Mask all but cascade on the primary PIC */
86 io_delay();
87}
88
89/*
90 * Reset IGNNE# if asserted in the FPU.
91 */
92static void reset_coprocessor(void)
93{
94 outb(0, 0xf0);
95 io_delay();
96 outb(0, 0xf1);
97 io_delay();
98}
99
100/*
101 * Set up the GDT
102 */
103#define GDT_ENTRY(flags,base,limit) \
104 (((u64)(base & 0xff000000) << 32) | \
105 ((u64)flags << 40) | \
106 ((u64)(limit & 0x00ff0000) << 32) | \
107 ((u64)(base & 0x00ffffff) << 16) | \
108 ((u64)(limit & 0x0000ffff)))
109
110struct gdt_ptr {
111 u16 len;
112 u32 ptr;
113} __attribute__((packed));
114
115static void setup_gdt(void)
116{
117 /* There are machines which are known to not boot with the GDT
118 being 8-byte unaligned. Intel recommends 16 byte alignment. */
119 static const u64 boot_gdt[] __attribute__((aligned(16))) = {
120 /* CS: code, read/execute, 4 GB, base 0 */
121 [GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
122 /* DS: data, read/write, 4 GB, base 0 */
123 [GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
124 /* TSS: 32-bit tss, 104 bytes, base 4096 */
125 /* We only have a TSS here to keep Intel VT happy;
126 we don't actually use it for anything. */
127 [GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
128 };
129 /* Xen HVM incorrectly stores a pointer to the gdt_ptr, instead
130 of the gdt_ptr contents. Thus, make it static so it will
131 stay in memory, at least long enough that we switch to the
132 proper kernel GDT. */
133 static struct gdt_ptr gdt;
134
135 gdt.len = sizeof(boot_gdt)-1;
136 gdt.ptr = (u32)&boot_gdt + (ds() << 4);
137
138 asm volatile("lgdtl %0" : : "m" (gdt));
139}
140
141/*
142 * Set up the IDT
143 */
144static void setup_idt(void)
145{
146 static const struct gdt_ptr null_idt = {0, 0};
147 asm volatile("lidtl %0" : : "m" (null_idt));
148}
149
150/*
151 * Actual invocation sequence
152 */
153void go_to_protected_mode(void)
154{
155 /* Hook before leaving real mode, also disables interrupts */
156 realmode_switch_hook();
157
158 /* Move the kernel/setup to their final resting places */
159 move_kernel_around();
160
161 /* Enable the A20 gate */
162 if (enable_a20()) {
163 puts("A20 gate not responding, unable to boot...\n");
164 die();
165 }
166
167 /* Reset coprocessor (IGNNE#) */
168 reset_coprocessor();
169
170 /* Mask all interrupts in the PIC */
171 mask_all_interrupts();
172
173 /* Actual transition to protected mode... */
174 setup_idt();
175 setup_gdt();
176 protected_mode_jump(boot_params.hdr.code32_start,
177 (u32)&boot_params + (ds() << 4));
178}
179
但是,在将CPU设置为保护模式之前,必须完成一些任务。
主要有两个问题:中断和内存。
在实模式下,处理器的中断向量表(interrupt vector table)始终位于内存地址0,而在保护模式下,中断向量表的位置存储在称为IDTR的CPU寄存器中。
同时,在实模式和保护模式之间,逻辑存储器地址(程序要处理的)到线性存储器地址(从0到存储器顶部的原始数字)的转换是不同的。
保护模式要求将名为GDTR的寄存器装入全局描述符表( Global Descriptor Table )的地址以进行存储。
因此,go_to_protected_mode()调用setup_idt()和setup_gdt()【位于pm.c中】来安装临时中断描述符表和全局描述符表。
现在我们准备进入保护模式,这是由另一个汇编例程(pmjump.S) protected_mode_jump完成的。
1/* ----------------------------------------------------------------------- *
2 *
3 * Copyright (C) 1991, 1992 Linus Torvalds
4 * Copyright 2007 rPath, Inc. - All Rights Reserved
5 *
6 * This file is part of the Linux kernel, and is made available under
7 * the terms of the GNU General Public License version 2.
8 *
9 * ----------------------------------------------------------------------- */
10
11/*
12 * arch/i386/boot/pmjump.S
13 *
14 * The actual transition into protected mode
15 */
16
17#include <asm/boot.h>
18#include <asm/processor-flags.h>
19#include <asm/segment.h>
20
21 .text
22
23 .globl protected_mode_jump
24 .type protected_mode_jump, @function
25
26 .code16
27
28/*
29 * void protected_mode_jump(u32 entrypoint, u32 bootparams);
30 */
31protected_mode_jump:
32 movl %edx, %esi # Pointer to boot_params table
33
34 xorl %ebx, %ebx
35 movw %cs, %bx
36 shll $4, %ebx
37 addl %ebx, 2f
38
39 movw $__BOOT_DS, %cx
40 movw $__BOOT_TSS, %di
41
42 movl %cr0, %edx
43 orb $X86_CR0_PE, %dl # Protected mode
44 movl %edx, %cr0
45 jmp 1f # Short jump to serialize on 386/486
461:
47
48 # Transition to 32-bit mode
49 .byte 0x66, 0xea # ljmpl opcode
502: .long in_pm32 # offset
51 .word __BOOT_CS # segment
52
53 .size protected_mode_jump, .-protected_mode_jump
54
55 .code32
56 .type in_pm32, @function
57in_pm32:
58 # Set up data segments for flat 32-bit mode
59 movl %ecx, %ds
60 movl %ecx, %es
61 movl %ecx, %fs
62 movl %ecx, %gs
63 movl %ecx, %ss
64 # The 32-bit code sets up its own stack, but this way we do have
65 # a valid stack if some debugging hack wants to use it.
66 addl %ebx, %esp
67
68 # Set up TR to make Intel VT happy
69 ltr %di
70
71 # Clear registers to allow for future extensions to the
72 # 32-bit boot protocol
73 xorl %ecx, %ecx
74 xorl %edx, %edx
75 xorl %ebx, %ebx
76 xorl %ebp, %ebp
77 xorl %edi, %edi
78
79 # Set up LDTR to make Intel VT happy
80 lldt %cx
81
82 jmpl *%eax # Jump to the 32-bit entrypoint
83
84 .size in_pm32, .-in_pm32
85
该例程通过将CR0 CPU寄存器中的PE位置1来启用保护模式。
至此,我们正在禁用分页运行;
分页是处理器的一项可选功能,即使是在保护模式下,也不需要它。
重要的是,我们不再局限于640K障碍,现在可以处理多达4GB的RAM。
然后,例程调用32位内核入口点,对于压缩内核,该入口点为startup_32。
1/*
2 * linux/boot/head.S
3 *
4 * Copyright (C) 1991, 1992, 1993 Linus Torvalds
5 */
6
7/*
8 * head.S contains the 32-bit startup code.
9 *
10 * NOTE!!! Startup happens at absolute address 0x00001000, which is also where
11 * the page directory will exist. The startup code will be overwritten by
12 * the page directory. [According to comments etc elsewhere on a compressed
13 * kernel it will end up at 0x1000 + 1Mb I hope so as I assume this. - AC]
14 *
15 * Page 0 is deliberately kept safe, since System Management Mode code in
16 * laptops may need to access the BIOS data stored there. This is also
17 * useful for future device drivers that either access the BIOS via VM86
18 * mode.
19 */
20
21/*
22 * High loaded stuff by Hans Lermen & Werner Almesberger, Feb. 1996
23 */
24.text
25
26#include <linux/linkage.h>
27#include <asm/segment.h>
28#include <asm/page.h>
29#include <asm/boot.h>
30#include <asm/asm-offsets.h>
31
32.section ".text.head","ax",@progbits
33 .globl startup_32
34
35startup_32:
36 cld
37 /* test KEEP_SEGMENTS flag to see if the bootloader is asking
38 * us to not reload segments */
39 testb $(1<<6), BP_loadflags(%esi)
40 jnz 1f
41
42 cli
43 movl $(__BOOT_DS),%eax
44 movl %eax,%ds
45 movl %eax,%es
46 movl %eax,%fs
47 movl %eax,%gs
48 movl %eax,%ss
491:
50
51/* Calculate the delta between where we were compiled to run
52 * at and where we were actually loaded at. This can only be done
53 * with a short local call on x86. Nothing else will tell us what
54 * address we are running at. The reserved chunk of the real-mode
55 * data at 0x1e4 (defined as a scratch field) are used as the stack
56 * for this calculation. Only 4 bytes are needed.
57 */
58 leal (0x1e4+4)(%esi), %esp
59 call 1f
601: popl %ebp
61 subl $1b, %ebp
62
63/* %ebp contains the address we are loaded at by the boot loader and %ebx
64 * contains the address where we should move the kernel image temporarily
65 * for safe in-place decompression.
66 */
67
68#ifdef CONFIG_RELOCATABLE
69 movl %ebp, %ebx
70 addl $(CONFIG_PHYSICAL_ALIGN - 1), %ebx
71 andl $(~(CONFIG_PHYSICAL_ALIGN - 1)), %ebx
72#else
73 movl $LOAD_PHYSICAL_ADDR, %ebx
74#endif
75
76 /* Replace the compressed data size with the uncompressed size */
77 subl input_len(%ebp), %ebx
78 movl output_len(%ebp), %eax
79 addl %eax, %ebx
80 /* Add 8 bytes for every 32K input block */
81 shrl $12, %eax
82 addl %eax, %ebx
83 /* Add 32K + 18 bytes of extra slack */
84 addl $(32768 + 18), %ebx
85 /* Align on a 4K boundary */
86 addl $4095, %ebx
87 andl $~4095, %ebx
88
89/* Copy the compressed kernel to the end of our buffer
90 * where decompression in place becomes safe.
91 */
92 pushl %esi
93 leal _end(%ebp), %esi
94 leal _end(%ebx), %edi
95 movl $(_end - startup_32), %ecx
96 std
97 rep
98 movsb
99 cld
100 popl %esi
101
102/* Compute the kernel start address.
103 */
104#ifdef CONFIG_RELOCATABLE
105 addl $(CONFIG_PHYSICAL_ALIGN - 1), %ebp
106 andl $(~(CONFIG_PHYSICAL_ALIGN - 1)), %ebp
107#else
108 movl $LOAD_PHYSICAL_ADDR, %ebp
109#endif
110
111/*
112 * Jump to the relocated address.
113 */
114 leal relocated(%ebx), %eax
115 jmp *%eax
116.section ".text"
117relocated:
118
119/*
120 * Clear BSS
121 */
122 xorl %eax,%eax
123 leal _edata(%ebx),%edi
124 leal _end(%ebx), %ecx
125 subl %edi,%ecx
126 cld
127 rep
128 stosb
129
130/*
131 * Setup the stack for the decompressor
132 */
133 leal stack_end(%ebx), %esp
134
135/*
136 * Do the decompression, and jump to the new kernel..
137 */
138 movl output_len(%ebx), %eax
139 pushl %eax
140 pushl %ebp # output address
141 movl input_len(%ebx), %eax
142 pushl %eax # input_len
143 leal input_data(%ebx), %eax
144 pushl %eax # input_data
145 leal _end(%ebx), %eax
146 pushl %eax # end of the image as third argument
147 pushl %esi # real mode pointer as second arg
148 call decompress_kernel
149 addl $20, %esp
150 popl %ecx
151
152#if CONFIG_RELOCATABLE
153/* Find the address of the relocations.
154 */
155 movl %ebp, %edi
156 addl %ecx, %edi
157
158/* Calculate the delta between where vmlinux was compiled to run
159 * and where it was actually loaded.
160 */
161 movl %ebp, %ebx
162 subl $LOAD_PHYSICAL_ADDR, %ebx
163 jz 2f /* Nothing to be done if loaded at compiled addr. */
164/*
165 * Process relocations.
166 */
167
1681: subl $4, %edi
169 movl 0(%edi), %ecx
170 testl %ecx, %ecx
171 jz 2f
172 addl %ebx, -__PAGE_OFFSET(%ebx, %ecx)
173 jmp 1b
1742:
175#endif
176
177/*
178 * Jump to the decompressed kernel.
179 */
180 xorl %ebx,%ebx
181 jmp *%ebp
182
183.bss
184.balign 4
185stack:
186 .fill 4096, 1, 0
187stack_end:
188
该例程执行一些基本的寄存器初始化,并调用decompress_kernel(),这是C函数进行实际的解压缩。
368asmlinkage void decompress_kernel(void *rmode, memptr heap,
369 uch *input_data, unsigned long input_len,
370 uch *output)
371{
372 real_mode = rmode;
373
374 if (RM_SCREEN_INFO.orig_video_mode == 7) {
375 vidmem = (char *) 0xb0000;
376 vidport = 0x3b4;
377 } else {
378 vidmem = (char *) 0xb8000;
379 vidport = 0x3d4;
380 }
381
382 lines = RM_SCREEN_INFO.orig_video_lines;
383 cols = RM_SCREEN_INFO.orig_video_cols;
384
385 window = output; /* Output buffer (Normally at 1M) */
386 free_mem_ptr = heap; /* Heap */
387 free_mem_end_ptr = heap + HEAP_SIZE;
388 inbuf = input_data; /* Input buffer */
389 insize = input_len;
390 inptr = 0;
391
392#ifdef CONFIG_X86_64
393 if ((ulg)output & (__KERNEL_ALIGN - 1))
394 error("Destination address not 2M aligned");
395 if ((ulg)output >= 0xffffffffffUL)
396 error("Destination address too large");
397#else
398 if ((u32)output & (CONFIG_PHYSICAL_ALIGN -1))
399 error("Destination address not CONFIG_PHYSICAL_ALIGN aligned");
400 if (heap > ((-__PAGE_OFFSET-(512<<20)-1) & 0x7fffffff))
401 error("Destination address too large");
402#ifndef CONFIG_RELOCATABLE
403 if ((u32)output != LOAD_PHYSICAL_ADDR)
404 error("Wrong destination address");
405#endif
406#endif
407
408 makecrc();
409 putstr("\nDecompressing Linux... ");
410 gunzip();
411 putstr("done.\nBooting the kernel.\n");
412 return;
413}
414
decompress_kernel()打印熟悉的“正在解压缩Linux …”消息。
解压缩就地进行,完成后,未压缩的内核映像将覆盖第一个图中所示的压缩映像。
因此,未压缩的内容也从1MB开始。
然后,decompress_kernel()打印“完成”。
和令人欣慰的“引导内核”。
“启动”是指跳到整个故事的最后一个入口点,由上帝本人在山Halti上为Linus赋予了Linus,这是第二兆RAM(0x100000)开头的保护模式内核入口点。
这个神圣的地方包含一个例程,呃,startup_32。
但是,您看到的是,它在另一个目录中。
1/*
2 * linux/arch/i386/kernel/head.S -- the 32-bit startup code.
3 *
4 * Copyright (C) 1991, 1992 Linus Torvalds
5 *
6 * Enhanced CPU detection and feature setting code by Mike Jagdis
7 * and Martin Mares, November 1997.
8 */
9
10.text
11#include <linux/threads.h>
12#include <linux/init.h>
13#include <linux/linkage.h>
14#include <asm/segment.h>
15#include <asm/page.h>
16#include <asm/pgtable.h>
17#include <asm/desc.h>
18#include <asm/cache.h>
19#include <asm/thread_info.h>
20#include <asm/asm-offsets.h>
21#include <asm/setup.h>
22#include <asm/processor-flags.h>
23
24/* Physical address */
25#define pa(X) ((X) - __PAGE_OFFSET)
26
27/*
28 * References to members of the new_cpu_data structure.
29 */
30
31#define X86 new_cpu_data+CPUINFO_x86
32#define X86_VENDOR new_cpu_data+CPUINFO_x86_vendor
33#define X86_MODEL new_cpu_data+CPUINFO_x86_model
34#define X86_MASK new_cpu_data+CPUINFO_x86_mask
35#define X86_HARD_MATH new_cpu_data+CPUINFO_hard_math
36#define X86_CPUID new_cpu_data+CPUINFO_cpuid_level
37#define X86_CAPABILITY new_cpu_data+CPUINFO_x86_capability
38#define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id
39
40/*
41 * This is how much memory *in addition to the memory covered up to
42 * and including _end* we need mapped initially.
43 * We need:
44 * - one bit for each possible page, but only in low memory, which means
45 * 2^32/4096/8 = 128K worst case (4G/4G split.)
46 * - enough space to map all low memory, which means
47 * (2^32/4096) / 1024 pages (worst case, non PAE)
48 * (2^32/4096) / 512 + 4 pages (worst case for PAE)
49 * - a few pages for allocator use before the kernel pagetable has
50 * been set up
51 *
52 * Modulo rounding, each megabyte assigned here requires a kilobyte of
53 * memory, which is currently unreclaimed.
54 *
55 * This should be a multiple of a page.
56 */
57LOW_PAGES = 1<<(32-PAGE_SHIFT_asm)
58
59/*
60 * To preserve the DMA pool in PAGEALLOC kernels, we'll allocate
61 * pagetables from above the 16MB DMA limit, so we'll have to set
62 * up pagetables 16MB more (worst-case):
63 */
64#ifdef CONFIG_DEBUG_PAGEALLOC
65LOW_PAGES = LOW_PAGES + 0x1000000
66#endif
67
68#if PTRS_PER_PMD > 1
69PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PMD) + PTRS_PER_PGD
70#else
71PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PGD)
72#endif
73BOOTBITMAP_SIZE = LOW_PAGES / 8
74ALLOCATOR_SLOP = 4
75
76INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_SIZE_asm
77
78/*
79 * 32-bit kernel entrypoint; only used by the boot CPU. On entry,
80 * %esi points to the real-mode code as a 32-bit pointer.
81 * CS and DS must be 4 GB flat segments, but we don't depend on
82 * any particular GDT layout, because we load our own as soon as we
83 * can.
84 */
85.section .text.head,"ax",@progbits
86ENTRY(startup_32)
87 /* test KEEP_SEGMENTS flag to see if the bootloader is asking
88 us to not reload segments */
89 testb $(1<<6), BP_loadflags(%esi)
90 jnz 2f
91
92/*
93 * Set segments to known values.
94 */
95 lgdt pa(boot_gdt_descr)
96 movl $(__BOOT_DS),%eax
97 movl %eax,%ds
98 movl %eax,%es
99 movl %eax,%fs
100 movl %eax,%gs
1012:
102
103/*
104 * Clear BSS first so that there are no surprises...
105 */
106 cld
107 xorl %eax,%eax
108 movl $pa(__bss_start),%edi
109 movl $pa(__bss_stop),%ecx
110 subl %edi,%ecx
111 shrl $2,%ecx
112 rep ; stosl
113/*
114 * Copy bootup parameters out of the way.
115 * Note: %esi still has the pointer to the real-mode data.
116 * With the kexec as boot loader, parameter segment might be loaded beyond
117 * kernel image and might not even be addressable by early boot page tables.
118 * (kexec on panic case). Hence copy out the parameters before initializing
119 * page tables.
120 */
121 movl $pa(boot_params),%edi
122 movl $(PARAM_SIZE/4),%ecx
123 cld
124 rep
125 movsl
126 movl pa(boot_params) + NEW_CL_POINTER,%esi
127 andl %esi,%esi
128 jz 1f # No comand line
129 movl $pa(boot_command_line),%edi
130 movl $(COMMAND_LINE_SIZE/4),%ecx
131 rep
132 movsl
1331:
134
135#ifdef CONFIG_PARAVIRT
136 /* This is can only trip for a broken bootloader... */
137 cmpw $0x207, pa(boot_params + BP_version)
138 jb default_entry
139
140 /* Paravirt-compatible boot parameters. Look to see what architecture
141 we're booting under. */
142 movl pa(boot_params + BP_hardware_subarch), %eax
143 cmpl $num_subarch_entries, %eax
144 jae bad_subarch
145
146 movl pa(subarch_entries)(,%eax,4), %eax
147 subl $__PAGE_OFFSET, %eax
148 jmp *%eax
149
150bad_subarch:
151WEAK(lguest_entry)
152WEAK(xen_entry)
153 /* Unknown implementation; there's really
154 nothing we can do at this point. */
155 ud2a
156
157 __INITDATA
158
159subarch_entries:
160 .long default_entry /* normal x86/PC */
161 .long lguest_entry /* lguest hypervisor */
162 .long xen_entry /* Xen hypervisor */
163num_subarch_entries = (. - subarch_entries) / 4
164.previous
165#endif /* CONFIG_PARAVIRT */
166
167/*
168 * Initialize page tables. This creates a PDE and a set of page
169 * tables, which are located immediately beyond _end. The variable
170 * init_pg_tables_end is set up to point to the first "safe" location.
171 * Mappings are created both at virtual address 0 (identity mapping)
172 * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
173 *
174 * Note that the stack is not yet set up!
175 */
176#define PTE_ATTR 0x007 /* PRESENT+RW+USER */
177#define PDE_ATTR 0x067 /* PRESENT+RW+USER+DIRTY+ACCESSED */
178#define PGD_ATTR 0x001 /* PRESENT (no other attributes) */
179
180default_entry:
181#ifdef CONFIG_X86_PAE
182
183 /*
184 * In PAE mode swapper_pg_dir is statically defined to contain enough
185 * entries to cover the VMSPLIT option (that is the top 1, 2 or 3
186 * entries). The identity mapping is handled by pointing two PGD
187 * entries to the first kernel PMD.
188 *
189 * Note the upper half of each PMD or PTE are always zero at
190 * this stage.
191 */
192
193#define KPMDS ((0x100000000-__PAGE_OFFSET) >> 30) /* Number of kernel PMDs */
194
195 xorl %ebx,%ebx /* %ebx is kept at zero */
196
197 movl $pa(pg0), %edi
198 movl $pa(swapper_pg_pmd), %edx
199 movl $PTE_ATTR, %eax
20010:
201 leal PDE_ATTR(%edi),%ecx /* Create PMD entry */
202 movl %ecx,(%edx) /* Store PMD entry */
203 /* Upper half already zero */
204 addl $8,%edx
205 movl $512,%ecx
20611:
207 stosl
208 xchgl %eax,%ebx
209 stosl
210 xchgl %eax,%ebx
211 addl $0x1000,%eax
212 loop 11b
213
214 /*
215 * End condition: we must map up to and including INIT_MAP_BEYOND_END
216 * bytes beyond the end of our own page tables.
217 */
218 leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
219 cmpl %ebp,%eax
220 jb 10b
2211:
222 movl %edi,pa(init_pg_tables_end)
223
224 /* Do early initialization of the fixmap area */
225 movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
226 movl %eax,pa(swapper_pg_pmd+0x1000*KPMDS-8)
227#else /* Not PAE */
228
229page_pde_offset = (__PAGE_OFFSET >> 20);
230
231 movl $pa(pg0), %edi
232 movl $pa(swapper_pg_dir), %edx
233 movl $PTE_ATTR, %eax
23410:
235 leal PDE_ATTR(%edi),%ecx /* Create PDE entry */
236 movl %ecx,(%edx) /* Store identity PDE entry */
237 movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
238 addl $4,%edx
239 movl $1024, %ecx
24011:
241 stosl
242 addl $0x1000,%eax
243 loop 11b
244 /*
245 * End condition: we must map up to and including INIT_MAP_BEYOND_END
246 * bytes beyond the end of our own page tables; the +0x007 is
247 * the attribute bits
248 */
249 leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
250 cmpl %ebp,%eax
251 jb 10b
252 movl %edi,pa(init_pg_tables_end)
253
254 /* Do early initialization of the fixmap area */
255 movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
256 movl %eax,pa(swapper_pg_dir+0xffc)
257#endif
258 jmp 3f
259/*
260 * Non-boot CPU entry point; entered from trampoline.S
261 * We can't lgdt here, because lgdt itself uses a data segment, but
262 * we know the trampoline has already loaded the boot_gdt for us.
263 *
264 * If cpu hotplug is not supported then this code can go in init section
265 * which will be freed later
266 */
267
268#ifndef CONFIG_HOTPLUG_CPU
269.section .init.text,"ax",@progbits
270#endif
271
272#ifdef CONFIG_SMP
273ENTRY(startup_32_smp)
274 cld
275 movl $(__BOOT_DS),%eax
276 movl %eax,%ds
277 movl %eax,%es
278 movl %eax,%fs
279 movl %eax,%gs
280#endif /* CONFIG_SMP */
2813:
282
283/*
284 * New page tables may be in 4Mbyte page mode and may
285 * be using the global pages.
286 *
287 * NOTE! If we are on a 486 we may have no cr4 at all!
288 * So we do not try to touch it unless we really have
289 * some bits in it to set. This won't work if the BSP
290 * implements cr4 but this AP does not -- very unlikely
291 * but be warned! The same applies to the pse feature
292 * if not equally supported. --macro
293 *
294 * NOTE! We have to correct for the fact that we're
295 * not yet offset PAGE_OFFSET..
296 */
297#define cr4_bits pa(mmu_cr4_features)
298 movl cr4_bits,%edx
299 andl %edx,%edx
300 jz 6f
301 movl %cr4,%eax # Turn on paging options (PSE,PAE,..)
302 orl %edx,%eax
303 movl %eax,%cr4
304
305 btl $5, %eax # check if PAE is enabled
306 jnc 6f
307
308 /* Check if extended functions are implemented */
309 movl $0x80000000, %eax
310 cpuid
311 cmpl $0x80000000, %eax
312 jbe 6f
313 mov $0x80000001, %eax
314 cpuid
315 /* Execute Disable bit supported? */
316 btl $20, %edx
317 jnc 6f
318
319 /* Setup EFER (Extended Feature Enable Register) */
320 movl $0xc0000080, %ecx
321 rdmsr
322
323 btsl $11, %eax
324 /* Make changes effective */
325 wrmsr
326
3276:
328
329/*
330 * Enable paging
331 */
332 movl $pa(swapper_pg_dir),%eax
333 movl %eax,%cr3 /* set the page table pointer.. */
334 movl %cr0,%eax
335 orl $X86_CR0_PG,%eax
336 movl %eax,%cr0 /* ..and set paging (PG) bit */
337 ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
3381:
339 /* Set up the stack pointer */
340 lss stack_start,%esp
341
342/*
343 * Initialize eflags. Some BIOS's leave bits like NT set. This would
344 * confuse the debugger if this code is traced.
345 * XXX - best to initialize before switching to protected mode.
346 */
347 pushl $0
348 popfl
349
350#ifdef CONFIG_SMP
351 cmpb $0, ready
352 jz 1f /* Initial CPU cleans BSS */
353 jmp checkCPUtype
3541:
355#endif /* CONFIG_SMP */
356
357/*
358 * start system 32-bit setup. We need to re-do some of the things done
359 * in 16-bit mode for the "real" operations.
360 */
361 call setup_idt
362
363checkCPUtype:
364
365 movl $-1,X86_CPUID # -1 for no CPUID initially
366
367/* check if it is 486 or 386. */
368/*
369 * XXX - this does a lot of unnecessary setup. Alignment checks don't
370 * apply at our cpl of 0 and the stack ought to be aligned already, and
371 * we don't need to preserve eflags.
372 */
373
374 movb $3,X86 # at least 386
375 pushfl # push EFLAGS
376 popl %eax # get EFLAGS
377 movl %eax,%ecx # save original EFLAGS
378 xorl $0x240000,%eax # flip AC and ID bits in EFLAGS
379 pushl %eax # copy to EFLAGS
380 popfl # set EFLAGS
381 pushfl # get new EFLAGS
382 popl %eax # put it in eax
383 xorl %ecx,%eax # change in flags
384 pushl %ecx # restore original EFLAGS
385 popfl
386 testl $0x40000,%eax # check if AC bit changed
387 je is386
388
389 movb $4,X86 # at least 486
390 testl $0x200000,%eax # check if ID bit changed
391 je is486
392
393 /* get vendor info */
394 xorl %eax,%eax # call CPUID with 0 -> return vendor ID
395 cpuid
396 movl %eax,X86_CPUID # save CPUID level
397 movl %ebx,X86_VENDOR_ID # lo 4 chars
398 movl %edx,X86_VENDOR_ID+4 # next 4 chars
399 movl %ecx,X86_VENDOR_ID+8 # last 4 chars
400
401 orl %eax,%eax # do we have processor info as well?
402 je is486
403
404 movl $1,%eax # Use the CPUID instruction to get CPU type
405 cpuid
406 movb %al,%cl # save reg for future use
407 andb $0x0f,%ah # mask processor family
408 movb %ah,X86
409 andb $0xf0,%al # mask model
410 shrb $4,%al
411 movb %al,X86_MODEL
412 andb $0x0f,%cl # mask mask revision
413 movb %cl,X86_MASK
414 movl %edx,X86_CAPABILITY
415
416is486: movl $0x50022,%ecx # set AM, WP, NE and MP
417 jmp 2f
418
419is386: movl $2,%ecx # set MP
4202: movl %cr0,%eax
421 andl $0x80000011,%eax # Save PG,PE,ET
422 orl %ecx,%eax
423 movl %eax,%cr0
424
425 call check_x87
426 lgdt early_gdt_descr
427 lidt idt_descr
428 ljmp $(__KERNEL_CS),$1f
4291: movl $(__KERNEL_DS),%eax # reload all the segment registers
430 movl %eax,%ss # after changing gdt.
431 movl %eax,%fs # gets reset once there's real percpu
432
433 movl $(__USER_DS),%eax # DS/ES contains default USER segment
434 movl %eax,%ds
435 movl %eax,%es
436
437 xorl %eax,%eax # Clear GS and LDT
438 movl %eax,%gs
439 lldt %ax
440
441 cld # gcc2 wants the direction flag cleared at all times
442 pushl $0 # fake return address for unwinder
443#ifdef CONFIG_SMP
444 movb ready, %cl
445 movb $1, ready
446 cmpb $0,%cl # the first CPU calls start_kernel
447 je 1f
448 movl $(__KERNEL_PERCPU), %eax
449 movl %eax,%fs # set this cpu's percpu
450 jmp initialize_secondary # all other CPUs call initialize_secondary
4511:
452#endif /* CONFIG_SMP */
453 jmp start_kernel
454
455/*
456 * We depend on ET to be correct. This checks for 287/387.
457 */
458check_x87:
459 movb $0,X86_HARD_MATH
460 clts
461 fninit
462 fstsw %ax
463 cmpb $0,%al
464 je 1f
465 movl %cr0,%eax /* no coprocessor: have to set bits */
466 xorl $4,%eax /* set EM */
467 movl %eax,%cr0
468 ret
469 ALIGN
4701: movb $1,X86_HARD_MATH
471 .byte 0xDB,0xE4 /* fsetpm for 287, ignored by 387 */
472 ret
473
474/*
475 * setup_idt
476 *
477 * sets up a idt with 256 entries pointing to
478 * ignore_int, interrupt gates. It doesn't actually load
479 * idt - that can be done only after paging has been enabled
480 * and the kernel moved to PAGE_OFFSET. Interrupts
481 * are enabled elsewhere, when we can be relatively
482 * sure everything is ok.
483 *
484 * Warning: %esi is live across this function.
485 */
486setup_idt:
487 lea ignore_int,%edx
488 movl $(__KERNEL_CS << 16),%eax
489 movw %dx,%ax /* selector = 0x0010 = cs */
490 movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
491
492 lea idt_table,%edi
493 mov $256,%ecx
494rp_sidt:
495 movl %eax,(%edi)
496 movl %edx,4(%edi)
497 addl $8,%edi
498 dec %ecx
499 jne rp_sidt
500
501.macro set_early_handler handler,trapno
502 lea \handler,%edx
503 movl $(__KERNEL_CS << 16),%eax
504 movw %dx,%ax
505 movw $0x8E00,%dx /* interrupt gate - dpl=0, present */
506 lea idt_table,%edi
507 movl %eax,8*\trapno(%edi)
508 movl %edx,8*\trapno+4(%edi)
509.endm
510
511 set_early_handler handler=early_divide_err,trapno=0
512 set_early_handler handler=early_illegal_opcode,trapno=6
513 set_early_handler handler=early_protection_fault,trapno=13
514 set_early_handler handler=early_page_fault,trapno=14
515
516 ret
517
518early_divide_err:
519 xor %edx,%edx
520 pushl $0 /* fake errcode */
521 jmp early_fault
522
523early_illegal_opcode:
524 movl $6,%edx
525 pushl $0 /* fake errcode */
526 jmp early_fault
527
528early_protection_fault:
529 movl $13,%edx
530 jmp early_fault
531
532early_page_fault:
533 movl $14,%edx
534 jmp early_fault
535
536early_fault:
537 cld
538#ifdef CONFIG_PRINTK
539 pusha
540 movl $(__KERNEL_DS),%eax
541 movl %eax,%ds
542 movl %eax,%es
543 cmpl $2,early_recursion_flag
544 je hlt_loop
545 incl early_recursion_flag
546 movl %cr2,%eax
547 pushl %eax
548 pushl %edx /* trapno */
549 pushl $fault_msg
550#ifdef CONFIG_EARLY_PRINTK
551 call early_printk
552#else
553 call printk
554#endif
555#endif
556 call dump_stack
557hlt_loop:
558 hlt
559 jmp hlt_loop
560
561/* This is the default interrupt "handler" :-) */
562 ALIGN
563ignore_int:
564 cld
565#ifdef CONFIG_PRINTK
566 pushl %eax
567 pushl %ecx
568 pushl %edx
569 pushl %es
570 pushl %ds
571 movl $(__KERNEL_DS),%eax
572 movl %eax,%ds
573 movl %eax,%es
574 cmpl $2,early_recursion_flag
575 je hlt_loop
576 incl early_recursion_flag
577 pushl 16(%esp)
578 pushl 24(%esp)
579 pushl 32(%esp)
580 pushl 40(%esp)
581 pushl $int_msg
582#ifdef CONFIG_EARLY_PRINTK
583 call early_printk
584#else
585 call printk
586#endif
587 addl $(5*4),%esp
588 popl %ds
589 popl %es
590 popl %edx
591 popl %ecx
592 popl %eax
593#endif
594 iret
595
596.section .text
597/*
598 * Real beginning of normal "text" segment
599 */
600ENTRY(stext)
601ENTRY(_stext)
602
603/*
604 * BSS section
605 */
606.section ".bss.page_aligned","wa"
607 .align PAGE_SIZE_asm
608#ifdef CONFIG_X86_PAE
609swapper_pg_pmd:
610 .fill 1024*KPMDS,4,0
611#else
612ENTRY(swapper_pg_dir)
613 .fill 1024,4,0
614#endif
615swapper_pg_fixmap:
616 .fill 1024,4,0
617ENTRY(empty_zero_page)
618 .fill 4096,1,0
619/*
620 * This starts the data section.
621 */
622#ifdef CONFIG_X86_PAE
623.section ".data.page_aligned","wa"
624 /* Page-aligned for the benefit of paravirt? */
625 .align PAGE_SIZE_asm
626ENTRY(swapper_pg_dir)
627 .long pa(swapper_pg_pmd+PGD_ATTR),0 /* low identity map */
628# if KPMDS == 3
629 .long pa(swapper_pg_pmd+PGD_ATTR),0
630 .long pa(swapper_pg_pmd+PGD_ATTR+0x1000),0
631 .long pa(swapper_pg_pmd+PGD_ATTR+0x2000),0
632# elif KPMDS == 2
633 .long 0,0
634 .long pa(swapper_pg_pmd+PGD_ATTR),0
635 .long pa(swapper_pg_pmd+PGD_ATTR+0x1000),0
636# elif KPMDS == 1
637 .long 0,0
638 .long 0,0
639 .long pa(swapper_pg_pmd+PGD_ATTR),0
640# else
641# error "Kernel PMDs should be 1, 2 or 3"
642# endif
643 .align PAGE_SIZE_asm /* needs to be page-sized too */
644#endif
645
646.data
647ENTRY(stack_start)
648 .long init_thread_union+THREAD_SIZE
649 .long __BOOT_DS
650
651ready: .byte 0
652
653early_recursion_flag:
654 .long 0
655
656int_msg:
657 .asciz "Unknown interrupt or fault at EIP %p %p %p\n"
658
659fault_msg:
660 .asciz \
661/* fault info: */ "BUG: Int %d: CR2 %p\n" \
662/* pusha regs: */ " EDI %p ESI %p EBP %p ESP %p\n" \
663 " EBX %p EDX %p ECX %p EAX %p\n" \
664/* fault frame: */ " err %p EIP %p CS %p flg %p\n" \
665 \
666 "Stack: %p %p %p %p %p %p %p %p\n" \
667 " %p %p %p %p %p %p %p %p\n" \
668 " %p %p %p %p %p %p %p %p\n"
669
670#include "../../x86/xen/xen-head.S"
671
672/*
673 * The IDT and GDT 'descriptors' are a strange 48-bit object
674 * only used by the lidt and lgdt instructions. They are not
675 * like usual segment descriptors - they consist of a 16-bit
676 * segment size, and 32-bit linear address value:
677 */
678
679.globl boot_gdt_descr
680.globl idt_descr
681
682 ALIGN
683# early boot GDT descriptor (must use 1:1 address mapping)
684 .word 0 # 32 bit align gdt_desc.address
685boot_gdt_descr:
686 .word __BOOT_DS+7
687 .long boot_gdt - __PAGE_OFFSET
688
689 .word 0 # 32-bit align idt_desc.address
690idt_descr:
691 .word IDT_ENTRIES*8-1 # idt contains 256 entries
692 .long idt_table
693
694# boot GDT descriptor (later on used by CPU#0):
695 .word 0 # 32 bit align gdt_desc.address
696ENTRY(early_gdt_descr)
697 .word GDT_ENTRIES*8-1
698 .long per_cpu__gdt_page /* Overwritten for secondary CPUs */
699
700/*
701 * The boot_gdt must mirror the equivalent in setup.S and is
702 * used only for booting.
703 */
704 .align L1_CACHE_BYTES
705ENTRY(boot_gdt)
706 .fill GDT_ENTRY_BOOT_CS,8,0
707 .quad 0x00cf9a000000ffff /* kernel 4GB code at 0x00000000 */
708 .quad 0x00cf92000000ffff /* kernel 4GB data at 0x00000000 */
startup_32的第二种形式也是一个汇编例程,但是它包含32位模式的初始化。
它将清除保护模式内核的bss段(这是真正的内核,它将在机器重新启动或关闭之前运行),设置最终的内存全局描述符表,构建页表以便可以打开分页,启用分页,初始化堆栈,创建最终的中断描述符表,最后跳转到与体系结构无关的内核启动start_kernel()。
下图显示了引导最后一站的代码流:
与体系结构无关的Linux内核初始化
1/*
2 * linux/init/main.c
3 *
4 * Copyright (C) 1991, 1992 Linus Torvalds
5 *
6 * GK 2/5/95 - Changed to support mounting root fs via NFS
7 * Added initrd & change_root: Werner Almesberger & Hans Lermen, Feb '96
8 * Moan early if gcc is old, avoiding bogus kernels - Paul Gortmaker, May '96
9 * Simplified starting of init: Michael A. Griffith
10 */
11
12#include <linux/types.h>
13#include <linux/module.h>
14#include <linux/proc_fs.h>
15#include <linux/kernel.h>
16#include <linux/syscalls.h>
17#include <linux/string.h>
18#include <linux/ctype.h>
19#include <linux/delay.h>
20#include <linux/utsname.h>
21#include <linux/ioport.h>
22#include <linux/init.h>
23#include <linux/smp_lock.h>
24#include <linux/initrd.h>
25#include <linux/hdreg.h>
26#include <linux/bootmem.h>
27#include <linux/tty.h>
28#include <linux/gfp.h>
29#include <linux/percpu.h>
30#include <linux/kmod.h>
31#include <linux/kernel_stat.h>
32#include <linux/start_kernel.h>
33#include <linux/security.h>
34#include <linux/workqueue.h>
35#include <linux/profile.h>
36#include <linux/rcupdate.h>
37#include <linux/moduleparam.h>
38#include <linux/kallsyms.h>
39#include <linux/writeback.h>
40#include <linux/cpu.h>
41#include <linux/cpuset.h>
42#include <linux/cgroup.h>
43#include <linux/efi.h>
44#include <linux/tick.h>
45#include <linux/interrupt.h>
46#include <linux/taskstats_kern.h>
47#include <linux/delayacct.h>
48#include <linux/unistd.h>
49#include <linux/rmap.h>
50#include <linux/mempolicy.h>
51#include <linux/key.h>
52#include <linux/unwind.h>
53#include <linux/buffer_head.h>
54#include <linux/debug_locks.h>
55#include <linux/lockdep.h>
56#include <linux/pid_namespace.h>
57#include <linux/device.h>
58#include <linux/kthread.h>
59#include <linux/sched.h>
60#include <linux/signal.h>
61
62#include <asm/io.h>
63#include <asm/bugs.h>
64#include <asm/setup.h>
65#include <asm/sections.h>
66#include <asm/cacheflush.h>
67
68#ifdef CONFIG_X86_LOCAL_APIC
69#include <asm/smp.h>
70#endif
71
72/*
73 * This is one of the first .c files built. Error out early if we have compiler
74 * trouble.
75 */
76
77#if __GNUC__ == 4 && __GNUC_MINOR__ == 1 && __GNUC_PATCHLEVEL__ == 0
78#warning gcc-4.1.0 is known to miscompile the kernel. A different compiler version is recommended.
79#endif
80
81static int kernel_init(void *);
82
83extern void init_IRQ(void);
84extern void fork_init(unsigned long);
85extern void mca_init(void);
86extern void sbus_init(void);
87extern void pidhash_init(void);
88extern void pidmap_init(void);
89extern void prio_tree_init(void);
90extern void radix_tree_init(void);
91extern void free_initmem(void);
92#ifdef CONFIG_ACPI
93extern void acpi_early_init(void);
94#else
95static inline void acpi_early_init(void) { }
96#endif
97#ifndef CONFIG_DEBUG_RODATA
98static inline void mark_rodata_ro(void) { }
99#endif
100
101#ifdef CONFIG_TC
102extern void tc_init(void);
103#endif
104
105enum system_states system_state;
106EXPORT_SYMBOL(system_state);
107
108/*
109 * Boot command-line arguments
110 */
111#define MAX_INIT_ARGS CONFIG_INIT_ENV_ARG_LIMIT
112#define MAX_INIT_ENVS CONFIG_INIT_ENV_ARG_LIMIT
113
114extern void time_init(void);
115/* Default late time init is NULL. archs can override this later. */
116void (*late_time_init)(void);
117extern void softirq_init(void);
118
119/* Untouched command line saved by arch-specific code. */
120char __initdata boot_command_line[COMMAND_LINE_SIZE];
121/* Untouched saved command line (eg. for /proc) */
122char *saved_command_line;
123/* Command line for parameter parsing */
124static char *static_command_line;
125
126static char *execute_command;
127static char *ramdisk_execute_command;
128
129#ifdef CONFIG_SMP
130/* Setup configured maximum number of CPUs to activate */
131unsigned int __initdata setup_max_cpus = NR_CPUS;
132
133/*
134 * Setup routine for controlling SMP activation
135 *
136 * Command-line option of "nosmp" or "maxcpus=0" will disable SMP
137 * activation entirely (the MPS table probe still happens, though).
138 *
139 * Command-line option of "maxcpus=", where is an integer
140 * greater than 0, limits the maximum number of CPUs activated in
141 * SMP mode to .
142 */
143#ifndef CONFIG_X86_IO_APIC
144static inline void disable_ioapic_setup(void) {};
145#endif
146
147static int __init nosmp(char *str)
148{
149 setup_max_cpus = 0;
150 disable_ioapic_setup();
151 return 0;
152}
153
154early_param("nosmp", nosmp);
155
156static int __init maxcpus(char *str)
157{
158 get_option(&str, &setup_max_cpus);
159 if (setup_max_cpus == 0)
160 disable_ioapic_setup();
161
162 return 0;
163}
164
165early_param("maxcpus", maxcpus);
166#else
167#define setup_max_cpus NR_CPUS
168#endif
169
170/*
171 * If set, this is an indication to the drivers that reset the underlying
172 * device before going ahead with the initialization otherwise driver might
173 * rely on the BIOS and skip the reset operation.
174 *
175 * This is useful if kernel is booting in an unreliable environment.
176 * For ex. kdump situaiton where previous kernel has crashed, BIOS has been
177 * skipped and devices will be in unknown state.
178 */
179unsigned int reset_devices;
180EXPORT_SYMBOL(reset_devices);
181
182static int __init set_reset_devices(char *str)
183{
184 reset_devices = 1;
185 return 1;
186}
187
188__setup("reset_devices", set_reset_devices);
189
190static char * argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
191char * envp_init[MAX_INIT_ENVS+2] = { "HOME=/", "TERM=linux", NULL, };
192static const char *panic_later, *panic_param;
193
194extern struct obs_kernel_param __setup_start[], __setup_end[];
195
196static int __init obsolete_checksetup(char *line)
197{
198 struct obs_kernel_param *p;
199 int had_early_param = 0;
200
201 p = __setup_start;
202 do {
203 int n = strlen(p->str);
204 if (!strncmp(line, p->str, n)) {
205 if (p->early) {
206 /* Already done in parse_early_param?
207 * (Needs exact match on param part).
208 * Keep iterating, as we can have early
209 * params and __setups of same names 8( */
210 if (line[n] == '\0' || line[n] == '=')
211 had_early_param = 1;
212 } else if (!p->setup_func) {
213 printk(KERN_WARNING "Parameter %s is obsolete,"
214 " ignored\n", p->str);
215 return 1;
216 } else if (p->setup_func(line + n))
217 return 1;
218 }
219 p++;
220 } while (p < __setup_end);
221
222 return had_early_param;
223}
224
225/*
226 * This should be approx 2 Bo*oMips to start (note initial shift), and will
227 * still work even if initially too large, it will just take slightly longer
228 */
229unsigned long loops_per_jiffy = (1<<12);
230
231EXPORT_SYMBOL(loops_per_jiffy);
232
233static int __init debug_kernel(char *str)
234{
235 console_loglevel = 10;
236 return 0;
237}
238
239static int __init quiet_kernel(char *str)
240{
241 console_loglevel = 4;
242 return 0;
243}
244
245early_param("debug", debug_kernel);
246early_param("quiet", quiet_kernel);
247
248static int __init loglevel(char *str)
249{
250 get_option(&str, &console_loglevel);
251 return 0;
252}
253
254early_param("loglevel", loglevel);
255
256/*
257 * Unknown boot options get handed to init, unless they look like
258 * failed parameters
259 */
260static int __init unknown_bootoption(char *param, char *val)
261{
262 /* Change NUL term back to "=", to make "param" the whole string. */
263 if (val) {
264 /* param=val or param="val"? */
265 if (val == param+strlen(param)+1)
266 val[-1] = '=';
267 else if (val == param+strlen(param)+2) {
268 val[-2] = '=';
269 memmove(val-1, val, strlen(val)+1);
270 val--;
271 } else
272 BUG();
273 }
274
275 /* Handle obsolete-style parameters */
276 if (obsolete_checksetup(param))
277 return 0;
278
279 /*
280 * Preemptive maintenance for "why didn't my misspelled command
281 * line work?"
282 */
283 if (strchr(param, '.') && (!val || strchr(param, '.') < val)) {
284 printk(KERN_ERR "Unknown boot option `%s': ignoring\n", param);
285 return 0;
286 }
287
288 if (panic_later)
289 return 0;
290
291 if (val) {
292 /* Environment option */
293 unsigned int i;
294 for (i = 0; envp_init[i]; i++) {
295 if (i == MAX_INIT_ENVS) {
296 panic_later = "Too many boot env vars at `%s'";
297 panic_param = param;
298 }
299 if (!strncmp(param, envp_init[i], val - param))
300 break;
301 }
302 envp_init[i] = param;
303 } else {
304 /* Command line option */
305 unsigned int i;
306 for (i = 0; argv_init[i]; i++) {
307 if (i == MAX_INIT_ARGS) {
308 panic_later = "Too many boot init vars at `%s'";
309 panic_param = param;
310 }
311 }
312 argv_init[i] = param;
313 }
314 return 0;
315}
316
317#ifdef CONFIG_DEBUG_PAGEALLOC
318int __read_mostly debug_pagealloc_enabled = 0;
319#endif
320
321static int __init init_setup(char *str)
322{
323 unsigned int i;
324
325 execute_command = str;
326 /*
327 * In case LILO is going to boot us with default command line,
328 * it prepends "auto" before the whole cmdline which makes
329 * the shell think it should execute a script with such name.
330 * So we ignore all arguments entered _before_ init=... [MJ]
331 */
332 for (i = 1; i < MAX_INIT_ARGS; i++)
333 argv_init[i] = NULL;
334 return 1;
335}
336__setup("init=", init_setup);
337
338static int __init rdinit_setup(char *str)
339{
340 unsigned int i;
341
342 ramdisk_execute_command = str;
343 /* See "auto" comment in init_setup */
344 for (i = 1; i < MAX_INIT_ARGS; i++)
345 argv_init[i] = NULL;
346 return 1;
347}
348__setup("rdinit=", rdinit_setup);
349
350#ifndef CONFIG_SMP
351
352#ifdef CONFIG_X86_LOCAL_APIC
353static void __init smp_init(void)
354{
355 APIC_init_uniprocessor();
356}
357#else
358#define smp_init() do { } while (0)
359#endif
360
361static inline void setup_per_cpu_areas(void) { }
362static inline void smp_prepare_cpus(unsigned int maxcpus) { }
363
364#else
365
366#ifndef CONFIG_HAVE_SETUP_PER_CPU_AREA
367unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
368
369EXPORT_SYMBOL(__per_cpu_offset);
370
371static void __init setup_per_cpu_areas(void)
372{
373 unsigned long size, i;
374 char *ptr;
375 unsigned long nr_possible_cpus = num_possible_cpus();
376
377 /* Copy section for each CPU (we discard the original) */
378 size = ALIGN(PERCPU_ENOUGH_ROOM, PAGE_SIZE);
379 ptr = alloc_bootmem_pages(size * nr_possible_cpus);
380
381 for_each_possible_cpu(i) {
382 __per_cpu_offset[i] = ptr - __per_cpu_start;
383 memcpy(ptr, __per_cpu_start, __per_cpu_end - __per_cpu_start);
384 ptr += size;
385 }
386}
387#endif /* CONFIG_HAVE_SETUP_PER_CPU_AREA */
388
389/* Called by boot processor to activate the rest. */
390static void __init smp_init(void)
391{
392 unsigned int cpu;
393
394 /* FIXME: This should be done in userspace --RR */
395 for_each_present_cpu(cpu) {
396 if (num_online_cpus() >= setup_max_cpus)
397 break;
398 if (!cpu_online(cpu))
399 cpu_up(cpu);
400 }
401
402 /* Any cleanup work */
403 printk(KERN_INFO "Brought up %ld CPUs\n", (long)num_online_cpus());
404 smp_cpus_done(setup_max_cpus);
405}
406
407#endif
408
409/*
410 * We need to store the untouched command line for future reference.
411 * We also need to store the touched command line since the parameter
412 * parsing is performed in place, and we should allow a component to
413 * store reference of name/value for future reference.
414 */
415static void __init setup_command_line(char *command_line)
416{
417 saved_command_line = alloc_bootmem(strlen (boot_command_line)+1);
418 static_command_line = alloc_bootmem(strlen (command_line)+1);
419 strcpy (saved_command_line, boot_command_line);
420 strcpy (static_command_line, command_line);
421}
422
423/*
424 * We need to finalize in a non-__init function or else race conditions
425 * between the root thread and the init thread may cause start_kernel to
426 * be reaped by free_initmem before the root thread has proceeded to
427 * cpu_idle.
428 *
429 * gcc-3.4 accidentally inlines this function, so use noinline.
430 */
431
432static void noinline __init_refok rest_init(void)
433 __releases(kernel_lock)
434{
435 int pid;
436
437 kernel_thread(kernel_init, NULL, CLONE_FS | CLONE_SIGHAND);
438 numa_default_policy();
439 pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
440 kthreadd_task = find_task_by_pid(pid);
441 unlock_kernel();
442
443 /*
444 * The boot idle thread must execute schedule()
445 * at least once to get things moving:
446 */
447 init_idle_bootup_task(current);
448 preempt_enable_no_resched();
449 schedule();
450 preempt_disable();
451
452 /* Call into cpu_idle with preempt disabled */
453 cpu_idle();
454}
455
456/* Check for early params. */
457static int __init do_early_param(char *param, char *val)
458{
459 struct obs_kernel_param *p;
460
461 for (p = __setup_start; p < __setup_end; p++) {
462 if ((p->early && strcmp(param, p->str) == 0) ||
463 (strcmp(param, "console") == 0 &&
464 strcmp(p->str, "earlycon") == 0)
465 ) {
466 if (p->setup_func(val) != 0)
467 printk(KERN_WARNING
468 "Malformed early option '%s'\n", param);
469 }
470 }
471 /* We accept everything at this stage. */
472 return 0;
473}
474
475/* Arch code calls this early on, or if not, just before other parsing. */
476void __init parse_early_param(void)
477{
478 static __initdata int done = 0;
479 static __initdata char tmp_cmdline[COMMAND_LINE_SIZE];
480
481 if (done)
482 return;
483
484 /* All fall through to do_early_param. */
485 strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
486 parse_args("early options", tmp_cmdline, NULL, 0, do_early_param);
487 done = 1;
488}
489
490/*
491 * Activate the first processor.
492 */
493
494static void __init boot_cpu_init(void)
495{
496 int cpu = smp_processor_id();
497 /* Mark the boot cpu "present", "online" etc for SMP and UP case */
498 cpu_set(cpu, cpu_online_map);
499 cpu_set(cpu, cpu_present_map);
500 cpu_set(cpu, cpu_possible_map);
501}
502
503void __init __attribute__((weak)) smp_setup_processor_id(void)
504{
505}
506
507asmlinkage void __init start_kernel(void)
508{
509 char * command_line;
510 extern struct kernel_param __start___param[], __stop___param[];
511
512 smp_setup_processor_id();
513
514 /*
515 * Need to run as early as possible, to initialize the
516 * lockdep hash:
517 */
518 unwind_init();
519 lockdep_init();
520 cgroup_init_early();
521
522 local_irq_disable();
523 early_boot_irqs_off();
524 early_init_irq_lock_class();
525
526/*
527 * Interrupts are still disabled. Do necessary setups, then
528 * enable them
529 */
530 lock_kernel();
531 tick_init();
532 boot_cpu_init();
533 page_address_init();
534 printk(KERN_NOTICE);
535 printk(linux_banner);
536 setup_arch(&command_line);
537 setup_command_line(command_line);
538 unwind_setup();
539 setup_per_cpu_areas();
540 smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
541
542 /*
543 * Set up the scheduler prior starting any interrupts (such as the
544 * timer interrupt). Full topology setup happens at smp_init()
545 * time - but meanwhile we still have a functioning scheduler.
546 */
547 sched_init();
548 /*
549 * Disable preemption - early bootup scheduling is extremely
550 * fragile until we cpu_idle() for the first time.
551 */
552 preempt_disable();
553 build_all_zonelists();
554 page_alloc_init();
555 printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
556 parse_early_param();
557 parse_args("Booting kernel", static_command_line, __start___param,
558 __stop___param - __start___param,
559 &unknown_bootoption);
560 if (!irqs_disabled()) {
561 printk(KERN_WARNING "start_kernel(): bug: interrupts were "
562 "enabled *very* early, fixing it\n");
563 local_irq_disable();
564 }
565 sort_main_extable();
566 trap_init();
567 rcu_init();
568 init_IRQ();
569 pidhash_init();
570 init_timers();
571 hrtimers_init();
572 softirq_init();
573 timekeeping_init();
574 time_init();
575 profile_init();
576 if (!irqs_disabled())
577 printk("start_kernel(): bug: interrupts were enabled early\n");
578 early_boot_irqs_on();
579 local_irq_enable();
580
581 /*
582 * HACK ALERT! This is early. We're enabling the console before
583 * we've done PCI setups etc, and console_init() must be aware of
584 * this. But we do want output early, in case something goes wrong.
585 */
586 console_init();
587 if (panic_later)
588 panic(panic_later, panic_param);
589
590 lockdep_info();
591
592 /*
593 * Need to run this when irqs are enabled, because it wants
594 * to self-test [hard/soft]-irqs on/off lock inversion bugs
595 * too:
596 */
597 locking_selftest();
598
599#ifdef CONFIG_BLK_DEV_INITRD
600 if (initrd_start && !initrd_below_start_ok &&
601 initrd_start < min_low_pfn << PAGE_SHIFT) {
602 printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - "
603 "disabling it.\n",initrd_start,min_low_pfn << PAGE_SHIFT);
604 initrd_start = 0;
605 }
606#endif
607 vfs_caches_init_early();
608 cpuset_init_early();
609 mem_init();
610 enable_debug_pagealloc();
611 cpu_hotplug_init();
612 kmem_cache_init();
613 setup_per_cpu_pageset();
614 numa_policy_init();
615 if (late_time_init)
616 late_time_init();
617 calibrate_delay();
618 pidmap_init();
619 pgtable_cache_init();
620 prio_tree_init();
621 anon_vma_init();
622#ifdef CONFIG_X86
623 if (efi_enabled)
624 efi_enter_virtual_mode();
625#endif
626 fork_init(num_physpages);
627 proc_caches_init();
628 buffer_init();
629 unnamed_dev_init();
630 key_init();
631 security_init();
632 vfs_caches_init(num_physpages);
633 radix_tree_init();
634 signals_init();
635 /* rootfs populating might need page-writeback */
636 page_writeback_init();
637#ifdef CONFIG_PROC_FS
638 proc_root_init();
639#endif
640 cgroup_init();
641 cpuset_init();
642 taskstats_init_early();
643 delayacct_init();
644
645 check_bugs();
646
647 acpi_early_init(); /* before LAPIC and SMP init */
648
649 /* Do the rest non-__init'ed, we're now alive */
650 rest_init();
651}
652
653static int __initdata initcall_debug;
654
655static int __init initcall_debug_setup(char *str)
656{
657 initcall_debug = 1;
658 return 1;
659}
660__setup("initcall_debug", initcall_debug_setup);
661
662extern initcall_t __initcall_start[], __initcall_end[];
663
664static void __init do_initcalls(void)
665{
666 initcall_t *call;
667 int count = preempt_count();
668
669 for (call = __initcall_start; call < __initcall_end; call++) {
670 ktime_t t0, t1, delta;
671 char *msg = NULL;
672 char msgbuf[40];
673 int result;
674
675 if (initcall_debug) {
676 printk("Calling initcall 0x%p", *call);
677 print_fn_descriptor_symbol(": %s()",
678 (unsigned long) *call);
679 printk("\n");
680 t0 = ktime_get();
681 }
682
683 result = (*call)();
684
685 if (initcall_debug) {
686 t1 = ktime_get();
687 delta = ktime_sub(t1, t0);
688
689 printk("initcall 0x%p", *call);
690 print_fn_descriptor_symbol(": %s()",
691 (unsigned long) *call);
692 printk(" returned %d.\n", result);
693
694 printk("initcall 0x%p ran for %Ld msecs: ",
695 *call, (unsigned long long)delta.tv64 >> 20);
696 print_fn_descriptor_symbol("%s()\n",
697 (unsigned long) *call);
698 }
699
700 if (result && result != -ENODEV && initcall_debug) {
701 sprintf(msgbuf, "error code %d", result);
702 msg = msgbuf;
703 }
704 if (preempt_count() != count) {
705 msg = "preemption imbalance";
706 preempt_count() = count;
707 }
708 if (irqs_disabled()) {
709 msg = "disabled interrupts";
710 local_irq_enable();
711 }
712 if (msg) {
713 printk(KERN_WARNING "initcall at 0x%p", *call);
714 print_fn_descriptor_symbol(": %s()",
715 (unsigned long) *call);
716 printk(": returned with %s\n", msg);
717 }
718 }
719
720 /* Make sure there is no pending stuff from the initcall sequence */
721 flush_scheduled_work();
722}
723
724/*
725 * Ok, the machine is now initialized. None of the devices
726 * have been touched yet, but the CPU subsystem is up and
727 * running, and memory and process management works.
728 *
729 * Now we can finally start doing some real work..
730 */
731static void __init do_basic_setup(void)
732{
733 /* drivers will send hotplug events */
734 init_workqueues();
735 usermodehelper_init();
736 driver_init();
737 init_irq_proc();
738 do_initcalls();
739}
740
741static int __initdata nosoftlockup;
742
743static int __init nosoftlockup_setup(char *str)
744{
745 nosoftlockup = 1;
746 return 1;
747}
748__setup("nosoftlockup", nosoftlockup_setup);
749
750static void __init do_pre_smp_initcalls(void)
751{
752 extern int spawn_ksoftirqd(void);
753
754 migration_init();
755 spawn_ksoftirqd();
756 if (!nosoftlockup)
757 spawn_softlockup_task();
758}
759
760static void run_init_process(char *init_filename)
761{
762 argv_init[0] = init_filename;
763 kernel_execve(init_filename, argv_init, envp_init);
764}
765
766/* This is a non __init function. Force it to be noinline otherwise gcc
767 * makes it inline to init() and it becomes part of init.text section
768 */
769static int noinline init_post(void)
770{
771 free_initmem();
772 unlock_kernel();
773 mark_rodata_ro();
774 system_state = SYSTEM_RUNNING;
775 numa_default_policy();
776
777 if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
778 printk(KERN_WARNING "Warning: unable to open an initial console.\n");
779
780 (void) sys_dup(0);
781 (void) sys_dup(0);
782
783 if (ramdisk_execute_command) {
784 run_init_process(ramdisk_execute_command);
785 printk(KERN_WARNING "Failed to execute %s\n",
786 ramdisk_execute_command);
787 }
788
789 /*
790 * We try each of these until one succeeds.
791 *
792 * The Bourne shell can be used instead of init if we are
793 * trying to recover a really broken machine.
794 */
795 if (execute_command) {
796 run_init_process(execute_command);
797 printk(KERN_WARNING "Failed to execute %s. Attempting "
798 "defaults...\n", execute_command);
799 }
800 run_init_process("/sbin/init");
801 run_init_process("/etc/init");
802 run_init_process("/bin/init");
803 run_init_process("/bin/sh");
804
805 panic("No init found. Try passing init= option to kernel.");
806}
807
808static int __init kernel_init(void * unused)
809{
810 lock_kernel();
811 /*
812 * init can run on any cpu.
813 */
814 set_cpus_allowed(current, CPU_MASK_ALL);
815 /*
816 * Tell the world that we're going to be the grim
817 * reaper of innocent orphaned children.
818 *
819 * We don't want people to have to make incorrect
820 * assumptions about where in the task array this
821 * can be found.
822 */
823 init_pid_ns.child_reaper = current;
824
825 cad_pid = task_pid(current);
826
827 smp_prepare_cpus(setup_max_cpus);
828
829 do_pre_smp_initcalls();
830
831 smp_init();
832 sched_init_smp();
833
834 cpuset_init_smp();
835
836 do_basic_setup();
837
838 /*
839 * check if there is an early userspace init. If yes, let it do all
840 * the work
841 */
842
843 if (!ramdisk_execute_command)
844 ramdisk_execute_command = "/init";
845
846 if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
847 ramdisk_execute_command = NULL;
848 prepare_namespace();
849 }
850
851 /*
852 * Ok, we have completed the initial bootup, and
853 * we're essentially up and running. Get rid of the
854 * initmem segments and start the user-mode stuff..
855 */
856 init_post();
857 return 0;
858}
859
start_kernel()看起来更像典型的内核代码,它几乎与C和机器无关。
该函数是对各种内核子系统和数据结构的初始化的一长串调用。
这些包括调度程序,内存区域,时间保持等等。
然后,start_kernel()调用rest_init(),此时几乎所有事情都可以进行。
rest_init()创建一个内核线程,该线程将另一个函数kernel_init()作为入口点。
然后rest_init()调用schedule()来启动任务调度,并通过调用cpu_idle()进入睡眠状态,cpu_idle()是Linux内核的空闲线程。
cpu_idle()会永远运行,因此进程零将由它托管。
只要有工作要做-可运行的进程-进程0就会从CPU引导出来,只有在没有可运行的进程可用时才返回。
但是,这是我们的出发点。
这个空闲循环是自启动以来我们followed的长线程的结尾,它是加电后处理器执行的第一个跳转的最终后代。
从复位向量到BIOS再到MBR再到引导加载程序再到实模式内核再到保护模式内核,所有这些混乱都在这里导致,一跳一跳地结束于引导处理器cpu_idle的空闲循环中这真的很酷。但是,这不可能是全部,否则计算机将无法工作。
此时,先前启动的内核线程已准备就绪,可以开始运行,从而替换进程0及其空闲线程。
事情确实如此,因为kernel_init()被指定为线程入口点,所以从这一点开始运行。
kernel_init()负责初始化系统中剩余的,自引导以来已暂停的CPU。
到目前为止,我们已经看到的所有代码都在称为引导处理器的单个CPU中执行。
在启动其他称为应用程序处理器的CPU时,它们以实模式启动,并且还必须运行多次初始化。
您可以在startup_32的代码中看到许多代码路径是通用的,但后来的应用程序处理器会使用一些分支。
最后,kernel_init()调用init_post(),它尝试按以下顺序执行用户模式进程:/ sbin / init,/ etc / init,/ bin / init和/ bin / sh。
如果全部失败,内核将崩溃。
幸运的是,init通常在那里,并以PID 1的身份开始运行。它会检查其配置文件以找出要启动的进程,其中可能包括X11 Windows,在控制台上登录的程序,网络守护程序等。
这样就结束了启动过程,因为另一个Linux盒开始在某处运行。
可能您的正常运行时间长而麻烦。
给定通用的体系结构,Windows的过程在许多方面都相似。
面临许多相同的问题,必须完成类似的初始化。
引导时,最大的区别之一是Windows将所有实模式内核代码以及一些初始的受保护模式代码打包到引导加载程序本身(C:\ NTLDR)中。
因此,Windows使用不同的二进制映像而不是在同一内核映像中包含两个区域。
另外,Linux将引导加载程序和内核完全分开;
这以某种方式自动脱离了开源过程。
下图显示了Windows内核的main bits:
Windows内核初始化
Windows用户模式的启动自然非常不同。
没有/ sbin / init,而是Csrss.exe和Winlogon.exe。
Winlogon生成启动所有Windows Services的Services.exe和本地安全身份验证子系统Lsass.exe。
经典的Windows登录对话框在Winlogon的上下文中运行。
这是本引导系列的结尾。
感谢大家的阅读和反馈。
对不起,有些事情得到了肤浅的对待。
我必须从某个地方开始,并且非常适合博客大小的讨论。
但是,没有什么比第二天更美好了。
我的计划是像其他系列一样,定期刊登本系列的“软件插图”文章。
同时,这里有一些资源:
最好,最重要的资源是用于Linux或BSD之一的真实内核的源代码。
英特尔发布了出色的《软件开发人员手册》,您可以免费下载。
了解Linux内核是一本不错的书,它遍历了许多Linux内核资源。
它已经过时了并且很干,但是我仍然推荐给任何想要使用内核的人。
Linux设备驱动程序更有趣,教学效果很好,但是范围有限。
最后,Patrick Moroney在本文的评论中建议了Robert Love的Linux Kernel Development。
我听说过该书的其他正面评论,因此听起来值得一试。
对于Windows,迄今为止最好的参考是Sysinternals的著名人物David Solomon和Mark Russinovich撰写的Windows Internals。
这是一本很棒的书,写得好而且透彻。
主要缺点是缺少源代码。
[更新:在下面的评论中,Nix在我掩盖的初始根文件系统上涵盖了很多基础。
感谢Marius Barbu在我写“ CR3”而不是GDTR时遇到了一个错误。