Arm Linux Head.S 文件的分析(转载)

还有一篇讲到怎么编译zImage的,也挺牛的

http://blog.csdn.net/pottichu/archive/2009/06/11/4261150.aspx

 

 

http://blog.csdn.net/arriod/archive/2008/08/21/2808861.aspx

 

这是 ARM-Linux 运行的第一个文件,这些代码是一个比较独立的代码包裹器。其作用就是解压 Linux 内核,并将 PC 指针跳到内核( vmlinux )的第一条指令。

Bootloader 中传入到 Linux 中的参数总共有三个, Linux 中用到的是第二个和第三个。第二个参数是 architecture id ,第三个是 taglist 的地址。 Architecture id arm 芯片在 Linux 中一定要唯一。 Taglist bootload Linux 传入的参数列表(详细的解释请参考《 booting arm linux.pdf 》)。

// 程序的入口点

              .section ".start", #alloc, #execinstr

/*

  * sort out different calling conventions

  */

              .align

start:

              .type       start,#function

              .rept       8 // 重复 8 次下面的指令,也就是空出中断向量表的位置

              mov r0, r0 // 就是 nop 指令

              .endr

 

              b     1f

              .word       0x016f2818              @ Magic numbers to help the loader

              .word       start               @ absolute load/run zImage address

              .word       _edata                   @ zImage end address

1:            mov r7, r1                  @ save architecture ID

              mov r8, r2                  @ save atags pointer

 

#ifndef __ARM_ARCH_2__

              /*

                * Booting from Angel - need to enter SVC mode and disable

                * FIQs/IRQs (numeric definitions from angel arm.h source).

                * We only do this if we were in user mode on entry.

                */

              mrs r2, cpsr        @ get current mode

              tst   r2, #3                  @ not user?

              bne       not_angel

              mov r0, #0x17            @ angel_SWIreason_EnterSVC

              swi       0x123456              @ angel_SWI_ARM

not_angel:

              mrs r2, cpsr        @ turn off interrupts to

              orr  r2, r2, #0xc0              @ prevent angel from running

              msr       cpsr_c, r2

#else

              teqp       pc, #0x0c000003         @ turn off interrupts

#endif

 

一定要保证当前运行在 SVC 模式下,否则会跳到 swi 里面去(为什么?我不清楚,而且我没有处理过这个 swi )。然后再关闭 irq fiq

 

              /*

                * Note that some cache flushing and other stuff may

                * be needed here - is there an Angel SWI call for this?

                */

 

              /*

                * some architecture specific code can be inserted

                * by the linker here, but it should preserve r7, r8, and r9.

                */

 

读入地址表。因为我们的代码可以在任何地址执行,也就是位置无关代码( PIC ),所以我们需要加上一个偏移量。下面有每一个列表项的具体意义。

GOT 表的初值是连接器指定的,当时程序并不知道代码在哪个地址执行。如果当前运行的地址已经和表上的地址不一样,还要修正 GOT 表。

              .text

              adr  r0, LC0

              ldmia       r0, {r1, r2, r3, r4, r5, r6, ip, sp}

              subs r0, r0, r1             @ calculate the delta offset

 

                                          @ if delta is zero, we are

              beq       not_relocated           @ running at the address we

                                          @ were linked at.

 

              /*

                * We're running at a different address.  We need to fix

                * up various pointers:

                *   r5 - zImage base address

                *   r6 - GOT start

                *   ip - GOT end

                */

              add r5, r5, r0

              add r6, r6, r0

              add  ip, ip, r0

 

              /*

                * If we're running fully PIC === CONFIG_ZBOOT_ROM = n,

                * we need to fix up pointers into the BSS region.

                *   r2 - BSS start

                *   r3 - BSS end

                *   sp - stack pointer

                */

              add r2, r2, r0

              add r3, r3, r0

              add       sp, sp, r0

 

修改 GOT (全局偏移表)表。根据当前的运行地址,修正该表。

              /*

                * Relocate all entries in the GOT table.

                */

1:            ldr  r1, [r6, #0]          @ relocate entries in the GOT

              add r1, r1, r0             @ table.  This fixes up the

              str  r1, [r6], #4          @ C references.

              cmp r6, ip

              blo       1b

 

BSS 段,所有的 arm 程序都需要做这些的。

 

not_relocated:       mov r0, #0

1:            str  r0, [r2], #4          @ clear bss

              str  r0, [r2], #4

              str  r0, [r2], #4

              str  r0, [r2], #4

              cmp r2, r3

              blo       1b

 

正如下面的注释所说, C 环境我们已经设置好了。下面我们要打开 cache mmu 。为什么要这样做呢?这只是一个解压程序呀?为了速度。那为什么要开 mmu 呢,而且只是做一个平板式的映射?还是为了速度。如果不开 mmu 的话,就只能打开 icache 。因为不开 mmu 的话就无法实现内存管理,而 io 区是决不能开 dcache 的。

 

              /*

                * The C runtime environment should now be setup

                * sufficiently.  Turn the cache on, set up some

                * pointers, and start decompressing.

                */

              bl       cache_on

是不是要跟读进去呢?对于只是对流程感兴趣的人只是知道打开 cache 就行了。不过跟进去是很有乐趣的,这就是为什么虽然 Linux 如此庞大,但仍有人会孜孜不倦的研究它的每一行代码的原因吧。反过来说,对于 Linux 内核的整体把握更加重要,要不然就成盲人摸象了。还有,想做 ARM 高手的人可以读 Linux 下的每一个汇编文件,因为 Linux 内核用 ARM 的东西还是比较全的。

 

              mov r1, sp                  @ malloc space above stack

              add r2, sp, #0x10000       @ 64k max

 

对下面这些地址的理解其实还是很麻烦,但有篇文档写得很清楚《 About TEXTADDR, ZTEXTADDR, PAGE_OFFSET etc... 》。下面程序的意义就是保证解压地址和当前程序的地址不重叠。上面分配了 64KB 的空间来做解压时的数据缓存。

/*

  * Check to see if we will overwrite ourselves.

  *   r4 = final kernel address // 内核执行的最终实地址

  *   r5 = start of this image // 该程序的首地址

  *   r2 = end of malloc space (and therefore this image)

  * We basically want:

  *   r4 >= r2 -> OK

  *   r4 + image length <= r5 -> OK

  */

              cmp r4, r2

              bhs       wont_overwrite

              add r0, r4, #4096*1024       @ 4MB largest kernel size

              cmp r0, r5

              bls       wont_overwrite

 

如果空间不够了,只好解压到缓冲区地址后面。调用 decompress_kernel 进行解压缩,这段代码是用 c 实现的,和架构无关。

 

              mov r5, r2                  @ decompress after malloc space

              mov r0, r5

              mov r3, r7

              bl       decompress_kernel

 

完成了解压缩之后,由于空间不够,内核也没有解压到正确的地址,必须通过代码搬移来搬到指定的地址。搬运过程中有可能会覆盖掉现在运行的这段代码,所以必须将有可能会执行到的代码搬运到安全的地方,这里用的是解压缩了的代码的后面。

 

              add r0, r0, #127

              bic  r0, r0, #127         @ align the kernel length

/*

  * r0     = decompressed kernel length

  * r1-r3  = unused

  * r4     = kernel execution address

  * r5     = decompressed kernel start

  * r6     = processor ID

  * r7     = architecture ID

  * r8     = atags pointer

  * r9-r14 = corrupted

  */

              add r1, r5, r0             @ end of decompressed kernel

              adr  r2, reloc_start

              ldr  r3, LC1

              add r3, r2, r3

1:            ldmia       r2!, {r9 - r14}        @ copy relocation code

              stmia       r1!, {r9 - r14}

              ldmia       r2!, {r9 - r14}

              stmia       r1!, {r9 - r14}

              cmp r2, r3

              blo       1b

 

              bl       cache_clean_flush // 因为有代码搬移,所以必须先清理( clean )清除( flush cache

              add       pc, r5, r0        @ call relocation code

 

decompress_kernel 共有 4 个参数,解压的内核地址、缓存区首地址、缓存区尾地址、和芯片 ID ,返回解压缩代码的长度。

 

/*

  * We're not in danger of overwriting ourselves.  Do this the simple way.

  *

  * r4     = kernel execution address

  * r7     = architecture ID

  */

wont_overwrite:       mov r0, r4

              mov r3, r7

              bl       decompress_kernel

              b       call_kernel

 

针对于不会出现代码覆盖的情况,就简单了。直接解压缩内核并且跳转到首地址运行。 call_kernel 这个函数我们会在下面分析它。

 

              .type       LC0, #object

LC0:              .word       LC0               @ r1

              .word       __bss_start              @ r2

              .word       _end                     @ r3

              .word       zreladdr          @ r4

              .word       _start                    @ r5

              .word       _got_start              @ r6

              .word       _got_end        @ ip

              .word       user_stack+4096            @ sp

LC1:              .word       reloc_end - reloc_start

              .size       LC0, . - LC0

 

上面这个就是刚才我们说过的地址表,里面有几个符号的地址定义。 LC0 是在这里定义的。 Zreladdr 是在当前目录下的 Makfile 里定义的。其他的符号是在 lds 里定义的。

 

下面我们来分析一下有关 cache mmu 的代码。通过这些代码我们可以看到 Linux 的高手们是如何通过汇编来实现各个 ARM 处理器的识别,以达到通用的目的。

/*

  * Turn on the cache.  We need to setup some page tables so that we

  * can have both the I and D caches on.

  *

  * We place the page tables 16k down from the kernel execution address,

  * and we hope that nothing else is using it.  If we're using it, we

  * will go pop!

  *

  * On entry,

  *  r4 = kernel execution address

  *  r6 = processor ID

  *  r7 = architecture number

  *  r8 = atags pointer

  *  r9 = run-time address of "start"  (???)

  * On exit,

  *  r1, r2, r3, r9, r10, r12 corrupted

  * This routine must preserve:

  *  r4, r5, r6, r7, r8

  */

              .align       5

cache_on:       mov r3, #8                     @ cache_on function

              b       call_cache_fn

 

这里涉及到了很多 MMU cache writebuffer TLB 的操作和协处理器的编程。具体编程的东西,我就不想多说了,可以对这 ARM 的手册逐行的理解。至于为什么要这样做,熟悉了他们的工作原理后也就不难理解了(《 ARM 嵌入式系统开发》这本书就有个比较好的说明)。因为这里包含了太多的代码搬运、解压等费时的操作,所以打开 cache 是有必要的。由于要用到数据 cache 所以需要对 mmu 进行配置。为了简单这里制作了一级映射,而且是物理地址和虚拟地址相同的 1:1 映射。

 

__setup_mmu:       sub  r3, r4, #16384           @ Page directory size

              bic  r3, r3, #0xff        @ Align the pointer

              bic  r3, r3, #0x3f00

/*

  * Initialise the page tables, turning on the cacheable and bufferable

  * bits for the RAM area only.

  */

              mov r0, r3

              mov r9, r0, lsr #18

              mov r9, r9, lsl #18              @ start of RAM

              add       r10, r9, #0x10000000       @ a reasonable RAM size

              mov r1, #0x12

              orr  r1, r1, #3 << 10

              add r2, r3, #16384

1:            cmp r1, r9                  @ if virt > start of RAM

              orrhs       r1, r1, #0x0c            @ set cacheable, bufferable

              cmp r1, r10                @ if virt > end of RAM

              bichs       r1, r1, #0x0c            @ clear cacheable, bufferable

              str  r1, [r0], #4          @ 1:1 mapping

              add r1, r1, #1048576

              teq  r0, r2

              bne       1b

 

参考下面的注释,如果当前在 flash 中运行,我们再映射 2MB 。就算是当前在 RAM 中执行其实也没关系,只不过是做了重复工作。

 

/*

  * If ever we are running from Flash, then we surely want the cache

  * to be enabled also for our execution instance...  We map 2MB of it

  * so there is no map overlap problem for up to 1 MB compressed kernel.

  * If the execution is in RAM then we would only be duplicating the above.

  */

              mov r1, #0x1e

              orr  r1, r1, #3 << 10

              mov r2, pc, lsr #20

              orr  r1, r1, r2, lsl #20

              add r0, r3, r2, lsl #2

              str  r1, [r0], #4

              add r1, r1, #1048576

              str  r1, [r0]

              mov       pc, lr

 

__armv4_cache_on:

              mov       r12, lr

              bl       __setup_mmu

              mov r0, #0

              mcr       p15, 0, r0, c7, c10, 4       @ drain write buffer

              mcr       p15, 0, r0, c8, c7, 0 @ flush I,D TLBs

              mrc       p15, 0, r0, c1, c0, 0 @ read control reg

              orr  r0, r0, #0x5000           @ I-cache enable, RR cache replacement

              orr  r0, r0, #0x0030

              bl       __common_cache_on

              mov r0, #0

              mcr       p15, 0, r0, c8, c7, 0 @ flush I,D TLBs

              mov       pc, r12

 

__common_cache_on:

#ifndef DEBUG

              orr  r0, r0, #0x000d           @ Write buffer, mmu

#endif

              mov r1, #-1

              mcr       p15, 0, r3, c2, c0, 0 @ load page table pointer

              mcr       p15, 0, r1, c3, c0, 0 @ load domain access control

              mcr       p15, 0, r0, c1, c0, 0 @ load control register

              mov       pc, lr

 

/*

  * All code following this line is relocatable.  It is relocated by

  * the above code to the end of the decompressed kernel image and

  * executed there.  During this time, we have no stacks.

  *

  * r0     = decompressed kernel length

  * r1-r3  = unused

  * r4     = kernel execution address

  * r5     = decompressed kernel start

  * r6     = processor ID

  * r7     = architecture ID

  * r8     = atags pointer

  * r9-r14 = corrupted

  */

 

下面这段代码是在解压空间不够的情况下需要重新定位的,具体原因上面已经说明。

 

              .align       5

reloc_start:       add  r9, r5, r0

              debug_reloc_start

              mov r1, r4

1:

              .rept 4

              ldmia       r5!, {r0, r2, r3, r10 - r14}       @ relocate kernel

              stmia       r1!, {r0, r2, r3, r10 - r14}

              .endr

 

              cmp r5, r9

              blo       1b

              debug_reloc_end

 

这是最后一个函数了,这个时候一切实质性的工作已经做完。关闭 cache ,并跳转到真正的内核入口。

 

call_kernel:     bl       cache_clean_flush

              bl       cache_off

              mov r0, #0                  @ must be zero

              mov r1, r7                  @ restore architecture number

              mov r2, r8                  @ restore atags pointer

              mov       pc, r4                    @ call kernel

 

/*

  * Here follow the relocatable cache support functions for the

  * various processors.  This is a generic hook for locating an

  * entry and jumping to an instruction at the specified offset

  * from the start of the block.  Please note this is all position

  * independent code.

  *

  *  r1  = corrupted

  *  r2  = corrupted

  *  r3  = block offset

  *  r6  = corrupted

  *  r12 = corrupted

  */

 

通过下面函数我们可以通过 proc_types 结构体数组我们可以顺利的找到现在的处理器型号,并且会根据 R3 的偏移量跳转到相应的函数中。里面涉及到协处理器 CP15 c0 的操作,如果有疑问,可以参考 ARM 相关手册。

 

call_cache_fn:       adr   r12, proc_types

              mrc       p15, 0, r6, c0, c0     @ get processor ID

1:            ldr  r1, [r12, #0]        @ get value

              ldr  r2, [r12, #4]        @ get mask

              eor  r1, r1, r6             @ (real ^ match)

              tst   r1, r2                  @       & mask

              addeq       pc, r12, r3              @ call cache function

              add       r12, r12, #4*5

              b       1b

 

/*

  * Table for cache operations.  This is basically:

  *   - CPU ID match

  *   - CPU ID mask

  *   - 'cache on' method instruction

  *   - 'cache off' method instruction

  *   - 'cache flush' method instruction

  *

  * We match an entry using: ((real_id ^ match) & mask) == 0

  *

  * Writethrough caches generally only need 'on' and 'off'

  * methods.  Writeback caches _must_ have the flush method

  * defined.

  */

              .type       proc_types,#object

proc_types:

              .word       0x41560600            @ ARM6/610

              .word       0xffffffe0

              b       __arm6_cache_off   @ works, but slow

              b       __arm6_cache_off

              mov       pc, lr

@           b       __arm6_cache_on           @ untested

@           b       __arm6_cache_off

@           b       __armv3_cache_flush

 

              .word       0x00000000            @ old ARM ID

              .word       0x0000f000

              mov       pc, lr

              mov       pc, lr

              mov       pc, lr

 

              .word       0x41007000            @ ARM7/710

              .word       0xfff8fe00

              b       __arm7_cache_off

              b       __arm7_cache_off

              mov       pc, lr

 

              .word       0x41807200            @ ARM720T (writethrough)

              .word       0xffffff00

              b       __armv4_cache_on

              b       __armv4_cache_off

              mov       pc, lr

 

              .word       0x00007000            @ ARM7 IDs

              .word       0x0000f000

              mov       pc, lr

              mov       pc, lr

              mov       pc, lr

 

              @ Everything from here on will be the new ID system.

 

              .word       0x4401a100            @ sa110 / sa1100

              .word       0xffffffe0

              b       __armv4_cache_on

              b       __armv4_cache_off

              b       __armv4_cache_flush

 

              .word       0x6901b110            @ sa1110

              .word       0xfffffff0

              b       __armv4_cache_on

              b       __armv4_cache_off

              b       __armv4_cache_flush

 

              @ These match on the architecture ID

 

              .word       0x00020000            @ ARMv4T

              .word       0x000f0000

              b       __armv4_cache_on

              b       __armv4_cache_off

              b       __armv4_cache_flush

 

              .word       0x00050000            @ ARMv5TE

              .word       0x000f0000

              b       __armv4_cache_on

              b       __armv4_cache_off

              b       __armv4_cache_flush

 

              .word       0x00060000            @ ARMv5TEJ

              .word       0x000f0000

              b       __armv4_cache_on

              b       __armv4_cache_off

              b       __armv4_cache_flush

 

              .word       0x00070000            @ ARMv6

              .word       0x000f0000

              b       __armv4_cache_on

              b       __armv4_cache_off

              b       __armv6_cache_flush

 

              .word       0                   @ unrecognised type

              .word       0

              mov       pc, lr

              mov       pc, lr

              mov       pc, lr

 

              .size       proc_types, . - proc_types

 

/*

  * Turn off the Cache and MMU.  ARMv3 does not support

  * reading the control register, but ARMv4 does.

  *

  * On entry,  r6 = processor ID

  * On exit,   r0, r1, r2, r3, r12 corrupted

  * This routine must preserve: r4, r6, r7

  */

              .align       5

cache_off:       mov r3, #12                     @ cache_off function

              b       call_cache_fn

 

// 代码略

 

这里分配了 4K 的空间用来做堆栈。

 

reloc_end:

 

              .align

              .section ".stack", "w"

user_stack:       .space       4096

你可能感兴趣的:(Arm Linux Head.S 文件的分析(转载))