Let me give the Results first
Sequence | time (uncompressed Kernel) | time (compressed Kernel) |
---|---|---|
U-boot load and start | 0.08 | 0.08 |
Kernel load | 0.52 (0.24 crc32 and 0.28 for copy) | 0.35 (0.16 crc32 and 0.19 for copy) |
Kernel start and uncompress | 0.0 | 0.66 (for unzip) |
Kernel Initialization | 0.37 | 0.38 |
Init | 0 (bypassed init=/bin/ash) | 0 (bypassed init=/bin/ash) |
Switching Run level and executing init scripts | 0 (bypassed init=/bin/ash) | 0 (bypassed init=/bin/ash) |
Starting Shell | 0 (init=/bin/ash) | 0 (init=/bin/ash) |
Total | 0.97 | 1.47 |
Message Log with uncompressed kernel (refer Measuring Boot Time)
$ ./tstamp.exe < /dev/ttyS0
column1 is elapsed time since first message column2 is elapsed time since previous message column3 is the message 0.000 0.000: 0.000 0.000: 0.000 0.000: U-Boot 1.2.0 (Jun 23 2008 - 14:53:30) 0.000 0.000: 0.000 0.000: I2C: ready 0.000 0.000: DRAM: 256 MB 0.000 0.000: MY AMD Flash: 16 MB 0.060 0.060: In: serial 0.060 0.000: Out: serial 0.060 0.000: Err: serial 0.070 0.010: RM Clock :- 297MHz DDR Clock :- 162MHz 1.071 1.001: Hit any key to stop autoboot: 0 1.351 0.280: # Booting image at 80007fc0 ... 1.351 0.000: Verifying Checksum ... 1.502 0.151: OK 1.502 0.000: OK 1.502 0.000: ## Loading Ramdisk Image at 80900000 ... 1.502 0.000: Verifying Checksum ... 1.592 0.090: OK 1.602 0.010: 1.602 0.000: Starting kernel ... 1.602 0.000: 1.972 0.370:bin/ash: can't access tty; job control turned off
Note: remember to subtract 1 second bootdelay
Message Log with compressed kernel (refer Measuring Boot Time)
$ ./tstamp.exe < /dev/ttyS0
column1 is elapsed time since first message column2 is elapsed time since previous message column3 is the message 0.000 0.000: 0.000 0.000: U-Boot 1.2.0 (Jun 23 2008 - 14:53:30) 0.000 0.000: 0.000 0.000: I2C: ready 0.010 0.010: DRAM: 256 MB 0.010 0.000: MY AMD Flash: 16 MB 0.060 0.050: In: serial 0.060 0.000: Out: serial 0.070 0.010: Err: serial 0.070 0.000: ARM Clock :- 297MHz DDR Clock :- 162MHz 1.071 1.001: Hit any key to stop autoboot: 0 1.261 0.190: # Booting image at 80007fc0 ... 1.261 0.000: Verifying Checksum ... 1.331 0.070: OK 1.331 0.000: OK 1.341 0.010: ## Loading Ramdisk Image at 80900000 ... 1.341 0.000: Verifying Checksum ... 1.432 0.091: OK 1.432 0.000: 1.432 0.000: Starting kernel ... 1.432 0.000: 2.093 0.661: Uncompressing Linux................................................. ......................... done, booting the kernel. 2.473 0.380:/bin/ash: can't access tty; job control turned off
Note: remember to subtract 1 second bootdelay
Board:DM6446 DVEVM RS232 port connected to PC
Linux: Montavista Pro 5.0 installed on Linux Box LSP: REL_LSP_PSP_02_00_00_010
U-Boot, Kernel and ramdisk(cramfs) in NOR flash. Rootfilesystem is ramdisk (cramfs)
This gave 0x107650 bytes (~1MB) compressed Kernel and 0x238FA0 bytes (~2.2MB) uncompressed Kernel. I have used 0x107650 and 0x238FA0 in this article, please replace it with your Kernel size appropriately.
Make ramdisk Host# mkdir <tempdir> Host# cd <tempdir> Host# cp /opt/montavista/pro/devkit/arm/v5t_le_uclibc/images/ramdisk.gz . Host# gzip -d ramdisk.gz loop mount Host# mkdir disk Host# mount -o loop -t ext2 ramdisk disk copy modules (created during kernel size optimization) Host# cp /opt/montavista/pro/devkit/lsp/ti-davinci/linux-2.6.18_pro500/drivers/net/davinci_emac_driver.ko disk/home make cramfs Host# mkcramfs -n ramdisk disk rootfs.cramfs make it U-boot compatible. (place U-Boot header) Host# mkimage -A arm -T ramdisk -n 'Ramdisk' -a 0x80900000 -e 0x80900000 -d rootfs.cramfs uCramfsdisk
This gave 0x164040 (~1.4MB) filesystem(cramfs). I have used 0x164040 in this article. Please replace it with your filesystem size.
boot parameters at this stage: setenv bootargs mem=256M console=ttyS0,115200n8 root=/dev/ram0 ro setenv bootcmd 'cp.b 0x2300000 0x80900000 <your filesystem size in hex>; bootm 0x2050000 0x80900000'
boot parameters at this stage: setenv bootargs mem=16M console=ttyS0,115200n8 root=/dev/ram0 ro quiet lpj=741376 ip=off init=/bin/ash
U-Boot is configured for the slowest possible NOR speed by defalut. So, copy from NOR is very slow (slowest possible).Board uses AM29LV256MH. connected using 16 bit bus. Data sheet (of NOR) mentions 120ns as the access time (read setup time + read strobe time). read strobe time has to be atleast 40ns. EMIFA runs at 100 MHz EMIFA speed (1/6 th of PLL1 600MHz). i.e. theoretically 12 EMIFA clocks are needed to fetch one short(16 bits bus).
Convention to calculate EMIFA cycles need EMIFA cycles = ceil of ("calculated cycles as per data sheets") + 1 = 12 + 1 one additional clock (+ 1) is to account for crystal/oscillator accuracy.
EMIFA CS2 setting can be any of the following (refer DM6446 EMIF user guide for EMIFA CS2 register) EMIFA CS2 0x3FFE058D. read setup of 1 clock (register value of 0) + 12 clocks (register value of 11). EMIFA CS2 0x3FFEC20D. read setup of 7 clocks (register value of 6) + 5 clocks (register value of 4) read hold time is not needed, but minimum is 1 clock on DM6446 (register value of 0)
Changes go into AEMIF NOR initialization part of board/davinci/lowlevel_init.S
ACFG2: .word 0x01E00010 ACFG2_VAL: .word 0x3FFE058D .. LDR R0, ACFG2 LDR R1, ACFG2_VAL LDR R2, [R0] AND R1, R2, R1 STR R1, [R0] ...
Reduce the U-Boot Parameter Space from 128kB to 2kB.
Changes goino include/configs/davinci.h
#define CFG_ENV_SIZE 0x800
I2C communication in U-Boot is slow and can be removed or relocated outof u-boot init code. Instead of running it as part of init, it can run as part of appropriate command (when u-boot is in interactive mode). On this board, for example, MAC address is read over I2C in board/davinci/davinci.c. Moved it to cmd_net.c
.. netboot_common(...) .. if(readset_ethaddr_first) { /* do I2C communication to get MAC address here */ readset_ethaddr_first = 0; } ....
u-boot memory copy code is not optimal.CPU does byte by byte copy.
Two functions where the copy happens
memmove of lib_generic/string.c do_mem_cp of common/cmd_mem.c
By writing optimized c routine to copy data (when source and destination are aligned on double boundary) got 0.8MB per 100 milliseconds. i.e. 2.4MB (Kernel+filesystem) copy takes 300 milliseconds.
if( (((uint)dst|(uint)src)&0x7) == 0 ) { // both dst and src are aligned on double boundary double *dDbl; const double *sDbl; loop = len >> 3; dDbl = (double *)dst; sDbl = (const double*)src; for (i = 0; i < loop; i++) *dDbl++ = *sDbl++; d = (uchar*)dDbl; s = (const uchar*)sDbl; if (len & 4) { *d++ = *s++; *d++ = *s++; *d++ = *s++; *d++ = *s++;} if (len & 2) { *d++ = *s++; *d++ = *s++; } if (len & 1) *d++ = *s++; }
By moving to EDMA to copy, got 1.26 MB per 100milliseconds. i.e. 2.4MB (Kernel+filesystem) copy takes 190 milliseconds.Theoretical calculations (based on AM29LV256MH's 120ns access time per two bytes) give 144milliseconds for 2.4MB NOR to DDR copy.
The following bootcmd has one issue, Kernel is accessed from NOR twice.First time for crc32 check and second time for kernel relocation
setenv bootcmd 'cp.b 0x2300000 0x80900000 <your filesystem size in hex>; bootm 0x2050000 0x80900000'
Change bootcmd to
setenv bootcmd 'cp.b 0x2050000 0x80700000 <your kernel size in hex>;cp.b 0x2300000 0x80900000 <your filesystem size in hex>; bootm 0x80700000 0x80900000'
Now NOR is accessed only once for copy. Crc32 and relocation happens on DDR
u-boot relocates Kernel to 0x80008000 using memmove function. This step happens after crc32 check passes.Relocation can be avoided by making the first copy smartly.uImage (Kernel image with header) has 0x40 byte header. Copy to 0x80007FC0 puts actual Kernel at 0x80008000
Change the bootcmd
setenv bootcmd 'cp.b 0x2050000 0x80007FC0 0x107650;cp.b 0x2300000 0x80900000 0x164040; bootm 0x80007FC0 0x80900000'
This is good, but, u-boot calls memmove to copy Kernel onto itself (relocate from 0x80008000 to 0x80008000 now).
add a check in memmove code to "do nothing" if destination and source are pointing to the same address
In the steps above, Kernel is copied from NOR to DDR. Therefore, Crc32 works on DDR instead of NOR.First step of crc32 optimization is already done.
Input buffer is accessed byte by byte in crc32 code. Change to 4bytes (as an integer) eachtime.It takes 500 milliseconds to do crc32 check of 2.4MB (Kernel + filesystem) at this stage.
Place 1kByte crc table (crc_table) onchip.DM6446 has 8kB of onchip at 0xA000
#ifdef CRC_TABLE_ONCHIP unsigned int * pTOnchip=(unsigned int*)0x0000A000; if(crc_table_onchip_first) { for(i=0;i<256;i++) // copy to onchip pTOnchip[i] = crc_table[i]; crc_table_onchip_first = 0; } pT = pTOnchip; #endif
It takes ~200 milliseconds to do crc32 check of 2.4MB (Kernel + filesystem) at this stage.
DM6446 has 8kB of onchip at 0xA000. In the step above, 1kB is used for crc table.That leaves 7kB. Use edma to get input data in chunks of 7kB and runs crc32 on onchip data.
uInt curLen; uInt *pDOnchip=(unsigned int*)0x0000A400; crc = crc ^ 0xffffffffL; for(i=0;i<len;i+=curLen) { curLen = ( (len-i) < (7*1024) )? (len-i):(7*1024); your_memcpy_using_edma(pDOnchip,buf+i,curLen); crc = your_crc32_with_no_compliment(crc,pDOnchip,curLen); // ^0xffffffff is already done } return crc ^ 0xffffffffL;
your_crc32_with_no_compliment is crc32 function without initial and final compliment (^0xffffffff)
It takes 160 milliseconds to do crc32 check of 2.4MB (compressed Kernel + filesystem)and 240 milliseconds to do crc32 check of 3.6MB (uncompressed Kernel + filesystem).
Untill this point compressed kernel is used. Now, with copy and crc32 optimized and unzip of kernel taking 0.66 seconds, it is a trade off between Kernel size (flash size) vs. boottime.
To use uncompressed kernel (Image) with u-boot, header has to be placed on Image.
mkimage -A arm -O linux -T kernel -C none -a 0x80008000 -e 0x80008000 -n 'Linux-2.6.18_pro500' -d Image uImage mkimage here places a 0x40 bytes header on Image to produce uImage
Now Burn Kernel at 0x02050000. see Burn any image to NOR flash
bootdelay=1 bootcmd=cp.b 0x2050000 0x80007FC0 0x238FA0;cp.b 0x2300000 0x80900000 0x164040; bootm 0x80007FC0 0x80900000 bootargs=mem=16M console=ttyS0,115200n8 root=/dev/ram0 ro quiet lpj=741376 ip=off init=/bin/ash
All this gives 1 second Boot time (0.97 to be precise)
Not sure I would be able to spend further time on this.If I endup spending time on this, I would do the following
see also
At the end of last year, to demonstrate my company’s swiftBoot service, I put together a rather impressive demo. Using a Renesas MS7724 development board I was able to achieve a one second cold Linux boot to a Qt application. Here’s the demo…
Many people see a demo like this and assume there are ‘smoke and mirrors’ or that we’ve implemented a suspend to disk solution. This is genuinely a cold boot including UBoot (2009-01), Linux kernel (2.6.31-rc7) and Qt Embedded Open Source 4.6.2. We’ve not applied any specific intellectual property but instead spent time analysing where boot delays are coming from and simply optimising them away. The majority of the modifications we make usually fall into the category of ‘removing things that aren’t required’, ‘optimising things that are required’, or ‘taking a new approach to solving problems’ and are tailored very precisely to the needs of the ‘product’.
If you’re interested in exactly what modification I made and a little more about the approach taken – you may be interested in these slides which I presented at ELC-E 2010 – I’m also expecting a video of this presentation to appear on Free Electrons in the near future.
You may also remember my last demo based on an OMAP3530 EVM. [© 2011 embedded-bits.co.uk]