ref:http://kmittal82.wordpress.com/2012/02/17/armthumbthumb-2/
A few months ago I gave a presentation titled “Introduction to the ARM architecture”. One of the most well received sections of that was a bit where I explained the difference between the various types of instruction sets that can be run on the ARM architecture, i.e. ARM (32 bit), Thumb (16 bit) and Thumb-2 (16/32 bit). I will try and explain the difference between the three in this post.
Before we begin, let’s look at a very simple test case which we will build for all three architectures (Thanks for my friend and colleague Stephen Wade for coming up with this test case)
typedef
unsigned
char
uint8;
typedef
unsigned
short
uint16;
/* r0 = x , r1 = a, r2 = b, r3 = c */
uint8 foo(uint8 x, uint8 a, uint16 b, uint16 c)
{
if
(a==2)
{
x += (b >> 8);
}
else
{
x += (c >> 8);
}
return
x;
}
|
According to the ARM procedure call standard, the arguments to the function would be passed in registers r0-r3, with the return value passed in r0.
Each instruction in the ARM instruction set is 32 bits in size. At the same time, the ARM instructions have access to other useful features such as conditional instructions and in-line barrel shifter. Without access to conditional instructions, code needs to be handled by branching, which is more expensive. The in-line barrel shifter gives the instruction the ability to shift bits within the registers as part of the instruction itself, which eliminates the need for having separate instructions for shifting. Using RVCT, I compiled this for ARM state, and dumped the code with fromelf
1
2
3
4
5
|
0x00000000: e3510002 ..Q. CMP r1,#2
0x00000004: 10800423 #... ADDNE r0,r0,r3,LSR #8
0x00000008: 00800422 "... ADDEQ r0,r0,r2,LSR #8
0x0000000c: e20000ff .... AND r0,r0,#0xff
0x00000010: e12fff1e ../. BX lr
|
First thing we notice here is that each instruction is 32 bits (or 4 bytes) wide based on the instruction encoding (second column from the left). Given that we have 5 such instructions, we have a code size of 20 bytes
Now, in the first line, r1 is compared to the integer value 2, which is the first thing that our function foo() does. Simple so far. The more interesting thing is the second and third instruction, which demonstrates conditional execution of instructions as well as the in-line barrel shifter.
The second instruction, in effect says “If the result of the previous comparison was a NE (not equal), then shift r3 by 8 bits to the right, add it to r0 and store the result back in r0″. This corresponds to the “else” part of the loop in our function. Not only did we manage to avoid a branch, we were also able to shift the bits and do the addition, all in one simple instruction. The third instruction is exactly the same, but deals with the “if” part of the loop.
The fourth instruction is again interesting. Since the return value of the function is of type unsigned char, only the bottom 8 bits are significant. So, we do an AND of the return value (stored in r0) with #0xff, which effectively zeros the top 24bits (by ANDing with 0), and now our return value is ready in r0.
The fifth and final instruction simply returns back to the caller.
In Thumb state, each instruction is 16 bits in size, and very few instructions are conditional. Also, there is no access to the in-line barrel shifter, so separate instructions are needed for shifting bits. What this means in practice is that Thumb code would generally be slower to execute than ARM code (since more Thumb instructions might be needed to do the job than the number of ARM instructions), but it can help save code size. Building our example with “RVCT” and dumping using “fromelf”, we see the following code sequence#
1
2
3
4
5
6
7
8
9
10
|
_Z3foohhtt
0x00000000: 2902 .) CMP r1,#2
0x00000002: d101 .. BNE {pc}+0x6 ; 0x8
0x00000004: 0a11 .. LSRS r1,r2,#8
0x00000006: e000 .. B {pc}+0x4 ; 0xa
0x00000008: 0a19 .. LSRS r1,r3,#8
0x0000000a: 1808 .. ADDS r0,r1,r0
0x0000000c: 0600 .. LSLS r0,r0,#24
0x0000000e: 0e00 .. LSRS r0,r0,#24
0x00000010: 4770 pG BX lr
|
Immediately, the first thing we notice is that all the instructions are 16 bits in size, as indicated by the instruction encodings (second column from the left). This here gives us a total code size of 9 x 2 = 18 bytes
When we start analysing the generated code, the drawback of Thumb state here is immediately apparent. Thumb only has access to conditional branches, so the generated code is done through branches. The second instruction branches to address 0×8 if the comparison in the first step was not equal (i.e. we enter the “else” part of our C++ code). At address 0×8, we notice the second drawback of the Thumb state, the lack of an in-line barrel shifter. A separate LSRS instruction shifts the bits in r3 by 8, and stores it in r1. Following that, this value is r1 is added to r0. Our return value is nearly ready, but we need to zero the top 24bits. The Thumb instruction set does not have access to AND, so it performs two shifts on r0, first shifting r0 by 24bits to the left (zeroing the bottom 8 bits), then shifting this further by 24 bits to the right (zeroing the top 24 bits). Similar code is generated for the “if” part of the code as well.
Thumb-2 offers a “best of both worlds” compromise between ARM and Thumb, and aims to deliver the performance of ARM state code with the code density of Thumb state code. Thumb-2 has access to both 16 and 32 bit instructions, and even has support for conditional execution, albeit in the form of “If-then” (IT) constructs. Thumb-2 was first introduced as part of ARMv6-T2, and has subsequently been made the default thumb implementation for ARMv7.
So, lets build our example for Thumb-2 and analyse it as before. Note, I have built this by using the “-Otime” optimization in order to generate the IT constructs.
1
2
3
4
5
6
7
|
_Z3foohhtt
0x00000000: 2902 .) CMP r1,#2
0x00000002: bf14 .. ITE NE
0x00000004: eb002013 ... ADDNE r0,r0,r3,LSR #8
0x00000008: eb002012 ... ADDEQ r0,r0,r2,LSR #8
0x0000000c: b2c0 .. UXTB r0,r0
0x0000000e: 4770 pG BX lr
|
First thing we notice here is there is a mix of 32 bit and 16 bit instructions, as apparent by the instruction encodings (second column from the left). We see here that instead of all instructions being the same width, we have 4 instructions which are 16 bit, and 2 instructions which are 32 bits, giving us a code size of 16 bytes.
As seen, the second instruction is an ITE NE construct. This is not an instruction per se, but more of a heads up to the processor, instructing it that some conditional instructions need to be executed, and the first one of this will be based on the NE condition. What follows in instructions 3 and 4 is identical to what happened in ARM state code. Finally, the UXTB instruction is an instruction which extends (unsigned) the byte to a word, which in this case effectively zeros out the top 24 bits.