GCC has an extremely powerful feature where it allows inline assembly within C (or C++) code. Other assemblers allow verbatim assembly constructs to be inserted into object code. The assembly code then interfaces with the outside world though the standard ABI. GCC is different. It exposes an interface into its "Register Transfer Language" (RTL). This means that gcc understands the meaning of the inputs and outputs to the fragment of assembly code.
The extra information gcc has allows it to carefully choose the registers (or other operands) that define the interface. The ones chosen can vary depending on the surrounding code. In addition, gcc can be told which registers will be "clobbered" by the assembly code. It will then automatically save and restore them if required. This contrasts strongly with other methods, where inlined assembly code needs to manually do this saving and restoring. (Even when the surrounding code is such that it isn't needed.)
The result is that commonly a piece of gcc inline assembly will compile into a single asm instruction in the executable or library. (Often you just want access to a single instruction not exposed by C.) However, to do this, you need to understand how to craft the constraints told to the compiler. If they are incorrect, then subtle bugs can result.
A simple function using inline-assembly might look like:
static __attribute__((used)) int var1;
int func1(void)
{
int out;
asm("mov var1, %0" : "=r" (out));
return out;
}
The above shows several features of gcc's interface. Firstly, the asm code is a compile-time C constant string. You can put anything you like within that string. GCC doesn't parse the assembly language itself. What it does do is use escape sequences (i.e. %0 in the above) to reference the interface described by the programmer. In this case %0 corresponds to the zeroth constraint, which in turn is described after the colon.
That constraint "=r" is an output-constraint (due to the use of the '=' symbol), and consists of a general-purpose register (due to the use of the 'r' symbol. The resulting output is then stored into the variable within the parenthesis, 'out'.
The result is a magic bit of code that somehow materializes a value, and then stores it into the variable 'out'. GCC doesn't understand where the value comes from. So in turn, it doesn't know that the variable 'var1' is actually used unless you tell it explicitly by the used attribute. (An unused variable can be elided from the executable object as a simple optimization.)
When the above is put inside a .c file called gcc_asm.c, and then compiled, the result is:
.p2align 4,,15
.globl func1
.type func1, @function
func1:
.LFB7:
.cfi_startproc
#APP
# 8 "gcc_asm.c" 1
mov var1, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE7:
.size func1, .-func1
The standard ABI on 64bit x86 machines is to return integers in the %eax register. GCC picks this for the register chosen to contain the variable 'out'. Thus the resulting function actually only consists of two instructions: (The above has a whole lot of asm directives describing unwinding and debug information in addition, but that doesn't appear in the straight-line code.)
mov var1, %eax
ret
See how gcc has replaced the '%0' in the asm string with the register it picked for the zeroth constraint. If there were more constraints, we could use '%1', '%2' etc. for them in the asm string. Values up to '%9' are available.
The above describes how to get information out of a fragment of inline assembly code. So what about the reverse, getting information in? An example function that does that looks like:
static __attribute__((used)) int var2;
void func2(int parm)
{
/* Register input - volatile because has no outputs, writes to memory */
asm volatile("mov %0, var2" : : "r" (parm) : "memory");
}
The above looks very similar to the first function. However, it has two more colon-delineated parts to the asm intrinsic. The first of these is again the asm string. The second, for the outputs, is blank in this case. This function has no outputs. The third section is an input constraint. Notice that the '=' symbol is missing. (It's an input, not an output.) What remains is the 'r', describing that this asm code wants that input stored in some general register. Finally, the asm code ends with a 'memory' constraint. This tells gcc that it writes to arbitrary memory.
One other difference from the other function is that the asm fragment has an extra 'volatile' keyword. This is necessary because the code has no outputs. GCC needs to know if it is allowed to elide the perhaps useless asm which may not interact with anything else. The 'volatile' tells gcc that it shouldn't be removed. The 'memory' constraint tells gcc that it shouldn't move this call across other memory references. (Otherwise our read of 'var2' might cross writes to it.)
It is possible to have output-less inline asm which don't have the above constraints. However, be aware that gcc can optimize your asm away, or move it around if they are missing. If done when you don't expect, the result will again be subtle bugs.
The above when compiled yields:
.p2align 4,,15
.globl func2
.type func2, @function
func2:
.LFB8:
.cfi_startproc
#APP
# 17 "gcc_asm.c" 1
mov %edi, var2
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE8:
.size func2, .-func2
Which again is as small as possible. GCC picks the %edi register corresponding to the ABI register for the first parameter on x86_64. (If you want to find the exact code generated by the asm fragment, look for the areas surrounded by #APP, #NO_APP comments.)
It is easy to create an inline asm with both input and output parameters:
int func3(int parm)
{
int out;
asm ("mov %1, %0": "=r" (out) : "r" (parm));
return out;
}
Here, the input parameter is %1, and the output is %0. Note the AT&T syntax used by default, which has outputs on the right of the asm instructions. Intel format can be used, which swaps things around. However, most gcc inline asm you will see will stick to AT&T format, so you should get used to seeing it.
The above compiles into:
.p2align 4,,15
.globl func3
.type func3, @function
func3:
.LFB9:
.cfi_startproc
#APP
# 24 "gcc_asm.c" 1
mov %edi, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE9:
.size func3, .-func3
GCC has picked both input and output registers so that again the result is a single instruction.
A slightly more complex example is when you want something to be both an input and an output at the same time. For that, use the position of an output, and use a '+' symbol instead of an '=':
int func4(int parm)
{
asm ("add $0xff, %0": "+r" (parm) : : "cc");
return parm;
}
The above also shows how you should prefix immediates with a dollar symbol in AT&T syntax. It also has the 'cc' constraint. This stands for "condition codes". Since the add instruction will affect the carry flag amongst other things, we need to tell gcc about it. Otherwise it might want to split a test-and-branch around our code. If it did so, the branch might go the wrong way due to the condition codes being corrupted. Basically, any inline asm that does arithmetic should explicitly clobber the flags like this.
When compiled, we get:
.p2align 4,,15
.globl func4
.type func4, @function
func4:
.LFB10:
.cfi_startproc
movl %edi, %eax
#APP
# 31 "gcc_asm.c" 1
add $0xff, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE10:
.size func4, .-func4
So now the input and outputs are one and the same register, %eax. However, since the parameter passed to the function is in %edi, gcc helpfully copies it into %eax for us. Only when the copying was really needed did gcc insert it.
Looking at a slightly more complex example:
int foo(int);
int func5(int parm)
{
int out;
asm ("mov $0xff, %0\n\t"
"add %1, %0\n\t"
: "=&r" (out) : "r" (parm) : "cc");
return foo(out);
}
/* Broken, because input and output share a register */
int func6(int parm)
{
int out;
asm ("mov $0xff, %0\n\t"
"add %1, %0\n\t"
: "=r" (out) : "r" (parm) : "cc");
return foo(out);
}
Functions 5 and 6 attempt to something similar to function 4. However, instead of returning a value, they call some other function called foo. This means that the output should be in the %edi register. However, the input will also be in that register. The result shows how gcc will assume that output and input registers are allowed to overlap unless you tell it otherwise.
func6() will not work correctly. gcc will pick %edi for both of 'out' and 'parm'. This will compile into:
.p2align 4,,15
.globl func6
.type func6, @function
func6:
.LFB12:
.cfi_startproc
#APP
# 51 "gcc_asm.c" 1
mov $0xff, %edi
add %edi, %edi
# 0 "" 2
#NO_APP
jmp foo@PLT
.cfi_endproc
.LFE12:
.size func6, .-func6
Which isn't what we want. The register is corrupted, and then added to itself.
To fix this, use the '=&' constraint. That tells gcc that the output constraint register shouldn't overlap an input register. Using that instead gives us function 5:
.p2align 4,,15
.globl func5
.type func5, @function
func5:
.LFB11:
.cfi_startproc
#APP
# 41 "gcc_asm.c" 1
mov $0xff, %eax
add %edi, %eax
# 0 "" 2
#NO_APP
movl %eax, %edi
jmp foo@PLT
.cfi_endproc
.LFE11:
.size func5, .-func5
Which uses two registers, as required. It picks %eax for this, and inserts the extra copy needed.
You may have noticed that the multi-line asm used '\n\t' control codes. This simply makes the result nice. You just need a carriage return '\n' to go to the next line. The tab character indents things to line up with the code generated by gcc from the rest of the program. (Remember that the inline asm string is basically inserted verbatim into the output sent to the assembler, modulo simple replacements.)
To have multiple inputs, just separate them with commas:
int func7(int p1, int p2)
{
int out;
asm ("add %2, %1\n\t"
"mov %1, %0\n\t"
: "=r" (out) : "r" (p1), "r" (p2) : "cc");
return out;
}
Which will compile into:
.p2align 4,,15
.globl func7
.type func7, @function
func7:
.LFB13:
.cfi_startproc
#APP
# 61 "gcc_asm.c" 1
add %esi, %edi
mov %edi, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE13:
.size func7, .-func7
Another possibility is that you might want some inputs and outputs to share a register. As described above, one way to do that is to use the '+' constraint. However, there is another way. You can use the number corresponding to another constraint within a second constraint. If you do this, then gcc will know that the two are linked, and must be the same. An example of using this is:
/* Input and output for same variable */
int func8(int p1, int p2)
{
asm ("add %2, %0"
: "=r" (p1) : "0" (p1), "r" (p2) : "cc");
return p1;
}
Which compiles into:
.p2align 4,,15
.globl func8
.type func8, @function
func8:
.LFB14:
.cfi_startproc
movl %edi, %eax
#APP
# 70 "gcc_asm.c" 1
add %esi, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE14:
.size func8, .-func8
This may, or may not be a more readable technique than using a '+' constraint. '+' used to be buggy in old versions of gcc, so old code tends to use this method. Newer code might want to use the more concise '+' descriptor.
In addition to passing information in registers, gcc can understand references to raw memory. This will expand to some more complex addressing mode within the asm string. Note that not all instructions can handle arbitrary memory references. Thus sometimes you need gcc to create a register with the required information. However, if you can get away with it, it is more efficient to use memory directly. Some code that does this looks like:
int var9;
int func9(void)
{
int out;
asm ("mov %0, %1"
: "=r" (out) : "m" (var9));
return out;
}
Which compiles into:
.p2align 4,,15
.globl func9
.type func9, @function
func9:
.LFB15:
.cfi_startproc
#APP
# 80 "gcc_asm.c" 1
mov %eax, var9(%rip)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE15:
.size func9, .-func9
Notice how in the above, gcc has generated a %rip-relative addressing mode for us.
Sometimes you really want a constraint to be satisfied by a certain register. Fortunately, gcc has specialized constraints for many (but not all) of the general purpose registers used on x86_64.
int func10(int p1)
{
asm ("inc %%eax" : "+a" (p1) :: "cc");
return p1;
}
The above code shows how you can explicitly use the 'a' register (which corresponds to %al, %ax, %eax, or %rax, depending on size). Note how we need to use a double-percent sign within the asm string. This is similar to a normal printf format string, where to print a single percent you need two of them. (This is due to a percent symbol being an escape character.)
Compiling, we get:
.p2align 4,,15
.globl func10
.type func10, @function
func10:
.LFB16:
.cfi_startproc
movl %edi, %eax
#APP
# 89 "gcc_asm.c" 1
inc %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE16:
.size func10, .-func10
GCC has copied from %edi into the constraint register defined by 'a', %eax for us. Note that different machines will have differing names, and differing constraint symbols for their registers. You will need to look at the gcc documentation for your particular machine to find out what they are. This article will concentrate on the x86_64 case.
Another commonly used register is the 'd' (%dl, %dx, %edx, %rdx) register:
int func11(int p1, int p2, int p3)
{
asm ("add %1, %0" : "+d" (p1) : "r" (p3) : "cc");
return p1;
}
The above is a little tricky. p3 is passed in within %edx as specified by the function ABI. This means that gcc needs to copy it into another register so that p1 can go there. Fortunately, gcc handles all of the marshalling for us:
.p2align 4,,15
.globl func11
.type func11, @function
func11:
.LFB17:
.cfi_startproc
movl %edx, %ecx
movl %edi, %edx
#APP
# 96 "gcc_asm.c" 1
add %ecx, %edx
# 0 "" 2
#NO_APP
movl %edx, %eax
ret
.cfi_endproc
.LFE17:
.size func11, .-func11
Note the extra moves before the add instruction, and afterwards in order to get things where they need to be. This is the reason why you really shouldn't use explicit named registers if you can avoid them. The only time where they are unavoidable is if you want to match some kind of ABI, or have to interface with an instruction with fixed inputs or outputs.
An example of this on x86, is the mul instruction. That will put its output in the 'a' and 'd' registers, and always takes one of its inputs from the 'a' register. So to describe it's use you might do something like:
/* Commuting inputs */
unsigned func12(unsigned p1, unsigned p2)
{
unsigned hi, lo;
asm ("mul %3"
: "=a" (lo), "=d" (hi)
: "%0" (p1), "r" (p2) : "cc");
return hi + lo;
}
The above uses another feature of gcc asm. Sometimes inputs commute, and we don't really care which of them uses a particular register. In this case p1*p2 = p2*p1, and we don't mind which of them goes in %eax. To tell gcc this, we can use the '%' constraint flag, which means that that constraint and the following one commute.
.p2align 4,,15
.globl func12
.type func12, @function
func12:
.LFB18:
.cfi_startproc
movl %edi, %eax
#APP
# 109 "gcc_asm.c" 1
mul %esi
# 0 "" 2
#NO_APP
addl %edx, %eax
ret
.cfi_endproc
.LFE18:
.size func12, .-func12
In this case, gcc decides not to swap the order of the two inputs because it doesn't matter.
We can try something slightly different, where we use the 'D' constraint to force the use of %edi as the multiplicand.
unsigned func13(unsigned p1, unsigned p2)
{
unsigned hi, lo;
asm ("mul %3"
: "=a" (lo), "=d" (hi)
: "%0" (p1), "D" (p2) : "cc");
return hi + lo;
}
This compiles into:
.p2align 4,,15
.globl func13
.type func13, @function
func13:
.LFB19:
.cfi_startproc
movl %edi, %eax
movl %esi, %edi
#APP
# 121 "gcc_asm.c" 1
mul %edi
# 0 "" 2
#NO_APP
addl %edx, %eax
ret
.cfi_endproc
.LFE19:
.size func13, .-func13
Unfortunately, gcc fails to make the swap in this case as well, even though it would be very profitable to do so. It looks like you can't really count on the '%' constraint specifier, which is a shame.
There is another way to get more flexibility within the constraints. You can simply list more than one constraint symbol. GCC will choose the best one. An example of using either a register, or a direct memory reference is:
int var14;
int func14(int p1)
{
asm ("add %1, %0"
:"+r" (p1) : "rm" (var14) : "cc");
return foo(p1);
}
Which will use the better direct-memory operand:
.globl func14
.type func14, @function
func14:
.LFB20:
.cfi_startproc
#APP
# 133 "gcc_asm.c" 1
add var14(%rip), %edi
# 0 "" 2
#NO_APP
jmp foo
.cfi_endproc
.LFE20:
.size func14, .-func14
Another way of gaining flexibility is using a more general constraint. 'g' allows a register, memory, or immediate operand. Using it:
int var15;
int func15(int p1)
{
asm ("add %1, %0"
:"+r" (p1) : "g" (var15) : "cc");
return foo(p1);
}
GCC will again pick the best option, which in this case is direct memory addressing modes.
.p2align 4,,15
.globl func15
.type func15, @function
func15:
.LFB21:
.cfi_startproc
#APP
# 142 "gcc_asm.c" 1
add var15(%rip), %edi
# 0 "" 2
#NO_APP
jmp foo
.cfi_endproc
.LFE21:
.size func15, .-func15
Of course, if you want an immediate, there is a symbol for that as well, 'i'. The limitation is that an immediate must be a compile, or link-time constant.
int func16(int p1)
{
asm ("add %1, %0"
:"+r" (p1) : "i" (99) : "cc");
return foo(p1);
}
Which compiles into:
.globl func16
.type func16, @function
func16:
.LFB22:
.cfi_startproc
#APP
# 150 "gcc_asm.c" 1
add $99, %edi
# 0 "" 2
#NO_APP
jmp foo
.cfi_endproc
.LFE22:
.size func16, .-func16
Notice how gcc automatically converts into the AT&T syntax for us, with the dollar symbol preceding the constant.
There are other constraint modifiers. One of which is the '#' symbol which acts like a comment character.
int var17;
int func17(int p1)
{
asm ("add %1, %0"
:"+r" (p1) : "m#hello" (var17) : "cc");
return foo(p1);
}
The above compiles into:
.p2align 4,,15
.globl func17
.type func17, @function
func17:
.LFB23:
.cfi_startproc
#APP
# 159 "gcc_asm.c" 1
add var17(%rip), %edi
# 0 "" 2
#NO_APP
jmp foo
.cfi_endproc
.LFE23:
.size func17, .-func17
Everything after the hash symbol is ignored. Unfortunately, you can't include spaces or punctuation symbols within the comment. The other thing that ends the 'comment' is a comma. This is because you can use commas to allow multiple alternatives in an inline asm. The alternatives are linked together (all first option, all second option, etc.) rather than being unlinked like in the 'rm' case. Some example code is:
int var18, var18a;
int func18(int p1, int p2)
{
/* Uses reg-reg option */
asm ("add %1, %0"
:"+m,r,r" (p1) : "r,m,r" (p2) : "cc");
/* Uses mem-reg option */
asm ("add %1, %0"
:"+m,r,r" (var18) : "r,m,r" (p2) : "cc");
#if 1
/* Uses reg-mem option even with two memory operands (gcc copies for us) */
asm ("add %1, %0"
:"+m,r,r" (var18) : "r,m,r" (var18a) : "cc");
#else
/* Creates invalid instruction with two memory operands */
asm ("add %1, %0"
:"+g" (var18) : "g" (var18a) : "cc");
#endif
/* Uses reg-mem option */
asm ("add %1, %0"
:"+m,r,r" (p1) : "r,m,r" (var18) : "cc");
#if 0
/* Doesn't work "inconsistent operand constraints" */
asm ("add %1, %0"
:"+r,m" (p1) : "m,r" (var18) : "cc");
#endif
return foo(p1);
}
The above shows the power of the technique. In x86 assembly language, there can only be a single reference to memory within an instruction. Thus if we use two 'g' constraints, we can sometimes generate invalid code. One fix for this is to use register-only 'r' constraints. However, they can lead to inefficiency. What we want to do is only ban the invalid option. By using alternative constraints, we select the valid 'm + r', 'r + m', and 'r + r' options.
Note that this feature isn't used very often within inline asm code, so is a little buggy. The final inline asm which is #defined out, in the above function should work. However, gcc gets confused by it. The fix is to add the 'r + r' option, like in the other cases.
When compiled, the above yields:
.p2align 4,,15
.globl func18
.type func18, @function
func18:
.LFB24:
.cfi_startproc
#APP
# 174 "gcc_asm.c" 1
add %esi, var18(%rip)
# 0 "" 2
# 170 "gcc_asm.c" 1
add %esi, %edi
# 0 "" 2
#NO_APP
movl var18a(%rip), %eax
#APP
# 178 "gcc_asm.c" 1
add %eax, var18(%rip)
# 0 "" 2
# 186 "gcc_asm.c" 1
add var18(%rip), %edi
# 0 "" 2
#NO_APP
jmp foo
.cfi_endproc
.LFE24:
.size func18, .-func18
Another possibility is when you want a constraint, but you don't want the compiler to worry too much about the cost of that constraint. This doesn't really come into play very often. In fact, with orthogonal architectures like x86, it may not happen at all. This is really a case of API leakage, where gcc offers a feature that may be useful to some machines to all. The '*' constraint specifier causes the following character to not count in terms of register pressure. The canonical example is the following:
int var19;
int func19(int p1)
{
int out;
/* Picks the "same register" case */
asm ("add %1, %0"
:"=*r,m" (out) : "0,r" (p1) : "cc");
/* Picks the "mem-register operand" case */
asm ("add %1, %0"
:"=*r,m" (var19) : "0,r" (out) : "cc");
return var19 - out;
}
In the above we have an instruction (an add, in this case), which will either take two references to the same register, or a memory-register combination. The same-reg, same-reg case is more strict, and we would like gcc to use the memory-addressing version if possible. The '*' accomplishes this. However, this trick is rather subtle... and probably shouldn't be used with inline asm. The above compiles into:
.p2align 4,,15
.globl func19
.type func19, @function
func19:
.LFB25:
.cfi_startproc
#APP
# 204 "gcc_asm.c" 1
add %edi, %edi
# 0 "" 2
# 208 "gcc_asm.c" 1
add %edi, var19(%rip)
# 0 "" 2
#NO_APP
movl var19(%rip), %eax
subl %edi, %eax
ret
.cfi_endproc
.LFE25:
.size func19, .-func19
Note how the differing form of the instruction is chosen.
A much better technique is to use constraint modifiers that explicitly penalize some alternatives over others. By using the right amount of penalization, you can create patterns that match the machine's costs. GCC will then be able to make intelligent choices about which is best. The simple way to do this is to add a '?' character to the more costly alternative.
/* Alternatives - costs (works with partial matches) */
int func20(int p1, int p2)
{
#if 1
/* Picks using eax */
asm ("add %1, %0"
:"+r?,a" (p1) : "d?,r" (p2) : "cc");
#else
/* Picks using edx */
asm ("add %1, %0"
:"+r,a?" (p1) : "d,r?" (p2) : "cc");
#endif
return foo(p1);
}
The above shows how you can tell the compiler that (for example) %eax can be more or less expensive to use than %edx. It compiles into:
.p2align 4,,15
.globl func20
.type func20, @function
func20:
.LFB26:
.cfi_startproc
movl %edi, %eax
#APP
# 220 "gcc_asm.c" 1
add %esi, %eax
# 0 "" 2
#NO_APP
movl %eax, %edi
jmp foo
.cfi_endproc
.LFE26:
.size func20, .-func20
Of course, a single level of penalization might not be enough. You can add more '?' symbols. Two question marks is even more penalized than one.
/* Can use more than one '?' for larger cost */
int func21(int p1, int p2)
{
#if 1
/* Picks using eax */
asm ("add %1, %0"
:"+r??,a?" (p1) : "d??,r?" (p2) : "cc");
#else
/* Picks using edx */
asm ("add %1, %0"
:"+r?,a??" (p1) : "d?,r??" (p2) : "cc");
#endif
return foo(p1);
}
Giving:
.p2align 4,,15
.globl func21
.type func21, @function
func21:
.LFB27:
.cfi_startproc
movl %edi, %eax
#APP
# 235 "gcc_asm.c" 1
add %esi, %eax
# 0 "" 2
#NO_APP
movl %eax, %edi
jmp foo
.cfi_endproc
.LFE27:
.size func21, .-func21
For even greater penalization, you can use the '!' symbol. It is equivalent to 100 '?' symbols. This should be very rarely needed.
/* Even stronger disparagement of the alternative ! = 100?'s */
int func22(int p1, int p2)
{
#if 1
/* Picks using eax */
asm ("add %1, %0"
:"+r!,a??" (p1) : "d!,r??" (p2) : "cc");
#else
/* Picks using edx */
asm ("add %1, %0"
:"+r??,a!" (p1) : "d??,r!" (p2) : "cc");
#endif
return foo(p1);
}
Giving:
.p2align 4,,15
.globl func22
.type func22, @function
func22:
.LFB28:
.cfi_startproc
movl %edi, %eax
#APP
# 250 "gcc_asm.c" 1
add %esi, %eax
# 0 "" 2
#NO_APP
movl %eax, %edi
jmp foo
.cfi_endproc
.LFE28:
.size func22, .-func22
Up until now, we have only used the clobber part of the asm intrinsic for 'memory', and 'cc' (condition codes). However, there are other things you can put in there. The most often used are names of registers. This tells gcc that that register is somehow used in the asm string. It will not use that register for inputs or outputs, and will helpfully save that register before the asm is called, and then will automatically restore it afterwards.
An example of this where we clobber the %rdx register is:
int func23(int p1)
{
unsigned out = 25;
asm ("mul %1"
: "+a" (out) : "g" (p1) : "%rdx", "cc");
return out;
}
The mul instruction will write to %rax and %rdx. We don't care about the upper part, so it isn't an output. To tell gcc about the register write, the clobber does the job. (Yes, there are other versions of the x86 multiply instruction that don't clobber %rdx unnecessarily, but this is just an example of how clobbers might be useful.) This compiles into:
.p2align 4,,15
.globl func23
.type func23, @function
func23:
.LFB29:
.cfi_startproc
movl $25, %eax
#APP
# 264 "gcc_asm.c" 1
mul %edi
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE29:
.size func23, .-func23
In this case, %rdx is 'dead' because it is a parameter-register in the ABI. GCC doesn't need to save or restore it, so doesn't. Without the clobber, we would need to save and restore the register manually. That would be inefficient in cases like the above, were such saves and restores are not needed.
Of course, you can clobber more than one register:
/* More than one clobber */
int func24(int p1)
{
int out;
/* %1 cannot overlap clobber list. Note use of "%%" in asm */
asm ("mov %1, %%edi\n\t"
"call foo\n\t"
: "=a" (out) : "g" (p1) : "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9", "memory", "cc");
return out;
}
The above is bad coding style. You really shouldn't use control-flow altering instructions inside inline asm. GCC doesn't know about them, and can do optimizations that invalidate what you are trying to do. (If 'foo' is inlined everywhere, it may not even exist to call.) Also, there have been many bugs when the number of clobbered registers gets too large. If gcc can't find a way to save and restore everything it may simply give up and crash.
In the above case, we are lucky, and it compiles without issue. The trick is to notice that the clobbered registers are all dead (except %rdi) due to the x86_64 SYSV ABI.
.p2align 4,,15
.globl func24
.type func24, @function
func24:
.LFB30:
.cfi_startproc
movl %edi, %eax
#APP
# 275 "gcc_asm.c" 1
mov %eax, %edi
call foo
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE30:
.size func24, .-func24
A much better technique is to use explicit temporaries. GCC can then allocate them where ever it has space. It can also move things around for more efficiency, based on the needs of surrounding code. An example of doing this is:
/* Better option, if available. Use temp out registers */
int func25(int p1, int p2)
{
int temp1, temp2;
int out;
/* '=&' so temp's don't overlap with inputs */
asm ("mov %3, %1\n\t"
"mov %4, %2\n\t"
"shr $10, %1\n\t"
"shl $10, %2\n\t"
"add %3, %1\n\t"
"lea (%4, %2, 1), %0\n\t"
"xor %1, %0\n\t"
: "=r" (out), "=&r" (temp1), "=&r" (temp2): "g" (p1), "g" (p2) : "cc");
return out;
}
In the above, we use two temporary registers. Since we don't want them to overlap the other inputs or outputs, they need to be defined by '=&r' constraints. The only thing left on the clobber list is the 'cc' due to the arithmetic and logic instructions altering the condition codes.
.p2align 4,,15
.globl func25
.type func25, @function
func25:
.LFB31:
.cfi_startproc
#APP
# 288 "gcc_asm.c" 1
mov %edi, %edx
mov %esi, %ecx
shr $10, %edx
shl $10, %ecx
add %edi, %edx
lea (%esi, %ecx, 1), %eax
xor %edx, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE31:
.size func25, .-func25
Finally, there is another way to name registers within the asm string itself. Depending on your point of view, the numerical '%0-%9' scheme may be more or less readable than the following:
int func26(int p1)
{
int out = 137;
asm ("sub %[p1], %[out_name]"
: [out_name] "+r" (out) : [p1] "g" (p1) : "cc");
return out;
}
By putting a name within square brackets in the constraints we can then use those names in the asm string. Note that the asm operand name does not have to be the same as the C variable it comes from. However, for readability, it may be better to keep the two the same if possible. The main disadvantage of the technique is that it can make the asm string a little longer, and can make it harder to see what addressing modes are used.
.p2align 4,,15
.globl func26
.type func26, @function
func26:
.LFB32:
.cfi_startproc
movl $137, %eax
#APP
# 304 "gcc_asm.c" 1
sub %edi, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE32:
.size func26, .-func26
There are a few standard constraints beyond those discussed above.
One of these is for "offsetable memory", which is any memory reference which can take an offset to it. In the orthogonal x86 architecture, this is anything that 'm' could reference, so this constraint class isn't too useful there. Other machines may be different though. An example of its usage is:
int func27(int p1)
{
static int out[2];
asm ("mov %1, 4+%0"
: "=o" (out) : "r" (p1));
return out[1];
}
Which compiles into:
.p2align 4,,15
.globl func27
.type func27, @function
func27:
.LFB33:
.cfi_startproc
#APP
# 317 "gcc_asm.c" 1
mov %edi, 4+out.2398(%rip)
# 0 "" 2
#NO_APP
movl out.2398+4(%rip), %eax
ret
.cfi_endproc
.LFE33:
.size func27, .-func27
The linker and assembler understand the more complex addressing within "out.2398+4(%rip)", and will generate the appropriate fix-up for us.
Since some machines have offsetable memory as a separate class from normal memory constraints, there is some memory which is not offsetable. If you want to have a constraint that references such memory, you can use the 'V' constraint flag. However, since x86 doesn't have such a beast, we don't provide an example of its use.
Some machines provide memory that automatically increments or decrements things stored within it. Such memory can be described by the '<' and '>' constraints. Again, x86 doesn't have anything like that, so those constraints are not supported, and no example is provided.
Another constraint that isn't so useful on x86 is 'n'. That refers to a constant integer that is known at assembly time. Some machines have less capable assemblers and linkers, and cannot use the more general 'i' constraint. 'i' is an integer constant known at link time. Since 'n' defines a sub-category of 'i', you can also use it on x86:
int func28(void)
{
int out = 0;
asm ("add %1,%0"
: "+r" (out) : "n" (5));
return out;
}
The above acts just like 'i' would do, and uses the 5 as an immediate:
.p2align 4,,15
.globl func28
.type func28, @function
func28:
.LFB34:
.cfi_startproc
xorl %eax, %eax
#APP
# 332 "gcc_asm.c" 1
add $5,%eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE34:
.size func28, .-func28
Another integer immediate constraint type is 's'. This describes an integer that is known at link time, but not compile or assembly time. This isn't particularly useful on x86, but on other machines can lead to optimizations.
Not all immediates are integers. Some machines allow immediate floating point numbers. The 'E' constraint is for floating point immediates that are defined on the compiling machine. If the target machine is different, then the bit-values may be incorrect. Thus, this constraint shouldn't be used if you are cross-compiling.
The x86 architecture really doesn't allow floating point immediates. You should get constants into SSE registers and the legacy floating point stack from memory instead. However, there are a coupled of special cases that still work:
double func29(void)
{
double out;
unsigned long long temp;
asm ("movabs %2, %1\n\t"
"movq %1, %0\n\t"
: "=x" (out), "=r" (temp) : "E" (2.0));
return out;
}
The above use the bit-pattern for the double '2.0', and indirectly moves it into an SSE register (defined by the 'x' constraint). It would be more efficient to do a direct memory load, but the above does work:
.p2align 4,,15
.globl func29
.type func29, @function
func29:
.LFB35:
.cfi_startproc
#APP
# 346 "gcc_asm.c" 1
movabs $0x4000000000000000, %rax
movq %rax, %xmm0
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE35:
.size func29, .-func29
The code for float-sized immediates is similar:
float func30(void)
{
float out;
unsigned temp;
asm ("mov %2, %1\n\t"
"movd %1, %0\n\t"
: "=x" (out), "=r" (temp) : "E" (2.0f));
return out;
}
Giving:
.p2align 4,,15
.globl func30
.type func30, @function
func30:
.LFB36:
.cfi_startproc
#APP
# 357 "gcc_asm.c" 1
mov $0x40000000, %eax
movd %eax, %xmm0
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE36:
.size func30, .-func30
In addition to the 'E' constraint is the 'F' constraint. This is cross-compiling friendly, and should probably be used instead. Otherwise, it has the same meaning as it's 'E' cousin.
double func31(void)
{
double out;
unsigned long long temp;
asm ("movabs %2, %1\n\t"
"movq %1, %0\n\t"
: "=x" (out), "=r" (temp) : "F" (2.0));
return out;
}
Which produces identical code to the 'E' version:
.p2align 4,,15
.globl func31
.type func31, @function
func31:
.LFB37:
.cfi_startproc
#APP
# 368 "gcc_asm.c" 1
movabs $0x4000000000000000, %rax
movq %rax, %xmm0
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE37:
.size func31, .-func31
Another rarely used constraint is 'p'. It describes a valid memory addresses. On x86, it behaves just like 'm' does. You should use the more standard 'm' instead.
void *func32(void)
{
static int mem;
void *out;
asm ("lea (%1), %0"
: "=r" (out) : "p" (&mem));
return out;
}
Which compiles into:
.p2align 4,,15
.globl func32
.type func32, @function
func32:
.LFB38:
.cfi_startproc
#APP
# 381 "gcc_asm.c" 1
lea ($mem.2421), %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE38:
.size func32, .-func32
There is one final constraint common to all machines, 'X'. This constraint matches absolutely everything. This catch-all doesn't give gcc any information about how to pass the information to the inline asm, so gcc picks the form most convenient for it. Since the exact output will be highly variable, it is difficult to use in normal asm instructions. However, it may be helpful in asm directives:
const char *func33(int p1)
{
const char *str;
asm (".pushsection .data\n\t"
"1:\n\t"
".asciz \"%1\"\n\t"
".popsection\n\t"
"lea 1b(%%rip),%0\n\t"
: "=r" (str) : "X" (p1));
return str;
}
The above compiles into:
.p2align 4,,15
.globl func33
.type func33, @function
func33:
.LFB39:
.cfi_startproc
#APP
# 390 "gcc_asm.c" 1
.pushsection .data
1:
.asciz "%edi"
.popsection
lea 1b(%rip),%rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE39:
.size func33, .-func33
This creates a zero-terminated ASCII string containing the operand used by gcc. With a bit of section magic, it obtains a pointer to it, which is then returned in the output.
Most of the previous constraint types will work on all machines. Some have been x86-only though. For example, 'a', which will expand to '%al', '%ax', '%eax" or '%rax', will obviously not work the same way on another architecture. We have seen a few of these x86-only, but there are many more.
A simple register constraint is 'R'. This selects any legacy register for use. i.e. one of the a,b,c,d,si,di,bp, or sp registers. This may be useful when interfacing with old code unable to use any of the new 64 bit registers. Otherwise, the constraint acts just like 'r' would do:
int func34(int p1, int p2, int p3, int p4, int p5)
{
int out;
/* Copies from r8 into legacy register first */
asm ("mov %1, %0"
: "=r" (out) : "R" (p5));
return foo(out);
}
The above cannot use p5 as is because it is passed in %r8 by the ABI. Thus gcc will insert a move instruction into a legacy register as requested. This copy wouldn't happen if 'r' were used instead;
.p2align 4,,15
.globl func34
.type func34, @function
func34:
.LFB40:
.cfi_startproc
movl %r8d, %eax
#APP
# 405 "gcc_asm.c" 1
mov %eax, %edi
# 0 "" 2
#NO_APP
jmp foo
.cfi_endproc
.LFE40:
.size func34, .-func34
Another constraint that picks a subset of the available registers is 'q'. This picks a register with an addressable lower 8-bit part. The list of available registers differs between 64-bit mode and 32-bit mode. In 32-bit mode, some of the registers don't exist. i.e. you can't access %dil or %sil.
char func35(char p1)
{
char out;
asm ("mov %1, %0"
: "=q" (out) : "q" (p1));
return out;
}
Otherwise the use looks exactly like 'r' would have.
.p2align 4,,15
.globl func35
.type func35, @function
func35:
.LFB41:
.cfi_startproc
#APP
# 414 "gcc_asm.c" 1
mov %dil, %al
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE41:
.size func35, .-func35
A variant of the above is the 'Q' constraint, that picks a register with a 'high' 8-bit sub-register. i.e. any of the a, b, c or d registers:
char func35a(char p1)
{
char out;
asm ("mov %1, %0"
: "=Q" (out) : "Q" (p1));
return out;
}
Which compiles into:
.p2align 4,,15
.globl func35a
.type func35a, @function
func35a:
.LFB42:
.cfi_startproc
movl %edi, %edx
#APP
# 423 "gcc_asm.c" 1
mov %dl, %al
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE42:
.size func35a, .-func35a
Notice how the compiler was not allowed to use the %edi register as the operand any more. Instead, it picked %edx.
As we have seen in the earlier sections, some of the x86 registers have constraints of their very own. We have seen 'a' and 'd'. Similarly, 'b' and 'c' do what you might expect, and refer to the '%bl', '%bx', '%ebx', and '%rbx' registers, and the '%cl', '%cx', '%ecx', and '%rcx' registers respectively. An example of this might be:
int func36(int p1, int p2, int p3, int p4)
{
int out;
asm ("mov %1, %0\n\t"
"add %2, %3\n\t"
"add %4, %0\n\t"
"add %3, %0\n\t"
: "=&r" (out) : "a" (p1), "b" (p2), "c" (p3), "d" (p4) : "cc");
return out;
}
Where every input has had its register manually defined by an explicit constraint. GCC needs to do a little bit of copying to get everything into the right spot;
.p2align 4,,15
.globl func36
.type func36, @function
func36:
.LFB42:
.cfi_startproc
movl %edx, %r8d
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movl %ecx, %edx
movl %edi, %eax
movl %esi, %ebx
movl %r8d, %ecx
#APP
# 423 "gcc_asm.c" 1
mov %eax, %r9d
add %ebx, %ecx
add %edx, %r9d
add %ecx, %r9d
# 0 "" 2
#NO_APP
popq %rbx
.cfi_def_cfa_offset 8
movl %r9d, %eax
ret
.cfi_endproc
.LFE42:
.size func36, .-func36
There are also special constraints for the si and di registers, 'S' and 'D' respectively. (We have used 'D' before in func13().) Something using them looks like:
int func37(int p1, int p2)
{
int out;
asm ("mov %1, %0\n\t"
"add %2, %0\n\t"
: "=&r" (out) : "D" (p1), "S" (p2) : "cc");
return out;
}
Which compiles into:
.p2align 4,,15
.globl func37
.type func37, @function
func37:
.LFB43:
.cfi_startproc
#APP
# 435 "gcc_asm.c" 1
mov %edi, %eax
add %esi, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE43:
.size func37, .-func37
There is one final way to access the general purpose registers, which is via the 'A' constraint. This is the two-register pair defined by the a and d registers. This is useful when you want to deal with 128-bit quantities in 64-bit mode, or 64-bit quantities in 32-bit mode. The low bits are stored in the a register, and the high bits in the d register, just like the multiply and division asm instructions expect. Its use looks like:
__uint128_t func38(unsigned long long p1, unsigned long long p2)
{
__uint128_t out;
asm ("mul %2"
:"=&A" (out) : "%0" (p1), "g" (p2) : "cc");
return out;
}
Which compiles into:
.p2align 4,,15
.globl func38
.type func38, @function
func38:
.LFB44:
.cfi_startproc
movq %rdi, %rax
#APP
# 445 "gcc_asm.c" 1
mul %rsi
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE44:
.size func38, .-func38
Since the ABI requires a function returning a 128-bit integer to do so in %rax and %rdx, the above has no extra register to register copies. (Other than that required to get the multiply instruction initialized.)
The x86 has a strange floating-point coprocessor which uses an internal stack of registers. Dealing with this is difficult with gcc. You need to make sure that the right number of values are added and removed from this stack. GCC assumes that all output constraints are under its purview, and are popped by it. Input constraints are more complex, can be either popped by gcc afterwards or not.
The least complex method is to tie an input constraint to an output. That makes it popped afterwards with the output that replaces it. You can also clobber an input to make it assumed to have been implicitly popped. Otherwise, gcc will assume it can use the input later for other calculations, and will handling the popping of that register itself.
One critical detail is that the floating point processor acts on a stack. That means that the used (popped or not) registers must be contiguous. It's not possible for gcc to re-arrange the stack by popping something the middle. You need to make sure the outputs are first in the stack, followed by all registers you pop, and finally followed by the ones gcc will pop from that stack.
The constraint for the top of the floating point stack is 't'. We can add things to the stack without a floating point register input by using memory instead:
long double func39(int p1)
{
long double out;
asm ("fild %1"
: "=t" (out) : "m" (p1));
return out;
}
The above converts an integer into a long double float:
.p2align 4,,15
.globl func39
.type func39, @function
func39:
.LFB45:
.cfi_startproc
movl %edi, -12(%rsp)
#APP
# 454 "gcc_asm.c" 1
fild -12(%rsp)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE45:
.size func39, .-func39
The ABI mandates that long doubles are returned in st(0), so the above routine doesn't need to alter the stack.
The next-from-top floating point register, st(1), also has a special constraint: 'u'. An example of its use might be:
long double func40(long double p1, long double p2)
{
long double out;
asm ("fadd %2, %0"
: "=&t" (out) : "%0" (p1), "u" (p2));
return out;
}
Note how in the above we link the first input to the output, so it is stored in st(0), and popped by gcc afterwards. The other input is in st(1), and since is not clobbered, will also be popped by gcc afterwards.
.globl func40
.type func40, @function
func40:
.LFB46:
.cfi_startproc
fldt 8(%rsp)
fldt 24(%rsp)
fxch %st(1)
#APP
# 463 "gcc_asm.c" 1
fadd %st(1), %st
# 0 "" 2
#NO_APP
fstp %st(1)
ret
.cfi_endproc
.LFE46:
.size func40, .-func40
You can see how gcc sets up the floating point stack (in a not particularly efficient way). You can also see how the st(1) input is cleaned up afterwards by the ftsp instruction. st(0) is still live at the end of the function, and is used for the long double output.
Finally, you can create an input in an arbitrary floating point slot by using the 'f' constraint. (This doesn't work as an output constraint.) An example of this is:
long double func41(long double p1)
{
long double out = 2.0;
asm ("fadd %1\n\t"
: "+&t" (out) : "f" (p1));
return out;
}
Where just to be different from the previous function, we use an in-out parameter on the top of the stack.
.p2align 4,,15
.globl func41
.type func41, @function
func41:
.LFB47:
.cfi_startproc
flds .LC1(%rip)
fldt 8(%rsp)
fxch %st(1)
#APP
# 472 "gcc_asm.c" 1
fadd %st(1)
# 0 "" 2
#NO_APP
fstp %st(1)
ret
.cfi_endproc
.LFE47:
.size func41, .-func41
Again the code generated has an extra fxch than what is needed. You really shouldn't use the legacy floating point instructions. Instead, modern code should use SSE instructions for their floating point work.
Another legacy part of the x86 instruction set are the mmx registers. These are aliases of the legacy floating point stack. This means that they are difficult to use because you need to use the 'emms' instruction afterwards to avoid floating point exceptions. However, some older vectorized code does use them. The constraint for their use is 'y':
typedef char mmx64 __attribute__ ((vector_size (8)));
mmx64 func42(mmx64 p1, mmx64 p2)
{
asm ("paddb %1, %0"
: "+&y" (p1) : "y" (p2));
return p1;
}
Which compiles into:
.globl func42
.type func42, @function
func42:
.LFB48:
.cfi_startproc
movdq2q %xmm0, %mm0
movdq2q %xmm1, %mm1
#APP
# 485 "gcc_asm.c" 1
paddb %mm1, %mm0
# 0 "" 2
#NO_APP
movq %mm0, -8(%rsp)
movq -8(%rsp), %xmm0
ret
.cfi_endproc
.LFE48:
.size func42, .-func42
The above is obviously very inefficient, as gcc goes through the better SSE registers as mandated by the vector ABI. Another thing missing is the emms instruction. You'll need to use yet another inline asm in order to add it where needed. A better option is to avoid these registers if possible.
Instead, most modern code should be using the 16-byte SSE registers. The constraint for accessing those is 'x'. (This was also used in func29.) Since the ABI is much more compatible, the overhead is lower:
typedef double xmmd __attribute__ ((vector_size (16)));
xmmd func43(xmmd p1, xmmd p2)
{
asm ("addpd %1, %0"
: "+&x" (p1) : "x" (p2));
return p1;
}
Which when compiled, produces:
.p2align 4,,15
.globl func43
.type func43, @function
func43:
.LFB49:
.cfi_startproc
#APP
# 494 "gcc_asm.c" 1
addpd %xmm1, %xmm0
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE49:
.size func43, .-func43
Many fewer instructions are used in the above, with the bulk of the function just a single SSE instruction.
The final register constraint type is defined by the two-character string 'Yz'. This constrains to the first SSE register, %xmm0. This is useful because that register is often mentioned by the ABI. It is the first floating point or vector parameter passed to a function, and also the register used for floating point or vectorized output. Using it is easy:
xmmd func44(xmmd p1, xmmd p2)
{
asm ("addpd %1, %0"
: "+&Yz" (p2) : "x" (p1));
return p2;
}
Here we deliberately cause gcc to have to swap the SSE registers around in order to get p2 into %xmm0:
.globl func44
.type func44, @function
func44:
.LFB50:
.cfi_startproc
movapd %xmm0, %xmm2
movapd %xmm1, %xmm0
#APP
# 502 "gcc_asm.c" 1
addpd %xmm2, %xmm0
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE50:
.size func44, .-func44
In addition to the machine-specific register constraints, the x86 inline asm in gcc also supports special integer constraints. Most of these are actually not useful for inline asm - being 'leakage' from the RTL pattern-matching used by the optimizer. They still can be used, although this is not recommended as these are not really documented.
The first of these is relatively useful. The 'I' constraint specifies a constant integer in the range 1-31. It is useful for 32 bit shift instructions:
unsigned func45(unsigned p1)
{
asm ("shl %1, %0"
: "+g" (p1) : "I" (20) : "cc");
return p1;
}
This compiles as you might expect:
.p2align 4,,15
.globl func45
.type func45, @function
func45:
.LFB51:
.cfi_startproc
movl %edi, %eax
#APP
# 510 "gcc_asm.c" 1
shl $20, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE51:
.size func45, .-func45
Similarly, there is the 'J' constraint which specifies a constant integer in the range 1-63 for 64 bit shift instructions:
unsigned long long func46(unsigned long long p1)
{
asm ("shl %1, %0"
: "+g" (p1) : "J" (40) : "cc");
return p1;
}
Which compiles into:
.p2align 4,,15
.globl func46
.type func46, @function
func46:
.LFB52:
.cfi_startproc
movq %rdi, %rax
#APP
# 518 "gcc_asm.c" 1
shl $40, %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE52:
.size func46, .-func46
The above two constraints are helpful in that gcc will error out if the constants are the wrong size. This extra error-checking can prevent bugs.
Perhaps less useful is the 'K' constraint. This specifies a signed 8-bit integer constant;
signed char func47(signed char p1)
{
asm ("add %1, %0"
: "+a" (p1) : "K" (-127) : "cc");
return p1;
}
Compiling into:
.p2align 4,,15
.globl func47
.type func47, @function
func47:
.LFB53:
.cfi_startproc
movl %edi, %eax
#APP
# 526 "gcc_asm.c" 1
add $-127, %al
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE53:
.size func47, .-func47
On the other hand the 'L' constraint is obviously something that has escaped from RTL-land. It only allows the two integers 0xFF or 0xFFFF. It basically is a method of pattern-matching certain zero-extending constructs. Since you can't alter the asm string based on register matches, this constraint is barely useful. Of course, it still can be used:
unsigned func48(unsigned p1)
{
unsigned out;
asm ("mov %1, %0\n\t"
"and %2, %0\n\t"
: "=&r" (out) : "L" (0xff), "g" (p1) : "cc");
return out;
}
Where the above shows how the and instruction may be used for zero-extension:
.p2align 4,,15
.globl func48
.type func48, @function
func48:
.LFB54:
.cfi_startproc
#APP
# 539 "gcc_asm.c" 1
mov $255, %eax
and %edi, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE54:
.size func48, .-func48
Another not so useful constraint is 'M'. This specifies integer constants from 0-3. This is useful for RTL pattern-matching shifts that may otherwise be better done with an lea instruction. However, again the result is something not so useful for inline asm. You probably shouldn't use it. However, if you do, it may look something like:
unsigned func49(unsigned p1)
{
asm ("shl %1, %0"
: "+&g" (p1) : "M" (3) : "cc");
return p1;
}
Which compiles into:
.p2align 4,,15
.globl func49
.type func49, @function
func49:
.LFB55:
.cfi_startproc
movl %edi, %eax
#APP
# 548 "gcc_asm.c" 1
shl $3, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE55:
.size func49, .-func49
The next integer constraint is 'N'. This one specifies an unsigned 8-bit integer constant. It is useful for the io/instructions 'in' and 'out':
unsigned func50(void)
{
unsigned out;
asm volatile ("in %1, %0"
: "=a" (out) : "N" (0x80));
return out;
}
Which gives:
.p2align 4,,15
.globl func50
.type func50, @function
func50:
.LFB56:
.cfi_startproc
#APP
# 557 "gcc_asm.c" 1
in $128, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE56:
.size func50, .-func50
The addition of 64-bit support to gcc meant that constraints needed to be added to support it. Since most instructions do not support 64 bit immediates, we need to differentiate from 'i' which will allow such large integers. Instead, you can use 'e', for a constraint for a constant 32-bit signed integer:
long long func51(void)
{
long long out;
asm ("mov %1, %0"
: "=a" (out) : "e" (-1));
return out;
}
Which when compiled gives:
.p2align 4,,15
.globl func51
.type func51, @function
func51:
.LFB57:
.cfi_startproc
#APP
# 567 "gcc_asm.c" 1
mov $-1, %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE57:
.size func51, .-func51
Similarly, there now is also a constraint for 32-bit unsigned integer constants, 'Z':
unsigned long long func52(void)
{
unsigned long long out;
asm ("mov %1, %0"
: "=a" (out) : "Z" (0xfffffffful) : "cc");
return out;
}
Which we can compile to give:
.p2align 4,,15
.globl func52
.type func52, @function
func52:
.LFB58:
.cfi_startproc
#APP
# 577 "gcc_asm.c" 1
mov $4294967295, %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE58:
.size func52, .-func52
Finally there are two floating point constant constraints that you probably shouldn't use at all. These are used by gcc for optimizations. The first of these, 'G', will match a constant that can be easily generated by the i387 by a single instruction. However, since the resulting operand cannot actually be used by floating point instructions, there is very little point in using it in inline asm:
const char *func53(void)
{
const char *out;
asm (".pushsection .data\n"
"1:\n\t"
".asciz \"%1\"\n\t"
".popsection\n\t"
"lea 1b, %0\n\t"
: "=r" (out) : "G" (1.0L));
return out;
}
Where in the above we use the same trick as used with the 'X' constraint, and simply convert the operand into a string. The resulting code after compilation is:
.p2align 4,,15
.globl func53
.type func53, @function
func53:
.LFB59:
.cfi_startproc
#APP
# 592 "gcc_asm.c" 1
.pushsection .data
1:
.asciz "1.0e+0"
.popsection
lea 1b, %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE59:
.size func53, .-func53
The other floating point constraint is the equivalent for SSE registers, 'C'. Since there are less constants constructible from a single instruction, this is even less useful:
const char *func54(void)
{
const char *out;
asm (".pushsection .data\n"
"1:\n\t"
".asciz \"%1\"\n\t"
".popsection\n\t"
"lea 1b, %0\n\t"
: "=r" (out) : "C" (0));
return out;
}
Which when compiled produces:
.p2align 4,,15
.globl func54
.type func54, @function
func54:
.LFB60:
.cfi_startproc
#APP
# 610 "gcc_asm.c" 1
.pushsection .data
1:
.asciz "$0"
.popsection
lea 1b, %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE60:
.size func54, .-func54
The use of constraints doesn't fulfil all the possible things you might want to do in an inline assembly statement. The problem is that the operand %0 might not be in quite the form you want. For example, you may want to access a sub-register of %0, or use a different addressing mode that perhaps requires some slightly different formatting than the default. Fortunately, gcc offers operand modifiers that allow doing these changes.
Operand modifiers work by inserting a symbol between the percent sign and the number for the operand (or its square-bracketed operand name). By using different modifiers, you can get different effects. However, many of the modifiers are really designed for RTL usage, so aren't helpful in inline asm mode.
The simplest modifier is one that just outputs the character 'b' (for byte-sized accesses) if the compiler is in AT&T mode. This helps in writing asm strings that can also be parsed in Intel mode, which requires unadorned instructions. Use the 'B' symbol to do this:
void func55(unsigned char *p1)
{
asm volatile ("mov%B0 $1, (%0)"
: : "r" (p1) : "memory");
}
Which compiles into:
.p2align 4,,15
.globl func55
.type func55, @function
func55:
.LFB61:
.cfi_startproc
#APP
# 627 "gcc_asm.c" 1
movb $1, (%rdi)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE61:
.size func55, .-func55
Note how 'mov' gets changed into 'movb'. This particular operand modifier doesn't really depend on the operand itself.
There are other versions of this for the 16-bit and 32-bit cases. 'W' will generate a 'w', and 'L' will create an 'l':
void func56(unsigned short *p1)
{
asm volatile ("mov%W0 $1, (%0)"
: : "r" (p1) : "memory");
}
void func57(unsigned *p1)
{
asm volatile ("mov%L0 $1, (%0)"
: : "r" (p1) : "memory");
}
Which compile into:
.p2align 4,,15
.globl func56
.type func56, @function
func56:
.LFB62:
.cfi_startproc
#APP
# 634 "gcc_asm.c" 1
movw $1, (%rdi)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE62:
.size func56, .-func56
.p2align 4,,15
.globl func57
.type func57, @function
func57:
.LFB63:
.cfi_startproc
#APP
# 641 "gcc_asm.c" 1
movl $1, (%rdi)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE63:
.size func57, .-func57
Unfortunately, this pattern does not continue into 64 bits. The 'Q' modifier outputs an 'l', rather than the 'q' you might expect. Perhaps this is due to the fact that most instructions cannot take a 64-bit immediate. An example of using it is:
void func58(unsigned long long *p1)
{
asm volatile ("mov%Q0 $1, (%0)"
: : "r" (p1) : "memory");
}
Yielding:
.p2align 4,,15
.globl func58
.type func58, @function
func58:
.LFB64:
.cfi_startproc
#APP
# 648 "gcc_asm.c" 1
movl $1, (%rdi)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE64:
.size func58, .-func58
Finally, there are two other character-printing modifiers. 'S' creates an 's', and 'T' makes a 't'. These are less useful, corresponding to legacy floating-point use. Of course, since the output is a raw string, you don't actually have to use them for that... and other sillier usages are possible, as is shown below.
void func59(unsigned long long *p1)
{
asm volatile ("bt%S0 $1, (%0)"
: : "r" (p1) : "memory");
}
void func60(unsigned long long *p1)
{
asm volatile ("no%T0q (%0)"
: : "r" (p1) : "memory");
}
Giving when compiled:
.p2align 4,,15
.globl func59
.type func59, @function
func59:
.LFB65:
.cfi_startproc
#APP
# 655 "gcc_asm.c" 1
bts $1, (%rdi)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE65:
.size func59, .-func59
.p2align 4,,15
.globl func60
.type func60, @function
func60:
.LFB66:
.cfi_startproc
#APP
# 662 "gcc_asm.c" 1
notq (%rdi)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE66:
.size func60, .-func60
Of course it goes without saying that such tricks should be avoided in real code.
Another operand modifier tells gcc that the operand is a label. This is used in the "asm goto" extension. Labels are listed after the clobber list, and can be referred inside the asm string. Such asms should not have any outputs. They are designed for control flow usage.
The problem is that there is no real way to get condition code information into and out of an inline asm statement. The asm goto method avoids this problem by letting the user do the branching inside, and thus all condition usage is encapsulated. Other gcc optimizers can then deal with the jump labels and move them around as needed. The result can be very efficient code. An example using it is:
int func61(volatile void *p1, size_t p2)
{
asm goto (
"lock; bts %1, (%0)\n\t"
"jc %l2\n\t"
: : "r" (p1), "r" (p2) : "memory", "cc" : carry);
return 0;
carry:
return 1;
}
Which compiles into:
.p2align 4,,15
.globl func61
.type func61, @function
func61:
.LFB67:
.cfi_startproc
#APP
# 669 "gcc_asm.c" 1
lock; bts %rsi, (%rdi)
jc .L64
# 0 "" 2
#NO_APP
xorl %eax, %eax
ret
.p2align 4,,10
.p2align 3
.L64:
.L63:
movl $1, %eax
ret
.cfi_endproc
.LFE67:
.size func61, .-func61
If this function gets inlined inside an if statement, then the extra statements that set the output will be removed by optimizers.
The above modifiers didn't really change the output of the operands. However the following do. The 'a' and 'A' modifiers deal with addresses. They are helpful when allowing compilation in Intel mode. They modify the operands in the correct way so that dereferencing is written in the right syntax. Example of their use is:
void func62(unsigned char *p1)
{
asm volatile ("movb $1, %a0"
: : "r" (p1) : "memory");
}
void func63(void *p1)
{
asm volatile ("jmp %A0"
: : "r" (p1) :);
}
.p2align 4,,15
.globl func62
.type func62, @function
func62:
.LFB68:
.cfi_startproc
#APP
# 681 "gcc_asm.c" 1
movb $1, (%rdi)
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE68:
.size func62, .-func62
.p2align 4,,15
.globl func63
.type func63, @function
func63:
.LFB69:
.cfi_startproc
#APP
# 687 "gcc_asm.c" 1
jmp *%rdi
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE69:
.size func63, .-func63
Note how 'a' added brackets around the register name, and 'A' added an asterisk in front.
The 'p' modifier is similar. It modifies an operand to be a raw symbol name. For constants, it removes the leading dollar symbol. This is useful because in some contexts a dollar symbol is incorrect syntax. For example, in segment-offset addressing:
int func64(void)
{
int out;
asm volatile ("movl %%gs:%p1, %0"
: "=r" (out) : "i" (40) : "memory");
return out;
}
Compiles into:
.p2align 4,,15
.globl func64
.type func64, @function
func64:
.LFB70:
.cfi_startproc
#APP
# 694 "gcc_asm.c" 1
movl %gs:40, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE70:
.size func64, .-func64
Notice how %gs:$40 would be wrong.
The 'P' modifier does a little more work, it removes things like '@PLT '. This is helpful if you are creating something like a dynamic linker, where you need to do inline asm before relocations have been calculated:
unsigned long long func65(void)
{
unsigned long long out;
asm ("leaq (%P1), %0"
: "=r" (out) : "g" (func65));
return out;
}
Which gives:
.p2align 4,,15
.globl func65
.type func65, @function
func65:
.LFB71:
.cfi_startproc
#APP
# 702 "gcc_asm.c" 1
leaq (func65), %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE71:
.size func65, .-func65
Notice how the raw unadorned 'func65' is used.
The 'X' modifier is similar to 'P'. It outputs a symbol name with a prefixed dollar symbol. It is useful for symbolic immediates:
unsigned long long func66(void)
{
unsigned long long out;
asm ("movabs %X1, %0"
: "=r" (out) : "g" (func66));
return out;
}
Which compiles into:
.p2align 4,,15
.globl func66
.type func66, @function
func66:
.LFB72:
.cfi_startproc
#APP
# 710 "gcc_asm.c" 1
movabs $func66, %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE72:
.size func66, .-func66
Compare with the output from the 'P' modifier. Basically, these symbol modifiers are only useful if you are playing with linker tricks. Usually, the default behavior from the 'm' or 'g' constraint is what you want. Only when you absolutely need some other form of linkage are they needed.
Occasionally, you may want to use a differently sized sub-register based on a given constraint. Without operand modifiers there is no way to do this. The given asm string for a register name may be completely different. Compare %rax to %eax, versus %r8 to %r8d. Fortunately, gcc provides ways of accessing all possible registers based on a given constraint.
The 'b' operand modifier gives you the 8-bit register related to a given register operand. (For those registers that have two 8-bit sub-registers, it picks the low one, i.e. %al, not %ah from %eax. Code using it looks like:
unsigned long long func67(unsigned p1)
{
unsigned long long out = 0;
asm volatile ("mov %b1, (%0)"
: : "r" (&out), "r" (p1) : "memory");
return out;
}
The above takes the bottom 8 bits of the 32-bit integer parameter, and sets the corresponding bits of the 64 bit output:
.p2align 4,,15
.globl func67
.type func67, @function
func67:
.LFB73:
.cfi_startproc
movq $0, -8(%rsp)
leaq -8(%rsp), %rax
#APP
# 718 "gcc_asm.c" 1
mov %dil, (%rax)
# 0 "" 2
#NO_APP
movq -8(%rsp), %rax
ret
.cfi_endproc
.LFE73:
.size func67, .-func67
There are, of course, other sized sub-registers. The 16-bit operand modifier is 'w':
unsigned long long func68(unsigned p1)
{
unsigned long long out = 0;
asm volatile ("mov %w1, (%0)"
: : "r" (&out), "r" (p1) : "memory");
return out;
}
Which does a similar thing as the previous function, but to the bottom 16 bits:
.p2align 4,,15
.globl func68
.type func68, @function
func68:
.LFB74:
.cfi_startproc
movq $0, -8(%rsp)
leaq -8(%rsp), %rax
#APP
# 726 "gcc_asm.c" 1
mov %di, (%rax)
# 0 "" 2
#NO_APP
movq -8(%rsp), %rax
ret
.cfi_endproc
.LFE74:
.size func68, .-func68
The operand modifier for 32-bits is 'k':
unsigned long long func69(unsigned p1)
{
unsigned long long out;
asm volatile ("mov %1, %k0"
: "=r" (out) : "r" (p1));
return out;
}
Where the above uses the 64-bit x86 feature that using a 32-bit instruction on a 64-bit register will clear the upper 32 bits. The asm looks like:
.p2align 4,,15
.globl func69
.type func69, @function
func69:
.LFB75:
.cfi_startproc
#APP
# 734 "gcc_asm.c" 1
mov %edi, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE75:
.size func69, .-func69
Finally, if you want the 64-bit version of a register, use the 'q' modifier:
unsigned long long func70(unsigned p1)
{
unsigned long long out = 0;
asm volatile ("mov %q1, %0"
: "=r" (out) : "r" (p1));
return out;
}
Which compiles into:
.p2align 4,,15
.globl func70
.type func70, @function
func70:
.LFB76:
.cfi_startproc
#APP
# 743 "gcc_asm.c" 1
mov %rdi, %rax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE76:
.size func70, .-func70
Of course, we still may want to access the other 8-bit "high" sub-register. The 'h' operand modifier allows this:
unsigned long long func71(unsigned p1)
{
unsigned long long out = 0;
asm volatile ("mov %h1, (%0)"
: : "r" (&out), "Q" (p1) : "memory");
return out;
}
Note how we had to use the 'Q' constraint to make sure that the high sub-reg existed. The resulting code chooses the %edx, and the %dh registers for this:
.p2align 4,,15
.globl func71
.type func71, @function
func71:
.LFB78:
.cfi_startproc
movq $0, -8(%rsp)
leaq -8(%rsp), %rax
movl %edi, %edx
#APP
# 759 "gcc_asm.c" 1
mov %dh, (%rax)
# 0 "" 2
#NO_APP
movq -8(%rsp), %rax
ret
.cfi_endproc
.LFE78:
.size func71, .-func71
Somewhat related is the 'H' operand modifier. This allows you to access the high 8-byte part of a 16 byte SSE variable in memory. It adds 8 bytes to the offset in the memory access. This effect can of course be simulated manually.
xmmd var72;
xmmd func72(double p1)
{
asm volatile ("movlpd %1, %H0"
: : "m" (var72), "x" (p1) : "memory");
return var72;
}
Which compiles into:
.p2align 4,,15
.globl func72
.type func72, @function
func72:
.LFB79:
.cfi_startproc
#APP
# 767 "gcc_asm.c" 1
movlpd %xmm0, var72+8(%rip)
# 0 "" 2
#NO_APP
movapd var72(%rip), %xmm0
ret
.cfi_endproc
.LFE79:
.size func72, .-func72
Another useful feature is that there are operand modifiers that help inline asm statements that deal with constants. The main issue is that in AT&T syntax, you may need to add a suffix to an instruction to tell the assembler what size of instruction to use. In Intel syntax, this suffix should not be there. The other problem is that flexible code may need to accept many possible instruction sizes. The 'z', and 'Z' modifiers help here. They print the correct suffix for a given register size:
int func73(void)
{
int out;
asm ("mov%z0 %1, %0"
: "=r" (out) : "i" (25));
return out;
}
Compiles into:
.p2align 4,,15
.globl func73
.type func73, @function
func73:
.LFB80:
.cfi_startproc
#APP
# 775 "gcc_asm.c" 1
movl $25, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE80:
.size func73, .-func73
Notice the 'l' in the 'movl' instruction has been added for us. The 'Z' variant is similar:
int func74(void)
{
int out;
asm ("mov%Z0 %1, %0"
: "=r" (out) : "i" (25));
return out;
}
And in this case compiles identically.
.p2align 4,,15
.globl func74
.type func74, @function
func74:
.LFB81:
.cfi_startproc
#APP
# 783 "gcc_asm.c" 1
movl $25, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE81:
.size func74, .-func74
The difference between 'z' and 'Z' is that 'Z' is more flexible. It works with floating-point registers as well as just the integer ones. Unfortunately, neither modifier will work with constant asm constraints, just register constraints.
Sometimes useful is that you may want to write accesses to the top of the legacy floating point stack slightly differently. The 'y' modifier converts 'st' into 'st(0)':
long double func75(long double p1, long double p2)
{
long double out;
asm ("fadd %2, %y0"
: "=&t" (out) : "%0" (p1), "u" (p2));
return out;
}
Compare the result with the output from from func40().
.p2align 4,,15
.globl func75
.type func75, @function
func75:
.LFB82:
.cfi_startproc
fldt 8(%rsp)
fldt 24(%rsp)
fxch %st(1)
#APP
# 791 "gcc_asm.c" 1
fadd %st(1), %st(0)
# 0 "" 2
#NO_APP
fstp %st(1)
ret
.cfi_endproc
.LFE82:
.size func75, .-func75
'n' is a weird operand modifier. It negates the value of an integer constant. It also suppresses the leading dollar sign:
int func76(void)
{
int out;
asm ("movl $%n1, %0"
: "=r" (out) : "i" (25));
return out;
}
Which gives:
.p2align 4,,15
.globl func76
.type func76, @function
func76:
.LFB83:
.cfi_startproc
#APP
# 799 "gcc_asm.c" 1
movl $-25, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE83:
.size func76, .-func76
Another strange one is the 's' modifier. It prints out an integer constant, followed by a comma. It does not suppress the leading dollar sign:
int func77(void)
{
int out;
asm ("movl %s1 %0"
: "=r" (out) : "i" (25));
return out;
}
Which gives:
.p2align 4,,15
.globl func77
.type func77, @function
func77:
.LFB84:
.cfi_startproc
#APP
# 807 "gcc_asm.c" 1
movl $25, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE84:
.size func77, .-func77
The next set of modifiers help asm using AVX instructions. The 't' modifier converts a SSE register name into its AVX equivalent:
typedef double ymmd __attribute__ ((vector_size (32)));
ymmd func78(xmmd p1, xmmd p2)
{
ymmd out;
asm ("vmovapd %t1, %0"
: "=x" (out) : "x" (p2));
return out;
}
Where if you compile with the -mavx compile-time flag, you get:
.p2align 4,,15
.globl func78
.type func78, @function
func78:
.LFB85:
.cfi_startproc
#APP
# 816 "gcc_asm.c" 1
vmovapd %ymm1, %ymm0
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE85:
.size func78, .-func78
The reverse is implemented by the 'x' modifier, which converts an AVX name into the SSE version:
xmmd func79(ymmd p1, ymmd p2)
{
xmmd out;
asm ("movapd %x1, %x0"
: "=x" (out) : "x" (p2));
return out;
}
Giving:
.p2align 4,,15
.globl func79
.type func79, @function
func79:
.LFB86:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
#APP
# 824 "gcc_asm.c" 1
movapd %xmm1, %xmm0
# 0 "" 2
#NO_APP
movq %rsp, %rbp
.cfi_def_cfa_register 6
popq %rbp
.cfi_def_cfa 7, 8
vzeroupper
ret
.cfi_endproc
.LFE86:
.size func79, .-func79
Also potentially useful for AVX code is the 'd' operand modifier. This is documented to duplicate an operand. Since the fused multiply-add instructions come in three and four operand variants, it would be convenient to be able to support both from the same code-base. Using duplicated operands would help somewhat. Unfortunately, simple usage of 'd' with AVX registers leads to internal compiler errors with the current version of gcc (4.7.1), so this modifier should be avoided for now.
Other modifiers to be avoided are those dealing with condition codes. There is no way for inline asm to input a condition code operand type. (They are generated from RTL, however.) So you shouldn't use the 'c', 'C', 'f', 'F', 'D' and 'Y' modifiers.
The one remaining modifier is 'O'. It isn't particularly useful. It prints nothing if sun syntax is off. (The default.) Otherwise it prints 'w', 'l' or 'q', helpful for cmov instructions, which are slightly different in that asm dialect.
In addition to operands specified by the constraints, there are a few others. The first of these we have seen before. '%%' will print a single percent sign. This is helpful for writing asm registers explicitly within the output string. The '%%' behavior is the same as that for the printf() function, so it is easy to remember. func10(), above shows its use.
The '%*' operand prints an asterisk if you are using AT&T assembly output. Otherwise, nothing is printed. This is helpful for portability:
void func81(void *p1)
{
asm volatile ("jmp %*%0"
: : "r" (p1) :);
}
Which compiles into:
.p2align 4,,15
.globl func81
.type func81, @function
func81:
.LFB87:
.cfi_startproc
#APP
# 839 "gcc_asm.c" 1
jmp *%rdi
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE87:
.size func81, .-func81
Again, you probably shouldn't use control flow instructions like that in inline asm, since gcc will not understand them. However... sometimes you might just need to, and tricks like that often help.
The '%=' operand prints a unique numeric identifier within the compilation region. This is helpful for constructing a unique symbol from within an inline asm. Perhaps __LINE__, or local symbols should be used instead though. For example:
void func82(void *p1)
{
asm volatile (".L%=something:"
: : :);
}
Which compiles to give:
.p2align 4,,15
.globl func82
.type func82, @function
func82:
.LFB88:
.cfi_startproc
#APP
# 845 "gcc_asm.c" 1
.L820something:
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE88:
.size func82, .-func82
Where in this particular case, it expanded to "820". Note that since you can construct a symbol name with a given pattern, this trick may be helpful for debugging.
The '%@' operand expands to the thread TLS segment register. In 32-bit mode, this is %gs. In 64-bit mode, %fs. If you are writing low-level thread library code, this may be helpful for portability.
int func83(void)
{
int out;
asm volatile ("movl %@:%p1, %0"
: "=r" (out) : "i" (40) : "memory");
return out;
}
Which compiles to give:
.p2align 4,,15
.globl func83
.type func83, @function
func83:
.LFB89:
.cfi_startproc
#APP
# 852 "gcc_asm.c" 1
movl %fs:40, %eax
# 0 "" 2
#NO_APP
ret
.cfi_endproc
.LFE89:
.size func83, .-func83
The '%~' operand expands to 'i' if avx2 is available. Otherwise it expands to 'f'. I don't know why this could be useful.
The '%;' operand expands to ':' if gcc has had compiled in support for certain buggy versions of the gnu assembler. Otherwise, it expands to nothing. This apparently is useful for getting segment overrides to work. However, these days, binutils is most likely modern, so you don't have to worry about this.
Finally, there are two more operands that are not useful from inline asm. The '%+' operand is designed to add branch-prediction prefixes. However, inline asm can't give the information it needs. The '%&' operand expands to the name of a dynamic tls variable used within the function the inline asm is invoked in. This is used internally within gcc to get thread local variables to work correctly. You shouldn't need to use it in inline asm code.
Another interface with assembly language within gcc are register variables. GCC has an extension that lets you assign which particular register a variable may use. An example of this is:
int func84(int p1)
{
register int out asm("r10d") = p1;
return out;
}
Where we would like the input parameter p1 to be stored in %r10, before being copied into %eax for output. Unfortunately, reality isn't so kind:
.p2align 4,,15
.globl func84
.type func84, @function
func84:
.LFB90:
.cfi_startproc
movl %edi, %eax
ret
.cfi_endproc
.LFE90:
.size func84, .-func84
GCC ignores our request, and instead optimizes the extra moves away. You might think you could use a volatile specifier on the variable to make loads and stores to it more explicit. This doesn't work either. In fact, there is a warning "-Wvolatile-register-var" for this broken usage. In light of the fact that asm register variables are held captive to the whims of the optimizer, the should perhaps not be used. It is difficult to make sure they will have the behavior you might need.
A final trick is that it is possible to insert asm at top-level within a C source code file. Normally, you would need to be inside a function to use inline assembly language. However, we can use the fact that the section attribute is inserted verbatim into the output. Since we can embed carriage returns, we can put anything we like there. The only constraint is that the input must be a constant C string:
int func85(void);
int __attribute__((used, section(".text\n\t"
".globl func85\n\t"
".type func85, @function\n\t"
"func85:\n\t"
"mov $1, %eax\n\t"
"retq\n\t"
".size func85, .-func85\n\t"
".section .data"))) func85a;
The above creates a function called func85() within the section attribute. The 'used' attribute is there to make sure that the variable func85a is not removed. The result is that func85 is inserted into the object code manually:
.globl func85a
.section .text
.globl func85
.type func85, @function
func85:
mov $1, %eax
retq
.size func85, .-func85
.section .data,"aw",@progbits
.align 4
.type func85a, @object
.size func85a, 4
func85a:
.zero 4
A similar version of this trick allows variables to be put into elf sections that are not '@progbits'. Simply add the section details you want, and then end them with a comment '#' character. The comment will remove the unwanted details gcc adds as a suffix.