AT&T Assembly Language Note

1. Some Tools

Unlike a high-level language environment in which you can purchase a complete development environment, you often have to piece together an assembly language development environment. At a minimum you should have the following:

  • An assembler
  • A linker
  • A debugger
Additionally, to create assembly routines fro other high-level language programs, you should also have these tools:
  • A compiler for the high-level language
  • An object code disassembler
  • A profiling tool for optimization

1.1 The GNU Assembler

The assembler produces the instruction codes from source code created by the programmer. There are three components to an assembly language source  code program:
  • Opcode mnemonics
  • Data sections
  • Directives
The opcode mnemonics are closely related to processor instruction codes.
Directives help the programmer define specific functions, such as declaring data types, and define memory regions within the program. The directives instruct the assembler how to construct the instruction code program.
An example of converting the assembly language program test.s to the object file test.o would be as follows:
as -o test.o test.s
This creates an object file test.o containing the instruction codes for the assembly language program.

1.2 The GNU Linker

The GNU linker, ld, is used to link object files into either executable program files or library files.
For the simplest case, to create an executable file from an object file generated from the assembler, you would use the following command:
ld -o test test.o
This command creates the executable file test from the object code file test.o.

1.3 The GNU Compiler

Most professional programmers attempt to write as much of the application as possible in a high-level language, such as C or C++, and concentrate on optimizing the trouble spots using assembly language programming. To do this, you must have the proper compiler for your high-level language.
The GNU Comiler Collection(gcc) is the most popular development system for UNIX systems.
Some command-line parameters should note:
  • -c Compile or assemble code, but not link
  • -S Stop after compiling, but do not assemble
  • -E Stop after preprocessing, but not compile.
  • -o Specific the output filename to use
  • -g Produce debugging information
  • -pg Producing extra code used by the gprof for profiling
  • -I Specify directories for include files
  • -L Specify directories for library files

1.4 The GNU Debugger Program

The GNU debugger command-line program is called gdb
To use the debugger, the executable file must have compiled or assembled with the -gstabs option, which includes the necessary information in the executable file for the debugger to know where in the source code file the instruction codes relate.
gcc -gstabs -o test test.c
gdb test

1.5 The GNU Objdump Program

Often it is necessary to view the instruction codes generated by the compiler in the object code files. The objdump program will display not only the assembly language code, but the raw instruction codes generated as well.
For the assembly language programmer, the -d parameter is the most interesting, as it displays the disassembled object code file. for example:
AT&T Assembly Language Note_第1张图片

I have something to say about above:
  • The C solution for passing input values to functions is to use the stack.
  • The stack consists of memory locations reserved at the end of the memory area allocated to the program. The ESP register is used to point to the top of the stack in memory. Data can be placed only on the top of the stack, and it can be removed only from the top of the stack.
  • #1 C style requires placing parameters on the stack in reverse order from the prototype for the function.
  • #2 when the CALL function is executed, it places the return address from the calling program onto the top of the stack as well.
  • #3 The stack pointer(ESP) points to the top of the stack, where the return address is located. To avoid changing the location of the ESP stack pointer and throw off the indirect addressing values for accessing the parameters in the stack, it is common practice to copy the ESP register value to the EBP register when entering the function. 
  • To avoid corrupting the original EBP register if it is used in the main program, before the ESP register value is copied, the EBP register value is also placed on the stack.
  • Now the EBP contains the location of the start of the stack(which is now the old EBP register value). The first input parameter from the main program is located at the indirect addressing location 8(%esp).
The technique of using the stack to reference input data for the function has created a standard set of instructions that are found in all functions written using the C function style technique. This code snippet demonstrates what instructions are used for the start and end of the function code:
AT&T Assembly Language Note_第2张图片

1.6 The GNU Profiler Program

The GNU profiler( gprof) is another program included in the binutils package. This program is used to analyse program execution and determine where "hot spots" are in the application.
The application hot spots are functions that requires the most amount of processing time as the program runs. Often, they are most mathematically intensive functions, but that is not always the case. Functions that are I/O intensive can also increase processing time.  
Summary
The assembler is used to convert the assembly language code into instruction code for the specific processor used to run the application. The linker is then used to convert the raw instruction code into an executable program by combining any necessary libraries, and resolving any memory references necessary for memory storage.
The debugger enables you to step through your program, watching how each instruction modifies registers and memory locations. The disassembler enables you to view the instruction codes in an object file generated by either an assmbly language program or a high-level language program.
If you plan on using high-level languages with your assembly code, you will also need a compiler to build the executable code from the high-level language source code.
Another final tool that is useful for programmer is a profiler. The profiler is used to analyze the performance of an application. By examining which functions consume the most processing time, you can determine which ones are worth trying to optimize to increase the performance of the application.

2. Some Knowledge

2.1 Defining the starting point

When the assembly language program is converted to an executable file, the linker must know what the starting point in your instruction code. To solve this problem, the GNU assembler declares a default label, or identifier, that should be used for the entry point of the application. The _start label is used to indicate the instruction from which the program should start running.

2.2 Assembling using a compiler

With the assembly language source code program saved as cpuid.s, you can build the executable program using the GNU assembler and GNU linker as follows:
as -o cpuid.o cpuid.s
ld -o cpuid cpuid.o
Because the GNU Common Compiler( gcc) uses the GNU assembler to compile C code, you can also use it to assemble and link your assembly language program in a single step. 
There is a problem when using gcc to assemble your programs, While the GNU linker looks the _start label to determine the begining of the program, gcc looks for the main label(you might recognize that from C or C++ programming). You must change both the _start label and the .global directive defining the label in your program.
2.3 Debugging the Program
In order to debug the assembly language program, you must first reassemble the source code using the -gstabs parameter:
as -gstabs -o cpuid.o cpuid.s
ld -o cpuid cpuid.o
Because the -gstabs parameter adds additional information to the executable program file, the resulting file becomes larger than it needs to just to run the application.

2.3 Linking with C library functions

When you use C library functions in your assembly language program, you must link the C library files with the program object code.
In order to link the C function libraries, they must be available on your system. On Linux system, there are two ways to link C functions to your assembly language program.
The first method is called static linking. Static linking links function object directly into your application executable program file. This creates huge executable programs, and wastes memory if multiple instance of the program are run at the same time(each instance has its own copy of the same functions).
The second method is called dynamic linking. Dynamic linking uses libraries that enable programmers to reference the functions in their applications, but not link the function codes in the executable program file. Instead, dynamic libraries are called at the program's runtime by the operating system, and can be shared by multiple programs.
On Linux systems, the standard C dynamic library is located in the file libc.so.x, where x is a value representing the version of the library.
This file is automatically linked to C programs when using gcc. You must manually link it to your program object code for the C functions to operate. To link the libc.so file, you must use the -l parameter of the GNU linker. When using the -l parameter, you do not need to specify the complete library name, The linker assumes that the library will be in a file:
/lib/libx.so
where the x is the library name specified on the command-line parameter--in this case, the letter.Thus the command to assemble and link the program would be as follows:
as -o cpuid.o cpuid.s
ld -dynamic-linker /lib/ld-linux.so.2 -o cpuid -lc cpuid.o
It is also possible to use the gcc compiler to assemble and link the assembly language program and C library functions. In fact, in this case it's a lot easier. The gcc compiler automatically links in the necessary C libraries without you having to do anything special.

2.4 Defining static symbols

Although the data section is intended primarily for defining variable data, you can also declare static data symbols here as well. The .equ directive is used to set a constant value to a symbol that can be used in the text section, as shown in the following examples:
.equ factor, 3
.equ LINUX_SYS_CALL, 0x80
Once set, the data symbol value cannot be changed within the program. The .equ directive can appear anywhere in the data section,.
To reference the static data element, you must use a dollar sign before the label name. For example, the instruction
movl $LINUX_SYS_CALL, %eax

2.5 The bss section

Defining data elements in the bss section is somewhat different from defining them in the data section. Instead of declaring specific data types, you just declare raw segments of memory  that are reserved for whatever purpose you need them for.
The format for directives is
.comm symbol, length
where symbol is a label assigned to the memory area, and length is the number of bytes contained in the memory area.
One benefit to declaring data in the bss section is that the data is not included in the executable program. When data is defined in the data section, it must be included in the executable program, since it must be initialized with a specific value. Because the data areas declared in the bss section are not initialized with program data, the memory areas are reserved at runtime, and do not have to be included in the final program. 

2.6 Moving data elements

The data elements are located in memory, and many of the processor instructions utilize registers, the first step to handling data elements is to be able to move them around between memory and registers.
The GNU assembler(uses AT&T style syntax)  adds another dimension to the MOV instruction, in that the size of the data element moved must be declared. The size is declared by adding an additional character to the MOV mnemonic. Thus the instruction becomes
movx
where x can be the following:
  • l for a 32-bit long word value
  • w for a 16-bit word value
  • b for an 8-bit byte value
There are very specific rules for using the MOV instruction. Only certain things can be moved to other things.
Each value must be preceded by a dollar sign to indicate that it is an immediate value.
Moving data from one processor register to another is the quickest way to move data with the processor. It is often advisable to keep data in processor registers as mus as possible to decrease the amount of time spent trying to access memory locations. 

2.7 Indexed memory locations

When referencing data in the array, you must use an index system to determine which value you are accessing.
The way this is done is called indexed memory mode. The memory location is determined by the following:
  • A base address
  • An offset address to add to the base address
  • The size of the data element
  • An index to determine which data element to select
The format of the expression is
base_address(offset_address, index, size)
The data value retrieved is located at
base_address + offset_address + index * size
If any of the values are zero, they can be omitted(but the commas are still required as placeholders). The offset_address and index value must be registers, but the size value can be a numerical value. 

2.8 Cleaning out the stack

There is just one more detail to consider when using C style function calling. Before the function is called, the calling program places all of the required input values onto the stack. When the function returns, those values still on the stack(since the function accessed them without popping them off of the stack). If the main program uses the stack for other things, most likely it will want to remove the old input values from the stack to get the stack back to where it was before the function call.
While you can use the POP instruction to do this, you can also just move the ESP stack pointer back to the original location before the function call. Adding back the size of the data elements pushed onto the stack using the ADD instruction does this.

2.9 Using indirect addressing with registers

Besides holding data, registers can also be used to hold memory addresses. When a register holds a memory address, it is referred to as a pointer. Accessing the data stored in the memory location using the pointer is called i ndirect addressing.
While using a label references the data value contained in the memory location, you can get the memory location address of the data value by placing a dollar sign($) in front of the label in the instruction.
movl $output, %edi
This instructions moves the memory address of output label to the EDI register. The dollar sign($) before the label name instructs the assembler to use the memory address, and not the data value located at the address.
movl %ebx, (%edi)
Without the parentheses around the EDI register, the instruction would just load the value in the EBXregister to the EDI register. With the parentheses around the EDI register, the instruction instead moves the value in the EBX register to the memory location contained in the EDI register.

2.10 FPU(Floatint-point Unit)

The FPU is a self-contained unit that handles floating-point operations using a set of registers that are set apart from the standard processor registers. The additional FPU registers include eight 80-bit data registers, and three 16-bit registers called the control, status, and tag registers. 
The FPU is independent of the main processor, it does not normally use the EFLAGS register to indicate results and determine behavior.















你可能感兴趣的:(AT&T Assembly Language Note)