Assembly Language Tips for P4

Most people associate optimizing programs for the Intel® Pentium® 4 processor with using Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2) instructions to improve performance. This concept holds true for most cases. Most of the time, by using SSE2 instructions on the 128-bit XMM registers, application performance is dramatically increased. On the other hand, if a program is using 64-bit data, and it is inefficient to that data together into 128-bit registers, there is a way to optimize the code without using SSE or SSE2.

Using SIMD instructions on 128-bit wide registers can vastly improve the performance of a program. If for some reason, however, one can only work with 64 bit data, what can be done about it? Saying that we can only work with 64-bit data does not mean that we cannot use 128-bit XMM registers. We can use it by rearranging data and putting it into XMM registers.

The problem is that the overhead associated with massaging data and inserting it into XMM registers can overwhelm any performance gains. The other thing one must consider is Hyper-Threading Technology; systems with Hyper-Threading Technology enabled are different from multi-processor systems, in the sense that they share physical resources like cache and execution units between the logical processors.

Optimizing for the Pentium 4 Processor with Hyper-Threading Technology

To optimize an application for the Pentium 4 processor, you can either use MMX or SSE/SSE2 instructions. The question is when to use MMXℿ instructions and when to use SSE/SSE2. The most common practice is to explore SSE/SSE2, first since it can be used on XMM registers, which are 128 bits wide, compared to the 64-bit MMX registers.

Since XMM registers are wider than MMX registers, however, some instructions take much longer to complete when operating on XMM registers, even though they use the same execution units. For example, the instruction paddq takes six clock cycles on XMM registers, but only two clock cycles to complete on MMX registers.

Therefore, if your applications involve many calculations on 128-bit data, using SSE/SSE2 makes sense. Otherwise, there is a lot of overhead involving massaging data to load it into XMM registers, and the overhead may be too large for your application to realize the benefit of using SSE/SSE2 over MMX.

Sometimes you can use the combination of both MMX and SSE/SSE2. You can use the XMM registers as extra storage for data values when the MMX operations require temporary storage of large amounts of data. By using the XMM registers as temporary storage, you do not have to swap data or wait until the MMX registers are free before loading new data. This way, you can save a substantial number of clock cycles loading and reloading.

Analyze overall programs or functions to simplify them if you can. Sometimes changing branching conditions can make a big difference. For example, taking a condition out and putting it before a loop will eliminate one branching statement out of a loop that is called 640 times (the number of pixels per scan line of a 640X480 resolution picture.) The improvement is even greater if that function is called many times in a program.

We all know that moving data from memory to registers or between memory locations can be costly if not carefully done. The common practice is to hoist them far in advance of their usage. Another point to consider is to look for equivalent instructions that have shorter latencies. For example, consider using the instruction pshufw (two clock cycles) in place of movd (six clock cycles) when moving data among MMX registers (with the order value of 0xE4).

Note that you have to take into account not only the latency of an instruction but also its throughput. The throughput is defined as the time it takes for an execution unit to serve an instruction before it is ready to receive the next one. The importance of throughput can be seen in the discussion of tips and tricks for creating constants, below.

Finally, it is very important to remember that Hyper-Threading Technology-enabled machines share execution units; therefore, use appropriate instructions to distribute tasks among execution units for statements that are next to or close to each other. This way, you can hide latencies, improving performance. This technique is beneficial not only for systems with Hyper-Threading Technology, but rather for all systems.

Put These Tips and Tricks to Work

The following tips and tricks put some of the techniques described above into practice.

  • Initializing Data:
     -  Set a register to zero:

           movd eax, 0
   Faster:
                xor eax, eax
pxor mm0, mm0
pxor xmm0, xmm0

        - Set all bits of MM0 to 1s:
                    C declaration: unsigned temp[4] = {0xFFFFFFFF, 0xFFFFFFFF,
0xFFFFFFFF, 0xFFFFFFFF};

asm { movq mm0, temp
movdq xmm1, temp}
  Faster:
                  pcmpeqd mm0, mm0
pcmpeqd xmm1, xmm1

  • Creating Constants:
    - Set mm7 to 0x FF00FF00FF00FF00:

                      pcmpeqd mm7, mm7 // 0xFF FF FF FF FF FF FF FF
psllq mm7, 8 // 0xFF FF FF FF FF FF FF 00
pshufw mm7, mm7, 0x0 // 0xFF 00 FF 00 FF 00 FF 00
              
Each instruction takes two clock cycles to complete. The whole operation will finish in six clock cycles. Faster:
                     pxor mm7, mm7 // 0x 0
pcmpeqd mm0, mm0 // 0x FFFFFFFFFFFFFFFF
punpcklbw mm7, mm0 // 0x FF00FF00FF00FF00
              Now, pxor and pcmpeqd are handled by the MMX-ALU execution unit, and punpcklbw is taken care of by MMX-SHIFT execution units. Each instruction takes two clock cycles to complete, but the MMX-ALU only waits one cycle instead of waiting for the completion of the instruction pxor before serving the instruction pcmpeqd. Thus, the whole operation only takes five clock cycles to complete instead of six.

                 - Set mm7 to 0x 00FF00FF00FF00FF:
                        pxor mm0, mm0 // 0x 0
pcmpeqd mm7, mm7 // 0x FFFFFFFFFFFFFFFF
punpcklbw mm7, mm0 // 0x 00FF00FF00FF00FF
                Note: The same technique can be used with XMM registers with some minor modifications, since we can only work on half of the XMM register at a time.

  • Loading Data:

         movq mm1, mm2
  Faster:
             pshufw mm1, mm2, 0xE4
               

Note: The trick lies in the magic number 0xE4; it means do not change the order.

This is a useful way to copy the contents of one register to another. The instruction movq takes six clock cycles to complete, compared with only two for the pshufw instruction. Do not substitute movq with pshufw automatically, however; make sure that the appropriate execution unit is not busy at that time. The movq and pshufw instructions use the FP_MOV and MMX_SHFT execution units, respectively.

  • Swapping Data:
    - Swapping the hi and lo portions of a register:

               pshufw mm0, mm0, 0x4E
pshufd xmm0, xmm0, 0x4E
               Note: If you reverse the order number from 0x4E to 0xE4, the operation will become copy instead of swap.
               
          - Creating patterns:
         Load register mm0 with 0xAADDAADDAADDAADD:
               movd eax, 0xAADD
movd mm0, eax
pshufw mm0, mm0, 0x0
              

Note: The number 0x0 will copy the first word “AADD‿to all subsequent words of mm0.

You can use the same technique with XMM registers by doing it in the lower half; shift left to move it to the upper half and issue the command again to take care of the lower half.

  • Using lea Instructions:

         mov edx,ecx
sal edx,3
  Faster:
             lea edx, [ecx + ecx]
add edx, edx
add edx, edx
              Note: lea instructions with two more add instructions will be fast, but do not go beyond three adds; when the throughput gets larger, it will defeat the benefit of the lea instruction.

Conclusion

We do not always have to use SSE/SSE2 instructions to increase performance on the Pentium 4 processor. If operations involve only integers on 64-bit data, use MMX instead of SSE/SSE2.

First, look at the overall picture to see what you can do to simplify the functions. If you have a large number of identical operations that need to stay close to each other, try to spread them across different execution units to hide latencies. Things like unused code, placement of branching conditions, looping, and faster instructions are worth considering during the optimization process.

Finally, MMX instructions are getting faster under the Pentium 4 processor, as well. Try to make use of MMX instructions if you can when working with 64-bit data.

你可能感兴趣的:(Assembly Language Tips for P4)