http://neilkemp.us/src/sse_tutorial/sse_tutorial.html
Intel SSE Tutorial : An Introduction to the SSE Instruction Set
Table of Contents
Introduction, Prerequisites, and Summary
I am writing this tutorial on SSE (Streaming SIMD Extensions) for three reasons. First there is a lack of well organized tutorials for the subject matter, second it's an educational process I'm personally undertaking to better familiarize myself with low level optimization techniques, and finally what better way to have fun than to mess around with a bunch of registers!?
There are some things you should already be familiar with in order to benefit from this tutorial. For starters you should have a general understanding of computer organization, the Intel x86 architecture and it's assembly language, and a solid understanding of C++. These tutorials are mostly concerned with SSE optimizations of common graphics operations, so a good understanding of 3D math will also be useful.
Being a 3D graphics programmer I am always interested in learning ways to make my applications run at faster speeds. Common sense holds that the fewer CPU operations executed between each frame, the more frames my application can draw per second. 3D math is one area that will greatly benefit from SSE optimization as matrix transformations and vector calculations (ex. dot and cross product) generally eat up valuable CPU cycles. So the focus of this tutorial will mostly be directed at building optimized 3D math functions to use with graphics applications. This tutorial will introduce the instruction set by solving common vector operations. So without further ado let's have a look at the fun packed world of SSE.
First what the heck is SSE? Basically SSE is a collection of 128 bit CPU registers. These registers can be packed with 4 32 bit scalars after which an operation can be performed on each of the 4 elements simultaneously. In contrast it may take 4 or more operations in regular assembly to do the same thing. Below you can see two vectors (SSE registers) packed with scalars. The registers are multiplied with MULPS which then stores the result. That's 4 multiplications reduced to a single operation. The benefits of using SSE are too great to ignore.
Data Movement Instructions:
Destination |
Source |
ANDPS |
ANDNPS |
ORPS |
XORPS |
1 |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
1 |
0 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
Other instructions that are not covered here include data conversion between x86 and MMX registers, cache control instructions, and state management instructions. To learn more details about any of these instructions you can follow one of the links provided at the bottom of this page.
Now that we are more familiar with the instruction palette lets take a look at our first example. In this example we will add two 4 element vectors using a C++ function and inline assembly. I'll start by showing you the source and then explain each step in detail.
Because we are sending references to the function rather than copies as parameters we will need to move the vector pointers to 32bit registers EAX and EBX first. Using the MOVUPS for unaligned data, we move the literal values of the 32bit registers into SIMD registers XMM0 and XMM1. Next we use ADDPS on to add the register operands and store the resulting vector in XMM0. The final step is to move the contents of XMM0 to a vector structure before returning it. Once again we use MOVUPS to do so.
So that doesn't look so hard. We have an unaligned 4 element vector to send to and get from the addition function. I said that the vector is unaligned but what does that mean? To be more specific the vector is not guaranteed to be a 16 byte aligned struct. Huh? The SSE registers are each 128 bits or 16 bytes. If data in memory has a 16 byte border then it is said to be aligned data. What this means is that the data will start on a 16 byte boundary in memory. In this case we are doing nothing special to the data to guarantee the 16 byte alignment. In later examples we will see how to manually align data. Some compilers, such as the Intel compiler, will automatically align the data during compilation. It should be noted that using aligned data is much faster than using unaligned data.
Shuffling is an easy way to change the order of a single vector or combine the elements of two separate registers. The SHUFPS instruction takes two SSE registers and an 8 bit hex value. The first two elements of the destination operand are overwritten by any two elements of the destination register. The third and fourth elements of the destination register are overwritten by any two elements from the source register. The hex string is used to tell the instruction which elements to shuffle. 00, 01, 10, and 11 are used to access elements within the registers.
Rather than redefine what an intrinsic is, I'll just quote from the MSDN.
"An intrinsic is a function known by the compiler that directly maps to a sequence of one or more assembly language instructions. Intrinsic functions are inherently more efficient than called functions because no calling linkage is required.
Intrinsics make the use of processor-specific enhancements easier because they provide a C/C++ language interface to assembly instructions. In doing so, the compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data." - MSDN
In short we no longer need to use the inline assembly mode when using SSE with our C++ applications and we no longer need to worry about register names. Working with intrinsics is somewhere between assembly programming and high level programming. Let's take a look at how to use them with what we already know.
The source code and html for this tutorial can be downloaded here. It demonstrates vector operations using SSE and inline assembly.
Because inline assembly is used in the examples the source will not compile with the x64 Visual C++ compiler.