http://developer.android.com/training/articles/perf-jni.html JNI tips
JNI Tips
IN THIS DOCUMENT
- JavaVM and JNIEnv
- Threads
- jclass, jmethodID, and jfieldID
- Local and Global References
- UTF-8 and UTF-16 Strings
- Primitive Arrays
- Region Calls
- Exceptions
- Extended Checking
- Native Libraries
- 64-bit Considerations
- Unsupported Features/Backwards Compatibility
- FAQ: Why do I get
UnsatisfiedLinkError
- FAQ: Why didn't
FindClass
find my class? - FAQ: How do I share raw data with native code?
JNI is the Java Native Interface. It defines a way for managed code (written in the Java programming language) to interact with native code (written in C/C++). It's vendor-neutral, has support for loading code from dynamic shared libraries, and while cumbersome at times is reasonably efficient.
JNI是java native 接口。JNI定义了一种方法让java代码和C/C++代码交互。他能够调用动态库(C/C++写的)虽然JNI编写维护很麻烦 但是执行效率挺高效的。
If you're not already familiar with it, read through the Java Native Interface Specification to get a sense for how JNI works and what features are available. Some aspects of the interface aren't immediately obvious on first reading, so you may find the next few sections handy.
如果我们还不了解JNI 那么还是先看一下Java Native Interface Specification http://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/jniTOC.html看了这个货就会大概明白JNI是怎么工作有什么特性。
JavaVM and JNIEnv
JNI defines two key data structures, "JavaVM" and "JNIEnv". Both of these are essentially pointers to pointers to function tables. (In the C++ version, they're classes with a pointer to a function table and a member function for each JNI function that indirects through the table.) The JavaVM provides the "invocation interface" functions, which allow you to create and destroy a JavaVM. In theory you can have multiple JavaVMs per process, but Android only allows one.
The JNIEnv provides most of the JNI functions. Your native functions all receive a JNIEnv as the first argument.
The JNIEnv is used for thread-local storage. For this reason, you cannot share a JNIEnv between threads. If a piece of code has no other way to get its JNIEnv, you should share the JavaVM, and use GetEnv
to discover the thread's JNIEnv. (Assuming it has one; see AttachCurrentThread
below.)
The C declarations of JNIEnv and JavaVM are different from the C++ declarations. The "jni.h"
include file provides different typedefs depending on whether it's included into C or C++. For this reason it's a bad idea to include JNIEnv arguments in header files included by both languages. (Put another way: if your header file requires #ifdef __cplusplus
, you may have to do some extra work if anything in that header refers to JNIEnv.)
Threads
All threads are Linux threads, scheduled by the kernel. They're usually started from managed code (usingThread.start
), but they can also be created elsewhere and then attached to the JavaVM. For example, a thread started with pthread_create
can be attached with the JNI AttachCurrentThread
orAttachCurrentThreadAsDaemon
functions. Until a thread is attached, it has no JNIEnv, and cannot make JNI calls.
Attaching a natively-created thread causes a java.lang.Thread
object to be constructed and added to the "main" ThreadGroup
, making it visible to the debugger. Calling AttachCurrentThread
on an already-attached thread is a no-op.
Android does not suspend threads executing native code. If garbage collection is in progress, or the debugger has issued a suspend request, Android will pause the thread the next time it makes a JNI call.
Threads attached through JNI must call DetachCurrentThread
before they exit. If coding this directly is awkward, in Android 2.0 (Eclair) and higher you can use pthread_key_create
to define a destructor function that will be called before the thread exits, and call DetachCurrentThread
from there. (Use that key withpthread_setspecific
to store the JNIEnv in thread-local-storage; that way it'll be passed into your destructor as the argument.)
jclass, jmethodID, and jfieldID
If you want to access an object's field from native code, you would do the following:
- Get the class object reference for the class with
FindClass
- Get the field ID for the field with
GetFieldID
- Get the contents of the field with something appropriate, such as
GetIntField
Similarly, to call a method, you'd first get a class object reference and then a method ID. The IDs are often just pointers to internal runtime data structures. Looking them up may require several string comparisons, but once you have them the actual call to get the field or invoke the method is very quick.
If performance is important, it's useful to look the values up once and cache the results in your native code. Because there is a limit of one JavaVM per process, it's reasonable to store this data in a static local structure.
The class references, field IDs, and method IDs are guaranteed valid until the class is unloaded. Classes are only unloaded if all classes associated with a ClassLoader can be garbage collected, which is rare but will not be impossible in Android. Note however that the jclass
is a class reference and must be protected with a call toNewGlobalRef
(see the next section).
If you would like to cache the IDs when a class is loaded, and automatically re-cache them if the class is ever unloaded and reloaded, the correct way to initialize the IDs is to add a piece of code that looks like this to the appropriate class:
/*
* We use a class initializer to allow the native code to cache some
* field offsets. This native function looks up and caches interesting
* class/field/method IDs. Throws on failure.
*/
privatestaticnativevoid nativeInit();
static{
nativeInit();
}
Create a nativeClassInit
method in your C/C++ code that performs the ID lookups. The code will be executed once, when the class is initialized. If the class is ever unloaded and then reloaded, it will be executed again.
Local and Global References
Every argument passed to a native method, and almost every object returned by a JNI function is a "local reference". This means that it's valid for the duration of the current native method in the current thread. Even if the object itself continues to live on after the native method returns, the reference is not valid.
This applies to all sub-classes of jobject
, including jclass
, jstring
, and jarray
. (The runtime will warn you about most reference mis-uses when extended JNI checks are enabled.)
The only way to get non-local references is via the functions NewGlobalRef
and NewWeakGlobalRef
.
If you want to hold on to a reference for a longer period, you must use a "global" reference. The NewGlobalRef
function takes the local reference as an argument and returns a global one. The global reference is guaranteed to be valid until you call DeleteGlobalRef
.
This pattern is commonly used when caching a jclass returned from FindClass
, e.g.:
jclass localClass = env->FindClass("MyClass");
jclass globalClass =reinterpret_cast(env->NewGlobalRef(localClass));
All JNI methods accept both local and global references as arguments. It's possible for references to the same object to have different values. For example, the return values from consecutive calls to NewGlobalRef
on the same object may be different. To see if two references refer to the same object, you must use the IsSameObject
function. Never compare references with ==
in native code.
One consequence of this is that you must not assume object references are constant or unique in native code. The 32-bit value representing an object may be different from one invocation of a method to the next, and it's possible that two different objects could have the same 32-bit value on consecutive calls. Do not use jobject
values as keys.
Programmers are required to "not excessively allocate" local references. In practical terms this means that if you're creating large numbers of local references, perhaps while running through an array of objects, you should free them manually with DeleteLocalRef
instead of letting JNI do it for you. The implementation is only required to reserve slots for 16 local references, so if you need more than that you should either delete as you go or useEnsureLocalCapacity
/PushLocalFrame
to reserve more.
Note that jfieldID
s and jmethodID
s are opaque types, not object references, and should not be passed toNewGlobalRef
. The raw data pointers returned by functions like GetStringUTFChars
and GetByteArrayElements
are also not objects. (They may be passed between threads, and are valid until the matching Release call.)
One unusual case deserves separate mention. If you attach a native thread with AttachCurrentThread
, the code you are running will never automatically free local references until the thread detaches. Any local references you create will have to be deleted manually. In general, any native code that creates local references in a loop probably needs to do some manual deletion.
UTF-8 and UTF-16 Strings
The Java programming language uses UTF-16. For convenience, JNI provides methods that work with Modified UTF-8 as well. The modified encoding is useful for C code because it encodes \u0000 as 0xc0 0x80 instead of 0x00. The nice thing about this is that you can count on having C-style zero-terminated strings, suitable for use with standard libc string functions. The down side is that you cannot pass arbitrary UTF-8 data to JNI and expect it to work correctly.
If possible, it's usually faster to operate with UTF-16 strings. Android currently does not require a copy inGetStringChars
, whereas GetStringUTFChars
requires an allocation and a conversion to UTF-8. Note that UTF-16 strings are not zero-terminated, and \u0000 is allowed, so you need to hang on to the string length as well as the jchar pointer.
Don't forget to Release
the strings you Get
. The string functions return jchar*
or jbyte*
, which are C-style pointers to primitive data rather than local references. They are guaranteed valid until Release
is called, which means they are not released when the native method returns.
Data passed to NewStringUTF must be in Modified UTF-8 format. A common mistake is reading character data from a file or network stream and handing it to NewStringUTF
without filtering it. Unless you know the data is 7-bit ASCII, you need to strip out high-ASCII characters or convert them to proper Modified UTF-8 form. If you don't, the UTF-16 conversion will likely not be what you expect. The extended JNI checks will scan strings and warn you about invalid data, but they won't catch everything.
Primitive Arrays
JNI provides functions for accessing the contents of array objects. While arrays of objects must be accessed one entry at a time, arrays of primitives can be read and written directly as if they were declared in C.
To make the interface as efficient as possible without constraining the VM implementation, theGet
family of calls allows the runtime to either return a pointer to the actual elements, or allocate some memory and make a copy. Either way, the raw pointer returned is guaranteed to be valid until the corresponding Release
call is issued (which implies that, if the data wasn't copied, the array object will be pinned down and can't be relocated as part of compacting the heap). You must Release
every array you Get
. Also, if the Get
call fails, you must ensure that your code doesn't try to Release
a NULL pointer later.
You can determine whether or not the data was copied by passing in a non-NULL pointer for the isCopy
argument. This is rarely useful.
The Release
call takes a mode
argument that can have one of three values. The actions performed by the runtime depend upon whether it returned a pointer to the actual data or a copy of it:
0
- Actual: the array object is un-pinned.
- Copy: data is copied back. The buffer with the copy is freed.
JNI_COMMIT
- Actual: does nothing.
- Copy: data is copied back. The buffer with the copy is not freed.
JNI_ABORT
- Actual: the array object is un-pinned. Earlier writes are not aborted.
- Copy: the buffer with the copy is freed; any changes to it are lost.
One reason for checking the isCopy
flag is to know if you need to call Release
with JNI_COMMIT
after making changes to an array — if you're alternating between making changes and executing code that uses the contents of the array, you may be able to skip the no-op commit. Another possible reason for checking the flag is for efficient handling of JNI_ABORT
. For example, you might want to get an array, modify it in place, pass pieces to other functions, and then discard the changes. If you know that JNI is making a new copy for you, there's no need to create another "editable" copy. If JNI is passing you the original, then you do need to make your own copy.
It is a common mistake (repeated in example code) to assume that you can skip the Release
call if *isCopy
is false. This is not the case. If no copy buffer was allocated, then the original memory must be pinned down and can't be moved by the garbage collector.
Also note that the JNI_COMMIT
flag does not release the array, and you will need to call Release
again with a different flag eventually.
Region Calls
There is an alternative to calls like Get
and GetStringChars
that may be very helpful when all you want to do is copy data in or out. Consider the following:
jbyte* data = env->GetByteArrayElements(array, NULL);
if(data != NULL){
memcpy(buffer, data, len);
env->ReleaseByteArrayElements(array, data, JNI_ABORT);
}
This grabs the array, copies the first len
byte elements out of it, and then releases the array. Depending upon the implementation, the Get
call will either pin or copy the array contents. The code copies the data (for perhaps a second time), then calls Release
; in this case JNI_ABORT
ensures there's no chance of a third copy.
One can accomplish the same thing more simply:
env->GetByteArrayRegion(array,0, len, buffer);
This has several advantages:
- Requires one JNI call instead of 2, reducing overhead.
- Doesn't require pinning or extra data copies.
- Reduces the risk of programmer error — no risk of forgetting to call
Release
after something fails.
Similarly, you can use the Set
call to copy data into an array, and GetStringRegion
orGetStringUTFRegion
to copy characters out of a String
.
Exceptions
You must not call most JNI functions while an exception is pending. Your code is expected to notice the exception (via the function's return value, ExceptionCheck
, or ExceptionOccurred
) and return, or clear the exception and handle it.
The only JNI functions that you are allowed to call while an exception is pending are:
DeleteGlobalRef
DeleteLocalRef
DeleteWeakGlobalRef
ExceptionCheck
ExceptionClear
ExceptionDescribe
ExceptionOccurred
MonitorExit
PopLocalFrame
PushLocalFrame
Release
ArrayElements ReleasePrimitiveArrayCritical
ReleaseStringChars
ReleaseStringCritical
ReleaseStringUTFChars
Many JNI calls can throw an exception, but often provide a simpler way of checking for failure. For example, ifNewString
returns a non-NULL value, you don't need to check for an exception. However, if you call a method (using a function like CallObjectMethod
), you must always check for an exception, because the return value is not going to be valid if an exception was thrown.
Note that exceptions thrown by interpreted code do not unwind native stack frames, and Android does not yet support C++ exceptions. The JNI Throw
and ThrowNew
instructions just set an exception pointer in the current thread. Upon returning to managed from native code, the exception will be noted and handled appropriately.
Native code can "catch" an exception by calling ExceptionCheck
or ExceptionOccurred
, and clear it withExceptionClear
. As usual, discarding exceptions without handling them can lead to problems.
There are no built-in functions for manipulating the Throwable
object itself, so if you want to (say) get the exception string you will need to find the Throwable
class, look up the method ID for getMessage "()Ljava/lang/String;"
, invoke it, and if the result is non-NULL use GetStringUTFChars
to get something you can hand to printf(3)
or equivalent.
Extended Checking
JNI does very little error checking. Errors usually result in a crash. Android also offers a mode called CheckJNI, where the JavaVM and JNIEnv function table pointers are switched to tables of functions that perform an extended series of checks before calling the standard implementation.
The additional checks include:
- Arrays: attempting to allocate a negative-sized array.
- Bad pointers: passing a bad jarray/jclass/jobject/jstring to a JNI call, or passing a NULL pointer to a JNI call with a non-nullable argument.
- Class names: passing anything but the “java/lang/String” style of class name to a JNI call.
- Critical calls: making a JNI call between a “critical” get and its corresponding release.
- Direct ByteBuffers: passing bad arguments to
NewDirectByteBuffer
. - Exceptions: making a JNI call while there’s an exception pending.
- JNIEnv*s: using a JNIEnv* from the wrong thread.
- jfieldIDs: using a NULL jfieldID, or using a jfieldID to set a field to a value of the wrong type (trying to assign a StringBuilder to a String field, say), or using a jfieldID for a static field to set an instance field or vice versa, or using a jfieldID from one class with instances of another class.
- jmethodIDs: using the wrong kind of jmethodID when making a
Call*Method
JNI call: incorrect return type, static/non-static mismatch, wrong type for ‘this’ (for non-static calls) or wrong class (for static calls). - References: using
DeleteGlobalRef
/DeleteLocalRef
on the wrong kind of reference. - Release modes: passing a bad release mode to a release call (something other than
0
,JNI_ABORT
, orJNI_COMMIT
). - Type safety: returning an incompatible type from your native method (returning a StringBuilder from a method declared to return a String, say).
- UTF-8: passing an invalid Modified UTF-8 byte sequence to a JNI call.
(Accessibility of methods and fields is still not checked: access restrictions don't apply to native code.)
There are several ways to enable CheckJNI.
If you’re using the emulator, CheckJNI is on by default.
If you have a rooted device, you can use the following sequence of commands to restart the runtime with CheckJNI enabled:
adb shell stop
adb shell setprop dalvik.vm.checkjni true
adb shell start
In either of these cases, you’ll see something like this in your logcat output when the runtime starts:
D AndroidRuntime:CheckJNIis ON
If you have a regular device, you can use the following command:
adb shell setprop debug.checkjni 1
This won’t affect already-running apps, but any app launched from that point on will have CheckJNI enabled. (Change the property to any other value or simply rebooting will disable CheckJNI again.) In this case, you’ll see something like this in your logcat output the next time an app starts:
D Late-enabling CheckJNI
Native Libraries
You can load native code from shared libraries with the standard System.loadLibrary
call. The preferred way to get at your native code is:
- Call
System.loadLibrary
from a static class initializer. (See the earlier example, where one is used to callnativeClassInit
.) The argument is the "undecorated" library name, so to load "libfubar.so" you would pass in "fubar". - Provide a native function:
jint JNI_OnLoad(JavaVM* vm, void* reserved)
- In
JNI_OnLoad
, register all of your native methods. You should declare the methods "static" so the names don't take up space in the symbol table on the device.
The JNI_OnLoad
function should look something like this if written in C++:
jint JNI_OnLoad(JavaVM* vm,void* reserved)
{
JNIEnv* env;
if(vm->GetEnv(reinterpret_cast<void**>(&env), JNI_VERSION_1_6)!= JNI_OK){
return-1;
}
// Get jclass with env->FindClass.
// Register methods with env->RegisterNatives.
return JNI_VERSION_1_6;
}
You can also call System.load
with the full path name of the shared library. For Android apps, you may find it useful to get the full path to the application's private data storage area from the context object.
This is the recommended approach, but not the only approach. Explicit registration is not required, nor is it necessary that you provide a JNI_OnLoad
function. You can instead use "discovery" of native methods that are named in a specific way (see the JNI spec for details), though this is less desirable because if a method signature is wrong you won't know about it until the first time the method is actually used.
One other note about JNI_OnLoad
: any FindClass
calls you make from there will happen in the context of the class loader that was used to load the shared library. Normally FindClass
uses the loader associated with the method at the top of the interpreted stack, or if there isn't one (because the thread was just attached) it uses the "system" class loader. This makes JNI_OnLoad
a convenient place to look up and cache class object references.
64-bit Considerations
Android is currently expected to run on 32-bit platforms. In theory it could be built for a 64-bit system, but that is not a goal at this time. For the most part this isn't something that you will need to worry about when interacting with native code, but it becomes significant if you plan to store pointers to native structures in integer fields in an object. To support architectures that use 64-bit pointers, you need to stash your native pointers in a long
field rather than an int
.
Unsupported Features/Backwards Compatibility
All JNI 1.6 features are supported, with the following exception:
DefineClass
is not implemented. Android does not use Java bytecodes or class files, so passing in binary class data doesn't work.
For backward compatibility with older Android releases, you may need to be aware of:
- Dynamic lookup of native functions
Until Android 2.0 (Eclair), the '$' character was not properly converted to "_00024" during searches for method names. Working around this requires using explicit registration or moving the native methods out of inner classes.
- Detaching threads
Until Android 2.0 (Eclair), it was not possible to use a
pthread_key_create
destructor function to avoid the "thread must be detached before exit" check. (The runtime also uses a pthread key destructor function, so it'd be a race to see which gets called first.) - Weak global references
Until Android 2.2 (Froyo), weak global references were not implemented. Older versions will vigorously reject attempts to use them. You can use the Android platform version constants to test for support.
Until Android 4.0 (Ice Cream Sandwich), weak global references could only be passed to
NewLocalRef
,NewGlobalRef
, andDeleteWeakGlobalRef
. (The spec strongly encourages programmers to create hard references to weak globals before doing anything with them, so this should not be at all limiting.)From Android 4.0 (Ice Cream Sandwich) on, weak global references can be used like any other JNI references.
- Local references
Until Android 4.0 (Ice Cream Sandwich), local references were actually direct pointers. Ice Cream Sandwich added the indirection necessary to support better garbage collectors, but this means that lots of JNI bugs are undetectable on older releases. See JNI Local Reference Changes in ICS for more details.
- Determining reference type with
GetObjectRefType
Until Android 4.0 (Ice Cream Sandwich), as a consequence of the use of direct pointers (see above), it was impossible to implement
GetObjectRefType
correctly. Instead we used a heuristic that looked through the weak globals table, the arguments, the locals table, and the globals table in that order. The first time it found your direct pointer, it would report that your reference was of the type it happened to be examining. This meant, for example, that if you calledGetObjectRefType
on a global jclass that happened to be the same as the jclass passed as an implicit argument to your static native method, you'd getJNILocalRefType
rather thanJNIGlobalRefType
.
FAQ: Why do I get UnsatisfiedLinkError
?
When working on native code it's not uncommon to see a failure like this:
java.lang.UnsatisfiedLinkError:Library foo not found
In some cases it means what it says — the library wasn't found. In other cases the library exists but couldn't be opened by dlopen(3)
, and the details of the failure can be found in the exception's detail message.
Common reasons why you might encounter "library not found" exceptions:
- The library doesn't exist or isn't accessible to the app. Use
adb shell ls -l
to check its presence and permissions. - The library wasn't built with the NDK. This can result in dependencies on functions or libraries that don't exist on the device.
Another class of UnsatisfiedLinkError
failures looks like:
java.lang.UnsatisfiedLinkError: myfunc
at Foo.myfunc(NativeMethod)
at Foo.main(Foo.java:10)
In logcat, you'll see:
W/dalvikvm( 880):No implementation found fornativeLFoo;.myfunc ()V
This means that the runtime tried to find a matching method but was unsuccessful. Some common reasons for this are:
- The library isn't getting loaded. Check the logcat output for messages about library loading.
- The method isn't being found due to a name or signature mismatch. This is commonly caused by:
- For lazy method lookup, failing to declare C++ functions with
extern "C"
and appropriate visibility (JNIEXPORT
). Note that prior to Ice Cream Sandwich, the JNIEXPORT macro was incorrect, so using a new GCC with an oldjni.h
won't work. You can usearm-eabi-nm
to see the symbols as they appear in the library; if they look mangled (something like_Z15Java_Foo_myfuncP7_JNIEnvP7_jclass
rather thanJava_Foo_myfunc
), or if the symbol type is a lowercase 't' rather than an uppercase 'T', then you need to adjust the declaration. - For explicit registration, minor errors when entering the method signature. Make sure that what you're passing to the registration call matches the signature in the log file. Remember that 'B' is
byte
and 'Z' isboolean
. Class name components in signatures start with 'L', end with ';', use '/' to separate package/class names, and use '$' to separate inner-class names (Ljava/util/Map$Entry;
, say).
- For lazy method lookup, failing to declare C++ functions with
Using javah
to automatically generate JNI headers may help avoid some problems.
FAQ: Why didn't FindClass
find my class?
Make sure that the class name string has the correct format. JNI class names start with the package name and are separated with slashes, such as java/lang/String
. If you're looking up an array class, you need to start with the appropriate number of square brackets and must also wrap the class with 'L' and ';', so a one-dimensional array of String
would be [Ljava/lang/String;
.
If the class name looks right, you could be running into a class loader issue. FindClass
wants to start the class search in the class loader associated with your code. It examines the call stack, which will look something like:
Foo.myfunc(NativeMethod)
Foo.main(Foo.java:10)
dalvik.system.NativeStart.main(NativeMethod)
The topmost method is Foo.myfunc
. FindClass
finds the ClassLoader
object associated with the Foo
class and uses that.
This usually does what you want. You can get into trouble if you create a thread yourself (perhaps by callingpthread_create
and then attaching it with AttachCurrentThread
). Now the stack trace looks like this:
dalvik.system.NativeStart.run(NativeMethod)
The topmost method is NativeStart.run
, which isn't part of your application. If you call FindClass
from this thread, the JavaVM will start in the "system" class loader instead of the one associated with your application, so attempts to find app-specific classes will fail.
There are a few ways to work around this:
- Do your
FindClass
lookups once, inJNI_OnLoad
, and cache the class references for later use. AnyFindClass
calls made as part of executingJNI_OnLoad
will use the class loader associated with the function that calledSystem.loadLibrary
(this is a special rule, provided to make library initialization more convenient). If your app code is loading the library,FindClass
will use the correct class loader. - Pass an instance of the class into the functions that need it, by declaring your native method to take a Class argument and then passing
Foo.class
in. - Cache a reference to the
ClassLoader
object somewhere handy, and issueloadClass
calls directly. This requires some effort.
FAQ: How do I share raw data with native code?
You may find yourself in a situation where you need to access a large buffer of raw data from both managed and native code. Common examples include manipulation of bitmaps or sound samples. There are two basic approaches.
You can store the data in a byte[]
. This allows very fast access from managed code. On the native side, however, you're not guaranteed to be able to access the data without having to copy it. In some implementations, GetByteArrayElements
and GetPrimitiveArrayCritical
will return actual pointers to the raw data in the managed heap, but in others it will allocate a buffer on the native heap and copy the data over.
The alternative is to store the data in a direct byte buffer. These can be created withjava.nio.ByteBuffer.allocateDirect
, or the JNI NewDirectByteBuffer
function. Unlike regular byte buffers, the storage is not allocated on the managed heap, and can always be accessed directly from native code (get the address with GetDirectBufferAddress
). Depending on how direct byte buffer access is implemented, accessing the data from managed code can be very slow.
The choice of which to use depends on two factors:
- Will most of the data accesses happen from code written in Java or in C/C++?
- If the data is eventually being passed to a system API, what form must it be in? (For example, if the data is eventually passed to a function that takes a byte[], doing processing in a direct
ByteBuffer
might be unwise.)
If there's no clear winner, use a direct byte buffer. Support for them is built directly into JNI, and performance should improve in future releases.
SMP Primer for Android
IN THIS DOCUMENT
- Theory
- Memory consistency models
- Processor consistency
- CPU cache behavior
- Observability
- ARM’s weak ordering
- Data memory barriers
- Store/store and load/load
- Load/store and store/load
- Barrier instructions
- Address dependencies and causal consistency
- Memory barrier summary
- Atomic operations
- Atomic essentials
- Atomic + barrier pairing
- Acquire and release
- Memory consistency models
- Practice
- What not to do in C
- C/C++ and “volatile”
- Examples
- What not to do in Java
- “synchronized” and “volatile”
- Examples
- What to do
- General advice
- Synchronization primitive guarantees
- Upcoming changes to C/C++
- What not to do in C
- Closing Notes
- Appendix
- SMP failure example
- Implementing synchronization stores
- Further reading
Android 3.0 and later platform versions are optimized to support multiprocessor architectures. This document introduces issues that can arise when writing code for symmetric multiprocessor systems in C, C++, and the Java programming language (hereafter referred to simply as “Java” for the sake of brevity). It's intended as a primer for Android app developers, not as a complete discussion on the subject. The focus is on the ARM CPU architecture.
If you’re in a hurry, you can skip the Theory section and go directly to Practice for best practices, but this is not recommended.
Introduction
SMP is an acronym for “Symmetric Multi-Processor”. It describes a design in which two or more identical CPU cores share access to main memory. Until a few years ago, all Android devices were UP (Uni-Processor).
Most — if not all — Android devices do have multiple CPUs, but generally one of them is used to run applications while others manage various bits of device hardware (for example, the radio). The CPUs may have different architectures, and the programs running on them can’t use main memory to communicate with each other.
Most Android devices sold today are built around SMP designs, making things a bit more complicated for software developers. The sorts of race conditions you might encounter in a multi-threaded program are much worse on SMP when two or more of your threads are running simultaneously on different cores. What’s more, SMP on ARM is more challenging to work with than SMP on x86. Code that has been thoroughly tested on x86 may break badly on ARM.
The rest of this document will explain why, and tell you what you need to do to ensure that your code behaves correctly.
Theory
This is a high-speed, glossy overview of a complex subject. Some areas will be incomplete, but none of it should be misleading or wrong.
See Further reading at the end of the document for pointers to more thorough treatments of the subject.
Memory consistency models
Memory consistency models, or often just “memory models”, describe the guarantees the hardware architecture makes about memory accesses. For example, if you write a value to address A, and then write a value to address B, the model might guarantee that every CPU core sees those writes happen in that order.
The model most programmers are accustomed to is sequential consistency, which is described like this (Adve & Gharachorloo):
- All memory operations appear to execute one at a time
- All operations on a single processor appear to execute in the order described by that processor's program.
If you look at a bit of code and see that it does some reads and writes from memory, on a sequentially-consistent CPU architecture you know that the code will do those reads and writes in the expected order. It’s possible that the CPU is actually reordering instructions and delaying reads and writes, but there is no way for code running on the device to tell that the CPU is doing anything other than execute instructions in a straightforward manner. (We’re ignoring memory-mapped device driver I/O for the moment.)
To illustrate these points it’s useful to consider small snippets of code, commonly referred to as litmus tests. These are assumed to execute in program order, that is, the order in which the instructions appear here is the order in which the CPU will execute them. We don’t want to consider instruction reordering performed by compilers just yet.
Here’s a simple example, with code running on two threads:
Thread 1 | Thread 2 |
---|---|
A = 3 |
reg0 = B |
In this and all future litmus examples, memory locations are represented by capital letters (A, B, C) and CPU registers start with “reg”. All memory is initially zero. Instructions are executed from top to bottom. Here, thread 1 stores the value 3 at location A, and then the value 5 at location B. Thread 2 loads the value from location B into reg0, and then loads the value from location A into reg1. (Note that we’re writing in one order and reading in another.)
Thread 1 and thread 2 are assumed to execute on different CPU cores. You should always make this assumption when thinking about multi-threaded code.
Sequential consistency guarantees that, after both threads have finished executing, the registers will be in one of the following states:
Registers | States |
---|---|
reg0=5, reg1=3 | possible (thread 1 ran first) |
reg0=0, reg1=0 | possible (thread 2 ran first) |
reg0=0, reg1=3 | possible (concurrent execution) |
reg0=5, reg1=0 | never |
To get into a situation where we see B=5 before we see the store to A, either the reads or the writes would have to happen out of order. On a sequentially-consistent machine, that can’t happen.
Most uni-processors, including x86 and ARM, are sequentially consistent. Most SMP systems, including x86 and ARM, are not.
Processor consistency
x86 SMP provides processor consistency, which is slightly weaker than sequential. While the architecture guarantees that loads are not reordered with respect to other loads, and stores are not reordered with respect to other stores, it does not guarantee that a store followed by a load will be observed in the expected order.
Consider the following example, which is a piece of Dekker’s Algorithm for mutual exclusion:
Thread 1 | Thread 2 |
---|---|
A = true |
B = true |
The idea is that thread 1 uses A to indicate that it’s busy, and thread 2 uses B. Thread 1 sets A and then checks to see if B is set; if not, it can safely assume that it has exclusive access to the critical section. Thread 2 does something similar. (If a thread discovers that both A and B are set, a turn-taking algorithm is used to ensure fairness.)
On a sequentially-consistent machine, this works correctly. On x86 and ARM SMP, the store to A and the load from B in thread 1 can be “observed” in a different order by thread 2. If that happened, we could actually appear to execute this sequence (where blank lines have been inserted to highlight the apparent order of operations):
Thread 1 | Thread 2 |
---|---|
reg1 = B |
|
This results in both reg1 and reg2 set to “false”, allowing the threads to execute code in the critical section simultaneously. To understand how this can happen, it’s useful to know a little about CPU caches.
CPU cache behavior
This is a substantial topic in and of itself. An extremely brief overview follows. (The motivation for this material is to provide some basis for understanding why SMP systems behave as they do.)
Modern CPUs have one or more caches between the processor and main memory. These are labeled L1, L2, and so on, with the higher numbers being successively “farther” from the CPU. Cache memory adds size and cost to the hardware, and increases power consumption, so the ARM CPUs used in Android devices typically have small L1 caches and little or no L2/L3.
Loading or storing a value into the L1 cache is very fast. Doing the same to main memory can be 10-100x slower. The CPU will therefore try to operate out of the cache as much as possible. The write policy of a cache determines when data written to it is forwarded to main memory. A write-through cache will initiate a write to memory immediately, while a write-back cache will wait until it runs out of space and has to evict some entries. In either case, the CPU will continue executing instructions past the one that did the store, possibly executing dozens of them before the write is visible in main memory. (While the write-through cache has a policy of immediately forwarding the data to main memory, it only initiates the write. It does not have to wait for it to finish.)
The cache behavior becomes relevant to this discussion when each CPU core has its own private cache. In a simple model, the caches have no way to interact with each other directly. The values held by core #1’s cache are not shared with or visible to core #2’s cache except as loads or stores from main memory. The long latencies on memory accesses would make inter-thread interactions sluggish, so it’s useful to define a way for the caches to share data. This sharing is called cache coherency, and the coherency rules are defined by the CPU architecture’s cache consistency model.
With that in mind, let’s return to the Dekker example. When core 1 executes “A = 1”, the value gets stored in core 1’s cache. When core 2 executes “if (A == 0)”, it might read from main memory or it might read from core 2’s cache; either way it won’t see the store performed by core 1. (“A” could be in core 2’s cache because of a previous load from “A”.)
For the memory consistency model to be sequentially consistent, core 1 would have to wait for all other cores to be aware of “A = 1” before it could execute “if (B == 0)” (either through strict cache coherency rules, or by disabling the caches entirely so everything operates out of main memory). This would impose a performance penalty on every store operation. Relaxing the rules for the ordering of stores followed by loads improves performance but imposes a burden on software developers.
The other guarantees made by the processor consistency model are less expensive to make. For example, to ensure that memory writes are not observed out of order, it just needs to ensure that the stores are published to other cores in the same order that they were issued. It doesn’t need to wait for store #1 to finish being published before it can start on store #2, it just needs to ensure that it doesn’t finish publishing #2 before it finishes publishing #1. This avoids a performance bubble.
Relaxing the guarantees even further can provide additional opportunities for CPU optimization, but creates more opportunities for code to behave in ways the programmer didn’t expect.
One additional note: CPU caches don’t operate on individual bytes. Data is read or written as cache lines; for many ARM CPUs these are 32 bytes. If you read data from a location in main memory, you will also be reading some adjacent values. Writing data will cause the cache line to be read from memory and updated. As a result, you can cause a value to be loaded into cache as a side-effect of reading or writing something nearby, adding to the general aura of mystery.
Observability
Before going further, it’s useful to define in a more rigorous fashion what is meant by “observing” a load or store. Suppose core 1 executes “A = 1”. The store is initiated when the CPU executes the instruction. At some point later, possibly through cache coherence activity, the store is observed by core 2. In a write-through cache it doesn’t really complete until the store arrives in main memory, but the memory consistency model doesn’t dictate when something completes, just when it can be observed.
(In a kernel device driver that accesses memory-mapped I/O locations, it may be very important to know when things actually complete. We’re not going to go into that here.)
Observability may be defined as follows:
- "A write to a location in memory is said to be observed by an observer Pn when a subsequent read of the location by Pn would return the value written by the write."
- "A read of a location in memory is said to be observed by an observer Pm when a subsequent write to the location by Pm would have no effect on the value returned by the read." (Reasoning about the ARM weakly consistent memory model)
A less formal way to describe it (where “you” and “I” are CPU cores) would be:
- I have observed your write when I can read what you wrote
- I have observed your read when I can no longer affect the value you read
The notion of observing a write is intuitive; observing a read is a bit less so (don’t worry, it grows on you).
With this in mind, we’re ready to talk about ARM.
ARM's weak ordering
ARM SMP provides weak memory consistency guarantees. It does not guarantee that loads or stores are ordered with respect to each other.
Thread 1 | Thread 2 |
---|---|
A = 41 |
loop_until (B == 1) |
Recall that all addresses are initially zero. The “loop_until” instruction reads B repeatedly, looping until we read 1 from B. The idea here is that thread 2 is waiting for thread 1 to update A. Thread 1 sets A, and then sets B to 1 to indicate data availability.
On x86 SMP, this is guaranteed to work. Thread 2 will observe the stores made by thread 1 in program order, and thread 1 will observe thread 2’s loads in program order.
On ARM SMP, the loads and stores can be observed in any order. It is possible, after all the code has executed, for reg to hold 0. It’s also possible for it to hold 41. Unless you explicitly define the ordering, you don’t know how this will come out.
(For those with experience on other systems, ARM’s memory model is equivalent to PowerPC in most respects.)
Data memory barriers
Memory barriers provide a way for your code to tell the CPU that memory access ordering matters. ARM/x86 uniprocessors offer sequential consistency, and thus have no need for them. (The barrier instructions can be executed but aren’t useful; in at least one case they’re hideously expensive, motivating separate builds for SMP targets.)
There are four basic situations to consider:
- store followed by another store
- load followed by another load
- load followed by store
- store followed by load
Store/store and load/load
Recall our earlier example:
Thread 1 | Thread 2 |
---|---|
A = 41 |
loop_until (B == 1) |
Thread 1 needs to ensure that the store to A happens before the store to B. This is a “store/store” situation. Similarly, thread 2 needs to ensure that the load of B happens before the load of A; this is a load/load situation. As mentioned earlier, the loads and stores can be observed in any order.
Going back to the cache discussion, assume A and B are on separate cache lines, with minimal cache coherency. If the store to A stays local but the store to B is published, core 2 will see B=1 but won’t see the update to A. On the other side, assume we read A earlier, or it lives on the same cache line as something else we recently read. Core 2 spins until it sees the update to B, then loads A from its local cache, where the value is still zero.
We can fix it like this:
Thread 1 | Thread 2 |
---|---|
A = 41 |
loop_until (B == 1) |
The store/store barrier guarantees that all observers will observe the write to A before they observe the write to B. It makes no guarantees about the ordering of loads in thread 1, but we don’t have any of those, so that’s okay. The load/load barrier in thread 2 makes a similar guarantee for the loads there.
Since the store/store barrier guarantees that thread 2 observes the stores in program order, why do we need the load/load barrier in thread 2? Because we also need to guarantee that thread 1 observes the loads in program order.
The store/store barrier could work by flushing all dirty entries out of the local cache, ensuring that other cores see them before they see any future stores. The load/load barrier could purge the local cache completely and wait for any “in-flight” loads to finish, ensuring that future loads are observed after previous loads. What the CPU actually does doesn’t matter, so long as the appropriate guarantees are kept. If we use a barrier in core 1 but not in core 2, core 2 could still be reading A from its local cache.
Because the architectures have different memory models, these barriers are required on ARM SMP but not x86 SMP.
Load/store and store/load
The Dekker’s Algorithm fragment shown earlier illustrated the need for a store/load barrier. Here’s an example where a load/store barrier is required:
Thread 1 | Thread 2 |
---|---|
reg = A |
loop_until (B == 1) |
Thread 2 could observe thread 1’s store of B=1 before it observe’s thread 1’s load from A, and as a result store A=41 before thread 1 has a chance to read A. Inserting a load/store barrier in each thread solves the problem:
Thread 1 | Thread 2 |
---|---|
reg = A |
loop_until (B == 1) |
A store to local cache may be observed before a load from main memory, because accesses to main memory are so much slower. In this case, assume core 1’s cache has the cache line for B but not A. The load from A is initiated, and while that’s in progress execution continues. The store to B happens in local cache, and by some means becomes available to core 2 while the load from A is still in progress. Thread 2 is able to exit the loop before it has observed thread 1’s load from A.
A thornier question is: do we need a barrier in thread 2? If the CPU doesn’t perform speculative writes, and doesn’t execute instructions out of order, can thread 2 store to A before thread 1’s read if thread 1 guarantees the load/store ordering? (Answer: no.) What if there’s a third core watching A and B? (Answer: now you need one, or you could observe B==0 / A==41 on the third core.) It’s safest to insert barriers in both places and not worry about the details.
As mentioned earlier, store/load barriers are the only kind required on x86 SMP.
Barrier instructions
Different CPUs provide different flavors of barrier instruction. For example:
- Sparc V8 has a “membar” instruction that takes a 4-element bit vector. The four categories of barrier can be specified individually.
- Alpha provides “rmb” (load/load), “wmb” (store/store), and “mb” (full). (Trivia: the linux kernel provides three memory barrier functions with these names and behaviors.)
- x86 has a variety of options; “mfence” (introduced with SSE2) provides a full barrier.
- ARMv7 has “dmb st” (store/store) and “dmb sy” (full).
“Full barrier” means all four categories are included.
It is important to recognize that the only thing guaranteed by barrier instructions is ordering. Do not treat them as cache coherency “sync points” or synchronous “flush” instructions. The ARM “dmb” instruction has no direct effect on other cores. This is important to understand when trying to figure out where barrier instructions need to be issued.
Address dependencies and causal consistency
(This is a slightly more advanced topic and can be skipped.)
The ARM CPU provides one special case where a load/load barrier can be avoided. Consider the following example from earlier, modified slightly:
Thread 1 | Thread 2 |
---|---|
[A+8] = 41 |
loop: |
This introduces a new notation. If “A” refers to a memory address, “A+n” refers to a memory address offset by 8 bytes from A. If A is the base address of an object or array, [A+8] could be a field in the object or an element in the array.
The “loop_until” seen in previous examples has been expanded to show the load of B into reg0. reg1 is assigned the numeric value 8, and reg2 is loaded from the address [A+reg1] (the same location that thread 1 is accessing).
This will not behave correctly because the load from B could be observed after the load from [A+reg1]. We can fix this with a load/load barrier after the loop, but on ARM we can also just do this:
Thread 1 | Thread 2 |
---|---|
[A+8] = 41 |
loop: |
What we’ve done here is change the assignment of reg1 from a constant (8) to a value that depends on what we loaded from B. In this case, we do a bitwise AND of the value with 0, which yields zero, which means reg1 still has the value 8. However, the ARM CPU believes that the load from [A+reg1] depends upon the load from B, and will ensure that the two are observed in program order.
This is called an address dependency. Address dependencies exist when the value returned by a load is used to compute the address of a subsequent load or store. It can let you avoid the need for an explicit barrier in certain situations.
ARM does not provide control dependency guarantees. To illustrate this it’s necessary to dip into ARM code for a moment: (Barrier Litmus Tests and Cookbook).
LDR r1,[r0]
CMP r1,#55
LDRNE r2,[r3]
The loads from r0 and r3 may be observed out of order, even though the load from r3 will not execute at all if [r0] doesn’t hold 55. Inserting AND r1, r1, #0 and replacing the last instruction with LDRNE r2, [r3, r1] would ensure proper ordering without an explicit barrier. (This is a prime example of why you can’t think about consistency issues in terms of instruction execution. Always think in terms of memory accesses.)
While we’re hip-deep, it’s worth noting that ARM does not provide causal consistency:
Thread 1 | Thread 2 | Thread 3 |
---|---|---|
A = 1 |
loop_until (A == 1) |
loop: |
Here, thread 1 sets A, signaling thread 2. Thread 2 sees that and sets B to signal thread 3. Thread 3 sees it and loads from A, using an address dependency to ensure that the load of B and the load of A are observed in program order.
It’s possible for reg2 to hold zero at the end of this. The fact that a store in thread 1 causes something to happen in thread 2 which causes something to happen in thread 3 does not mean that thread 3 will observe the stores in that order. (Inserting a load/store barrier in thread 2 fixes this.)
Memory barrier summary
Barriers come in different flavors for different situations. While there can be performance advantages to using exactly the right barrier type, there are code maintenance risks in doing so — unless the person updating the code fully understands it, they might introduce the wrong type of operation and cause a mysterious breakage. Because of this, and because ARM doesn’t provide a wide variety of barrier choices, many atomic primitives use full barrier instructions when a barrier is required.
The key thing to remember about barriers is that they define ordering. Don’t think of them as a “flush” call that causes a series of actions to happen. Instead, think of them as a dividing line in time for operations on the current CPU core.
Atomic operations
Atomic operations guarantee that an operation that requires a series of steps always behaves as if it were a single operation. For example, consider a non-atomic increment (“++A”) executed on the same variable by two threads simultaneously:
Thread 1 | Thread 2 |
---|---|
reg = A |
reg = A |
If the threads execute concurrently from top to bottom, both threads will load 0 from A, increment it to 1, and store it back, leaving a final result of 1. If we used an atomic increment operation, you would be guaranteed that the final result will be 2.
Atomic essentials
The most fundamental operations — loading and storing 32-bit values — are inherently atomic on ARM so long as the data is aligned on a 32-bit boundary. For example:
Thread 1 | Thread 2 |
---|---|
reg = 0x00000000 |
reg = 0xffffffff |
The CPU guarantees that A will hold 0x00000000 or 0xffffffff. It will never hold 0x0000ffff or any other partial “mix” of bytes.
The atomicity guarantee is lost if the data isn’t aligned. Misaligned data could straddle a cache line, so other cores could see the halves update independently. Consequently, the ARMv7 documentation declares that it provides “single-copy atomicity” for all byte accesses, halfword accesses to halfword-aligned locations, and word accesses to word-aligned locations. Doubleword (64-bit) accesses are notatomic, unless the location is doubleword-aligned and special load/store instructions are used. This behavior is important to understand when multiple threads are performing unsynchronized updates to packed structures or arrays of primitive types.
There is no need for 32-bit “atomic read” or “atomic write” functions on ARM or x86. Where one is provided for completeness, it just does a trivial load or store.
Operations that perform more complex actions on data in memory are collectively known as read-modify-write(RMW) instructions, because they load data, modify it in some way, and write it back. CPUs vary widely in how these are implemented. ARM uses a technique called “Load Linked / Store Conditional”, or LL/SC.
A linked or locked load reads the data from memory as usual, but also establishes a reservation, tagging the physical memory address. The reservation is cleared when another core tries to write to that address. To perform an LL/SC, the data is read with a reservation, modified, and then a conditional store instruction is used to try to write the data back. If the reservation is still in place, the store succeeds; if not, the store will fail. Atomic functions based on LL/SC usually loop, retrying the entire read-modify-write sequence until it completes without interruption.
It’s worth noting that the read-modify-write operations would not work correctly if they operated on stale data. If two cores perform an atomic increment on the same address, and one of them is not able to see what the other did because each core is reading and writing from local cache, the operation won’t actually be atomic. The CPU’s cache coherency rules ensure that the atomic RMW operations remain atomic in an SMP environment.
This should not be construed to mean that atomic RMW operations use a memory barrier. On ARM, atomics have no memory barrier semantics. While a series of atomic RMW operations on a single address will be observed in program order by other cores, there are no guarantees when it comes to the ordering of atomic and non-atomic operations.
It often makes sense to pair barriers and atomic operations together. The next section describes this in more detail.
Atomic + barrier pairing
As usual, it’s useful to illuminate the discussion with an example. We’re going to consider a basic mutual-exclusion primitive called a spin lock. The idea is that a memory address (which we’ll call “lock”) initially holds zero. When a thread wants to execute code in the critical section, it sets the lock to 1, executes the critical code, and then changes it back to zero when done. If another thread has already set the lock to 1, we sit and spin until the lock changes back to zero.
To make this work we use an atomic RMW primitive called compare-and-swap. The function takes three arguments: the memory address, the expected current value, and the new value. If the value currently in memory matches what we expect, it is replaced with the new value, and the old value is returned. If the current value is not what we expect, we don’t change anything. A minor variation on this is called compare-and-set; instead of returning the old value it returns a boolean indicating whether the swap succeeded. For our needs either will work, but compare-and-set is slightly simpler for examples, so we use it and just refer to it as “CAS”.
The acquisition of the spin lock is written like this (using a C-like language):
do{
success = atomic_cas(&lock,0,1)
}while(!success)
full_memory_barrier()
critical-section
If no thread holds the lock, the lock value will be 0, and the CAS operation will set it to 1 to indicate that we now have it. If another thread has it, the lock value will be 1, and the CAS operation will fail because the expected current value does not match the actual current value. We loop and retry. (Note this loop is on top of whatever loop the LL/SC code might be doing inside the atomic_cas function.)
On SMP, a spin lock is a useful way to guard a small critical section. If we know that another thread is going to execute a handful of instructions and then release the lock, we can just burn a few cycles while we wait our turn. However, if the other thread happens to be executing on the same core, we’re just wasting time because the other thread can’t make progress until the OS schedules it again (either by migrating it to a different core or by preempting us). A proper spin lock implementation would optimistically spin a few times and then fall back on an OS primitive (such as a Linux futex) that allows the current thread to sleep while waiting for the other thread to finish up. On a uniprocessor you never want to spin at all. For the sake of brevity we’re ignoring all this.
The memory barrier is necessary to ensure that other threads observe the acquisition of the lock before they observe any loads or stores in the critical section. Without that barrier, the memory accesses could be observed while the lock is not held.
The full_memory_barrier
call here actually does two independent operations. First, it issues the CPU’s full barrier instruction. Second, it tells the compiler that it is not allowed to reorder code around the barrier. That way, we know that the atomic_cas
call will be executed before anything in the critical section. Without thiscompiler reorder barrier, the compiler has a great deal of freedom in how it generates code, and the order of instructions in the compiled code might be much different from the order in the source code.
Of course, we also want to make sure that none of the memory accesses performed in the critical section are observed after the lock is released. The full version of the simple spin lock is:
do{
success = atomic_cas(&lock,0,1) // acquire
}while(!success)
full_memory_barrier()
critical-section
full_memory_barrier()
atomic_store(&lock,0) // release
We perform our second CPU/compiler memory barrier immediately before we release the lock, so that loads and stores in the critical section are observed before the release of the lock.
As mentioned earlier, the atomic_store
operation is a simple assignment on ARM and x86. Unlike the atomic RMW operations, we don’t guarantee that other threads will see this value immediately. This isn’t a problem, though, because we only need to keep the other threads out. The other threads will stay out until they observe the store of 0. If it takes a little while for them to observe it, the other threads will spin a little longer, but we will still execute code correctly.
It’s convenient to combine the atomic operation and the barrier call into a single function. It also provides other advantages, which will become clear shortly.
Acquire and release
When acquiring the spinlock, we issue the atomic CAS and then the barrier. When releasing the spinlock, we issue the barrier and then the atomic store. This inspires a particular naming convention: operations followed by a barrier are “acquiring” operations, while operations preceded by a barrier are “releasing” operations. (It would be wise to install the spin lock example firmly in mind, as the names are not otherwise intuitive.)
Rewriting the spin lock example with this in mind:
do{
success = atomic_acquire_cas(&lock,0,1)
}while(!success)
critical-section
atomic_release_store(&lock,0)
This is a little more succinct and easier to read, but the real motivation for doing this lies in a couple of optimizations we can now perform.
First, consider atomic_release_store
. We need to ensure that the store of zero to the lock word is observed after any loads or stores in the critical section above it. In other words, we need a load/store and store/store barrier. In an earlier section we learned that these aren’t necessary on x86 SMP -- only store/load barriers are required. The implementation of atomic_release_store
on x86 is therefore just a compiler reorder barrier followed by a simple store. No CPU barrier is required.
The second optimization mostly applies to the compiler (although some CPUs, such as the Itanium, can take advantage of it as well). The basic principle is that code can move across acquire and release barriers, but only in one direction.
Suppose we have a mix of locally-visible and globally-visible memory accesses, with some miscellaneous computation as well:
local1 = arg1 /41
local2 = threadStruct->field2
threadStruct->field3 = local2
do{
success = atomic_acquire_cas(&lock,0,1)
}while(!success)
local5 = globalStruct->field5
globalStruct->field6 = local5
atomic_release_store(&lock,0)
Here we see two completely independent sets of operations. The first set operates on a thread-local data structure, so we’re not concerned about clashes with other threads. The second set operates on a global data structure, which must be protected with a lock.
A full compiler reorder barrier in the atomic ops will ensure that the program order matches the source code order at the lock boundaries. However, allowing the compiler to interleave instructions can improve performance. Loads from memory can be slow, but the CPU can continue to execute instructions that don’t require the result of that load while waiting for it to complete. The code might execute more quickly if it were written like this instead:
do{
success = atomic_acquire_cas(&lock,0,1)
}while(!success)
local2 = threadStruct->field2
local5 = globalStruct->field5
local1 = arg1 /41
threadStruct->field3 = local2
globalStruct->field6 = local5
atomic_release_store(&lock,0)
We issue both loads, do some unrelated computation, and then execute the instructions that make use of the loads. If the integer division takes less time than one of the loads, we essentially get it for free, since it happens during a period where the CPU would have stalled waiting for a load to complete.
Note that all of the operations are now happening inside the critical section. Since none of the “threadStruct” operations are visible outside the current thread, nothing else can see them until we’re finished here, so it doesn’t matter exactly when they happen.
In general, it is always safe to move operations into a critical section, but never safe to move operations out of a critical section. Put another way, you can migrate code “downward” across an acquire barrier, and “upward” across a release barrier. If the atomic ops used a full barrier, this sort of migration would not be possible.
Returning to an earlier point, we can state that on x86 all loads are acquiring loads, and all stores are releasing stores. As a result:
- Loads may not be reordered with respect to each other. You can’t take a load and move it “upward” across another load’s acquire barrier.
- Stores may not be reordered with respect to each other, because you can’t move a store “downward” across another store’s release barrier.
- A load followed by a store can’t be reordered, because neither instruction will tolerate it.
- A store followed by a load can be reordered, because each instruction can move across the other in that direction.
Hence, you only need store/load barriers on x86 SMP.
Labeling atomic operations with “acquire” or “release” describes not only whether the barrier is executed before or after the atomic operation, but also how the compiler is allowed to reorder code.
Practice
Debugging memory consistency problems can be very difficult. If a missing memory barrier causes some code to read stale data, you may not be able to figure out why by examining memory dumps with a debugger. By the time you can issue a debugger query, the CPU cores will have all observed the full set of accesses, and the contents of memory and the CPU registers will appear to be in an “impossible” state.
What not to do in C
Here we present some examples of incorrect code, along with simple ways to fix them. Before we do that, we need to discuss the use of a basic language feature.
C/C+++ and "volatile"
When writing single-threaded code, declaring a variable “volatile” can be very useful. The compiler will not omit or reorder accesses to volatile locations. Combine that with the sequential consistency provided by the hardware, and you’re guaranteed that the loads and stores will appear to happen in the expected order.
However, accesses to volatile storage may be reordered with non-volatile accesses, so you have to be careful in multi-threaded uniprocessor environments (explicit compiler reorder barriers may be required). There are no atomicity guarantees, and no memory barrier provisions, so “volatile” doesn’t help you at all in multi-threaded SMP environments. The C and C++ language standards are being updated to address this with built-in atomic operations.
If you think you need to declare something “volatile”, that is a strong indicator that you should be using one of the atomic operations instead.
Examples
In most cases you’d be better off with a synchronization primitive (like a pthread mutex) rather than an atomic operation, but we will employ the latter to illustrate how they would be used in a practical situation.
For the sake of brevity we’re ignoring the effects of compiler optimizations here — some of this code is broken even on uniprocessors — so for all of these examples you must assume that the compiler generates straightforward code (for example, compiled with gcc -O0). The fixes presented here do solve both compiler-reordering and memory-access-ordering issues, but we’re only going to discuss the latter.
MyThing* gGlobalThing = NULL;
void initGlobalThing() // runs in thread 1
{
MyStruct* thing = malloc(sizeof(*thing));
memset(thing,0,sizeof(*thing));
thing->x =5;
thing->y =10;
/* initialization complete, publish */
gGlobalThing = thing;
}
void useGlobalThing() // runs in thread 2
{
if(gGlobalThing != NULL){
int i = gGlobalThing->x; // could be 5, 0, or uninitialized data
...
}
}
The idea here is that we allocate a structure, initialize its fields, and at the very end we “publish” it by storing it in a global variable. At that point, any other thread can see it, but that’s fine since it’s fully initialized, right? At least, it would be on x86 SMP or a uniprocessor (again, making the erroneous assumption that the compiler outputs code exactly as we have it in the source).
Without a memory barrier, the store to gGlobalThing
could be observed before the fields are initialized on ARM. Another thread reading from thing->x
could see 5, 0, or even uninitialized data.
This can be fixed by changing the last assignment to:
atomic_release_store(&gGlobalThing, thing);
That ensures that all other threads will observe the writes in the proper order, but what about reads? In this case we should be okay on ARM, because the address dependency rules will ensure that any loads from an offset ofgGlobalThing
are observed after the load of gGlobalThing
. However, it’s unwise to rely on architectural details, since it means your code will be very subtly unportable. The complete fix also requires a barrier after the load:
MyThing* thing = atomic_acquire_load(&gGlobalThing);
int i = thing->x;
Now we know the ordering will be correct. This may seem like an awkward way to write code, and it is, but that’s the price you pay for accessing data structures from multiple threads without using locks. Besides, address dependencies won’t always save us:
MyThing gGlobalThing;
void initGlobalThing() // runs in thread 1
{
gGlobalThing.x =5;
gGlobalThing.y =10;
/* initialization complete */
gGlobalThing.initialized =true;
}
void useGlobalThing() // runs in thread 2
{
if(gGlobalThing.initialized){
int i = gGlobalThing.x; // could be 5 or 0
}
}
Because there is no relationship between the initialized
field and the others, the reads and writes can be observed out of order. (Note global data is initialized to zero by the OS, so it shouldn’t be possible to read “random” uninitialized data.)
We need to replace the store with:
atomic_release_store(&gGlobalThing.initialized,true);
and replace the load with:
int initialized = atomic_acquire_load(&gGlobalThing.initialized);
Another example of the same problem occurs when implementing reference-counted data structures. The reference count itself will be consistent so long as atomic increment and decrement operations are used, but you can still run into trouble at the edges, for example:
voidRefCounted::release()
{
int oldCount = atomic_dec(&mRefCount);
if(oldCount ==1){ // was decremented to zero
recycleStorage();
}
}
void useSharedThing(RefCountedThing sharedThing)
{
int localVar = sharedThing->x;
sharedThing->release();
sharedThing = NULL; // can’t use this pointer any more
doStuff(localVar); // value of localVar might be wrong
}
The release()
call decrements the reference count using a barrier-free atomic decrement operation. Because this is an atomic RMW operation, we know that it will work correctly. If the reference count goes to zero, we recycle the storage.
The useSharedThing()
function extracts what it needs from sharedThing
and then releases its copy. However, because we didn’t use a memory barrier, and atomic and non-atomic operations can be reordered, it’s possible for other threads to observe the read of sharedThing->x
after they observe the recycle operation. It’s therefore possible for localVar
to hold a value from "recycled" memory, for example a new object created in the same location by another thread after release()
is called.
This can be fixed by replacing the call to atomic_dec()
with atomic_release_dec()
. The barrier ensures that the reads from sharedThing
are observed before we recycle the object.
In most cases the above won’t actually fail, because the “recycle” function is likely guarded by functions that themselves employ barriers (libc heap free()
/delete()
, or an object pool guarded by a mutex). If the recycle function used a lock-free algorithm implemented without barriers, however, the above code could fail on ARM SMP.
What not to do in Java
We haven’t discussed some relevant Java language features, so we’ll take a quick look at those first.
Java's "synchronized" and "volatile" keywords
The “synchronized” keyword provides the Java language’s in-built locking mechanism. Every object has an associated “monitor” that can be used to provide mutually exclusive access.
The implementation of the “synchronized” block has the same basic structure as the spin lock example: it begins with an acquiring CAS, and ends with a releasing store. This means that compilers and code optimizers are free to migrate code into a “synchronized” block. One practical consequence: you must not conclude that code inside a synchronized block happens after the stuff above it or before the stuff below it in a function. Going further, if a method has two synchronized blocks that lock the same object, and there are no operations in the intervening code that are observable by another thread, the compiler may perform “lock coarsening” and combine them into a single block.
The other relevant keyword is “volatile”. As defined in the specification for Java 1.4 and earlier, a volatile declaration was about as weak as its C counterpart. The spec for Java 1.5 was updated to provide stronger guarantees, almost to the level of monitor synchronization.
The effects of volatile accesses can be illustrated with an example. If thread 1 writes to a volatile field, and thread 2 subsequently reads from that same field, then thread 2 is guaranteed to see that write and all writes previously made by thread 1. More generally, the writes made by any thread up to the point where it writes the field will be visible to thead 2 when it does the read. In effect, writing to a volatile is like a monitor release, and reading from a volatile is like a monitor acquire.
Non-volatile accesses may be reorded with respect to volatile accesses in the usual ways, for example the compiler could move a non-volatile load or store “above” a volatile store, but couldn’t move it “below”. Volatile accesses may not be reordered with respect to each other. The VM takes care of issuing the appropriate memory barriers.
It should be mentioned that, while loads and stores of object references and most primitive types are atomic,long
and double
fields are not accessed atomically unless they are marked as volatile. Multi-threaded updates to non-volatile 64-bit fields are problematic even on uniprocessors.
Examples
Here’s a simple, incorrect implementation of a monotonic counter: (Java theory and practice: Managing volatility).
classCounter{
privateint mValue;
publicintget(){
return mValue;
}
publicvoid incr(){
mValue++;
}
}
Assume get()
and incr()
are called from multiple threads, and we want to be sure that every thread sees the current count when get()
is called. The most glaring problem is that mValue++
is actually three operations:
reg = mValue
reg = reg + 1
mValue = reg
If two threads execute in incr()
simultaneously, one of the updates could be lost. To make the increment atomic, we need to declare incr()
“synchronized”. With this change, the code will run correctly in multi-threaded uniprocessor environments.
It’s still broken on SMP, however. Different threads might see different results from get()
, because we’re reading the value with an ordinary load. We can correct the problem by declaring get()
to be synchronized. With this change, the code is obviously correct.
Unfortunately, we’ve introduced the possibility of lock contention, which could hamper performance. Instead of declaring get()
to be synchronized, we could declare mValue
with “volatile”. (Note incr()
must still usesynchronize
.) Now we know that the volatile write to mValue
will be visible to any subsequent volatile read ofmValue
. incr()
will be slightly slower, but get()
will be faster, so even in the absence of contention this is a win if reads outnumber writes. (See also AtomicInteger
.)
Here’s another example, similar in form to the earlier C examples:
classMyGoodies{
publicint x, y;
}
classMyClass{
staticMyGoodies sGoodies;
void initGoodies(){ // runs in thread 1
MyGoodies goods =newMyGoodies();
goods.x =5;
goods.y =10;
sGoodies = goods;
}
void useGoodies(){ // runs in thread 2
if(sGoodies !=null){
int i = sGoodies.x; // could be 5 or 0
....
}
}
}
This has the same problem as the C code, namely that the assignment sGoodies = goods
might be observed before the initialization of the fields in goods
. If you declare sGoodies
with the volatile keyword, you can think about the loads as if they were atomic_acquire_load()
calls, and the stores as if they wereatomic_release_store()
calls.
(Note that only the sGoodies
reference itself is volatile. The accesses to the fields inside it are not. The statementz = sGoodies.x
will perform a volatile load of MyClass.sGoodies
followed by a non-volatile load of sGoodies.x
. If you make a local reference MyGoodies localGoods = sGoodies
, z = localGoods.x
will not perform any volatile loads.)
A more common idiom in Java programming is the infamous “double-checked locking”:
classMyClass{
privateHelper helper =null;
publicHelper getHelper(){
if(helper ==null){
synchronized(this){
if(helper ==null){
helper =newHelper();
}
}
}
return helper;
}
}
The idea is that we want to have a single instance of a Helper
object associated with an instance of MyClass
. We must only create it once, so we create and return it through a dedicated getHelper()
function. To avoid a race in which two threads create the instance, we need to synchronize the object creation. However, we don’t want to pay the overhead for the “synchronized” block on every call, so we only do that part if helper
is currently null.
This doesn’t work correctly on uniprocessor systems, unless you’re using a traditional Java source compiler and an interpreter-only VM. Once you add fancy code optimizers and JIT compilers it breaks down. See the “‘Double Checked Locking is Broken’ Declaration” link in the appendix for more details, or Item 71 (“Use lazy initialization judiciously”) in Josh Bloch’s Effective Java, 2nd Edition..
Running this on an SMP system introduces an additional way to fail. Consider the same code rewritten slightly, as if it were compiled into a C-like language (I’ve added a couple of integer fields to represent Helper’s
constructor activity):
if(helper ==null){
// acquire monitor using spinlock
while(atomic_acquire_cas(&this.lock,0,1)!= success)
;
if(helper ==null){
newHelper = malloc(sizeof(Helper));
newHelper->x =5;
newHelper->y =10;
helper = newHelper;
}
atomic_release_store(&this.lock,0);
}
Now the problem should be obvious: the store to helper
is happening before the memory barrier, which means another thread could observe the non-null value of helper
before the stores to the x
/y
fields.
You could try to ensure that the store to helper
happens after the atomic_release_store()
on this.lock
by rearranging the code, but that won’t help, because it’s okay to migrate code upward — the compiler could move the assignment back above the atomic_release_store()
to its original position.
There are two ways to fix this:
- Do the simple thing and delete the outer check. This ensures that we never examine the value of
helper
outside a synchronized block. - Declare
helper
volatile. With this one small change, the code in Example J-3 will work correctly on Java 1.5 and later. (You may want to take a minute to convince yourself that this is true.)
This next example illustrates two important issues when using volatile:
classMyClass{
int data1, data2;
volatileint vol1, vol2;
void setValues(){ // runs in thread 1
data1 =1;
vol1 =2;
data2 =3;
}
void useValues1(){ // runs in thread 2
if(vol1 ==2){
int l1 = data1; // okay
int l2 = data2; // wrong
}
}
void useValues2(){ // runs in thread 2
int dummy = vol2;
int l1 = data1; // wrong
int l2 = data2; // wrong
}
Looking at useValues1()
, if thread 2 hasn’t yet observed the update to vol1
, then it can’t know if data1
or data2
has been set yet. Once it sees the update to vol1
, it knows that the change to data1
is also visible, because that was made before vol1
was changed. However, it can’t make any assumptions about data2
, because that store was performed after the volatile store.
The code in useValues2()
uses a second volatile field, vol2
, in an attempt to force the VM to generate a memory barrier. This doesn’t generally work. To establish a proper “happens-before” relationship, both threads need to be interacting with the same volatile field. You’d have to know that vol2
was set after data1/data2
in thread 1. (The fact that this doesn’t work is probably obvious from looking at the code; the caution here is against trying to cleverly “cause” a memory barrier instead of creating an ordered series of accesses.)
What to do
General advice
In C/C++, use the pthread
operations, like mutexes and semaphores. These include the proper memory barriers, providing correct and efficient behavior on all Android platform versions. Be sure to use them correctly, for example be wary of signaling a condition variable without holding the corresponding mutex.
It's best to avoid using atomic functions directly. Locking and unlocking a pthread mutex require a single atomic operation each if there’s no contention, so you’re not going to save much by replacing mutex calls with atomic ops. If you need a lock-free design, you must fully understand the concepts in this entire document before you begin (or, better yet, find an existing code library that is known to be correct on SMP ARM).
Be extremely circumspect with "volatile” in C/C++. It often indicates a concurrency problem waiting to happen.
In Java, the best answer is usually to use an appropriate utility class from the java.util.concurrent
package. The code is well written and well tested on SMP.
Perhaps the safest thing you can do is make your class immutable. Objects from classes like String and Integer hold data that cannot be changed once the class is created, avoiding all synchronization issues. The bookEffective Java, 2nd Ed. has specific instructions in “Item 15: Minimize Mutability”. Note in particular the importance of declaring fields “final" (Bloch).
If neither of these options is viable, the Java “synchronized” statement should be used to guard any field that can be accessed by more than one thread. If mutexes won’t work for your situation, you should declare shared fields “volatile”, but you must take great care to understand the interactions between threads. The volatile declaration won’t save you from common concurrent programming mistakes, but it will help you avoid the mysterious failures associated with optimizing compilers and SMP mishaps.
The Java Memory Model guarantees that assignments to final fields are visible to all threads once the constructor has finished — this is what ensures proper synchronization of fields in immutable classes. This guarantee does not hold if a partially-constructed object is allowed to become visible to other threads. It is necessary to follow safe construction practices.(Safe Construction Techniques in Java).
Synchronization primitive guarantees
The pthread library and VM make a couple of useful guarantees: all accesses previously performed by a thread that creates a new thread are observable by that new thread as soon as it starts, and all accesses performed by a thread that is exiting are observable when a join()
on that thread returns. This means you don’t need any additional synchronization when preparing data for a new thread or examining the results of a joined thread.
Whether or not these guarantees apply to interactions with pooled threads depends on the thread pool implementation.
In C/C++, the pthread library guarantees that any accesses made by a thread before it unlocks a mutex will be observable by another thread after it locks that same mutex. It also guarantees that any accesses made before calling signal()
or broadcast()
on a condition variable will be observable by the woken thread.
Java language threads and monitors make similar guarantees for the comparable operations.
Upcoming changes to C/C++
The C and C++ language standards are evolving to include a sophisticated collection of atomic operations. A full matrix of calls for common data types is defined, with selectable memory barrier semantics (choose from relaxed, consume, acquire, release, acq_rel, seq_cst).
See the Further Reading section for pointers to the specifications.
Closing Notes
While this document does more than merely scratch the surface, it doesn’t manage more than a shallow gouge. This is a very broad and deep topic. Some areas for further exploration:
- Learn the definitions of happens-before, synchronizes-with, and other essential concepts from the Java Memory Model. (It’s hard to understand what “volatile” really means without getting into this.)
- Explore what compilers are and aren’t allowed to do when reordering code. (The JSR-133 spec has some great examples of legal transformations that lead to unexpected results.)
- Find out how to write immutable classes in Java and C++. (There’s more to it than just “don’t change anything after construction”.)
- Internalize the recommendations in the Concurrency section of Effective Java, 2nd Edition. (For example, you should avoid calling methods that are meant to be overridden while inside a synchronized block.)
- Understand what sorts of barriers you can use on x86 and ARM. (And other CPUs for that matter, for example Itanium’s acquire/release instruction modifiers.)
- Read through the
java.util.concurrent
andjava.util.concurrent.atomic
APIs to see what's available. Consider using concurrency annotations like@ThreadSafe
and@GuardedBy
(from net.jcip.annotations).
The Further Reading section in the appendix has links to documents and web sites that will better illuminate these topics.
Appendix
SMP failure example
This document describes a lot of “weird” things that can, in theory, happen. If you’re not convinced that these issues are real, a practical example may be useful.
Bill Pugh’s Java memory model web site has a few test programs on it. One interesting test is ReadAfterWrite.java, which does the following:
Thread 1 | Thread 2 |
---|---|
for (int i = 0; i < ITERATIONS; i++) { |
for (int i = 0; i < ITERATIONS; i++) { |
Where a
and b
are declared as volatile int
fields, and AA
and BB
are ordinary integer arrays.
This is trying to determine if the VM ensures that, after a value is written to a volatile, the next read from that volatile sees the new value. The test code executes these loops a million or so times, and then runs through afterward and searches the results for inconsistencies.
At the end of execution,AA
and BB
will be full of gradually-increasing integers. The threads will not run side-by-side in a predictable way, but we can assert a relationship between the array contents. For example, consider this execution fragment:
Thread 1 | Thread 2 |
---|---|
(initially a == 1534) |
(initially b == 165) |
(This is written as if the threads were taking turns executing so that it’s more obvious when results from one thread should be visible to the other, but in practice that won’t be the case.)
Look at the assignment of AA[166]
in thread 2. We are capturing the fact that, at the point where thread 2 was on iteration 166, it can see that thread 1 was on iteration 1536. If we look one step in the future, at thread 1’s iteration 1537, we expect to see that thread 1 saw that thread 2 was at iteration 166 (or later). BB[1537]
holds 167, so it appears things are working.
Now suppose we fail to observe a volatile write to b
:
Thread 1 | Thread 2 |
---|---|
(initially a == 1534) |
(initially b == 165) |
Now, BB[1537]
holds 165, a smaller value than we expected, so we know we have a problem. Put succinctly, for i=166, BB[AA[i]+1] < i. (This also catches failures by thread 2 to observe writes to a
, for example if we miss an update and assign AA[166] = 1535
, we will get BB[AA[166]+1] == 165
.)
If you run the test program under Dalvik (Android 3.0 “Honeycomb” or later) on an SMP ARM device, it will never fail. If you remove the word “volatile” from the declarations of a
and b
, it will consistently fail. The program is testing to see if the VM is providing sequentially consistent ordering for accesses to a
and b
, so you will only see correct behavior when the variables are volatile. (It will also succeed if you run the code on a uniprocessor device, or run it while something else is using enough CPU that the kernel doesn’t schedule the test threads on separate cores.)
If you run the modified test a few times you will note that it doesn’t fail in the same place every time. The test fails consistently because it performs the operations a million times, and it only needs to see out-of-order accesses once. In practice, failures will be infrequent and difficult to locate. This test program could very well succeed on a broken VM if things just happen to work out.
Implementing synchronization stores
(This isn’t something most programmers will find themselves implementing, but the discussion is illuminating.)
Consider once again volatile accesses in Java. Earlier we made reference to their similarities with acquiring loads and releasing stores, which works as a starting point but doesn’t tell the full story.
We start with a fragment of Dekker’s algorithm. Initially both flag1
and flag2
are false:
Thread 1 | Thread 2 |
---|---|
flag1 = true |
flag2 = true |
flag1
and
flag2
are declared as volatile boolean fields. The rules for acquiring loads and releasing stores would allow the accesses in each thread to be reordered, breaking the algorithm. Fortunately, the JMM has a few things to say here. Informally:
- A write to a volatile field happens-before every subsequent read of that same field. (For this example, it means that if one thread updates a flag, and later on the other thread reads that flag, the reader is guaranteed to see the write.)
- Every execution has a total order over all volatile field accesses. The order is consistent with program order.
Taken together, these rules say that the volatile accesses in our example must be observable in program order by all threads. Thus, we will never see these threads executing the “critical-stuff” simultaneously.
Another way to think about this is in terms of data races. A data race occurs if two accesses to the same memory location by different threads are not ordered, at least one of them stores to the memory location, and at least one of them is not a synchronization action (Boehm and McKenney). The memory model declares that a program free of data races must behave as if executed by a sequentially-consistent machine. Because both flag1
and flag2
are volatile, and volatile accesses are considered synchronization actions, there are no data races and this code must execute in a sequentially consistent manner.
As we saw in an earlier section, we need to insert a store/load barrier between the two operations. The code executed in the VM for a volatile access will look something like this:
volatile load | volatile store |
---|---|
reg = A |
store/store barrier |
The volatile load is just an acquiring load. The volatile store is similar to a releasing store, but we’ve omitted load/store from the pre-store barrier, and added a store/load barrier afterward.
What we’re really trying to guarantee, though, is that (using thread 1 as an example) the write to flag1 is observed before the read of flag2. We could issue the store/load barrier before the volatile load instead and get the same result, but because loads tend to outnumber stores it’s best to associate it with the store.
On some architectures, it’s possible to implement volatile stores with an atomic operation and skip the explicit store/load barrier. On x86, for example, atomics provide a full barrier. The ARM LL/SC operations don’t include a barrier, so for ARM we must use explicit barriers.
(Much of this is due to Doug Lea and his “JSR-133 Cookbook for Compiler Writers” page.)
Further reading
Web pages and documents that provide greater depth or breadth. The more generally useful articles are nearer the top of the list.
-
Shared Memory Consistency Models: A Tutorial
-
Written in 1995 by Adve & Gharachorloo, this is a good place to start if you want to dive more deeply into memory consistency models.
http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf Memory Barriers
-
Nice little article summarizing the issues.
http://en.wikipedia.org/wiki/Memory_barrier Threads Basics
-
An introduction to multi-threaded programming in C++ and Java, by Hans Boehm. Excellent discussion of data races and basic synchronization methods.
http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html Java Concurrency In Practice
-
Published in 2006, this book covers a wide range of topics in great detail. Highly recommended for anyone writing multi-threaded code in Java.
http://www.javaconcurrencyinpractice.com JSR-133 (Java Memory Model) FAQ
-
A gentle introduction to the Java memory model, including an explanation of synchronization, volatile variables, and construction of final fields.
http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html Overview of package java.util.concurrent
-
The documentation for the
java.util.concurrent
package. Near the bottom of the page is a section entitled “Memory Consistency Properties” that explains the guarantees made by the various classes.
java.util.concurrent
Package Summary Java Theory and Practice: Safe Construction Techniques in Java
-
This article examines in detail the perils of references escaping during object construction, and provides guidelines for thread-safe constructors.
http://www.ibm.com/developerworks/java/library/j-jtp0618.html Java Theory and Practice: Managing Volatility
-
A nice article describing what you can and can’t accomplish with volatile fields in Java.
http://www.ibm.com/developerworks/java/library/j-jtp06197.html The “Double-Checked Locking is Broken” Declaration
-
Bill Pugh’s detailed explanation of the various ways in which double-checked locking is broken. Includes C/C++ and Java.
http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html [ARM] Barrier Litmus Tests and Cookbook
-
A discussion of ARM SMP issues, illuminated with short snippets of ARM code. If you found the examples in this document too un-specific, or want to read the formal description of the DMB instruction, read this. Also describes the instructions used for memory barriers on executable code (possibly useful if you’re generating code on the fly).
http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf Linux Kernel Memory Barriers
-
Documentation for Linux kernel memory barriers. Includes some useful examples and ASCII art.
http://www.kernel.org/doc/Documentation/memory-barriers.txt ISO/IEC JTC1 SC22 WG21 (C++ standards) 14882 (C++ programming language), chapter 29 (“Atomic operations library”)
-
Draft standard for C++ atomic operation features.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3090.pdf
(intro: http://www.hpl.hp.com/techreports/2008/HPL-2008-56.pdf) ISO/IEC JTC1 SC22 WG14 (C standards) 9899 (C programming language) chapter 7.16 (“Atomics -
Draft standard for ISO/IEC 9899-201x C atomic operation features. (See also n1484 for errata.)
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1425.pdf Dekker’s algorithm
-
The “first known correct solution to the mutual exclusion problem in concurrent programming”. The wikipedia article has the full algorithm, with a discussion about how it would need to be updated to work with modern optimizing compilers and SMP hardware.
http://en.wikipedia.org/wiki/Dekker's_algorithm Comments on ARM vs. Alpha and address dependencies
-
An e-mail on the arm-kernel mailing list from Catalin Marinas. Includes a nice summary of address and control dependencies.
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-05/msg11811.html What Every Programmer Should Know About Memory
-
A very long and detailed article about different types of memory, particularly CPU caches, by Ulrich Drepper.
http://www.akkadia.org/drepper/cpumemory.pdf Reasoning about the ARM weakly consistent memory model
-
This paper was written by Chong & Ishtiaq of ARM, Ltd. It attempts to describe the ARM SMP memory model in a rigorous but accessible fashion. The definition of “observability” used here comes from this paper.
http://portal.acm.org/ft_gateway.cfm?id=1353528&type=pdf&coll=&dl=&CFID=96099715&CFTOKEN=57505711 The JSR-133 Cookbook for Compiler Writers
-
Doug Lea wrote this as a companion to the JSR-133 (Java Memory Model) documentation. It goes much deeper into the details than most people will need to worry about, but it provides good fodder for contemplation.
http://g.oswego.edu/dl/jmm/cookbook.html The Semantics of Power and ARM Multiprocessor Machine Code
-
If you prefer your explanations in rigorous mathematical form, this is a fine place to go next.
http://www.cl.cam.ac.uk/~pes20/weakmemory/draft-ppc-arm.pdf