zz
Memory barriers, or fences, are a set of processor instructions used to apply ordering limitations on memory operations. This article explains the impact memory barriers have on the determinism of multi-threaded programs. We'll look at how memory barriers relate to JVM concurrency constructs such as volatile, synchronized and atomic conditionals. It is assumed the reader has a solid understanding of these concepts and the Java memory model. This is not an article about mutual exclusion, parallelism or atomicity per se. Memory barriers are used to achieve an equally important element of concurrent programming called visibility.
Introducing SQLFire: a memory-optimized, high performance SQL database
VMware vFabric SQLFire - Test drive the data management system with memory speed, horizontal scalability and a familiar SQL interface
Thanks to Brian Goetz and Eric Yew for reviewing this article. I'd also like to thank Christian Thalinger for access to SPARC hardware.
A trip to main memory costs hundreds of clock cycles on commodity hardware. Processors use caching to decrease the costs of memory latency by orders of magnitude. These caches re-order pending memory operations for the sake of performance. In other words, the reads and writes of a program are not necessarily performed in the order in which they are given to the processor. When data is immutable and/or confined to the scope of one thread these optimizations are harmless. Combining these optimizations with symmetric multi-processing and shared mutable state on the other hand can be a nightmare. A program can behave non-deterministically when memory operations on shared mutable state are re-ordered. It is possible for a thread to write values that become visible to another thread in ways that are inconsistent with the order in which they were written. A properly placed memory barrier prevents this problem by forcing the processor to serialize pending memory operations.
Memory barriers are not directly exposed by the JVM; instead they are inserted into the instruction sequence by the JVM in order to uphold the semantics of language level concurrency primitives. We'll look at the source code and assembly instructions of some simple Java programs to see how. Let's begin a crash course in memory barriers with Dekker's algorithm. This algorithm uses three volatile variables to coordinate access to a shared resource between two threads.
Try not to focus on the finer details of this algorithm. Which parts are relevant? Each thread attempts to enter the critical section on the first line of code by signaling intent to do so. If a thread observes a conflict on line three (both threads have signaled intent) the conflict is resolved by turn taking. Only one thread can access the critical section at a given point in time.
// code run by first thread // code run by second thread 1 intentFirst = true; intentSecond = true; 2 3 while (intentSecond) while (intentFirst) // volatile read 4 if (turn != 0) { if (turn != 1) { // volatile read 5 intentFirst = false; intentSecond = false; 6 while (turn != 0) {} while (turn != 1) {} 7 intentFirst = true; intentSecond = true; 8 } } 9 10 criticalSection(); criticalSection(); 11 12 turn = 1; turn = 0; // volatile write 13 intentFirst = false; intentSecond = false; // volatile write
Hardware optimizations can break this code without memory barriers, even if the compiler were to emit all memory operations in the order they appear to be in from the programmer's point of view. Consider the two consecutive volatile read operations on lines three and four. Each thread checks to see if the other has signaled an intent to enter the critical sectionand then checks to see whose turn it is. Consider the two consecutive volatile write operations on lines 12 and 13. Each thread gives the other its "turn"and then withdraws its intent to enter the critical section. A reading thread should never expect to observe the other thread's write to the turn variable after the other thread's withdrawal of intent. This would be a disaster. But without the volatile modifier on these variables this indeed can happen! For example, without the volatile modifier the second thread could observe the first thread's write to intentFirst (last line) before the first thread's write to turn (second to last line). The keyword volatile prevents this problem because it establishes a happens before relationship between the write to the turn variable and the write to the intentFirst variable. The compiler cannot re-order these write operations and if necessary it must forbid the processor from doing so with a memory barrier. A peek under the hood shows how.
The PrintAssembly HotSpot option is a diagnostic flag for the JVM that allows us to capture the generated assembly instructions of the JIT compiler. This requires the latest OpenJDK release or a new version of HotSpot, update 14 or above. A disassembler plugin is also required. The Kenai project has plugin binaries for Solaris, Linux and BSD. The hsdis plugin is an alternative that can be built from source for Windows.
The first of the two consecutive read operations on line three is captured in the assembly instructions below. This stream was captured on multi-processing Itanium 2 hardware running JDK 1.6 with update 17. All of the instruction streams in this article are sequenced by line number on the left hand side. Relevant read operations, write operations and memory barrier instructions are in bold. The reader is advised to avoid getting caught up in the semantics of each and every instruction.
1 0x2000000001de819c: adds r37=597,r36;; ;...84112554 2 0x2000000001de81a0: ld1.acq r38=[r37];; ;...0b30014a a010 3 0x2000000001de81a6: nop.m 0x0 ;...00000002 00c0 4 0x2000000001de81ac: sxt1 r38=r38;; ;...00513004 5 0x2000000001de81b0: cmp4.eq p0,p6=0,r38 ;...1100004c 8639 6 0x2000000001de81b6: nop.i 0x0 ;...00000002 0003 7 0x2000000001de81bc: br.cond.dpnt.many 0x2000000001de8220;;
This short stream of instructions tells a long story. The first volatile read is on line two. The Java memory model guarantees the JVM will deliver this read to the processor before the second read, in "program order" - but this alone would not be enough because the processor is still free to perform these operations out of order. To uphold the consistency guarantees of the Java memory model the JVM annotates the first read operation with a variant of ld.acq, or "load acquire". By using ld.acq the compiler ensures the read operation on line two will complete before the subsequent read operation. Problem solved.
Notice this affects reads, not writes. A memory barrier that enforces ordering limitations on readsor writes is said to be unidirectional. A memory barrier that enforces ordering limitations on readsand writes is said to be bidirectional, or, a full fence. Using ld.acq is an example of a unidirectional memory barrier.
Consistency is a two way street. How useful is it for a reading thread to insert a memory barrier between both reads if the other thread does not separate both writes with one as well? In order for threads to communicate they mustall obey the protocol; just like nodes on a network, or people on a team. If one thread breaks formation then the efforts of all other threads are rendered useless. We should expect to see a memory barrier in the assembly instructions for the last two lines of Dekker's algorithm, a volatile write followed by a volatile write.
$ java -XX:+UnlockDiagnosticVMOptions -XX:PrintAssemblyOptions=hsdis-print-bytes -XX:CompileCommand=print,WriterReader.write WriterReader
1 0x2000000001de81c0: adds r37=592,r36;; ;...0b284149 0421 2 0x2000000001de81c6: st4.rel [r37]=r39 ;...00389560 2380 3 0x2000000001de81cc: adds r36=596,r36;; ;...84112544 4 0x2000000001de81d0: st1.rel [r36]=r0 ;...09000048 a011 5 0x2000000001de81d6: mf ;...00000044 0000 6 0x2000000001de81dc: nop.i 0x0;; ;...00040000 7 0x2000000001de81e0: mov r12=r33 ;...00600042 0021 8 0x2000000001de81e6: mov.ret b0=r35,0x2000000001de81e0 9 0x2000000001de81ec: mov.i ar.pfs=r34 ;...00aa0220 10 0x2000000001de81f0: mov r6=r32 ;...09300040 0021
Here we can see the second write operation annotated with an explicit memory barrier on line four. By using a variant of st.rel, or "store release", the compiler ensures the first write operation will be visible before the second write operation. This completes both sides of the protocol because the first write operation happens before the second write operation.
The st.rel barrier is unidirectional - just like ld.acq. On line five however the compiler emits a bidirectional memory barrier. The mf instruction, or "memory fence", is a full fence for the Itanium 2 instruction set. This seems redundant to the author.
This article does not aim to be a comprehensive overview of all memory barriers. This would be a monumental task. It is important though to appreciate the fact that these instructions vary considerably across different hardware architectures. Below is what the consecutive volatile writes translate to on multi-processing Intel Xeon hardware. All remaining assembly instruction sequences in this article were captured on an Intel Xeon unless specified otherwise.
1 0x03f8340c: push %ebp ;...55 2 0x03f8340d: sub $0x8,%esp ;...81ec0800 0000 3 0x03f83413: mov $0x14c,%edi ;...bf4c0100 00 4 0x03f83418: movb $0x1,-0x505a72f0(%edi) ;...c687108d a5af01 5 0x03f8341f: mfence ;...0faef0 6 0x03f83422: mov $0x148,%ebp ;...bd480100 00 7 0x03f83427: mov $0x14d,%edx ;...ba4d0100 00 8 0x03f8342c: movsbl -0x505a72f0(%edx),%ebx ;...0fbe9a10 8da5af 9 0x03f83433: test %ebx,%ebx ;...85db 10 0x03f83435: jne 0x03f83460 ;...7529 11 0x03f83437: movl $0x1,-0x505a72f0(%ebp) ;...c785108d a5af01 12 0x03f83441: movb $0x0,-0x505a72f0(%edi) ;...c687108d a5af00 13 0x03f83448: mfence ;...0faef0 14 0x03f8344b: add $0x8,%esp ;...83c408 15 0x03f8344e: pop %ebp ;...5d
Here we see both volatile writes on lines 11 and 12 on the x86 Xeon. The second write is chased with an mfence instruction, an explicit bidirectional memory barrier.
And now the consecutive volatile writes on SPARC.
1 0xfb8ecc84: ldub [ %l1 + 0x155 ], %l3 ;...e60c6155 2 0xfb8ecc88: cmp %l3, 0 ;...80a4e000 3 0xfb8ecc8c: bne,pn %icc, 0xfb8eccb0 ;...12400009 4 0xfb8ecc90: nop ;...01000000 5 0xfb8ecc94: st %l0, [ %l1 + 0x150 ] ;...e0246150 6 0xfb8ecc98: clrb [ %l1 + 0x154 ] ;...c02c6154 7 0xfb8ecc9c: membar #StoreLoad ;...8143e002 8 0xfb8ecca0: sethi %hi(0xff3fc000), %l0 ;...213fcff0 9 0xfb8ecca4: ld [ %l0 ], %g0 ;...c0042000 10 0xfb8ecca8: ret ;...81c7e008 11 0xfb8eccac: restore ;...81e80000
Here we see the both volatile writes on lines five and six. The second write is chased with a membar instruction, an explicit bidirectional memory barrier.
There is one important difference between the instruction streams for x86 and SPARC and the instruction stream for Itanium. The JVM chased the consecutive write operations with a memory barrier on x86 and SPARC, but it did not place a memory barrierbetween the two write operations. On the other hand the instruction stream for Itanium has a memory barrier between both writes. Why does the JVM behave differently across hardware architectures? Because a hardware architecture has a memory model and each memory model has a set of consistency guarantees. Some memory models, like that of x86 or SPARC, have a very strong set of consistency guarantees. Other memory models, like that of Itanium, PowerPC or Alpha, have a much more relaxed set of guarantees. For example x86 and SPARC do not re-order consecutive write operations - so no memory barrier is needed. Itanium, PowerPC and Alpha will re-order consecutive write operations - so the JVM has to place a memory barrier between them. The JVM uses memory barriers to bridge the gaps between the Java memory model and the memory model of the hardware it runs on.
Explicit fence instructions are not the only way to serialize memory operations. Let's switch gears to the Counter class to see an example.
class Counter{ static int counter = 0; public static void main(String[] _){ for(int i = 0; i < 100000; i++) inc(); } static synchronized void inc(){ counter += 1; } }
The Counter class performs a classic read-modify-write operation. The static counter field is not volatile because all three operations must be observed atomically. For this reason the inc method of the Counter class is synchronized. We can compile the Counter class and observe the generated assembly instructions for the synchronized inc method with the following command. The Java memory model guarantees the same visibility semantics for exiting of synchronized regions as it does for volatile memory operations, so we should expect to see another memory barrier.
$ java -XX:+UnlockDiagnosticVMOptions -XX:PrintAssemblyOptions=hsdis-print-bytes -XX:-UseBiasedLocking -XX:CompileCommand=print,Counter.inc Counter
1 0x04d5eda7: push %ebp ;...55 2 0x04d5eda8: mov %esp,%ebp ;...8bec 3 0x04d5edaa: sub $0x28,%esp ;...83ec28 4 0x04d5edad: mov $0x95ba5408,%esi ;...be0854ba 95 5 0x04d5edb2: lea 0x10(%esp),%edi ;...8d7c2410 6 0x04d5edb6: mov %esi,0x4(%edi) ;...897704 7 0x04d5edb9: mov (%esi),%eax ;...8b06 8 0x04d5edbb: or $0x1,%eax ;...83c801 9 0x04d5edbe: mov %eax,(%edi) ;...8907 10 0x04d5edc0: lock cmpxchg %edi,(%esi) ;...f00fb13e 11 0x04d5edc4: je 0x04d5edda ;...0f841000 0000 12 0x04d5edca: sub %esp,%eax ;...2bc4 13 0x04d5edcc: and $0xfffff003,%eax ;...81e003f0 ffff 14 0x04d5edd2: mov %eax,(%edi) ;...8907 15 0x04d5edd4: jne 0x04d5ee11 ;...0f853700 0000 16 0x04d5edda: mov $0x95ba52b8,%eax ;...b8b852ba 95 17 0x04d5eddf: mov 0x148(%eax),%esi ;...8bb04801 0000 18 0x04d5ede5: inc %esi ;...46 19 0x04d5ede6: mov %esi,0x148(%eax) ;...89b04801 0000 20 0x04d5edec: lea 0x10(%esp),%eax ;...8d442410 21 0x04d5edf0: mov (%eax),%esi ;...8b30 22 0x04d5edf2: test %esi,%esi ;...85f6 23 0x04d5edf4: je 0x04d5ee07 ;...0f840d00 0000 24 0x04d5edfa: mov 0x4(%eax),%edi ;...8b7804 25 0x04d5edfd: lock cmpxchg %esi,(%edi) ;...f00fb137 26 0x04d5ee01: jne 0x04d5ee1f ;...0f851800 0000 27 0x04d5ee07: mov %ebp,%esp ;...8be5 28 0x04d5ee09: pop %ebp ;...5d
To no surprise the number of instructions generated by synchronized is more than volatile. The increment is found on line 18 but at no point does the JVM insert an explicit memory barrier. Instead, the JVM has killed two birds with one stone using a lock prefixed cmpxchg instruction on lines 10 and 25. The semantics of cmpxchg are beyond the scope of this article. What's relevant is that 'lock cmpxchg' not only performs the write operation atomically - it also flushes pending read and write operations. The write operation will now become visible before all subsequent memory operations. If we refactor and run the Counter class to use java.util.concurrent.atomic.AtomicInteger we can observe this same trick.
import java.util.concurrent.atomic.AtomicInteger; class Counter{ static AtomicInteger counter = new AtomicInteger(0); public static void main(String[] args){ for(int i = 0; i < 1000000; i++) counter.incrementAndGet(); } }
$ java -XX:+UnlockDiagnosticVMOptions -XX:PrintAssemblyOptions=hsdis-print-bytes -XX:CompileCommand=print,*AtomicInteger.incrementAndGet Counter
1 0x024451f7: push %ebp ;...55 2 0x024451f8: mov %esp,%ebp ;...8bec 3 0x024451fa: sub $0x38,%esp ;...83ec38 4 0x024451fd: jmp 0x0244520a ;...e9080000 00 5 0x02445202: xchg %ax,%ax ;...6690 6 0x02445204: test %eax,0xb771e100 ;...850500e1 71b7 7 0x0244520a: mov 0x8(%ecx),%eax ;...8b4108 8 0x0244520d: mov %eax,%esi ;...8bf0 9 0x0244520f: inc %esi ;...46 10 0x02445210: mov $0x9a3f03d0,%edi ;...bfd0033f 9a 11 0x02445215: mov 0x160(%edi),%edi ;...8bbf6001 0000 12 0x0244521b: mov %ecx,%edi ;...8bf9 13 0x0244521d: add $0x8,%edi ;...83c708 14 0x02445220: lock cmpxchg %esi,(%edi) ;...f00fb137 15 0x02445224: mov $0x1,%eax ;...b8010000 00 16 0x02445229: je 0x02445234 ;...0f840500 0000 17 0x0244522f: mov $0x0,%eax ;...b8000000 00 18 0x02445234: cmp $0x0,%eax ;...83f800 19 0x02445237: je 0x02445204 ;...74cb 20 0x02445239: mov %esi,%eax ;...8bc6 21 0x0244523b: mov %ebp,%esp ;...8be5 22 0x0244523d: pop %ebp ;...5d
Again we see the write operation being combined with a lock prefix on line 14. This ensures the new value of the variable will become visible to other threads before all subsequent memory operations.
The JVM is very good at eliminating unnecessary memory barriers. Often it gets lucky and the consistency guarantees of the hardware memory model are greater than or equal to those of the Java memory model. When this happens the JVM simply inserts a no op instead of an actual memory barrier. For example, the consistency guarantees of the x86 and SPARC memory models are strong enough to eliminate the need for a memory barrier when reading a volatile variable. Remember the explicit unidirectional memory barrier used to separate both read operations on Itanium? Well, the generated assembly instructions for the consecutive volatile reads in Dekker's algorithm on an x86 haveno memory barrier.
A read followed by a read of shared memory on x86
1 0x03f83422: mov $0x148,%ebp ;...bd480100 00 2 0x03f83427: mov $0x14d,%edx ;...ba4d0100 00 3 0x03f8342c: movsbl -0x505a72f0(%edx),%ebx ;...0fbe9a10 8da5af 4 0x03f83433: test %ebx,%ebx ;...85db 5 0x03f83435: jne 0x03f83460 ;...7529 6 0x03f83437: movl $0x1,-0x505a72f0(%ebp) ;...c785108d a5af01 7 0x03f83441: movb $0x0,-0x505a72f0(%edi) ;...c687108d a5af00 8 0x03f83448: mfence ;...0faef0 9 0x03f8344b: add $0x8,%esp ;...83c408 10 0x03f8344e: pop %ebp ;...5d 11 0x03f8344f: test %eax,0xb78ec000 ;...850500c0 8eb7 12 0x03f83455: ret ;...c3 13 0x03f83456: nopw 0x0(%eax,%eax,1) ;...66660f1f 840000 14 0x03f83460: mov -0x505a72f0(%ebp),%ebx ;...8b9d108d a5af 15 0x03f83466: test %edi,0xb78ec000 ;...853d00c0 8eb7
The volatile read operations are found on lines three and fourteen. Neither are paired with a memory barrier. In other words the only performance penalty for a volatile read on an x86 (or on SPARC for that matter) is a minor loss of code motion optimization opportunities - the instruction itself is no different than an ordinary read.
Unidirectional memory barriers are naturally less expensive than bidirectional ones. The JVM will avoid a bidirectional memory barrier when it knows a unidirectional one is sufficient. The first example in this article demonstrated this. We saw the first of two consecutive volatile read operations on Itanium were annotated with a unidirectional memory barrier. If the read operations had been annotated with an explicit bidirectional memory barrier the program would still be correct, but at a greater latency cost.
Everything a static compiler knows at build time is known by a dynamic compiler at runtime, and more. More information means more opportunities to optimize. For example let's look at how the JVM treats memory barriers when running on a single processor. The following instruction stream was captured from a runtime compilation of two consecutive volatile writes in Dekker's algorithm. The program was running in a VMWare workstation image in uni-processor mode on x86 hardware.
1 0x017b474c: push %ebp ;...55 2 0x017b474d: sub $0x8,%esp ;...81ec0800 0000 3 0x017b4753: mov $0x14c,%edi ;...bf4c0100 00 4 0x017b4758: movb $0x1,-0x507572f0(%edi) ;...c687108d 8aaf01 5 0x017b475f: mov $0x148,%ebp ;...bd480100 00 6 0x017b4764: mov $0x14d,%edx ;...ba4d0100 00 7 0x017b4769: movsbl -0x507572f0(%edx),%ebx ;...0fbe9a10 8d8aaf 8 0x017b4770: test %ebx,%ebx ;...85db 9 0x017b4772: jne 0x017b4790 ;...751c 10 0x017b4774: movl $0x1,-0x507572f0(%ebp) ;...c785108d 8aaf01 11 0x017b477e: movb $0x0,-0x507572f0(%edi) ;...c687108d 8aaf00 12 0x017b4785: add $0x8,%esp ;...83c408 13 0x017b4788: pop %ebp ;...5d
On a uni-processor system the JVM inserts a no op for all memory barriers because memory operations are already serialized. Neither write operation (lines 10 and 11) is chased with a barrier. The JVM makes similar optimizations for atomic conditionals. Here is an instruction stream captured from the runtime compilation of AtomicInteger.incrementAndGet on the same VMWare image.
1 0x036880f7: push %ebp ;...55 2 0x036880f8: mov %esp,%ebp ;...8bec 3 0x036880fa: sub $0x38,%esp ;...83ec38 4 0x036880fd: jmp 0x0368810a ;...e9080000 00 5 0x03688102: xchg %ax,%ax ;...6690 6 0x03688104: test %eax,0xb78b8100 ;...85050081 8bb7 7 0x0368810a: mov 0x8(%ecx),%eax ;...8b4108 8 0x0368810d: mov %eax,%esi ;...8bf0 9 0x0368810f: inc %esi ;...46 10 0x03688110: mov $0x9a3f03d0,%edi ;...bfd0033f 9a 11 0x03688115: mov 0x160(%edi),%edi ;...8bbf6001 0000 12 0x0368811b: mov %ecx,%edi ;...8bf9 13 0x0368811d: add $0x8,%edi ;...83c708 14 0x03688120: cmpxchg %esi,(%edi) ;...0fb137 15 0x03688123: mov $0x1,%eax ;...b8010000 00 16 0x03688128: je 0x03688133 ;...0f840500 0000 17 0x0368812e: mov $0x0,%eax ;...b8000000 00 18 0x03688133: cmp $0x0,%eax ;...83f800 19 0x03688136: je 0x03688104 ;...74cc 20 0x03688138: mov %esi,%eax ;...8bc6 21 0x0368813a: mov %ebp,%esp ;...8be5 22 0x0368813c: pop %ebp ;...5d
Notice the cmpxchg instruction on line 14. Previously we saw the compiler give this instruction to the processor with a lock prefix. In the absence of SMP the JVM has chosen to avoid this cost - something it could not have done with static compilation.
Memory barriers are a necessity for multi-threaded programming. They come in many flavors. Some are explicit, others are implicit. Some are bidirectional, others are unidirectional. The JVM uses this array of choices to efficiently honor the Java memory model across all platforms. I hope this article helps experienced JVM developers become a little more knowledgeable about how their code behaves under the hood.
Dennis Byrne is a senior software engineer forDRW Trading, a proprietary trading firm and liquidity provider. He is a writer, presenter and active member of the open source community.