This is the third and final post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 and part 2 before continuing on to this post.
It’s been a few days since the first two posts, so let’s start with a quick refresher. In the first post we discussed Java garbage collection algorithms in general and explained that the problem of lengthy pauses in HBase has only gotten worse over time as heap sizes have grown. In the second post we ran an experiment showing that write workloads in HBase cause memory fragmentation as all newly inserted data is spread out into several MemStores which are freed at different points in time.
As identified in the previous post, we saw that the central issue is that data from different MemStores is all mixed up in the old generation. When we flush one MemStore, we only free up bits and pieces of heap instead of any large chunks. In other words, we’re violating one of the assumptions of the Java GC model — namely, that objects allocated together in time tend to die together. The allocation pattern of a random write workload guarantees nearly the opposite.
In order to attack this issue, we simply need to ensure that data for each region is allocated from the same area in the heap. In a language with manual memory management, this is typically done using a well-known pattern called arena allocation. In this pattern, every allocation is associated with a larger area of memory called an arena — the arena is simply divided up into smaller pieces as memory is allocated.
The most commonly seen application of this concept is the thread-local allocation buffer, or TLAB. In this model, each execution thread has its own memory arena, and all allocations done by that thread come from its own arena. There are several benefits to the use of TLABs:
In fact, the Sun JVM uses TLABs by default for small allocations. You can learn more about TLABs in the JVM in this excellent blog post.
Unfortunately, the TLABs used in the JVM do not help solve the fragmentation issue experienced by HBase. This is because an individual handler thread in HBase actually handles requests for different regions throughout its lifetime – so even though the allocations come from a single thread-local arena, data for different MemStores are intermixed within the TLAB. When the memory is promoted to the old generation, the data remains intermingled.
However, it’s not too difficult to borrow the concept and apply the same idea to MemStores — coining the term MemStore-Local Allocation Buffer (MSLAB). Whenever a request thread needs to insert data into a MemStore, it shouldn’t allocate the space for that data from the heap at large, but rather from a memory arena dedicated to the target region. This should have the following benefits:
Unfortunately, standard Java does not give programmers the ability to allocate objects from memory arenas. But, in the case of HBase, we’re not dealing with particularly large or many objects — each piece of data consists of a single KeyValue object which is not large. Additionally each object is exactly the same size, so doesn’t cause significant fragmentation. Rather, it’s the byte[] arrays referred to by the KeyValue objects that cause the fragmentation. So, we simply need to ensure that the byte[] arrays are allocated from MSLABs instead of the heap.
It turns out this is not very difficult! The KeyValue class doesn’t just contain a byte[], but also an offsetfield pointing into the byte array. So in order to place the data for different KeyValue objects near each other, we just need to take slices of a larger byte[] representing the MSLAB arena. The implementation looks something like this:
In order to keep this efficient, the entire algorithm is implemented lock-free, using atomic compare-and-swap operations on the nextFreeOffset pointer and the curChunk structure.
After implementing MSLABs, we expected to see significantly less fragmentation. So, to confirm this, I ran the same write-load generator as described in the prior post and graphed the results with the same methodology:
This graph shows the experiment beginning with an entirely empty heap when the Region Server is started, and continuing through about an hour and a half of write load. As before, we see the free_space graph fluctuate back and forth as the concurrent collector runs. The max_chunk graph drops quickly at first as memory is allocated, but then eventually stabilizes. I’ve also included num_blocks — the total number of separate free chunks in the old generation -- in this graph. You can see that this metric also stabilizes after an hour or so of runtime.
After producing the above graph, I let the insert workload run overnight, and then continued for several days. In all of this time, there was not a single GC pause that lasted longer than a second. The fragmentation problem was completely solved for this workload!
The MSLAB allocation scheme is available in Apache HBase 0.90.1, and part of CDH3 Beta 4 released last week. Since it is relatively new, it is not yet enabled by default, but it can be configured using the following flags:
Configuration | Description |
---|---|
hbase.hregion.memstore.mslab.enabled | Set to true to enable this feature |
hbase.hregion.memstore.mslab.chunksize | The size of the chunks allocated by MSLAB, in bytes (default 2MB) |
hbase.hregion.memstore.mslab.max.allocation | The maximum size byte array that should come from the MSLAB, in bytes (default 256KB) |
As this is a very new feature, there are still a few rough edges to be worked out. In particular, each region now has a minimum of 2MB of memory usage on the region server - this means that a server hosting thousands of regions could have several GB of wasted memory sitting in unused allocation buffers. We need to figure out a good heuristic to automatically tune chunk size and avoid this kind of situation.
There are also a few more efficiency gains to be made. For example, we currently do an extra memory copy of data when moving it into the MSLAB chunk. This can be avoided for a modest CPU improvement.
Lastly, some work needs to be done to re-tune the suggested garbage collecting tuning parameters. Given this improvement, we may be able to tune the value of -XX:CMSInitiatingOccupancyFraction to a higher value than we did in prior versions.
If you've been having problems with garbage collection pauses in Apache HBase, please give this experimental option a try and report back your results. According to our synthetic test workloads, it may significantly reduce or even eliminate the problem!
I had a great time working with HBase and the JVM to understand these memory fragmentation behaviors and then designing and implementing the MSLAB solution. If you found this series of posts interesting, you might be just the kind of engineer we're looking for to join Cloudera. Please be sure to check out ourcareers page and get in touch!
Ref: http://www.cloudera.com/blog/2011/03/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-3/