Today, rather than discussing new projects or use cases built on top of CDH, I’d like to switch gears a bit and share some details about the engineering that goes into our products. In this post, I’ll explain theMemStore-Local Allocation Buffer, a new component in the guts of Apache HBase which dramatically reduces the frequency of long garbage collection pauses. While you won’t need to understand these details to use Apache HBase, I hope it will provide an interesting view into the kind of work that engineers at Cloudera do.
This post will be the first in a three part series on this project.
Over the last few years, the amount of memory available on inexpensive commodity servers has gone up and up. When the Apache HBase project started in 2007, typical machines running Hadoop had 4-8GB of RAM. Today, most Cloudera customers run with at least 24G of RAM, and larger amounts like 48G or even 72G are becoming increasingly common as costs continue to come down. On the surface, all this new memory capacity seems like a great win for latency-sensitive database software like HBase — with a lot of RAM, more data can fit in cache, avoiding expensive disk seeks on reads, and more data can fit in thememstore, the memory area that buffers writes before they flush to disk.
In practice, however, as typical heap sizes for HBase have crept up and up, the garbage collection algorithms available in production-quality JDKs have remained largely the same. This has led to one major problem for many users of HBase: lengthy stop-the-world garbage collection pauses which get longer and longer as heap sizes continue to grow. What does this mean in practice?
The above issues will be familiar to most who have done serious load testing of an HBase cluster. On typical hardware, it can pause 8-10 seconds per GB of heap — a 8G heap may pause for upwards of a minute. No matter how much tuning one might do, it turns out this problem is completely unavoidable in HBase 0.90.0 or older with today’s production-ready garbage collectors. Since this is such a common issue, and only getting worse, it became a priority for Cloudera at the beginning of the year. In the rest of this post, I’ll describe a solution we developed that largely eliminates the problem.
In order to understand the GC pause issue thoroughly, it’s important to have some background in Java’s garbage collection techniques. Some simplifications will be made, so I highly encourage you to do further research for all the details.
If you’re already an expert in GC, feel free to skip this section.
Java’s garbage collector typically operates in a generational mode, relying on an assumption called thegenerational hypothesis: we assume that most objects either die young, or stick around for quite a long time. For example, the buffers used in processing an RPC request will only last for some milliseconds, whereas the data in a cache or the data in the HBase MemStore will likely survive for many minutes.
Given that objects have two different lifetime profiles, it’s intuitive that different garbage collection algorithms might do a better job on one profile than another. So, we split up the heap into two generations: the young(a.k.a new) generation and the old (a.k.a tenured). When objects are allocated, they start in the young generation, where we prefer an algorithm that operates efficiently when most of the data is short-lived. If an object survives several collections inside the young generation, we tenure it by relocating it into the old generation, where we assume that data is likely to die out much more slowly.
In most latency-sensitive workloads like HBase, we recommend the -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. This enables the Parallel New collector for the young generation and the Concurrent Mark-Sweep collector for the old generation.
The Parallel New collector is a stop-the-world copying collector. Whenever it runs, it first stops the world, suspending all Java threads. Then, it traces object references to determine which objects are live (still referenced by the program). Lastly, it moves the live objects over to a free section of the heap and updates any pointers into those objects to point to the new addresses. There are a few important points here about this collector:
Each time the Parallel New collector copies an object, it increments a counter for that object. After an object has been copied around in the young generation several times, the algorithm decides that it must belong to the long-lived class of objects, and moves it to the old generation (tenures it). The number of times an object is copied inside the young generation before being tenured is called the tenuring threshold.
Every time the parallel new collector runs, it will tenure some objects into the old generation. So, of course, the old generation will eventually fill up, and we need a strategy for collecting it as well. The Concurrent-Mark-Sweep collector (CMS) is responsible for clearing dead objects in this generation.
The CMS collector operates in a series of phases. Some phases stop the world, and others run concurrently with the Java application. The major phases are:
The important things to note here are:
As I described it, the CMS collector sounds pretty great – it only pauses for very short times and most of the heavy lifting is done concurrently. So how is it that we see multi-minute pauses when we run HBase under heavy load with large heaps? It turns out that the CMS collector has two failure modes.
The first failure mode, and the one more often discussed, is simple concurrent mode failure. This is best described with an example: suppose that there is an 8GB heap. When the heap is 7GB full, the CMS collector may begin its first phase. It’s happily chugging along with the concurrent-mark phase. Meanwhile, more data is being allocated and tenured into the old generation. If the tenuring rate is too fast, the generation may completely fill up before the collector is done marking. At that point, the program may not proceed because there is no free space to tenure more objects! The collector must abort its concurrent work and fall back to a stop-the-world single-threaded copying collection algorithm. This algorithm relocates all live objects to the beginning of the heap, and frees up all of the dead space. After the long pause, the program may proceed.
It turns out this is fairly easy to avoid with tuning: we simply need to encourage the collector to start its work earlier! Thus, it’s less likely that it will get overrun with new allocations before it’s done with its collection. This is tuned by setting -XX:CMSInitiatingOccupancyFraction=N where N is the percent of heap at which to start the collection process. The HBase region server carefully accounts its memory usage to stay within 60% of the heap, so we usually set this value to around 70.
This failure mode is a little bit more complicated. Recall that the CMS collector does not relocate objects, but simply tracks all of the separate areas of free space in the heap. As a thought experiment, imagine that I allocate 1 million objects, each 1KB, for a total usage of 1GB in a heap that is exactly 1GB. Then I free every odd-numbered object, so I have 500MB live. However, the free space will be solely made up of 1KB chunks. If I need to allocate a 2KB object, there is nowhere to put it, even though I ostensibly have 500MB of space free. This is termed memory fragmentation. No matter how early I ask the CMS collector to start, since it does not relocate objects, it cannot solve this problem!
When this problem occurs, the collector again falls back to the copying collector, which is able to compact all the objects and free up space.
Let’s come back up for air and use what we learned about Java GC to think about HBase. We can make two observations about the pauses we see in HBase:
Given these observations, we hypothesize that our problem must be caused by fragmentation, rather than some kind of memory leak or improper tuning.
To confirm this hypothesis, we’ll run an experiment. The first step is to collect some measurements about heap fragmentation. After spelunking in the OpenJDK source code, I discovered the little-known parameter-XX:PrintFLSStatistics=1 which, when combined with other verbose GC logging options, causes the CMS collector to print statistical information about its free space before and after every collection. In particular, the metrics we care about are:
I enabled this option, started up a cluster, and then ran three separate stress workloads against it using Yahoo Cloud Serving Benchmark (YCSB):
Each workload will run at least an hour, so we can collect some good data about the GC behavior under that workload. The goal of this experiment is first to verify our hypothesis that the pauses are caused by fragmentation, and second to determine whether these issues were primarily caused by the read path (including the LRU cache) or the write path (including the MemStores for each region).
The next post in the series will show the results of this experiment and dig into HBase’s internals to understand how the different workloads affect memory layout.
Meanwhile, if you want to learn more about Java’s garbage collectors, I recommend the following links: