This is the second post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 before continuing on to this post.
In last week’s post, we noted that HBase has had problems coping with long garbage collection pauses, and we summarized the different garbage collection algorithms commonly used for HBase on the Sun/Oracle Java 6 JVM. Then, we hypothesized that the long garbage collection pauses are due to memory fragmentation, and devised an experiment to both confirm this hypothesis and investigate which workloads are most prone to this problem.
As described in the previous post, I ran three different workload types against an HBase region server while collecting verbose GC logs with -XX:PrintFLSStatistics=1. I then wrote a short python script to parse the results and reformat into a TSV file, and graphed the resulting metrics using my favorite R graphing library, ggplot2:
The top part of the graph shows free_space, the total amount of free space in the heap. The bottom graph shows max_chunk, the size of the largest chunk of contiguous free space. The X axis is time in seconds, and the Y axis has units of heap words. In this case a word is 8 bytes, since I am running a 64-bit JVM.
It was immediately obvious from this overview graph that the three different workloads have very different memory characteristics. We’ll zoom in on each in turn.
Zoomed in on the write-only workload, we can see two interesting patterns:
By correlating this graph with the GC logs, I noted that the long full GCs corresponded exactly to the vertical spikes in the max_chunk graph — after each of these full GCs, the heap had been defragmented, so all of the free space was in one large chunk.
So, we’ve learned that the write load does indeed cause heap fragmentation and that the long pauses occur when there are no large free chunks left in the heap.
In the second workload, the clients perform only reads, and the set of records to be read is much larger than the size of the LRU block cache. So, we see a large amount of memory churn as items are pulled into and evicted from the cache.
The free_space graph reflects this – it shows much more frequent collections than the write-only workload. However, we note that the max_chunk graph stays pretty constant around its starting value. This suggests that the read-only workload doesn’t cause heap fragmentation nearly as badly as the write workload, even though the memory churn is much higher.
The third workload, colored green in the overview graph, turned out to be quite boring. Because there’s no cache churn, the only allocations being done were short-lived objects for servicing each RPC request. Hence, they were never promoted to the old generation, and both free_space and max_chunk time series stayed entirely constant.
To summarize the results of this experiment:
Now that we know that write workloads cause rapid heap fragmentation, let’s take a step back and think about why. In order to do so, we’ll take a brief digression to give an overview of how HBase’s write path works.
In order to store a very large dataset distributed across many machines, Apache HBase partitions each table into segments called Regions. Each region has a designated “start key” and “stop key”, and contains every row where the key falls between the two. This scheme can be compared to primary key-based range partitions in an RDBMS, though HBase manages the partitions automatically and transparently. Each region is typically less than a gigabyte in size, so every server in an HBase cluster is responsible for several hundred regions. Read and write requests are routed to the server currently hosting the target region.
Once a write request has reached the correct server, the new data is added to an in-memory structure called a MemStore. This is essentially a sorted map, per region, containing all recently written data. Of course, memory is a finite resource, and thus the region server carefully accounts memory usage and triggers a flush on a MemStore when the usage has crossed some threshold. The flush writes the data to disk and frees up the memory.
Let’s imagine that a region server is hosting 5 regions — colored pink, blue, green, red, and yellow in the diagram below. It is being subjected to a random write workload where the writes are spread evenly across the regions and arrive in no particular order.
As the writes come in, new buffers are allocated for each row, and these buffers are promoted into the old generation, since they stay in the MemStore for several minutes waiting to be flushed. Since the writes arrive in no particular order, data from different regions is intermingled in the old generation. When one of the regions is flushed, however, this means we can’t free up any large contiguous chunk, and we’re guaranteed to get fragmentation:
This behavior results in exactly what our experiment showed: over time, writes will always cause severe fragmentation in the old generation, leading to a full garbage collection pause.
In this post we reviewed the results of our experiment, and came to understand why writes in HBase cause memory fragmentation. In the next and last post in this series, we’ll look at the design of MemStore-Local Allocation Buffers, which avoid fragmentation and thus avoid full GCs.
Ref: http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-2/