As of today, Hadoop 0.20.2 is the latest stable release of Apache Hadoop that is marked as ready for production (neither 0.21 nor 0.22 are). Unfortunately, Hadoop 0.20.2 release is not compatible with the latest stable version of HBase: if you run HBase on top of Hadoop 0.20.2, you risk to lose data! Hence HBase users are required to build their own Hadoop 0.20.x version if they want to run HBase on a production cluster of Hadoop. In this article, I describe how to build such a production-ready version of Hadoop 0.20.x that is compatible with HBase 0.90.2.

Table of Contents:
  • Before we start
  • Examples use git (not svn)
  • Hadoop 0.20.2 versus 0.20.203.0
  • Hadoop is covered. What about HBase then?
  • Version of Hadoop 0.20-append used in this article
  • Background
  • Hadoop and HBase: Which versions to pick for production clusters?
  • Alternatives to what we are doing here
  • A word of caution and a Thank You
  • Building Hadoop 0.20-append from branch-0.20-append
  • Retrieve the Hadoop 0.20-append sources
  • Hadoop 0.20.2 release vs. Hadoop 0.20-append
  • Run the build process
  • Build commands
  • The build test fails, now what?
  • Locate the build output (Hadoop JAR files)
  • Install your Hadoop 0.20-append build in your Hadoop cluster
  • Rename the build JAR files if you run Hadoop 0.20.2
  • Maintaining your own version of Hadoop 0.20-append
  • Conclusion
  • Comments (35)


Update October 17, 2011: As of  version 0.20.205.0 (marked as beta release), Hadoop does now supports HDFS append/hsynch/hflush out of the box and is thus compatible with Hbase 0.90.x. You can still follow the instructions described in this article to build your own version of Hadoop.

Before we start

Examples use git (not svn)

In the following sections, I will use git as the version control system to work on the Hadoop source code. Why? Because I am much more comfortable with git than svn, so please bear with me.

If you are using Subversion, feel free to adapt the git commands described below. You are invited to write a comment to this article about your SVN experience so that other SVN users can benefit, too!

Hadoop 0.20.2 versus 0.20.203.0

Update 2011-06-11: Hadoop 0.20.203.0 and HBase 0.90.3 were released a few weeks after this article was published. While the article talks mostly about Hadoop 0.20.2, the build instructions should also work for Hadoop 0.20.203.0 but I haven’t had the time to test it yet myself. Feel free to leave a comment at the end of the article if you have run into any issues!

Hadoop is covered. What about HBase then?

I focus solely in this article on building a Hadoop 0.20.x version (see the Background section below) that is compatible with HBase 0.90.2. In a future article, I will then describe how to actually install and set up HBase 0.90.2 on the Hadoop 0.20.x version that we created here.

Version of Hadoop 0.20-append used in this article

The instructions below use the latest version of branch-0.20-append. As of this writing, the latest commit to the append branch is git commit df0d79cc aka Subversion rev 1057313. For reference, the corresponding commit message is “HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.” from January 10, 2011.

commit df0d79cc2b09438c079fdf10b913936492117917
Author: Hairong Kuang 
Date:   Mon Jan 10 19:01:36 2011 +0000

    HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.

    git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append@1057313 13f79535-47bb-0310-9956-ffa450edef68

That said, the steps should also work for any later versions of branch-0.20-append.

Background

Hadoop and HBase: Which versions to pick for production clusters?

Hadoop 0.20.2 is the latest stable release of Apache Hadoop that is marked ready for production. Unfortunately, the latest stable release of Apache HBase, i.e. HBase 0.90.2, is not compatible with Hadoop 0.20.2: If you try to run HBase 0.90.2 on an unmodified version of Hadoop 0.20.2 release, you might lose data!

The following lines are taken slightly modified from the HBase documentation:

This version of HBase [0.90.2] will only run on Hadoop 0.20.x. It will not run on Hadoop 0.21.x (nor 0.22.x). HBase will lose data unless it is running on an HDFS that has a durable sync. Currently only the branch-0.20-append branch has this attribute. No official releases have been made from this branch up to now so you will have to build your own Hadoop from the tip of this branch.

Here is a quick overview:

Hadoop version HBase version Compatible?
0.20.2 release 0.90.2 NO
0.20-append 0.90.2 YES
0.21.0 release 0.90.2 NO
0.22.x (in development) 0.90.2 NO

To be honest, it took me quite some time to get up to speed with the various requirements, dependencies, project statuses, etc. for marrying Hadoop 0.20.x and HBase 0.90.2. Hence I want to contribute back to the Hadoop and HBase communities by writing this article.

Alternatives to what we are doing here

Another option you have to get HBase up and running on Hadoop — rather than build Hadoop 0.20-append yourself — is using Cloudera’s CDH3 distribution. CDH3 has the Hadoop 0.20-append patches needed to add a durable sync, i.e. to make Hadoop 0.20.x compatible with HBase 0.90.2.

A word of caution and a Thank You

First, a warning: while I have taken great care to compile and describe the steps in the following sections, I still cannot give you any guarantees. If in doubt, join our discussions on the HBase mailing list.

Second, I am only stitching together the pieces of the puzzle here. The heavy lifting has done by others. Hence I would like to thank St.Ack for his great feedback while preparing the information for this article, and to both him and the rest of the HBase developers for their help on the HBase mailing list. It’s much appreciated!

Building Hadoop 0.20-append from branch-0.20-append

Retrieve the Hadoop 0.20-append sources

Hadoop 0.20.x is not separated into the Common, HDFS and MapReduce components as the versions ≥ 0.21.0 are. Hence you find all the required code in the Hadoop Common repository.

So the first step is to check out the Hadoop Common repository.

$ git clone http://git.apache.org/hadoop-common.git
$ cd hadoop-common

However, the previous git command only retrieved the latest version of Hadoop common, i.e. the tip aka HEAD of the development for Hadoop Common. We, however, are only interested in the code tree for Hadoop 0.20-append, i.e. the branch branch-0.20-append. Because git by default does not download remote branches from a cloned repository, we must instruct it to explicitly do so:

# Retrieve the (remote) Hadoop 0.20-append branch as git normally checks out
# only the master tree (trunk in Subversion language).
$ git checkout -t origin/branch-0.20-append
Branch branch-0.20-append set up to track remote branch branch-0.20-append from origin.
Switched to a new branch 'branch-0.20-append'

Hadoop 0.20.2 release vs. Hadoop 0.20-append

Up to now, you might have asked yourself what the difference between the 0.20.2 release of Hadoop and its append branch actually is. Here’s the answer: The Hadoop 0.20-append branch is effectively a superset of Hadoop 0.20.2 release. In other words, there is not a single “real” commit in Hadoop 0.20.2 release that is not also in Hadoop 0.20-append. This means that Hadoop 0.20-append brings all the goodies that Hadoop 0.20.2 release has, great!

Run the following git command to verify this:

$ git show-branch release-0.20.2 branch-0.20-append
! [release-0.20.2] Hadoop 0.20.2 release
 * [branch-0.20-append] HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.
--
 * [branch-0.20-append] HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.
 * [branch-0.20-append^] HDFS-1555. Disallow pipelien recovery if a file is already being lease recovered. Contributed by Hairong Kuang.
 * [branch-0.20-append~2] Revert the change made to HDFS-1555: merge -c -1056483 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append
 * [branch-0.20-append~3] HDFS-1555. Disallow pipeline recovery if a file is already being lease recovered. Contributed by Hairong Kuang.
[...]
 * [branch-0.20-append~50] JDiff output for release 0.20.2
 * [branch-0.20-append~51] HADOOP-1849. Merge -r 916528:916529 from trunk to branch-0.20.
+  [release-0.20.2] Hadoop 0.20.2 release
+  [release-0.20.2^] Hadoop 0.20.2-rc4
+* [branch-0.20-append~52] Prepare for 0.20.2-rc4

As you can see, there are only two commits in 0.20.2 release that are not in branch-0.20-append, namely the commits “Hadoop 0.20.2 release” and “Hadoop 0.20.2-rc4″. Both of these commits are simple tagging commits, i.e. they are just used for release management but do not introduce any changes to the content of the Hadoop source code.

Run the build process

Build commands

First, we have to create the build.properties file (see full instructions).

Here are the contents of mine:

#this is essential
resolvers=internal
#you can increment this number as you see fit
version=0.20-append-for-hbase
project.version=${version}
hadoop.version=${version}
hadoop-core.version=${version}
hadoop-hdfs.version=${version}
hadoop-mapred.version=${version}

The “version” key in build.properties will also determine the names of the generated Hadoop JAR files. If, for instance, you set “version” to “0.20-append-for-hbase”, the build process will generate files namedhadoop-core-0.20-append-for-hbase.jar etc. Basically, you can use any version identifier that you like (though it would help if it makes sense).

The build.properties file should be placed (or available) in the hadoop-common top directory, i.e. hadoop-common/build.properties. You can either place the file there directly or you can follow the recommended approach, where you place the file in a parent directory and create a symlink to it. The latter approach is convenient if you also have checked out the repositories of the Hadoop sub-projects hadoop-hdfs and hadoop-mapreduce and thus want to use the same build.properties file for all three sub-projects.

$ pwd
/your/path/to/hadoop-common

# Create/edit the build.properties file
$ vi ../build.properties

# Create a symlink to it
$ ln -s ../build.properties build.properties

Now we are ready to compile Hadoop from source with ant. I used the command ant mvn-install as described onGit and Hadoop. The build itself should only take a few minutes. Be sure to run ant test as well (or only ant test-core if you’re lazy) but be aware that the tests take much longer than the build (two hours on my 3-year old MacBook Pro, for instance).

# Make sure we are using the branch-0.20-append sources
$ git checkout branch-0.20-append

# Run the build process
$ ant mvn-install

# Optional: run the full test suite or just the core test suite
$ ant test
$ ant test-core

If you want to re-run builds or build tests: By default, ant mvn-install places the build output into$HOME/.m2/repository. In case you re-run the compile you might want to remove the previous build output from $HOME/.m2/repository, e.g. via rm -rf $HOME/.m2/repository. You might also want to run ant clean-cache. For details, see Git and Hadoop.

The build test fails, now what?

Now comes the more delicate part: If you run the build tests via ant test, you will notice that the build test process always fails! One consistent test error is reported by TestFileAppend4 and logged to the file build/test/TEST-org.apache.hadoop.hdfs.TestFileAppend4.txt. Here is a short excerpt of the test’s output (click here for a longer snippet):

2011-04-06 09:40:28,666 INFO  ipc.Server (Server.java:run(970)) - IPC Server handler 5 on 47574, call append(/bbw.test, DFSClient_1066000827) from 127.0.0.1:45323: error: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /bbw.test for DFSClient_1066000827 on client 127.0.0.1, because this file is already being created by DFSClient_-95621936 on 127.0.0.1
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1054)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1221)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
        [...]
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)
2011-04-06 09:40:28,667 INFO  hdfs.TestFileAppend4 (TestFileAppend4.java:recoverFile(161)) - Failed open for append, waiting on lease recovery

[...]
Testcase: testRecoverFinalizedBlock took 5.555 sec
	Caused an ERROR
No lease on /testRecoverFinalized File is not open for writing. Holder DFSClient_1816717192 does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /testRecoverFinalized File is not open for writing. Holder DFSClient_1816717192 does not have any open files.
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1439)
        [...]
	at org.apache.hadoop.hdfs.TestFileAppend4$1.run(TestFileAppend4.java:636)

Fortunately, this error does not mean that the build is not working. From what we know this is a problem of the unit tests in branch-0.20-append themselves (see also St.Ack’s comment on HBASE-3285). In other words, this test failure is a known issue that we can ignore for the moment. Phew!

Occasionally, you might run into other build failures and/or build errors. On my machine, for instance, I have also seen the following tests fail:

  • org.apache.hadoop.hdfs.server.namenode.TestEditLogRace (see error)
  • org.apache.hadoop.hdfs.TestMultiThreadedSync (see error)

I do not know what might cause these occasional errors — maybe it is a problem of the machine I am running the tests on. Still working on this.

Frankly, what I wrote above may sound discomforting to you. At least it does to me. Still, the feedback I have received on the HBase mailing list indicates that the Hadoop 0.20-append build as done above is indeed correct.

Locate the build output (Hadoop JAR files)

By default, the build run via ant mvn-install places the generated Hadoop JAR files in $HOME/.m2/repository. You can find the actual JAR files with the following command.

$ find $HOME/.m2/repository -name "hadoop-*.jar"

.../repository/org/apache/hadoop/hadoop-examples/0.20-append-for-hbase/hadoop-examples-0.20-append-for-hbase.jar
.../repository/org/apache/hadoop/hadoop-test/0.20-append-for-hbase/hadoop-test-0.20-append-for-hbase.jar
.../repository/org/apache/hadoop/hadoop-tools/0.20-append-for-hbase/hadoop-tools-0.20-append-for-hbase.jar
.../repository/org/apache/hadoop/hadoop-streaming/0.20-append-for-hbase/hadoop-streaming-0.20-append-for-hbase.jar
.../repository/org/apache/hadoop/hadoop-core/0.20-append-for-hbase/hadoop-core-0.20-append-for-hbase.jar

Install your Hadoop 0.20-append build in your Hadoop cluster

The only thing left to do now is to install the Hadoop 0.20-append build in your cluster. This step is easy: simplyreplace the Hadoop JAR files of your existing installation of Hadoop 0.20.2 release with the ones you just created above. You will also have to replace the Hadoop core JAR file in your HBase 0.90.2 installation ($HBASE_HOME/lib/hadoop-core-0.20-append-r1056497.jar) with the Hadoop core JAR file you created above (hadoop-core-0.20-apppend-for-hbase.jar if you followed the instructions above).

Since this is such an important step, I will repeat it again: The Hadoop JAR files used by Hadoop itself and by HBase must match!

Rename the build JAR files if you run Hadoop 0.20.2

Update 2011-06-11: The instructions of the this section are NOT required if you are using the latest stable release Hadoop 0.20.203.0.

Hadoop 0.20.2 release names its JAR files in the form of hadoop-VERSION-PACKAGE.jar, e.g. hadoop-0.20.2-examples.jar. The build process above uses the different scheme hadoop-PACKAGE-VERSION.jar, e.g. hadoop-examples-0.20-append-for-hbase.jar. You might therefore want to rename the JAR files you created in the previous section so that they match the naming scheme of Hadoop 0.20.2 release (otherwise the bin/hadoop script will not be able to add the Hadoop core JAR file to its CLASSPATH, and also command examples such as hadoop jar hadoop-*-examples.jar pi 50 1000 in the Hadoop docs will not work as is).

# When you replace the Hadoop JAR files *in your Hadoop installation*,
# you might want to rename your Hadoop 0.20-append JAR files like so.
hadoop-examples-0.20-append-for-hbase.jar  --> hadoop-0.20-append-for-hbase-examples.jar
hadoop-test-0.20-append-for-hbase.jar      --> hadoop-0.20-append-for-hbase-test.jar
hadoop-tools-0.20-append-for-hbase.jar     --> hadoop-0.20-append-for-hbase-tools.jar
hadoop-streaming-0.20-append-for-hbase.jar --> hadoop-0.20-append-for-hbase-streaming.jar
hadoop-core-0.20-append-for-hbase.jar      --> hadoop-0.20-append-for-hbase-core.jar

In contrast, HBase uses the hadoop-PACKAGE-VERSION.jar scheme. So when you replace the Hadoop core JAR file shipped with HBase 0.90.2 in $HBASE_HOME/lib, you can here opt for leaving the name of the newly built Hadoop core JAR file as is.

Note for users running HBase 0.90.0 or 0.90.1: The Hadoop 0.20-append JAR files we created above are based on the tip of branch-0.20-append and thus use an RPC version of 43. This is ok for HBase 0.90.2 but it will cause problems for HBase 0.90.0 and 0.90.1. See HBASE-3520 or St.Ack’s comment for more information.

Maintaining your own version of Hadoop 0.20-append

If you must integrate some additional patches into Hadoop 0.20.2 and/or Hadoop 0.20-append (normally in the form of backports of patches for Hadoop 0.21 or 22.0), you can create a local branch based on the Hadoop version you are interested in. Yes, this creates some effort on your behalf so you should be sure to weigh the pros and cons of doing so.

Imagine that, for instance, you use Hadoop 0.20-append based on branch-0.20-append because you also want to run the latest stable release of HBase on your Hadoop cluster. While doing your benchmarking and stress testingof your cluster, you have unfortunately discovered a problem that you could track down to HDFS-611. So a patch is available (you might have to do some tinkering for backporting it) but it is not in the version of Hadoop you are running, i.e. it is not in the stock branch-0.20-append.

What you can do is to create a local git branch based on your Hadoop version (here: branch-0.20-append) where you can integrate and test any relevant patches you need. Please understand that I will only describe the basic approach here — I do not go into details on how you can make sure to stay current with any changes to the Hadoop version you are tracking after you followed the steps below. There are a lot of splendid git introductions such as theGit Community Book that can explain this much better and thoroughly than I am able to.

# Make sure we are in branch-0.20-append before running the next command
$ git checkout branch-0.20-append

# Create your own local branch based on the latest version (HEAD) of the official branch-0.20-append
$ git checkout -b branch-0.20-append-yourbranch

Verify that the two append branches are identical up to now.

# Verify that your local branch and the "official" (remote) branch are identical
$ git show-branch branch-0.20-append branch-0.20-append-yourbranch
! [branch-0.20-append] HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.
 * [branch-0.20-append-yourbranch] HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.
--
+* [branch-0.20-append] HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.

# Yep, they are.

Apply the relevant patch to your branch. In the example below, I apply a backport of the patch for HDFS-611 forbranch-0.20-append via the file HDFS-611.branch-0.20-append.v1.patch. Note that this backport is not available on the HDFS-611 page — I created the backport myself based on the HDFS-611 patch for Hadoop 0.20.2 release (HDFS-611.branch-20.v6.patch).

# Apply the patch to your branch
$ patch -p1 < HDFS-611.branch-0.20-append.v1.patch

# Add any modified or newly created files from the patch to git's index
$ git add src/hdfs/org/apache/hadoop/hdfs/protocol/FSConstants.java \
         src/hdfs/org/apache/hadoop/hdfs/server/datanode/FSDataset.java \
         src/hdfs/org/apache/hadoop/hdfs/server/datanode/FSDatasetAsyncDiskService.java \
         src/test/org/apache/hadoop/hdfs/TestDFSRemove.java

# Commit the changes from the index to the repository
$ git commit -m "HDFS-611: Backport of HDFS-611 patch for Hadoop 0.20.2 release"

Verify that your patched branch is one commit ahead of the original (remote) append branch.

# Compare the commit histories of your local append branch and the original (remote) branch
$ git show-branch branch-0.20-append branch-0.20-append-yourbranch
! [branch-0.20-append] HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.
 * [branch-0.20-append-yourbranch] HDFS-611: Backport of HDFS-611 patch for Hadoop 0.20.2 release
--
 * [branch-0.20-append-yourbranch] HDFS-611: Backport of HDFS-611 patch for Hadoop 0.20.2 release
+* [branch-0.20-append] HDFS-1554. New semantics for recoverLease. Contributed by Hairong Kuang.

# Yep, it is exactly one commit ahead.

Voilà!

And by the way, if you want to see the commit differences between Hadoop 0.20.2 release, the official branch-0.20-append and your own, patched branch-0.20-append-yourbranch, run the following git command:

$ git show-branch release-0.20.2 branch-0.20-append branch-0.20-append-yourbranch

Conclusion

I hope this article helps you to build a Hadoop 0.20.x version for running HBase 0.90.2 on in a production environment. Your feedback and comments are as always appreciated.