来源:https://ccp.cloudera.com/display/CDHDOC/HBase+Installation
Contents
Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster.
Upgrading HBase to the Latest CDH3 Release
The instructions that follow assume that you are upgrading HBase as part of an upgrade to the latest CDH3 release, and have already performed the steps underUpgrading CDH3.
To upgrade HBase to the latest CDH3 release, proceed as follows.
Step 1: Perform a Graceful Cluster Shutdown
To shut HBase down gracefully, stop the Thrift server and clients, then stop the cluster.
- Stop the Thrift server and clients
sudo service hadoop-hbase-thrift stop
- Stop the cluster.
- Use the following command on the master node:
sudo service hadoop-hbase-master stop
- Use the following command on each node hosting a region server:
sudo service hadoop-hbase-regionserver stop
- Use the following command on the master node:
This shuts down the master and the region servers gracefully.
Step 2. Stop the ZooKeeper Server
$ sudo service hadoop-zookeeper-server stop
|
Step 3: Install the new version of HBase
Follow directions in the next section,Installing HBase.
Installing HBase
To install HBase on Ubuntu and other Debian systems:
$ sudo apt-get install hadoop-hbase
|
To install HBase On Red Hat-compatible systems:
$ sudo yum install hadoop-hbase
|
To install HBase on SUSE systems:
$ sudo zypper install hadoop-hbase
|
To list the installed files on Ubuntu and other Debian systems:
$ dpkg -L hadoop-hbase
|
To list the installed files on Red Hat and SUSE systems:
$ rpm -ql hadoop-hbase
|
You can see that the HBase package has been configured to conform to the Linux Filesystem Hierarchy Standard. (To learn more, runman hier).
HBase wrapper script /usr/bin/hbase
HBase Configuration Files /etc/hbase/conf
HBase Jar and Library Files /usr/lib/hbase
HBase Log Files /var/log/hbase
HBase service scripts /etc/init.d/hadoop-hbase-*
|
You are now ready to enable the server daemons you want to use with Hadoop. Java-based client access is also available by adding the jars in/usr/lib/hbase/and/usr/lib/hbase/lib/to your Java class path.
Host Configuration Settings for HBase
Configuring the REST Port
You can use aninit.dscript,/etc/init.d/hadoop-hbase-rest, to start the REST server; for example:
/etc/init.d/hadoop-hbase-rest start
|
The script starts the server by default on port 8080. This is a commonly used port and so may conflict with other applications running on the same host.
If you need change the port for the REST server, configure it inhbase-site.xml, for example:
<property>
<name>hbase.rest.port</name>
<value>60050</value>
</property>
|
Using DNS with HBase
HBase uses the local hostname to report its IP address. Both forward and reverse DNS resolving should work. If your machine has multiple interfaces, HBase uses the interface that the primary hostname resolves to. If this is insufficient, you can sethbase.regionserver.dns.interfacein thehbase-site.xmlfile to indicate the primary interface. To work properly, this setting requires that your cluster configuration is consistent and every host has the same network interface configuration. As an alternative, you can sethbase.regionserver.dns.nameserverin thehbase-site.xmlfile to choose a different name server than the system-wide default.
Using the Network Time Protocol (NTP) with HBase
The clocks on cluster members should be in basic alignments. Some skew is tolerable, but excessive skew could generate odd behaviors. RunNTPon your cluster, or an equivalent. If you are having problems querying data or unusual cluster operations, verify the system time.
Setting User Limits for HBase
Because HBase is a database, it uses a lot of files at the same time. The default ulimit setting of 1024 for the maximum number of open files on Unix systems is insufficient. Any significant amount of loading will result in failures in strange ways and cause the error messagejava.io.IOException...(Too many open files)to be logged in the HBase or HDFS log files. For more information about this issue, see theApache HBase Book. You may also notice errors such as:
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
|
Configuring ulimit for HBase
Cloudera recommends increasing the maximum number of file handles to more than 10,000. Note that increasing the file handles for the user who is running the HBase process is an operating system configuration, not an HBase configuration. Also, a common mistake is to increase the number of file handles for a particular user but, for whatever reason, HBase will be running as a different user. HBase prints the ulimit it is using on the first line in the logs. Make sure that it is correct.
If you are using ulimit, you must make the following configuration changes:
- In the/etc/security/limits.conffile, add the following lines:
hdfs - nofile 32768
hbase - nofile 32768
- To apply the changes in/etc/security/limits.confon Ubuntu and other Debian systems, add the following line in the/etc/pam.d/common-sessionfile:
session required pam_limits.so
Using dfs.datanode.max.xcievers with HBase
A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. The upper bound property is calleddfs.datanode.max.xcievers(the property is spelled in the code exactly as shown here). Before loading, make sure you have configured the value fordfs.datanode.max.xcieversin theconf/hdfs-site.xmlfile to at least 4096 as shown below:
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
|
Be sure to restart HDFS after changing the value fordfs.datanode.max.xcievers. If you don't change that value as described, strange failures can occur and an error message about exceeding the number ofxcieverswill be added to the DataNode logs. Other error messages about missing blocks are also logged, such as:
10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node:
java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
|
Starting HBase in Standalone Mode
By default, HBase ships configured forstandalone mode. In this mode of operation, a single JVM hosts the HBase Master, an HBase Region Server, and a ZooKeeper quorum peer. In order to run HBase in standalone mode, you must install the HBase Master package:
Installing the HBase Master for Standalone Operation
To install the HBase Master on Ubuntu and other Debian systems:
$ sudo apt-get install hadoop-hbase-master
|
To install the HBase Master On Red Hat-compatible systems:
$ sudo yum install hadoop-hbase-master
|
To install the HBase Master on SUSE systems:
$ sudo zypper install hadoop-hbase-master
|
Starting the HBase Master
On Red Hat and SUSE systems (using.rpmpackages) you can start now start the HBase Master by using the included service script:
$ sudo /etc/init.d/hadoop-hbase-master start
|
On Ubuntu systems (using Debian packages) the HBase Master starts when the HBase package is installed.
To verify that the standalone installation is operational, visithttp://localhost:60010. The list of Region Servers at the bottom of the page should include one entry for your local machine.
If you see this message when you start the HBase standalone master:
Starting Hadoop HBase master daemon: starting master, logging to /usr/lib/hbase/logs/hbase-hbase-master/cloudera-vm.out
Couldnt start ZK at requested address of 2181, instead got: 2182. Aborting. Why? Because clients (eg shell) wont be able to find this ZK quorum
hbase-master.
|
you will need to stop the hadoop-zookeeper-server or uninstall the hadoop-zookeeper-server package.
Accessing HBase by using the HBase Shell
After you have started the standalone installation, you can access the database by using the HBase Shell:
$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.89.20100621+17, r, Mon Jun 28 10:13:32 PDT 2010
hbase(main):001:0> status 'detailed'
version 0.89.20100621+17
0 regionsInTransition
1 live servers
my-machine:59719 1277750189913
requests=0, regions=2, usedHeap=24, maxHeap=995
.META.,,1
stores=2, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
-ROOT-,,0
stores=1, storefiles=1, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
0 dead servers
|
Using MapReduce with HBase
To run MapReduce jobs that use HBase, you need to add the HBase and Zookeeper JAR files to the Hadoop Java classpath. You can do this by adding the following statement to each job:
TableMapReduceUtil.addDependencyJars(job);
|
This distributes the JAR files to the cluster along with your job and adds them to the job's classpath, so that you do not need to edit the MapReduce configuration.
You can find more information aboutaddDependencyJarshere.
When getting anConfigurationobject for a HBase MapReduce job, instantiate it using theHBaseConfiguration.create()method.
Configuring HBase in Pseudo-distributed Mode
Pseudo-distributedmode differs fromstandalonemode in that each of the component processes run in a separate JVM.
Modifying the HBase Configuration
To enable pseudo-distributed mode, you must first make some configuration changes. Open/etc/hbase/conf/hbase-site.xmlin your editor of choice, and insert the following XML properties between the<configuration>and</configuration>tags. Be sure to replacelocalhostwith the host name of your HDFS Name Node if it is not running locally.
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost/hbase</value>
</property>
|
Creating the/hbaseDirectory inHDFS
Before starting the HBase Master, you need to create the/hbasedirectory inHDFS. The HBase master runs ashbase:hbaseso it does not have the required permissions to create a top level directory.
To create the/hbasedirectory inHDFS:
$ sudo -u hdfs hadoop fs -mkdir /hbase
$ sudo -u hdfs hadoop fs -chown hbase /hbase
|
Enabling Servers for Pseudo-distributed Operation
After you have configured HBase, you must enable the various servers that make up a distributed HBase cluster. HBase uses three required types of servers:
Installing and Starting ZooKeeper Server
HBase uses ZooKeeper Server as a highly available, central location for cluster management. For example, it allows clients to locate the servers, and ensures that only one master is active at a time. For a small cluster, running a ZooKeeper node colocated with the NameNode is recommended. For larger clusters, contact Cloudera Support for configuration help.
Install and start the ZooKeeper Server in standalone mode by running the commands shown in the"Installing the ZooKeeper Server Package on a Single Server" section ofZooKeeper Installation.
Starting the HBase Master
After ZooKeeper is running, you can start the HBase master in standalone mode.
$ sudo /etc/init.d/hadoop-hbase-master start
|
Starting an HBase Region Server
The Region Server is the part of HBase that actually hosts data and processes requests. The region server typically runs on all of the slave nodes in a cluster, but not the master node.
To enable the HBase Region Server on Ubuntu and other Debian systems:
$ sudo apt-get install hadoop-hbase-regionserver
|
To enable the HBase Region Server On Red Hat-compatible systems:
$ sudo yum install hadoop-hbase-regionserver
|
To enable the HBase Region Server on SUSE systems:
$ sudo zypper install hadoop-hbase-regionserver
|
To start the Region Server:
$ sudo /etc/init.d/hadoop-hbase-regionserver start
|
Verifying the Pseudo-Distributed Operation
After you have started ZooKeeper, the Master, and a Region Server, the pseudo-distributed cluster should be up and running. You can verify that each of the daemons is running using thejpstool from the Oracle JDK, which you can obtain fromhere. If you are running a pseudo-distributed HDFS installation and a pseudo-distributed HBase installation on one machine,jpswill show the following output:
$ sudo jps
32694 Jps
30674 HRegionServer
29496 HMaster
28781 DataNode
28422 NameNode
30348 QuorumPeerMain
|
You should also be able to navigate tohttp://localhost:60010and verify that the local region server has registered with the master.
Installing the HBase Thrift Server
The HBase Thrift Server is an alternative gateway for accessing the HBase server. Thrift mirrors most of the HBase client APIs while enabling popular programming languages to interact with HBase. The Thrift Server is multi-platform and performs better than REST in many situations. Thrift can be run collocated along with the region servers, but should not be collocated with the NameNode or the JobTracker. For more information about Thrift, visithttp://incubator.apache.org/thrift/.
To enable the HBase Thrift Server on Ubuntu and other Debian systems:
$ sudo apt-get install hadoop-hbase-thrift
|
To enable the HBase Thrift Server On Red Hat-compatible systems:
$ sudo yum install hadoop-hbase-thrift
|
To enable the HBase Thrift Server on SUSE systems:
$ sudo zypper install hadoop-hbase-thrift
|
Deploying HBase in a Distributed Cluster
After you have HBase running in pseudo-distributed mode, the same configuration can be extended to running on a distributed cluster.
Choosing where to Deploy the Processes
For small clusters, Cloudera recommends designating one node in your cluster as the master node. On this node, you will typically run the HBase Master and a ZooKeeper quorum peer. These master processes may be collocated with the Hadoop NameNode and JobTracker for small clusters.
Designate the remaining nodes as slave nodes. On each node, Cloudera recommends running a Region Server, which may be collocated with a Hadoop TaskTracker and a DataNode. When collocating with TaskTrackers, be sure that the resources of the machine are not oversubscribed – it's safest to start with a small number of MapReduce slots and work up slowly.
Configuring for Distributed Operation
After you have decided which machines will run each process, you can edit the configuration so that the nodes may locate each other. In order to do so, you should make sure that the configuration files are synchronized across the cluster. Cloudera strongly recommends the use of a configuration management system to synchronize the configuration files, though you can use a simpler solution such asrsyncto get started quickly.
The only configuration change necessary to move from pseudo-distributed operation to fully-distributed operation is the addition of the ZooKeeper Quorum address inhbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:
<property>
<name>hbase.zookeeper.quorum</name>
<value>mymasternode</value>
</property>
|
To start the cluster, start the services in the following order:
- The ZooKeeper Quorum Peer
- The HBase Master
- Each of the HBase Region Servers
After the cluster is fully started, you can view the HBase Master web interface on port 60010 and verify that each of the slave nodes has registered properly with the master.
Troubleshooting
The Cloudera packages of HBase have been configured to place logs in/var/log/hbase. While getting started, Cloudera recommends tailing these logs to note any error messages or failures.
Viewing the HBase Documentation
For additional HBase documentation, seehttp://archive.cloudera.com/cdh/3/hbase/.