HBase Installation

来源:https://ccp.cloudera.com/display/CDHDOC/HBase+Installation

 

 

Contents

 

Apache HBase provides large-scale tabular storage for Hadoop using the Hadoop Distributed File System (HDFS). Cloudera recommends installing HBase in a standalone mode before you try to run it on a whole cluster.

Important
If you have not already done so, install Cloudera'syum,zypper/YaSToraptrepository before using the following commands to install or upgrade HBase. For instructions, seeCDH3 Installation.

Upgrading HBase to the Latest CDH3 Release

Note
To see which version of HBase is shipping in the latest CDH3 release, check theVersion and Packaging Information. For important information on new and changed components, see theRelease Notes.

The instructions that follow assume that you are upgrading HBase as part of an upgrade to the latest CDH3 release, and have already performed the steps underUpgrading CDH3.

To upgrade HBase to the latest CDH3 release, proceed as follows.

Warning
Youmustshut down the HBase, Thrift, and ZooKeeper processes as shown below. If these processes are running during the upgrade, the new version will not work correctly.

Step 1: Perform a Graceful Cluster Shutdown

To shut HBase down gracefully, stop the Thrift server and clients, then stop the cluster.

  1. Stop the Thrift server and clients
    sudo service hadoop-hbase-thrift stop
  2. Stop the cluster.
    1. Use the following command on the master node:
      sudo service hadoop-hbase-master stop
    2. Use the following command on each node hosting a region server:
      sudo service hadoop-hbase-regionserver stop

This shuts down the master and the region servers gracefully.

Step 2. Stop the ZooKeeper Server

$ sudo service hadoop-zookeeper-server stop
Note
Depending on your platform and release, you may need to use
$ sudo /sbin/service hadoop-zookeeper-server stop

or

$ sudo /sbin/service hadoop-zookeeper stop

Step 3: Install the new version of HBase

Note
You may want to take this opportunity to upgrade ZooKeeper, but you do nothaveto upgrade Zookeeper before upgrading HBase; the new version of HBase will run with the older version of Zookeeper. For instructions on upgrading ZooKeeper, seeUpgrading ZooKeeper to the Latest CDH3 Release.

Follow directions in the next section,Installing HBase.

Installing HBase

To install HBase on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase

To install HBase On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase

To install HBase on SUSE systems:

$ sudo zypper install hadoop-hbase

To list the installed files on Ubuntu and other Debian systems:

$ dpkg -L hadoop-hbase

To list the installed files on Red Hat and SUSE systems:

$ rpm -ql hadoop-hbase

You can see that the HBase package has been configured to conform to the Linux Filesystem Hierarchy Standard. (To learn more, runman hier).

HBase wrapper script /usr/bin/hbase
HBase Configuration Files /etc/hbase/conf
HBase Jar and Library Files /usr/lib/hbase
HBase Log Files /var/log/hbase
HBase service scripts /etc/init.d/hadoop-hbase-*

 

You are now ready to enable the server daemons you want to use with Hadoop. Java-based client access is also available by adding the jars in/usr/lib/hbase/and/usr/lib/hbase/lib/to your Java class path.

Host Configuration Settings for HBase

Configuring the REST Port

You can use aninit.dscript,/etc/init.d/hadoop-hbase-rest, to start the REST server; for example:

/etc/init.d/hadoop-hbase-rest start

The script starts the server by default on port 8080. This is a commonly used port and so may conflict with other applications running on the same host.

If you need change the port for the REST server, configure it inhbase-site.xml, for example:

<property>
<name>hbase.rest.port</name>
<value>60050</value>
</property>
Note
You can useHBASE_REST_OPTSinhbase-env.shto pass other settings (such as heap size and GC parameters) to the REST server JVM.

Using DNS with HBase

HBase uses the local hostname to report its IP address. Both forward and reverse DNS resolving should work. If your machine has multiple interfaces, HBase uses the interface that the primary hostname resolves to. If this is insufficient, you can sethbase.regionserver.dns.interfacein thehbase-site.xmlfile to indicate the primary interface. To work properly, this setting requires that your cluster configuration is consistent and every host has the same network interface configuration. As an alternative, you can sethbase.regionserver.dns.nameserverin thehbase-site.xmlfile to choose a different name server than the system-wide default.

Using the Network Time Protocol (NTP) with HBase

The clocks on cluster members should be in basic alignments. Some skew is tolerable, but excessive skew could generate odd behaviors. RunNTPon your cluster, or an equivalent. If you are having problems querying data or unusual cluster operations, verify the system time.

Setting User Limits for HBase

Because HBase is a database, it uses a lot of files at the same time. The default ulimit setting of 1024 for the maximum number of open files on Unix systems is insufficient. Any significant amount of loading will result in failures in strange ways and cause the error messagejava.io.IOException...(Too many open files)to be logged in the HBase or HDFS log files. For more information about this issue, see theApache HBase Book. You may also notice errors such as:

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901

Configuring ulimit for HBase

Cloudera recommends increasing the maximum number of file handles to more than 10,000. Note that increasing the file handles for the user who is running the HBase process is an operating system configuration, not an HBase configuration. Also, a common mistake is to increase the number of file handles for a particular user but, for whatever reason, HBase will be running as a different user. HBase prints the ulimit it is using on the first line in the logs. Make sure that it is correct.

If you are using ulimit, you must make the following configuration changes:

  1. In the/etc/security/limits.conffile, add the following lines:
    Note
    Only the root user can edit this file.
    hdfs - nofile 32768
    hbase - nofile 32768
  2. To apply the changes in/etc/security/limits.confon Ubuntu and other Debian systems, add the following line in the/etc/pam.d/common-sessionfile:
    session required pam_limits.so

Using dfs.datanode.max.xcievers with HBase

A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. The upper bound property is calleddfs.datanode.max.xcievers(the property is spelled in the code exactly as shown here). Before loading, make sure you have configured the value fordfs.datanode.max.xcieversin theconf/hdfs-site.xmlfile to at least 4096 as shown below:

<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>

Be sure to restart HDFS after changing the value fordfs.datanode.max.xcievers. If you don't change that value as described, strange failures can occur and an error message about exceeding the number ofxcieverswill be added to the DataNode logs. Other error messages about missing blocks are also logged, such as:

10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node:
java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...

Starting HBase in Standalone Mode

By default, HBase ships configured forstandalone mode. In this mode of operation, a single JVM hosts the HBase Master, an HBase Region Server, and a ZooKeeper quorum peer. In order to run HBase in standalone mode, you must install the HBase Master package:

Installing the HBase Master for Standalone Operation

To install the HBase Master on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase-master

To install the HBase Master On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase-master

To install the HBase Master on SUSE systems:

$ sudo zypper install hadoop-hbase-master

Starting the HBase Master

On Red Hat and SUSE systems (using.rpmpackages) you can start now start the HBase Master by using the included service script:

$ sudo /etc/init.d/hadoop-hbase-master start

On Ubuntu systems (using Debian packages) the HBase Master starts when the HBase package is installed.

To verify that the standalone installation is operational, visithttp://localhost:60010. The list of Region Servers at the bottom of the page should include one entry for your local machine.

Note
Although you just started the master process, instandalonemode this same process is also internally running a region server and a ZooKeeper peer. In the next section, you will break out these components into separate JVMs.

If you see this message when you start the HBase standalone master:

Starting Hadoop HBase master daemon: starting master, logging to /usr/lib/hbase/logs/hbase-hbase-master/cloudera-vm.out
Couldnt start ZK at requested address of 2181, instead got: 2182. Aborting. Why? Because clients (eg shell) wont be able to find this ZK quorum
hbase-master.

you will need to stop the hadoop-zookeeper-server or uninstall the hadoop-zookeeper-server package.

Accessing HBase by using the HBase Shell

After you have started the standalone installation, you can access the database by using the HBase Shell:

$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.89.20100621+17, r, Mon Jun 28 10:13:32 PDT 2010
 
hbase(main):001:0> status 'detailed'
version 0.89.20100621+17
0 regionsInTransition
1 live servers
my-machine:59719 1277750189913
requests=0, regions=2, usedHeap=24, maxHeap=995
.META.,,1
stores=2, storefiles=0, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
-ROOT-,,0
stores=1, storefiles=1, storefileSizeMB=0, memstoreSizeMB=0, storefileIndexSizeMB=0
0 dead servers

Using MapReduce with HBase

To run MapReduce jobs that use HBase, you need to add the HBase and Zookeeper JAR files to the Hadoop Java classpath. You can do this by adding the following statement to each job:

TableMapReduceUtil.addDependencyJars(job);

This distributes the JAR files to the cluster along with your job and adds them to the job's classpath, so that you do not need to edit the MapReduce configuration.

You can find more information aboutaddDependencyJarshere.

When getting anConfigurationobject for a HBase MapReduce job, instantiate it using theHBaseConfiguration.create()method.

Configuring HBase in Pseudo-distributed Mode

Pseudo-distributedmode differs fromstandalonemode in that each of the component processes run in a separate JVM.

Note
If the HBase master is already running in standalone mode, stop it by running/etc/init.d/hadoop-hbase-master stopbefore continuing with pseudo-distributed configuration.

Modifying the HBase Configuration

To enable pseudo-distributed mode, you must first make some configuration changes. Open/etc/hbase/conf/hbase-site.xmlin your editor of choice, and insert the following XML properties between the<configuration>and</configuration>tags. Be sure to replacelocalhostwith the host name of your HDFS Name Node if it is not running locally.

<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost/hbase</value>
</property>

Creating the/hbaseDirectory inHDFS

Before starting the HBase Master, you need to create the/hbasedirectory inHDFS. The HBase master runs ashbase:hbaseso it does not have the required permissions to create a top level directory.

To create the/hbasedirectory inHDFS:

$ sudo -u hdfs hadoop fs -mkdir /hbase
$ sudo -u hdfs hadoop fs -chown hbase /hbase

Enabling Servers for Pseudo-distributed Operation

After you have configured HBase, you must enable the various servers that make up a distributed HBase cluster. HBase uses three required types of servers:

Installing and Starting ZooKeeper Server

HBase uses ZooKeeper Server as a highly available, central location for cluster management. For example, it allows clients to locate the servers, and ensures that only one master is active at a time. For a small cluster, running a ZooKeeper node colocated with the NameNode is recommended. For larger clusters, contact Cloudera Support for configuration help.

Install and start the ZooKeeper Server in standalone mode by running the commands shown in the"Installing the ZooKeeper Server Package on a Single Server" section ofZooKeeper Installation.

Starting the HBase Master

After ZooKeeper is running, you can start the HBase master in standalone mode.

$ sudo /etc/init.d/hadoop-hbase-master start

Starting an HBase Region Server

The Region Server is the part of HBase that actually hosts data and processes requests. The region server typically runs on all of the slave nodes in a cluster, but not the master node.

To enable the HBase Region Server on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase-regionserver

To enable the HBase Region Server On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase-regionserver

To enable the HBase Region Server on SUSE systems:

$ sudo zypper install hadoop-hbase-regionserver

To start the Region Server:

$ sudo /etc/init.d/hadoop-hbase-regionserver start

Verifying the Pseudo-Distributed Operation

After you have started ZooKeeper, the Master, and a Region Server, the pseudo-distributed cluster should be up and running. You can verify that each of the daemons is running using thejpstool from the Oracle JDK, which you can obtain fromhere. If you are running a pseudo-distributed HDFS installation and a pseudo-distributed HBase installation on one machine,jpswill show the following output:

$ sudo jps
32694 Jps
30674 HRegionServer
29496 HMaster
28781 DataNode
28422 NameNode
30348 QuorumPeerMain

You should also be able to navigate tohttp://localhost:60010and verify that the local region server has registered with the master.

Installing the HBase Thrift Server

The HBase Thrift Server is an alternative gateway for accessing the HBase server. Thrift mirrors most of the HBase client APIs while enabling popular programming languages to interact with HBase. The Thrift Server is multi-platform and performs better than REST in many situations. Thrift can be run collocated along with the region servers, but should not be collocated with the NameNode or the JobTracker. For more information about Thrift, visithttp://incubator.apache.org/thrift/.

To enable the HBase Thrift Server on Ubuntu and other Debian systems:

$ sudo apt-get install hadoop-hbase-thrift

To enable the HBase Thrift Server On Red Hat-compatible systems:

$ sudo yum install hadoop-hbase-thrift

To enable the HBase Thrift Server on SUSE systems:

$ sudo zypper install hadoop-hbase-thrift

Deploying HBase in a Distributed Cluster

After you have HBase running in pseudo-distributed mode, the same configuration can be extended to running on a distributed cluster.

Choosing where to Deploy the Processes

For small clusters, Cloudera recommends designating one node in your cluster as the master node. On this node, you will typically run the HBase Master and a ZooKeeper quorum peer. These master processes may be collocated with the Hadoop NameNode and JobTracker for small clusters.

Designate the remaining nodes as slave nodes. On each node, Cloudera recommends running a Region Server, which may be collocated with a Hadoop TaskTracker and a DataNode. When collocating with TaskTrackers, be sure that the resources of the machine are not oversubscribed – it's safest to start with a small number of MapReduce slots and work up slowly.

Configuring for Distributed Operation

After you have decided which machines will run each process, you can edit the configuration so that the nodes may locate each other. In order to do so, you should make sure that the configuration files are synchronized across the cluster. Cloudera strongly recommends the use of a configuration management system to synchronize the configuration files, though you can use a simpler solution such asrsyncto get started quickly.

The only configuration change necessary to move from pseudo-distributed operation to fully-distributed operation is the addition of the ZooKeeper Quorum address inhbase-site.xml. Insert the following XML property to configure the nodes with the address of the node where the ZooKeeper quorum peer is running:

<property>
<name>hbase.zookeeper.quorum</name>
<value>mymasternode</value>
</property>

To start the cluster, start the services in the following order:

  1. The ZooKeeper Quorum Peer
  2. The HBase Master
  3. Each of the HBase Region Servers

After the cluster is fully started, you can view the HBase Master web interface on port 60010 and verify that each of the slave nodes has registered properly with the master.

Troubleshooting

The Cloudera packages of HBase have been configured to place logs in/var/log/hbase. While getting started, Cloudera recommends tailing these logs to note any error messages or failures.

Viewing the HBase Documentation

For additional HBase documentation, seehttp://archive.cloudera.com/cdh/3/hbase/.

 

你可能感兴趣的:(hadoop,hbase)