some important optimized advices for hbase-0.94.x

The following gives you a list to run through when you encounter problems with your cluster setup.

1.Basic setup checklist

This section provides a checklist of things you should confirm for your cluster, before going into a deeper analysis in case of problems or performance issues.

File handles.

HBase is a database, so it uses a lot of files at the same time. The default ulimit -n of 1024 on most Unix or other Unix-like systems is insufficient. Any significant amount of loading will lead to I/O errors stating the obvious: java.io.IOException: Too many open files. You may also notice errors such as the following:

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException

2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901 

The ulimit -n for the DataNode processes and the HBase processes should be set high. To verify the current ulimit setting you can also run the following:

$ cat /proc/<PID of JVM>/limits

You should see that the limit on the number of files is set reasonably high—it is safest to just bump this up to 32000, or even more. “File handles and process lim- its” on page 49 has the full details on how to configure this value.

  TODO but i found this value is different from the one set in OS,and the later always is 4096(ie 4 times of OS's one)

 

 

DataNode connections(dfs.datanode.max.xcievers)

The DataNodes should be configured with a large number of transceivers—at least 4,096, but potentially more. There’s no particular harm in setting it up to as high as 16,000 or so. See “Datanode handlers” on page 51 for more infor- mation.

Not having this configuration in place makes for strange-looking failures. Eventually, you will see a complaint in the datanode logs about the xcievers limit being exceeded, but on the run up to this one manifestation is a complaint about missing blocks. For example:

10/12/08 20:10:31 INFO hdfs.DFSClient: Could not obtain block blk_XXXXXXXXXXXXXXXXXXXXXX_YYYYYYYY from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry... 

Compression. Compressionshouldalmostalwaysbeon,unlessyouarestoringprecom- pressed data. “Compression” on page 424 discusses the details. Make sure that you have verified the installation so that all region servers can load the required compression libraries. If not, you will see errors like this:

hbase(main):007:0> create 'testtable', { NAME => 'colfam1', COMPRESSION => 'LZO' } ERROR: org.apache.hadoop.hbase.client.NoServerForRegionException: \

No server address listed in .META. for region \ testtable2,,1309713043529.8ec02f811f75d2178ad098dc40b4efcf.

In the logfiles of the servers, you will see the root cause for this problem (abbreviated and line-wrapped to fit the available width):

2011-07-03 19:10:43,725 INFO org.apache.hadoop.hbase.regionserver.HRegion: \ Setting up tabledescriptor config now ...

2011-07-03 19:10:43,725 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: \ Instantiated testtable,,1309713043529.8ec02f811f75d2178ad098dc40b4efcf.

2011-07-03 19:10:43,839 ERROR org.apache.hadoop.hbase.regionserver.handler. \ OpenRegionHandler: Failed open of region=testtable,,1309713043529. \ 8ec02f811f75d2178ad098dc40b4efcf.
java.io.IOException: java.lang.RuntimeException: \

java.lang.ClassNotFoundException: com.hadoop.compression.lzo.LzoCodec
at org.apache.hadoop.hbase.util.CompressionTest.testCompression
at org.apache.hadoop.hbase.regionserver.HRegion.checkCompressionCodecs ...

page501image19448 page501image19608
The missing compression library triggers an error when the region server tries to open the region with the column family configured to use LZO compression.

Garbage collection/memory tuning. We discussed the common Java garbage collector set- tings in “Garbage Collection Tuning” on page 419. If enough memory is available, you should increase the region server heap up to at least 4 GB, preferably more like 8 GB. The recommended garbage collection settings ought to work for any heap size.

Also, if you are colocating the region server and MapReduce task tracker, be mindful of resource contention on the shared system. Edit the mapred-site.xml file to reduce the number of slots for nodes running with ZooKeeper, so you can allocate a good share of memory to the region server. Do the math on memory allocation, accounting for memory allocated to the task tracker and region server, as well as memory allocated for each child task (from mapred-site.xml and hadoop-env.sh) to make sure you are leaving enough memory for the region server but you’re not oversubscribing the system. Refer to the discussion in “Requirements” on page 34. You might want to consider separating MapReduce and HBase functionality if you are otherwise strapped for resources.

Lastly, HBase is also CPU-intensive. So even if you have enough memory, check your CPU utilization to determine if slots need to be reduced, using a simple Unix command such as top, or the monitoring described in Chapter 10.

 

 

2.Stability issues

In rare cases, a region server may shut itself down, or its process may be terminated unexpectedly. You can check the following:

• Double-checkthattheJVMversionisnot1.6.0u18(whichisknowntohavedet- rimental effects on running HBase processes).

• Check the last lines of the region server logs—they probably have a message con- taining the word "aborting" (or "abort"), hopefully with a reason.

The latter is often an issue when the server is losing its ZooKeeper session. If that is the case, you can look into the following:

2.1 ZooKeeper problems. It is vital to ensure that ZooKeeper can perform its tasks as the co- ordination service for HBase. It is also important for the HBase processes to be able to communicate with ZooKeeper on a regular basis. Here is a checklist you can use to ensure that your do not run into commonly known problems with ZooKeeper:

 

 

Check that the region server and ZooKeeper machines do not swap
If machines start swapping, certain resources start to time out and the region servers will lose their ZooKeeper session, causing them to abort themselves. You can use Ganglia, for example, to graph the machines’ swap usage, or execute

$ vmstat 20

 

page502image23136 page502image23296

 in fact ,this is the theshold that the swap will be used when the memorey have been used for (1-20%) percent .

 

on the server(s) while running load against the cluster (e.g., a MapReduce job): make sure the "si" and "so" columns stay at 0. These columns show the amount of data swapped in or out. Also execute

$ free -m

to make sure that no swap space is used (the swap column should state 0). Also consider tuning the kernel’s swappiness value (/proc/sys/vm/swappiness) down to 5 or 10. This should help if the total memory allocation adds up to less than the box’s available memory, yet swap is happening anyway.

Check network issues
If the network is flaky, region servers will lose their connections to ZooKeeper and abort.

Check ZooKeeper machine deployment
ZooKeeper should never be codeployed with task trackers or data nodes. It is per- missible to deploy ZooKeeper with the name node, secondary name node, and job tracker on small clusters (e.g., fewer than 40 nodes).

It is preferable to deploy just one ZooKeeper peer shared with the name node/job tracker than to deploy three that are collocated with other processes: the other processes will stress the machine and ZooKeeper will start timing out.

Check pauses related to garbage collection
Check the region server’s logfiles for a message containing "slept"; for example, you might see something like "We slept 65000ms instead of 10000ms". If you see this, it is probably due to either garbage collection pauses or heavy swapping. If they are garbage collection pauses, refer to the tuning options mentioned in “Basic setup checklist” on page 471.

Monitor slow disks
HBase does not degrade well when reading or writing a block on a data node with a slow disk. This problem can affect the entire cluster if the block holds data from the META region, causing compactions to slow and back up. Again, use monitor- ing to carefully keep these vital metrics under control.

 it is hard to find unless you use some disk check tools as hdpam

ref:

hbase-guide

swappiness

HBase性能调优

你可能感兴趣的:(import)