HBase
Apache HBase™ : Hadoop数据库,一个分布式,可扩展,大数据存储。
当需要对大数据进行随机、实时读写访问的时候,就需要使用 Apache HBase™ 。该项目的目标就是在普通的商业机器上托管非常大的表(数十亿行×数百万列)。Apache HBase 是 Google's Bigtable (Chang 等人开发的结构化数据分布式存储系统) 项目之后的一个开源的、分布式、版本化、非关系模型数据库。证如 Bigtable 利用Google File System提供的分布式存储一样,Apache HBase 在HDFS之上提供类似于 Bigtable 的功能。
Features
- 线性扩展和模块化扩展
- 严格的一致读写
- 自动的、可配置的表分片
- 区域服务器之间的自动故障转移支持
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- 便利的客户端 Java API 接入
- 向实时查询提供块缓存和Bloom过滤器
- 以服务器端过滤器实现的Query predicate push down
- 简练的 gateway 和 REST-ful Web service ,支持 XML, Protobuf, 以及二进制数据编码
- 基于jruby(JIRB)的扩展shell
- 支持通过Hadoop metrics自系统将指标发布给文件或者Ganglia;或者通过JMX
准备工作
https://hbase.apache.org/book.html#basic.prerequisites
[root@hadoop opt]# useradd hbase
[root@hadoop opt]# chown -R hbase:hbase hbase-2.2.5*
[root@hadoop opt]# ll
total 0
drwxr-xr-x. 11 hive hive 221 Oct 22 22:24 apache-hive-3.1.2-bin
drwxr-xr-x. 12 hadoop hadoop 188 Oct 21 06:23 hadoop-3.1.4
drwxr-xr-x. 6 hbase hbase 170 Oct 23 01:14 hbase-2.2.5
drwxr-xr-x. 5 hbase hbase 149 Oct 23 01:14 hbase-2.2.5-client
[root@hadoop opt]# su - hbase
[hbase@hadoop ~]$
Java
配置conf/hbase-env.sh
文件,设置JAVA_HOME
# The java implementation to use. Java 1.8+ required.
# export JAVA_HOME=/usr/java/jdk1.8.0/
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_202
ssh
HBase 使用ssh命令和功能赖在集群节点间通信。集群内的每台服务器都需要运行ssh,这样Hadoop 和 HBase daemons 才能被管理。应该可以使用密钥而不是密码从Mater以及任何backup Master上登录所有的节点,包括localhost。
Linux or Unix 系统设置参考 "Procedure: Configure Passwordless SSH Access"。
OS X 设置参考 SSH: Setting up Remote Desktop and Enabling Self-Login 。
DNS
HBase 使用本地 hostname 来作为其地址。
NTP
集群上的节点时钟应该同步。 少量的偏差是可以接受的,但是大范围偏差将导致不稳定和不可预测的事件。如果集群发生不能解释的问题,首先检查时间同步。建议在集群所有节点运行NTP服务或者其他时间同步机制,所有节点使用同一服务端进行时间同步。 参考 Basic NTP Configuration 设置。
Limits on Number of Files and Processes (ulimit)
Apache HBase 是数据库。那么它就需要一次打开很多文件的能力。很多 Linux 发行版限制了单个用户打开文件的数量,比如 1024
(or 旧版本OS X 设置256
)。你需要以运行HBase的用户登录,然后使用命令 ulimit -n
检查。参考the Troubleshooting section 查看limit设置过小导致的问题。你也许会看到如下错误信息:
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Exception increateBlockOutputStream java.io.EOFException
2010-04-06 03:04:37,542 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6935524980745310745_1391901
建议最少设置为10,000,当然 10,240更好,因为该值通常以1024的倍数表示。每个ColumnFamily至少有一个StoreFile,如果该区域正在加载,则可能有六个以上的StoreFile。打开文件的数量取决于ColumnFamilies 的数量和区域的数量。如下是计算RegionServer打开文件数量的简易公式:
(StoreFiles per ColumnFamily) x (regions per RegionServer)
假设,一个schema在每个区域有3个ColumnFamilies, 每个ColumnFamily有3个StoreFiles, 每个RegionServer有100个regions,那么JVM 将打开 3 * 3 * 100 = 900 个文件,还不包括打开的JAR文件,配置文件等。打开一个文件不需要很多资源,因此一个用户打开很多文件的风险是很小的 。
另一个相关的设置是单次用户允许运行的进程数。在Linux 和 Unix,通过ulimit -u
命令进行设置。 不能与nproc
命令混淆,nproc
控制指定用户可用的CPU数量。在工作负载下,ulimit -u
太低会导致OutOfMemoryError错误。
给运行HBase的用户配置最大文件描述和进程是操作系统配置,而不是HBase软件配置。确保运行HBase的用户配置文件已经变更是很重要的。 查看哪个用户启动HBase,以及该用户的ulimit配置,查看该实例HBase log 的第一行即可。
Example 1. ulimit Settings on Ubuntu
在Ubuntu上配置ulimit,编辑 /etc/security/limits.conf 文件,以空格分隔的4列。参考 limits.conf 的 man 文档获取更多相关内容。在如下范例,第一行设置hadoop用户打开文件数量的soft和hard limits 为 32768;第二行设置hadoop用户打开进程的数量32000。
hadoop - nofile 32768 hadoop - nproc 32000
这些设置仅在 Pluggable Authentication Module (PAM) 可以直接调用它们的时候生效。要配置PAM使用这些limits,确保
/etc/pam.d/common-session
文件包含如下行:session required pam_limits.so
Linux Shell
HBase的所有脚本依赖于 GNU Bash shell。
Windows
不推荐在 Windows 设备上运行生产系统。
HBase 有两种运行模式: standalone 和distributed。HBase 开箱即用运行于 standalone 模式。不管哪个模式,都需要配置 HBase,配置文件在 conf 目录。 至少,你必须配置
conf/hbase-env.sh
来使HBase知道运行哪个java。在这个文件,你可以设置 HBase 的环境变量比如 heapsize 等JVM
参数,log 文件的保存位置等等。JAVA_HOME 指向java 安装目录的根目录。
单点
standalone
这是默认模式。Standalone 描述在 quickstart 部分。在standalone mode, HBase 不支持 HDFS — 使用本地文件系统 — 并且运行所有的 HBase daemons 和一个local ZooKeeper 在同一个JVM下。ZooKeeper 绑定众所周知的端口,这样客户端可以连到HBase。
standalone.over.hdfs
一个常见变种 Standalone HBase
,运行所有的daemons在一个JVM,但是不保存于本地系统,而是保存于HDFS。
要配置该 standalone 变种,编辑hbase-site.xml文件,设置hbase.rootdir
指向HDFS实例的一个目录,但是设置hbase.cluster.distributed
为false。比如:
hbase.rootdir
hdfs://namenode.example.org:8020/hbase
hbase.cluster.distributed
false
https://hbase.apache.org/book.html#distributed
Distributed mode 又可以细分为所有daemons分布运行于一台服务器的pseudo-distributed
,和所有>daemons分布运行于集群内各节点的fully-distributed
。伪分布式pseudo-distributed
vs 完全分布式fully-distributed
命令法来自于Hadoop。Pseudo-distributed 模式可以运行在本地文件系统,也可以运行在HDFS。Fully-distributed 模式只能运行于 HDFS。参考 Hadoop documentation 设置 HDFS。推荐一份在Hadoop 2 上设置HDFS的文档 http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide。
伪分布式
https://hbase.apache.org/book.html#pseudo
pseudo-distributed
模式是运行于单机模式的简化fully-distributed
模式。该 HBase 配置仅用于测试和原型设计,不能用于生产和性能测试。
试过了上面的单点模式,我们重新配置HBase运行于伪分布式。伪分布式的意思是HBase仍然运行于一台服务器,但是HBase daemon (HMaster, HRegionServer, and ZooKeeper) 各自运行于独立进程:在单点模式下所有的daemons运行于一个jvm进程/实例。默认,除非你配置hbase.rootdir
属性,那么你的数据将保存于 /tmp/目录。这里,我们将配置数据保存于HDFS。你也可以跳过HDFS配置,将数据保存于本地文件系统。
- 创建HDFS目录
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -mkdir /apps
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -chmod 777 /apps
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /
Found 2 items
drwxrwxrwx - hadoop supergroup 0 2020-10-27 00:08 /apps
drwxr-xr-x - hadoop supergroup 0 2020-10-26 21:46 /user
- 配置HBase
编辑hbase-site.xml
hbase.cluster.distributed
true
hbase.rootdir
hdfs://10.0.31.65:9000/apps/hbase
hbase.unsafe.stream.capability.enforce
false
无需在HDFS创建目录。HBase会自动创建。如果你创建了目录,HBase会认为是迁移操作。
移除 hbase.tmp.dir
和 hbase.unsafe.stream.capability.enforce
配置。
- 启动HBase
使用bin/start-hbase.sh
命令启动 HBase。如果你的系统配置正确,jps命令会看到HMaster 和 HRegionServer 进程在运行。
[hbase@hadoop hbase-2.2.5]$ bin/start-hbase.sh
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop-3.1.4/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hbase-2.2.5/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop-3.1.4/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hbase-2.2.5/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
localhost: running zookeeper, logging to /opt/hbase-2.2.5/bin/../logs/hbase-hbase-zookeeper-hadoop.out
running master, logging to /opt/hbase-2.2.5/bin/../logs/hbase-hbase-master-hadoop.out
: regionserver running as process 6509. Stop it first.
[hbase@hadoop hbase-2.2.5]$ jps
9507 HQuorumPeer
9574 HMaster
6509 HRegionServer
9935 Jps
- 检查HDFS创建的HBase目录
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /
Found 2 items
drwxrwxrwx - hadoop supergroup 0 2020-10-27 00:19 /apps
drwxr-xr-x - hadoop supergroup 0 2020-10-26 21:46 /user
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /apps
Found 1 items
drwxr-xr-x - hbase supergroup 0 2020-10-27 01:12 /apps/hbase
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /apps/hbase
Found 12 items
drwxr-xr-x - hbase supergroup 0 2020-10-27 00:19 /apps/hbase/.hbck
drwxr-xr-x - hbase supergroup 0 2020-10-27 01:12 /apps/hbase/.tmp
drwxr-xr-x - hbase supergroup 0 2020-10-27 01:12 /apps/hbase/MasterProcWALs
drwxr-xr-x - hbase supergroup 0 2020-10-27 01:12 /apps/hbase/WALs
drwxr-xr-x - hbase supergroup 0 2020-10-27 00:19 /apps/hbase/archive
drwxr-xr-x - hbase supergroup 0 2020-10-27 00:19 /apps/hbase/corrupt
drwxr-xr-x - hbase supergroup 0 2020-10-27 01:12 /apps/hbase/data
-rw-r--r-- 1 hbase supergroup 42 2020-10-27 00:19 /apps/hbase/hbase.id
-rw-r--r-- 1 hbase supergroup 7 2020-10-27 00:19 /apps/hbase/hbase.version
drwxr-xr-x - hbase supergroup 0 2020-10-27 00:19 /apps/hbase/mobdir
drwxr-xr-x - hbase supergroup 0 2020-10-27 01:12 /apps/hbase/oldWALs
drwx--x--x - hbase supergroup 0 2020-10-27 00:19 /apps/hbase/staging
[hadoop@hadoop hadoop-3.1.4]$
- 创建表、填充数据
你可以使用HBase Shell来创建表、填充数据、扫描以及获取数据,使用相同的程序 shell exercises。
[hbase@hadoop hbase-2.2.5]$ bin/hbase shell
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop-3.1.4/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hbase-2.2.5/lib/client-facing-thirdparty/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.2.5, rf76a601273e834267b55c0cda12474590283fd4c, 2020年 05月 21日 星期四 18:34:40 CST
Took 0.0042 seconds
hbase(main):001:0> crate 'test','cf'
NoMethodError: undefined method `crate' for main:Object
Did you mean? create
hbase(main):002:0> create 'test','cf'
Created table test
Took 2.9205 seconds
=> Hbase::Table - test
hbase(main):003:0> list 'test'
TABLE
test
1 row(s)
Took 0.0533 seconds
=> ["test"]
hbase(main):004:0> describe 'test'
Table test is ENABLED
test
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER
=> 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536'}
1 row(s)
QUOTAS
0 row(s)
Took 0.3261 seconds
hbase(main):005:0> put 'test', 'row1', 'cf:a', 'value1'
Took 0.0904 seconds
hbase(main):006:0> put 'test', 'row2', 'cf:b', 'value2'
Took 0.0208 seconds
hbase(main):007:0> put 'test', 'row3', 'cf:c', 'value3'
Took 0.0090 seconds
hbase(main):008:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1603732800927, value=value1
row2 column=cf:b, timestamp=1603732808455, value=value2
row3 column=cf:c, timestamp=1603732815117, value=value3
3 row(s)
Took 0.0355 seconds
hbase(main):009:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1603732800927, value=value1
1 row(s)
Took 0.0210 seconds
hbase(main):010:0> drop 'test'
ERROR: Table test is enabled. Disable it first.
For usage try 'help "drop"'
Took 0.0652 seconds
hbase(main):011:0> disable 'test'
Took 1.3450 seconds
hbase(main):012:0> enable 'test'
Took 0.7684 seconds
hbase(main):013:0> disable 'test'
Took 0.7596 seconds
hbase(main):014:0> drop 'test'
Took 0.4805 seconds
hbase(main):015:0> exit
[hbase@hadoop hbase-2.2.5]$
- 启动&停止 备份 HBase Master (HMaster) 服务器
在同一硬件上运行多个HMaster实例在生产环境中没有意义,就像运行伪分布式集群对生产环境没有意义一样。
此步骤仅用于测试和学习目的。
HMaster服务器控制HBase集群。你可以最多启动9台backup HMaster服务器,含主服务器一共10台。使用 local-master-backup.sh
启动backup HMaster。对于要启动的每一台backup master,添加一个相对master偏移的端口参数。每台HMaster使用2个端口 (默认16000 和 16010)。端口便宜要添加到这些端口,所以如果偏移量是2,那么 backup HMaster 将使用端口 16002 和 16012。如下命令启动了3台backup servers,使用端口 16002/16012, 16003/16013, and 16005/16015。
$ ./bin/local-master-backup.sh start 2 3 5
要杀掉一个backup master而不是整个集群,你需要先找到它的进程ID (PID)。PID 存在一个文件内,文件名类似于/tmp/hbase-USER-X-master.pid
。文件的唯一内容就是PID。你可以使用kill -9
命令杀掉该PID。如下的命令会杀掉端口偏移量1的进程,而不影响集群运行:
$ cat /tmp/hbase-testuser-1-master.pid |xargs kill -9
- Start and stop additional RegionServers
HRegionServer根据HMaster的指示管理在StoreFiles内的数据。 通常,集群内的每个节点运行一个HRegionServer。在同一个系统上运行多个HRegionServers对于测试伪分布
模式很有用。 命令local-regionservers.sh
允许你运行多个RegionServers。它和local-master-backup.sh
命令工作方式很像,提交的每个参数代表一个实例的端口偏移量。每个RegionServer需要2个端口,默认为 16020 和 16030。自HBase version 1.1.0以来,HMaster不再使用region server端口,这给RegionServers留下了10个端口 ports (16020 - 16029 and 16030 - 16039) 可用。要支持额外的RegionServers,在运行local-regionservers.sh
等命令前,设置环境变量HBASE_RS_BASE_PORT
和HBASE_RS_INFO_BASE_PORT
到适当的值。如果base ports的值设置为16200 和 16300,那么一台服务器可以支持额外99个RegionServers。如下命令启动了4台额外RegionServers,从16022/16032 (base ports 16020/16030 plus 2)开始的顺序端口上运行。
$ .bin/local-regionservers.sh start 2 3 4 5
手动停止RegionServer,使用local-regionservers.sh
命令带 stop 参数 和端口偏移量。
$ .bin/local-regionservers.sh stop 3
- Stop HBase.
bin/stop-hbase.sh
命令
分布式
https://hbase.apache.org/book.html#fully_dist
By default, HBase runs in stand-alone mode. Both stand-alone mode and pseudo-distributed mode are provided for the purposes of small-scale testing. For a production environment, distributed mode is advised. In distributed mode, multiple instances of HBase daemons run on multiple servers in the cluster.
Just as in pseudo-distributed mode, a fully distributed configuration requires that you set the hbase.cluster.distributed
property to true
. Typically, the hbase.rootdir
is configured to point to a highly-available HDFS filesystem.
In addition, the cluster is configured so that multiple cluster nodes enlist as RegionServers, ZooKeeper QuorumPeers, and backup HMaster servers. These configuration basics are all demonstrated in quickstart-fully-distributed.
Distributed RegionServers
Typically, your cluster will contain multiple RegionServers all running on different servers, as well as primary and backup Master and ZooKeeper daemons. The conf/regionservers file on the master server contains a list of hosts whose RegionServers are associated with this cluster. Each host is on a separate line. All hosts listed in this file will have their RegionServer processes started and stopped when the master server starts or stops.
ZooKeeper and HBase
See the ZooKeeper section for ZooKeeper setup instructions for HBase.
Example Distributed HBase Cluster
This is a bare-bones conf/hbase-site.xml for a distributed HBase cluster. A cluster that is used for real-world work would contain more custom configuration parameters. Most HBase configuration directives have default values, which are used unless the value is overridden in the hbase-site.xml. See "Configuration Files" for more information.
hbase.rootdir
hdfs://namenode.example.org:8020/hbase
hbase.cluster.distributed
true
hbase.zookeeper.quorum
node-a.example.com,node-b.example.com,node-c.example.com
This is an example conf/regionservers file, which contains a list of nodes that should run a RegionServer in the cluster. These nodes need HBase installed and they need to use the same contents of the conf/ directory as the Master server.
node-a.example.com
node-b.example.com
node-c.example.com
This is an example conf/backup-masters file, which contains a list of each node that should run a backup Master instance. The backup Master instances will sit idle unless the main Master becomes unavailable.
node-b.example.com
node-c.example.com
Distributed HBase Quickstart
See quickstart-fully-distributed for a walk-through of a simple three-node cluster configuration with multiple ZooKeeper, backup HMaster, and RegionServer instances.
Procedure: HDFS Client Configuration
-
Of note, if you have made HDFS client configuration changes on your Hadoop cluster, such as configuration directives for HDFS clients, as opposed to server-side configurations, you must use one of the following methods to enable HBase to see and use these configuration changes:
Add a pointer to your
HADOOP_CONF_DIR
to theHBASE_CLASSPATH
environment variable in hbase-env.sh.Add a copy of hdfs-site.xml (or hadoop-site.xml) or, better, symlinks, under ${HBASE_HOME}/conf, or
if only a small set of HDFS client configurations, add them to hbase-site.xml.
An example of such an HDFS client configuration is dfs.replication
. If for example, you want to run with a replication factor of 5, HBase will create files with the default of 3 unless you do the above to make the configuration available to HBase.