接下来我们一块儿看一下HBase的几个概念,首先来看第一个概念:Row Key,如下图所示,Row Key顾名思义,就是把一行当做主键,由于HBase建立了索引,所以我们根据行号可以迅速定位的那一行,我们还可以通过Row Key的range来定位数据,也就是查询的时候一次查多行的数据,指定一个范围,同样可以根据索引快速为我们查询出我们想要的结果。当然,也可以通过全表扫描的方式来查询我们想要的数据,这种方式相对来说就慢了。
看完了第一个概念,我们接着来看一下第二个概念:列族,如下图所示。列族是在我们建表的时候就需要声明的,一个表可以指定一个到多个列族,列族当中可以包含多个列,这些列是可以动态增加的,一个列族当中可以有0到多个列。如果表创建好了又想增加列族,那么需要先停止表,然后Alter表增加列族,然后再重新启用表。
看完了第二个概念,我们接着来看一下第三个概念:timestamp,时间戳是用来建立索引的,通过时间戳我们可以快速找到我们想要的版本的数据。
2,下面下载安装hbase
下载地址:http://archive.cloudera.com/cdh5/cdh/5/
我这里下的是1.0的
下载后上传解压,修改目录名字,最后如下:
xiaoye@ubuntu3:~/Downloads$ cd ..
xiaoye@ubuntu3:~$ mv hbase-1.0.0-cdh5.5.1/ hbase
xiaoye@ubuntu3:~$ ls
apache-activemq-5.15.3 hbase Public
classes hive QueryResult.java
derby.log metastore_db SDS.java
Desktop Music sqoop
Documents mysql-connector-java-5.1.32 Templates
Downloads Pictures Videos
examples.desktop product2.java zookeeper
hadoop product.java zookeeper.out
xiaoye@ubuntu3:~$
要想跑起来HBase,我们需要简单配置一下两个文件,分别是hbase-env.sh和hbase-site.xml,首先我们来配置一下hbase-env.sh文件,如下所示,habase-env.sh文件当中的export JAVA_HOME这一行的内容原来配置的是jdk1.6版本的并且是注释掉的,我们现在去掉注释并 将jdk的版本换成我们现在用的版本。改完之后保存退出。
xiaoye@ubuntu3:~/hbase/conf$ vim hbase-env.sh
xiaoye@ubuntu3:~/hbase/conf$
# The java implementation to use. Java 1.7+ required.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
接着我们来配置一下habase-site.xml文件,在这个文件当中我们暂且把文件系统配成本地文件系统。如下所示,注意:
xiaoye@ubuntu3:~/hbase/conf$ vim hbase-site.xml
2.1 下面进入bin目录下启动hbase
xiaoye@ubuntu3:~/hbase/conf$ vim hbase-site.xml
xiaoye@ubuntu3:~/hbase/conf$ cd ../bin
xiaoye@ubuntu3:~/hbase/bin$ ls
draining_servers.rb hbase-daemons.sh rolling-restart.sh
get-active-master.rb hbase-jruby shutdown_regionserver.rb
graceful_stop.sh hirb.rb start-hbase.cmd
hbase local-master-backup.sh start-hbase.sh
hbase-cleanup.sh local-regionservers.sh stop-hbase.cmd
hbase.cmd master-backup.sh stop-hbase.sh
hbase-common.sh region_mover.rb test
hbase-config.cmd regionservers.sh thread-pool.rb
hbase-config.sh region_status.rb zookeepers.sh
hbase-daemon.sh replication
xiaoye@ubuntu3:~/hbase/bin$ ./start-hbase.sh
xiaoye@ubuntu3:~/hbase/bin$ ./start-hbase.sh
starting master, logging to /home/xiaoye/hbase/bin/../logs/hbase-xiaoye-master-ubuntu3.out
xiaoye@ubuntu3:~/hbase/bin$ jps
16483 Jps
1431 QuorumPeerMain
2279 ResourceManager
1503 JournalNode
2196 DataNode
2424 NodeManager
jps并没有发现关于hbase的进程启动,因此可能报错了,到/hbase/logs日志下去看。有如下错:
ERROR [main] master.HMasterCommandLine: Master exiting
java.io.IOException: Could not start ZK at requested port of 2181. ZK was started at port: -1. Aborting as clients (e.g. shell) will not be able to find this ZK quorum.
看样子是hbase默认端口被占用了,百度看到有人说单机启动hbase前不要启动hadoop集群,2181这个端口好像是zookeeper的默认端口,那我们就试着改端口了。
xiaoye@ubuntu3:~/hbase/logs$ cd ../conf/
xiaoye@ubuntu3:~/hbase/conf$ vim hbase-site.xml
注意大小写也要一样。
xiaoye@ubuntu3:~/hbase/conf$ ../bin/start-hbase.sh
starting master, logging to /home/xiaoye/hbase/bin/../logs/hbase-xiaoye-master-ubuntu3.out
xiaoye@ubuntu3:~/hbase/conf$ jps
1431 QuorumPeerMain
2279 ResourceManager
17446 Jps
1503 JournalNode
2196 DataNode
2424 NodeManager
17107 HMaster
这样就启动了。
启动好了HBase,我们像检查安装好mysql那样,打开一个客户端来试试是否安装成功。我们用到的是命令是hbase,这个命令后可以跟很多命令,我们输入./habase一回车它就给我们显示./habase后面都可以跟哪些内容。如下所示。我们检查HBase用到的是shell。
我们来执行一下./hbase shell
xiaoye@ubuntu3:~/hbase/bin$ ./hbase shell
2018-04-06 04:52:06,631 INFO [main] Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/xiaoye/hbase/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/xiaoye/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-04-06 04:52:28,528 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
HBase Shell; enter 'help
Type "exit
Version 1.0.0-cdh5.5.1, rUnknown, Wed Dec 2 10:36:13 PST 2015
hbase(main):001:0> help
HBase Shell, version 1.0.0-cdh5.5.1, rUnknown, Wed Dec 2 10:36:13 PST 2015
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.
COMMAND GROUPS:
Group name: general
Commands: status, table_help, version, whoami
Group name: ddl
Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, show_filters
Group name: namespace
Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables
Group name: dml
Commands: append, count, delete, deleteall, get, get_counter, incr, put, scan, truncate, truncate_preserve
Group name: tools
Commands: assign, balance_switch, balancer, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, close_region, compact, compact_mob, compact_rs, flush, major_compact, major_compact_mob, merge_region, move, split, trace, unassign, wal_roll, zk_dump
Group name: replication
Commands: add_peer, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs, set_peer_tableCFs, show_peer_tableCFs
Group name: snapshots
Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot
Group name: configuration
Commands: update_all_config, update_config
Group name: quotas
Commands: list_quotas, set_quota
Group name: security
Commands: grant, revoke, user_permission
Group name: visibility labels
Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility
SHELL USAGE:
Quote all names in HBase Shell such as table and column names. Commas delimit
command parameters. Type
Dictionaries of configuration used in the creation and alteration of tables are
Ruby Hashes. They look like this:
{'key1' => 'value1', 'key2' => 'value2', ...}
and are opened and closed with curley-braces. Key/values are delimited by the
'=>' character combination. Usually keys are predefined constants such as
NAME, VERSIONS, COMPRESSION, etc. Constants do not need to be quoted. Type
'Object.constants' to see a (messy) list of all constants in the environment.
If you are using binary keys or values and need to enter them in the shell, use
double-quote'd hexadecimal representation. For example:
hbase> get 't1', "key\x03\x3f\xcd"
hbase> get 't1', "key\003\023\011"
hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"
The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added.
For more on the HBase Shell, see http://hbase.apache.org/book.html
hbase(main):002:0>
根据回显信息,看出
命令是按组来分的,有general、ddl、namesapce等等组。我们常用到的组是ddl和dml。
那么ddl和dml代表的意思是什么呢?
DDL(Data Definition Language)数据库定义语言,用于定义数据库的三级结构,包括外模式、概念模式、内模式及其相互之间的映像,定义数据的完整性、安全控制等约束。DDL不需要commit。常用的命令有alter(修改表),create(创建表), describe(表结构的描述信息),drop(删除表),list(查询所有的表),可以发现都是针对表的操作。
DML(Data Manipulation Language)数据操纵语言,用于让用户或程序员使用,实现对数据库中数据的操作。DML分成交互型DML和嵌入型DML两类。依据语言的级别,DML又可分成过程性DML和非过程性DML两种。需要commit。常用的命令有scan(全表扫描,相当于select *),get(取出一条数据),put(向表中插入数据),delete(删除表中数据),等等。可以发现是对数据操作的命令。
下面我们创建一张表,看帮助有建表案例:
hbase(main):002:0> heltpp
NameError: undefined local variable or method `heltpp' for #
这里我的help输入错了,发现backspace键不能回退删除,百度说hbase的shell命令的回退键是ctrl+backspace
hbase(main):003:0> help 'create'
Creates a table. Pass a table name, and a set of column family
specifications (at least one), and, optionally, table configuration.
Column specification can be a simple string (name), or a dictionary
(dictionaries are described below in main help output), necessarily
including NAME attribute.
Examples:
Create a table with namespace=ns1 and table qualifier=t1
hbase> create 'ns1:t1', {NAME => 'f1', VERSIONS => 5}
Create a table with namespace=default and table qualifier=t1
hbase> create 't1', {NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
hbase> # The above in shorthand would be the following:
hbase> create 't1', 'f1', 'f2', 'f3'
hbase> create 't1', {NAME => 'f1', VERSIONS => 1, TTL => 2592000, BLOCKCACHE => true}
hbase> create 't1', {NAME => 'f1', CONFIGURATION => {'hbase.hstore.blockingStoreFiles' => '10'}}
Table configuration options can be put at the end.
Examples:
hbase> create 'ns1:t1', 'f1', SPLITS => ['10', '20', '30', '40']
hbase> create 't1', 'f1', SPLITS => ['10', '20', '30', '40']
hbase> create 't1', 'f1', SPLITS_FILE => 'splits.txt', OWNER => 'johndoe'
hbase> create 't1', {NAME => 'f1', VERSIONS => 5}, METADATA => { 'mykey' => 'myvalue' }
hbase> # Optionally pre-split the table into NUMREGIONS, using
hbase> # SPLITALGO ("HexStringSplit", "UniformSplit" or classname)
hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit'}
hbase> create 't1', 'f1', {NUMREGIONS => 15, SPLITALGO => 'HexStringSplit', REGION_REPLICATION => 2, CONFIGURATION => {'hbase.hregion.scan.loadColumnFamiliesOnDemand' => 'true'}}
You can also keep around a reference to the created table:
hbase> t1 = create 't1', 'f1'
Which gives you a reference to the table named 't1', on which you can then
call methods.
hbase(main):004:0>
那我们就照着案例建一张表:
先介绍命令的意思:create不用多说,就是创建的意思,'student'是表名,{NAME => 'info', VERSIONS =>3}的意思是一个列族,建表的时候我们必须至少建一个列族,也可以建多个,NAME => 'info'是给这个列族起的名字,VERSIONS =>3是指这个列族可以存储三个版本的数据,多于3个的话,最老的版本将被删除(这个后面会说到),同理,{NAME => 'data', VERSIONS =>1}这句的意思是建了另外一个列族,这个列族的名字是'data',存储的版本只有1个。
hbase(main):006:0> create 'student',{NAME => 'info',VERSIONS =>3},{name => 'data',VERSIONS=>1}
NameError: undefined local variable or method `name' for #
hbase(main):007:0> create 'student',{NAME => 'info',VERSIONS =>3},{NAME => 'data',VERSIONS=>1}
0 row(s) in 2.6020 seconds
=> Hbase::Table - student
可以看出hbase 的shell命令区分大小写,
下面插入数据:
来具体说一下这条语句的意思,put的意思是插入,'student'的意思是表名,表示我们是向student表中插入数据,'rk0001'的意思是row key,可以认为是一行的唯一标识符,'info:name'的意思是一个cell(单元格),一个单元格是由列族和列名共同组成的,iinfo是列族,name是列名,'tom'是name的值。其实我们还可以指定timestamp的值,我们这里没有指定,系统会自动帮我们生成一个timestamp。
hbase(main):015:0> put 'student','rk0001' ,'info:name','tom'
0 row(s) in 0.3790 seconds
hbase(main):016:0> scan 'student'
ROW COLUMN+CELL
rk0001 column=info:name, timestamp=1523017557754, value=tom
1 row(s) in 0.0590 seconds
增加另外一个列族
hbase(main):017:0> put 'student', 'rk0002','data:score','99'
0 row(s) in 0.0320 seconds
hbase(main):018:0> scan 'student'
ROW COLUMN+CELL
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=data:score, timestamp=1523017667855, value=99
2 row(s) in 0.0430 seconds
增加属性:
hbase(main):019:0> put 'student','rk0001' ,'info:age','22'
0 row(s) in 0.0150 seconds
hbase(main):020:0> scan 'student'
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523017710918, value=22
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=data:score, timestamp=1523017667855, value=99
2 row(s) in 0.0580 seconds
删除操作:
hbase(main):021:0> delete 'student','rk0002','data:score', 1523017667855
0 row(s) in 0.0960 seconds
hbase(main):022:0> scan 'student'
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523017710918, value=22
rk0001 column=info:name, timestamp=1523017557754, value=tom
1 row(s) in 0.0680 seconds
我们现在继续向student表中插入另外一名同学jerry的相关信息。如下所示,我们只添加了info:name和info:gender的信息,并没有添加age属性的值
hbase(main):026:0> put 'student','rk0002','info:name','jerry4'
0 row(s) in 0.0350 seconds
hbase(main):027:0> scan 'student'
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523017710918, value=22
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=info:name, timestamp=1523018366607, value=jerry4
2 row(s) in 0.0470 seconds
hbase(main):028:0> put 'student','rk0002','info:gender','male'
0 row(s) in 0.0150 seconds
hbase(main):029:0> scan 'student'
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523017710918, value=22
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=info:gender, timestamp=1523018418851, value=male
rk0002 column=info:name, timestamp=1523018366607, value=jerry4
2 row(s) in 0.0630 seconds
现在我们来验证一下我们在建表时给列族设定的VERSIONS =>3是否有效,我们向rk0001的iinfo:age列继续添加两次数据。info:age的值分别是21和22。
hbase(main):030:0> put 'student','rk0001','info:age','22'
0 row(s) in 0.0210 seconds
hbase(main):031:0> put 'student','rk0001','info:age','21'
0 row(s) in 0.0190 seconds
hbase(main):032:0> scan 'student'
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523018602020, value=21
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=info:gender, timestamp=1523018418851, value=male
rk0002 column=info:name, timestamp=1523018366607, value=jerry4
2 row(s) in 0.0440 seconds
可以看出只保留最近一次插入的数据。
那么我们会有个疑问,我们前面插入的info:age的值为20和21的数据被删除了吗?其实没有。我们可以通过scan 'student', {COLUMNS => 'info', VERSIONS => 3}来查看,COLUMNS => 'info'指定的是列族,VERSIONS => 3是建这个列族时指定的可以容纳版本的数量,执行结果如下所示,我们发现info:age的所有值我们都查询出来了。
hbase(main):034:0> scan 'student',{COLUMNS=>'info',VERSIONS=>3}
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523018602020, value=21
rk0001 column=info:age, timestamp=1523018594743, value=22
rk0001 column=info:age, timestamp=1523017710918, value=22
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=info:gender, timestamp=1523018418851, value=male
rk0002 column=info:name, timestamp=1523018366607, value=jerry4
rk0002 column=info:name, timestamp=1523018359429, value=jerry3
rk0002 column=info:name, timestamp=1523018350603, value=jerry2
2 row(s) in 0.0740 seconds
既然名为info的列族设置了版本数量为3的限制,现在已经有3个版本了,那么我们继续向这个列族添加数据的话,看看是什么效果,如下所示,发现添加info:age的值为23的数据后,我们查看到的info:age信息当中只有21、22、23了,没有了最开始的20。其实info:age值为20的数据现在已经被标记为删除了,内存被flush的话就真正删除了。当前内存还没有flush,我们仍然是可以查看到那条被标记为删除的记录的。
hbase(main):035:0> put 'student','rk0001','info:age','23'
0 row(s) in 0.0110 seconds
hbase(main):036:0> scan 'student',{COLUMNS=>'info',VERSIONS=>3}
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523019548097, value=23
rk0001 column=info:age, timestamp=1523018602020, value=21
rk0001 column=info:age, timestamp=1523018594743, value=22
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=info:gender, timestamp=1523018418851, value=male
rk0002 column=info:name, timestamp=1523018366607, value=jerry4
rk0002 column=info:name, timestamp=1523018359429, value=jerry3
rk0002 column=info:name, timestamp=1523018350603, value=jerry2
2 row(s) in 0.1070 seconds
注意新插入的23的位置,顶替掉最开始插入的22
我们使用scan 'student', {RAW => true, VERSIONS => 10}这条命令来查询包括缓存中已被标记为删除的记录。如下所示。直到缓存中的数据被flush之后才不再显示。
hbase(main):001:0> scan 'student',{RAW=>true,VERSIONS=>10}
ROW COLUMN+CELL
rk0001 column=info:age, timestamp=1523019548097, value=23
rk0001 column=info:age, timestamp=1523018602020, value=21
rk0001 column=info:age, timestamp=1523018594743, value=22
rk0001 column=info:age, timestamp=1523017710918, value=22
rk0001 column=info:name, timestamp=1523017557754, value=tom
rk0002 column=data:score, timestamp=1523017667855, type=DeleteCol
umn
rk0002 column=data:score, timestamp=1523017667855, value=99
rk0002 column=info:gender, timestamp=1523018418851, value=male
rk0002 column=info:name, timestamp=1523018366607, value=jerry4
rk0002 column=info:name, timestamp=1523018359429, value=jerry3
rk0002 column=info:name, timestamp=1523018350603, value=jerry2
rk0002 column=info:name, timestamp=1523018339854, value=jerry
2 row(s) in 0.7550 seconds
3,最后我们直观看看hbase结构和存表的结构图:
HBase数据表分析
我们把我们刚才操作的数据表给画出来,如下图所示,可见,这是一张不规则的表,这也是HBase的特色之处,我们可以灵活的给列族当中添加列,列的名称由我们来定。我们可以从这张图看到有些列是没有值的,那么这些空的值占空间吗?在HBase当中,这些空值是不占空间的,这比我们的关系型数据库明显要有优势(关系型数据库,你只要声明了某列,即使你不给它赋值,它也是占空间的)
上面图可能还不是特别直观,其实如果我们要存储的是比较复杂的json数据nosql形式的存储形式优势就凸显出来了。如下(数据不跟上图对应)
{
"customer":{
"id":1136,
"name":"Z3",
"billingAddress":[{"city":"beijing"}],
"orders":[
"id":17,
"customerId":1136,
"orderItems":[{"productId":27,"price":77.5,"productName":"thinking in java"}],
"shippingAddress":[{"city":"beijing"}],
"orderPayment":[{"cciinfo":"111-222-333","tenid":"asdfadcd334","billingAddress":{"city":"beijing"}}],
}
]
}
}
上面存的可以看出淘宝的一个客户的信息数据,id:“1136”相当于rowkey,也就是这个客户的唯一id,billingAddress是一个地址,可这里是数组形式的,因此这里列簇的版本数就起作用了;此外这里name和地址两个字段是归于info列簇,orders单独归为另一个列簇,这个列簇的列(字段)有id,orderitems等字段。