【运行环境】
hive-1.2.1 hbase-1.1.2
【需求背景】
有时候我们需要把已存在Hbase中的用户画像数据导到hive里面查询,也就是通过hive就能查到hbase里的数据。但是我又不想使用sqoop或者DataX等工具倒来倒去。这时候可以在hive中创建关联表的方式来查询hbase中的数据。
【创建关联表的几种方案】
前提是:hbase中已经存在了一张表。
可选的方案:既可以在hive中关联此表的所有列簇,也可以仅关联一个列簇,也可以关联单一列蔟下的单一列,还可以关联单一列簇下的多个列。
假设我在hbase中的 users 名称空间下面有一个表 china_mainland,此表的视图如下:
hbase(main):001:0> desc "users:china_mainland"
Table users:china_mainland is ENABLED
users:china_mainland, {TABLE_ATTRIBUTES => {METADATA => {'OWNER' => 'hbase'}}
COLUMN FAMILIES DESCRIPTION
{NAME => 'act', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'basic', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'docs', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'pref', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
{NAME => 'rc', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
5 row(s) in 0.2430 seconds
可以看到有5个列簇。我的列簇个数太多了,应该控制在3个列簇以内的,这个问题今后再说吧,计划是按照每个列簇分别拆分成不同的表吧,不然读写性能会随着数据量的增长而下降得很厉害。
下面演示如何在hive中创建外部表,注意:不能使用load data加载数据到这个hive的外部表,因为外部表是使用HBaseStorageHandler创建的。但是内部表就可以load data。
【方案一】创建一个hive外表,使其与hbase中的china_mainland表的所有列簇映射(包括每个列簇下的所有列)
注意这里的关键步骤是在建表的时候,在WITH SERDEPROPERTIES指定关联到hbase表的哪个列簇或列!
hive> CREATE EXTERNAL TABLE china_mainland(
> rowkey string,
> act map,
> basic map,
> docs map,
> pref map,
> rc map
> ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,act:,basic:,docs:,pref:,rc:")
> TBLPROPERTIES ("hbase.table.name" = "users:china_mainland")
> ;
【方案二】与单一列簇下的单个列映射
hive表china_mainland_acturl中的2个字段rowkey、act_url分别映射到Hbase表users:china_mainland中的行健和“act列簇下的一个url列”
hive> CREATE EXTERNAL TABLE china_mainland_acturl(
> rowkey string,
> act_url STRING
> ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,act:url")
> TBLPROPERTIES ("hbase.table.name" = "users:china_mainland")
> ;
【方案三】与单一列簇下的多个列映射
hive表china_mainland_kylin_test中的3个字段pp_professionact、pp_salary、pp_gender,分别映射到Hbase表users:china_mainland中的列簇act下的3个列pp_profession、pp_salary、pp_gender
hive> CREATE EXTERNAL TABLE china_mainland_kylin_test(
> rowkey string,
> pp_profession string,
> pp_salary double,
> pp_gender int)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" =":key,act:pp_profession,act:pp_salary,act:pp_gender")
> TBLPROPERTIES ("hbase.table.name" = "users:china_mainland");
【方案四】
关联到hbase表的单一列簇下的所有列
hive> CREATE EXTERNAL TABLE china_mainland_pref(
> rowkey STRING,
> pref map
> )
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pref:")
> TBLPROPERTIES ("hbase.table.name" = "users:china_mainland")
> ;
OK
Time taken: 0.164 seconds
hive> DESC FORMATTED china_mainland_pref;
OK
# col_name data_type comment
rowkey string from deserializer
pref map from deserializer
# Detailed Table Information
Database: lmy_test
Owner: hive
CreateTime: Thu May 03 15:12:32 CST 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://ks-hdfs/apps/hive/warehouse/tony_test.db/china_mainland_pref
Table Type: EXTERNAL_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE{\"BASIC_STATS\":\"true\"}
EXTERNAL TRUE
hbase.table.name users:china_mainland
numFiles 0
numRows 0
rawDataSize 0
storage_handler org.apache.hadoop.hive.hbase.HBaseStorageHandler
totalSize 0
transient_lastDdlTime1525331552
# Storage Information
SerDe Library: org.apache.hadoop.hive.hbase.HBaseSerDe
InputFormat: null
OutputFormat: null
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
hbase.columns.mapping:key,pref:
serialization.format1
Time taken: 0.165 seconds, Fetched: 36 row(s)
建好china_mainland_pref以后,马上就能用了——
# 查看前几条
hive> SELECT * FROM china_mainland_pref LIMIT 5;
OK
P_0 {"preference_game":"2","preference_shopping":"29"}
P_00 {"preference_news":"2","preference_shopping":"3","preference_travel":"2"}
P_0000001876382F627351AA1353507E7E {"preference_news":"1","preference_science":"9","preference_shopping":"2","profession_communication":"1","train_collegeexam":"8","train_postgraduexam":"1"}
P_00000050B563527DE611805C3513BFCF {"preference_news":"2","preference_shopping":"68","preference_sns":"2","title_doc2vec_vector":"[-0.029652609855638477, 0.16829780398795727, 0.0634797563895506, 0.09985980261821016, -0.027206807371619682, 0.04609297546347725, 0.22684986310548136, 0.07509010726482222, -0.35608007793539426, 0.17766480221945294, 0.46286574267486735, 0.2597907394226127, -0.14725957574341025, 0.07262948642247423, 0.07125068438490716, 0.19818145833107004, -0.11506854374877365, 0.22868833573045896, -0.43365914508786096, -0.3630766762536705]"}
P_00000051992B3C6B48DF65CFCD00F570 {"preference_automobile":"2","preference_entertainment":"3","preference_financial":"2","preference_game":"2","preference_house":"1","preference_maternal":"2","preference_medcine":"2","preference_news":"1043","preference_science":"1","preference_shopping":"153","preference_sns":"1","preference_sport":"11","preference_travel":"1","profession_Agriculture":"9","profession_appliance":"1","profession_building":"4","profession_businesstrade":"4","profession_electronic":"1","profession_food":"11","profession_logistics":"14","profession_metallurgy":"4","profession_otherindustry":"2","shopping_frequence":"0:7 11:0.286 14:0.143 15:0.143 19:0.143 20:0.143 21:0.143 25:0.571 29:0.143 30:0.286","title_doc2vec_vector":"[-0.24758818000253893, 0.2338153449865942, -0.05848859099970545, -0.02417380306100351, 0.14515214525269712, 0.30990456699703955, -0.09866324401812501, -0.2331386034898917, -0.33209567698858355, 0.07699443458979872, 0.1471409695212827, 0.45493278257739733, 0.3085450975389932, -0.32623349923656875, -0.13339757411540565, -0.05206131438941852, 0.1003091645774786, 0.24435056196480162, -0.15905020331297162, -0.19182652834852418]","train_postgraduexam":"1"}
Time taken: 0.385 seconds, Fetched: 5 row(s)
# 查询此表的总记录数
hive> select count(rowkey) from china_mainland_pref;
Query ID = hive_20180503151327_21874623-dd01-4770-97be-65772cd20a6c
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1523618137752_0743)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 102 102 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 1072.51 s
--------------------------------------------------------------------------------
OK
851887471
Time taken: 1080.993 seconds, Fetched: 1 row(s)