目标:hbase中有一张表,为了提高存储效率使用pb的二进制方式存储;现在hive上建了一个外表,需要写一个udf解pb的二进制数据。
一、hbase中存储的数据先用pb生成二进制,转成string后再使用base64编码:
1、在hive中创建外表,结构如下:
create external table ext_toutiao_feed_incr (f_id string,tagPb string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,data:tagPb"
)TBLPROPERTIES ("hbase.table.name" = "toutiao_feed_incr");
hive> desc ext_toutiao_feed_incr;
OK
f_id string from deserializer
tagpb string from deserializer
1)hbase查看一条内容:查看一条数据:
hbase(main):003:0> get 'toutiao_feed_incr',10000000570
COLUMN CELL
data:tagPb timestamp=1482862346773, value=CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHue
bm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/
2 row(s) in 0.4400 seconds
2)hive上查看一条数据:
hive> select * from ext_toutiao_feed_incr where f_id=10000000570;
WARNING: Comparing a bigint and a string may result in a loss of precision.
Total jobs = 1
...
OK
10000000570 CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/
Time taken: 36.179 seconds, Fetched: 1 row(s)
3)使用java解该pb:
fid:10000000570,type:0,channels:[],tags:[{tag=幼儿, score=0.32119}, {tag=类型, score=0.3181}, {tag=盛世骄阳英文童谣大全, score=0.53446}]
2、使用udf执行结果:
add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;
create temporary function udf_pb_lx as'com.abc.ttbrain.log.manager.hive.DecodePbUdf';
hive> select *,udf_pb_lx(tagpb) from ext_toutiao_feed_incr where f_id=10000000570;
WARNING: Comparing a bigint and a string may result in a loss of precision.
Total jobs = 1
...
OK
10000000570 CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/ fid:10000000570,type:0,channels:[],tags:[{tag=幼儿, score=0.32119}, {tag=类型, score=0.3181}, {tag=盛世骄阳英文童谣大全, score=0.53446}]
二、hbase中存储的数据直接用pb生成二进制:
1、在hive上创建外表,结构如下:
create external table ext_test (f_id string,tagPb BINARY,tag string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,data:tagPb,data:tagPb"
)TBLPROPERTIES ("hbase.table.name" = "test_liu");
hive> desc ext_test;
OK
f_id string from deserializer
tagpb binary from deserializer
Time taken: 0.164 seconds, Fetched: 2 row(s)
1)在hbase上查询:
hbase(main):037:0> scan 'test_liu'
ROW COLUMN+CELL
10000000570 column=data:tagPb, timestamp=1491884382969, value=\x08\xBA\xCC\xAF\xA0%\x12\x0D\x0A\x0
6\xE5\xB9\xBC\xE5\x84\xBF\x15\x04s\xA4>\x12\x0D\x0A\x06\xE7\xB1\xBB\xE5\x9E\x8B\x15\x0
1\xDE\xA2>\x12%\x0A\x1E\xE7\x9B\x9B\xE4\xB8\x96\xE9\xAA\x84\xE9\x98\xB3\xE8\x8B\xB1\xE
6\x96\x87\xE7\xAB\xA5\xE8\xB0\xA3\xE5\xA4\xA7\xE5\x85\xA8\x15_\xD2\x08?
1 row(s) in 0.0080 seconds
2)hive上查看一条数据:
hive> select * from ext_test;
OK
10000000570 �̯�%
幼儿s�>
类型ޢ>%
盛世骄阳英文童谣大全_.?�̯�%
幼儿s�>
类型ޢ>%
盛世骄阳英文童谣大全_.?
Time taken: 0.11 seconds, Fetched: 1 row(s)
2、使用udf执行结果:
add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;
create temporary function udf_pb_kevinliu as'com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte';
1)正常:
hive> select udf_pb_kevinliu(tagPb,'') from ext_test;
Total jobs = 1
...
Total MapReduce CPU Time Spent: 4 seconds 40 msec
OK
fid:10000000570,type:0,channels:[],tags:[{tag=幼儿, score=0.32119}, {tag=类型, score=0.3181}, {tag=盛世骄阳英文童谣大全, score=0.53446}]
2)错误1:
hive> select udf_pb_kevinliu(tag) from ext_test;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1490153150757_1824274, Tracking URL = http://hadoop-jy-resourcemanager01:8088/proxy/application_1490153150757_1824274/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1490153150757_1824274
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-11 15:41:17,541 Stage-1 map = 0%, reduce = 0%
2017-04-11 15:41:29,747 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.51 sec
MapReduce Total cumulative CPU time: 3 seconds 510 msec
Ended Job = job_1490153150757_1824274
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.51 sec HDFS Read: 278 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 510 msec
OK
3)错误2:
hive> select udf_pb_kevinliu(tagPb) from ext_test;
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments 'tagPb': No matching method for class com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte with (binary). Possible choices: _FUNC_(binary) _FUNC_(binary, string) _FUNC_(string)
3、总结:
hbase中是使用pb的二进制直接写入其中的,到hbase中的,在hive上创建外表,使用binary和string分别去映射hbase的列;发现问题:
1)string类型是无法对应hbase中pb二进制写入的数据;
2)binary类型,写udf时必须要用两个参数,一个参数会莫名其妙报错,这可能是hive的一个bug。
所以,尽量对pb生成的二进制做一次base64.