udf开发——解hive外表中的pb二进制数据

目标:hbase中有一张表,为了提高存储效率使用pb的二进制方式存储;现在hive上建了一个外表,需要写一个udf解pb的二进制数据。

 

一、hbase中存储的数据先用pb生成二进制,转成string后再使用base64编码:

1、在hive中创建外表,结构如下:

create external table ext_toutiao_feed_incr (f_id string,tagPb string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
"hbase.columns.mapping" = ":key,data:tagPb" 
)TBLPROPERTIES ("hbase.table.name" = "toutiao_feed_incr");

hive> desc ext_toutiao_feed_incr;
OK
f_id                	string              	from deserializer   
tagpb               	string              	from deserializer


1)hbase查看一条内容:查看一条数据:

hbase(main):003:0> get 'toutiao_feed_incr',10000000570
COLUMN                         CELL                                                                                  
 data:tagPb                    timestamp=1482862346773, value=CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHue
                               bm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/                                         
2 row(s) in 0.4400 seconds

 

2)hive上查看一条数据:

hive> select * from ext_toutiao_feed_incr where f_id=10000000570;     
WARNING: Comparing a bigint and a string may result in a loss of precision.
Total jobs = 1
...
OK
10000000570	CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/
Time taken: 36.179 seconds, Fetched: 1 row(s)

 

3)使用java解该pb:

fid:10000000570,type:0,channels:[],tags:[{tag=幼儿, score=0.32119}, {tag=类型, score=0.3181}, {tag=盛世骄阳英文童谣大全, score=0.53446}]

 

 

2、使用udf执行结果:

add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;
create temporary function udf_pb_lx as'com.abc.ttbrain.log.manager.hive.DecodePbUdf';

hive> select *,udf_pb_lx(tagpb) from ext_toutiao_feed_incr where f_id=10000000570;                         
WARNING: Comparing a bigint and a string may result in a loss of precision.
Total jobs = 1
...
OK
10000000570	CLrMr6AlEg0KBuW5vOWEvxUEc6Q+Eg0KBuexu+WeixUB3qI+EiUKHuebm+S4lumqhOmYs+iLseaWh+erpeiwo+Wkp+WFqBVf0gg/ fid:10000000570,type:0,channels:[],tags:[{tag=幼儿, score=0.32119}, {tag=类型, score=0.3181}, {tag=盛世骄阳英文童谣大全, score=0.53446}]

 

二、hbase中存储的数据直接用pb生成二进制:

1、在hive上创建外表,结构如下:

create external table ext_test (f_id string,tagPb BINARY,tag string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ( 
"hbase.columns.mapping" = ":key,data:tagPb,data:tagPb" 
)TBLPROPERTIES ("hbase.table.name" = "test_liu");

hive> desc ext_test;
OK
f_id                	string              	from deserializer   
tagpb               	binary              	from deserializer   
Time taken: 0.164 seconds, Fetched: 2 row(s)


1)在hbase上查询:

hbase(main):037:0> scan 'test_liu'
ROW                            COLUMN+CELL                                                                           
 10000000570                   column=data:tagPb, timestamp=1491884382969, value=\x08\xBA\xCC\xAF\xA0%\x12\x0D\x0A\x0
                               6\xE5\xB9\xBC\xE5\x84\xBF\x15\x04s\xA4>\x12\x0D\x0A\x06\xE7\xB1\xBB\xE5\x9E\x8B\x15\x0
                               1\xDE\xA2>\x12%\x0A\x1E\xE7\x9B\x9B\xE4\xB8\x96\xE9\xAA\x84\xE9\x98\xB3\xE8\x8B\xB1\xE
                               6\x96\x87\xE7\xAB\xA5\xE8\xB0\xA3\xE5\xA4\xA7\xE5\x85\xA8\x15_\xD2\x08?               
1 row(s) in 0.0080 seconds


2)hive上查看一条数据:

hive> select * from ext_test;
OK
10000000570    �̯�% 
幼儿s�> 
类型ޢ>%
盛世骄阳英文童谣大全_.?�̯�% 
幼儿s�> 
类型ޢ>%
盛世骄阳英文童谣大全_.?
Time taken: 0.11 seconds, Fetched: 1 row(s)


2、使用udf执行结果:

add jar /home/qytt/ttbrain-log-manager-jar-with-dependencies.jar;

create temporary function udf_pb_kevinliu as'com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte';

 

1)正常:

hive> select udf_pb_kevinliu(tagPb,'') from ext_test;

Total jobs = 1

...

Total MapReduce CPU Time Spent: 4 seconds 40 msec

OK

fid:10000000570,type:0,channels:[],tags:[{tag=幼儿, score=0.32119}, {tag=类型, score=0.3181}, {tag=盛世骄阳英文童谣大全, score=0.53446}]

2)错误1:

hive> select udf_pb_kevinliu(tag) from ext_test;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1490153150757_1824274, Tracking URL = http://hadoop-jy-resourcemanager01:8088/proxy/application_1490153150757_1824274/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1490153150757_1824274
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2017-04-11 15:41:17,541 Stage-1 map = 0%,  reduce = 0%
2017-04-11 15:41:29,747 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.51 sec
MapReduce Total cumulative CPU time: 3 seconds 510 msec
Ended Job = job_1490153150757_1824274
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 3.51 sec   HDFS Read: 278 HDFS Write: 21 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 510 msec
OK


3)错误2:

hive> select udf_pb_kevinliu(tagPb) from ext_test;
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments 'tagPb': No matching method for class com.abc.ttbrain.log.manager.hive.DecodePbUdf4Byte with (binary). Possible choices: _FUNC_(binary)  _FUNC_(binary, string)  _FUNC_(string)  

 

3、总结:

hbase中是使用pb的二进制直接写入其中的,到hbase中的,在hive上创建外表,使用binary和string分别去映射hbase的列;发现问题:

1)string类型是无法对应hbase中pb二进制写入的数据;

2)binary类型,写udf时必须要用两个参数,一个参数会莫名其妙报错,这可能是hive的一个bug。

所以,尽量对pb生成的二进制做一次base64.

你可能感兴趣的:(hive)