hive查询hbase

背景

hbase查询的确是不太方便,除了指定rowkey,或者通过指定startkey stopkey进行scan之外,没有更有效的查询方式 如果想通过列值过滤,只能全表扫描了 如果要搞什么group by或者order by(除非你的rowkey做了相应设计) 更是没法弄 在传统的mysql/oracle得心应手的查询在hbase上就是束手束脚

当然可以通过写hadoop job解决问题,但为了查询去写job,代价未免有点高 于是hive出现了

有两个方法可以集成hive和hbase

  1. 使用HBaseStorageHandler,这个会直接操作HBase,可能会对线上产生影响
  2. 将HBase定期导入到HDFS,再通过hive访问HDFS

下面将详述第二种方法

HDFS导入

  1. 使用datax将HBase表导入到HDFS上,比如/group/wireless-arctic/task/arctic_task

  2. hive产生外部表,从而避免导入数据

     CREATE EXTERNAL TABLE task_history (
     biz_type string, 
     cid string,
     content string,
     ctime string,
     gmt_create string,
     hostName string,
     item string,
     mtime string,
     otags string,
     priority string,
     retry string,
     result string,
     srcImages string,
     src_url string,
     status string,
     summary string,
     task_type string,
     title string,
     userId string,
     userNick string,
     utags string,
     writer string
     ) 
     ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
     LOCATION '/group/wireless-arctic/task';

    location是云梯文件的目录

  3. 测试

     select cid,result from task_history limit 10;
     Total MapReduce jobs = 1
     Launching Job 1 out of 1
     Number of reduce tasks is set to 0 since there's no reduce operator
     Selecting distributed mode: Input Size (= 2578823293 = 2 gigabytes 411 megabytes 366 kilobytes 125 bytes) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728 = 128 megabytes 0 kilobytes 0 bytes)
     Starting Job = job_201311281255_6734353, Tracking URL = http://hdpjt2.alibaba-inc.com/jobdetails.jsp?jobid=job_201311281255_6734353
     Kill Command = /home/hadoop/hadoop-current/bin/../bin/hadoop job  -Dmapred.job.tracker=hdpjt:9001 -kill job_201311281255_6734353
     Hadoop job information for Stage-1: number of mappers: 10; number of reducers: 0
     2013-12-19 18:53:02,891 Stage-1 map = 0%,  reduce = 0%
     2013-12-19 18:53:11,017 Stage-1 map = 50%,  reduce = 0%
     2013-12-19 18:53:12,033 Stage-1 map = 90%,  reduce = 0%
     2013-12-19 18:53:19,394 Stage-1 map = 100%,  reduce = 100%
     Ended Job = job_201311281255_6734353
     OK
     200011928538    success
     200011928538    success
     200011909281    success
     200011928474    success
     200011909281    success
     200011928474    success
     110010569498    failure:userId:1782836127,contentId:110010569498  ImageFlow,call error and ret:1
     110010523403    success
     110010523921    success
     110010524299    success
     Time taken: 23.137 seconds = 23 seconds 137 milliseconds

添加分区及自动化

  1. 完成了上面的步骤,你就可以查询数据了,但面临一个问题,数据更新怎么办?

    • 一个比较通用的做法就是每天跑一个定时任务将HBase表dump到HDFS,即每天一个快照
    • 每天的快照可以存放在以日期命名的目录中,这样可以保存多份快照,出了问题也好追踪
  2. hive如何利用这每天的快照?

    • 那就是hive分区
    • 分区的本意是数据量大了切分数据,但目前我们并未如此使用,而是利用分区来区分快照
    • 删除之前的表
      drop table task_history;
    • 产生一张分区表
      CREATE EXTERNAL TABLE task_history (
      biz_type string, 
      cid string,
      content string,
      ctime string,
      gmt_create string,
      hostName string,
      item string,
      mtime string,
      otags string,
      priority string,
      retry string,
      result string,
      srcImages string,
      src_url string,
      status string,
      summary string,
      task_type string,
      title string,
      userId string,
      userNick string,
      utags string,
      writer string
      )
      PARTITIONED BY (dt string) 
      ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
      LOCATION '/group/wireless-arctic/task';
      其实就是在之前的建表语句中加了一行PARTITIONED BY (dt string)
    • 添加分区
      ALTER TABLE task_history ADD PARTITION(dt='20131223') LOCATION '/group/wireless-arctic/task/20131223';
  3. 如何自动化

    • 通过工具比如datax或者其他导出工具将HBase表导出到HDFS,正如前面提到的每天一个目录(以日期命名)
    • 将每天的数据目录挂载到hive分区
      hive -e "ALTER TABLE task_history ADD PARTITION(dt=`date -d yesterday +%Y%m%d`) LOCATION '/group/wireless-arctic/task/`date -d yesterday +%Y%m%d`';"
    • 将前面2个步骤的脚本整合到crontab 中就可以做到自动化了
    • 最后如何通过分区查询
      select * from task_history where dt='20131223' limit 10;
      即加上分区查询条件dt='20131223'

你可能感兴趣的:(hbase,hive,hbase)