Java API 读取Hive Orc文件

Orc是Hive特有的一种列式存储的文件格式,它有着非常高的压缩比和读取效率,因此很快取代了之前的RCFile,成为Hive中非常常用的一种文件格式。

在实际业务场景中,可能需要使用Java API,或者MapReduce读写Orc文件。

本文先介绍使用Java API读取Hive Orc文件。

在Hive中已有一张Orc格式存储的表lxw1234:

该表有四个字段:url、word、freq、weight,类型均为string;

数据只有5条:

 

下面的代码,从表lxw1234对应的HDFS路径中使用API直接读取Orc文件:

 
  1. package com.lxw1234.test;
  2.  
  3. import java.util.List;
  4. import java.util.Properties;
  5.  
  6. import org.apache.hadoop.fs.Path;
  7. import org.apache.hadoop.hive.ql.io.orc.OrcInputFormat;
  8. import org.apache.hadoop.hive.ql.io.orc.OrcSerde;
  9. import org.apache.hadoop.hive.serde2.objectinspector.StructField;
  10. import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
  11. import org.apache.hadoop.mapred.FileInputFormat;
  12. import org.apache.hadoop.mapred.InputFormat;
  13. import org.apache.hadoop.mapred.InputSplit;
  14. import org.apache.hadoop.mapred.JobConf;
  15. import org.apache.hadoop.mapred.RecordReader;
  16. import org.apache.hadoop.mapred.Reporter;
  17.  
  18. /**
  19. * lxw的大数据田地 -- http://lxw1234.com
  20. * @author lxw.com
  21. *
  22. */
  23. public class TestOrcReader {
  24.  
  25. public static void main(String[] args) throws Exception {
  26. JobConf conf = new JobConf();
  27. Path testFilePath = new Path(args[0]);
  28. Properties p = new Properties();
  29. OrcSerde serde = new OrcSerde();
  30. p.setProperty("columns", "url,word,freq,weight");
  31. p.setProperty("columns.types", "string:string:string:string");
  32. serde.initialize(conf, p);
  33. StructObjectInspector inspector = (StructObjectInspector) serde.getObjectInspector();
  34. InputFormat in = new OrcInputFormat();
  35. FileInputFormat.setInputPaths(conf, testFilePath.toString());
  36. InputSplit[] splits = in.getSplits(conf, 1);
  37. System.out.println("splits.length==" + splits.length);
  38.  
  39. conf.set("hive.io.file.readcolumn.ids", "1");
  40. RecordReader reader = in.getRecordReader(splits[0], conf, Reporter.NULL);
  41. Object key = reader.createKey();
  42. Object value = reader.createValue();
  43. List fields = inspector.getAllStructFieldRefs();
  44. long offset = reader.getPos();
  45. while(reader.next(key, value)) {
  46. Object url = inspector.getStructFieldData(value, fields.get(0));
  47. Object word = inspector.getStructFieldData(value, fields.get(1));
  48. Object freq = inspector.getStructFieldData(value, fields.get(2));
  49. Object weight = inspector.getStructFieldData(value, fields.get(3));
  50. offset = reader.getPos();
  51. System.out.println(url + "|" + word + "|" + freq + "|" + weight);
  52. }
  53. reader.close();
  54.  
  55. }
  56.  
  57. }
  58.  

将上面的代码打包成orc.jar;

在Hadoop客户端机器上运行命令:

export HADOOP_CLASSPATH=/usr/local/apache-hive-0.13.1-bin/lib/hive-exec-0.13.1.jar:$HADOOP_CLASSPATH

hadoop jar orc.jar com.lxw1234.test.TestOrcReader /hivedata/warehouse/liuxiaowen.db/lxw1234/

运行结果如下:

 
  1. [liuxiaowen@dev tmp]$ hadoop jar orc.jar com.lxw1234.test.TestOrcReader /hivedata/warehouse/liuxiaowen.db/lxw1234/
  2. 15/08/18 17:03:37 INFO log.PerfLogger:
  3. 15/08/18 17:03:38 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
  4. 15/08/18 17:03:38 INFO orc.OrcInputFormat: FooterCacheHitRatio: 0/1
  5. 15/08/18 17:03:38 INFO log.PerfLogger:
  6. splits.length==1
  7. 15/08/18 17:03:38 INFO orc.ReaderImpl: Reading ORC rows from hdfs://cdh5/hivedata/warehouse/liuxiaowen.db/lxw1234/000000_0 with {include: null, offset: 0, length: 712}
  8.  
  9. http://cook.mytv365.com/v/20130403/1920802.html|未找到|1|5
  10.  
  11.  
  12. http://cook.mytv365.com/v/20130403/1920802.html|网站|1|3
  13.  
  14.  
  15. http://cook.mytv365.com/v/20130403/1920802.html|广大|1|3
  16.  
  17.  
  18. http://cook.mytv365.com/v/20130403/1920802.html|直播|1|3
  19.  
  20.  
  21. http://cook.mytv365.com/v/20130403/1920802.html|精彩|1|3
  22.  

结果没有问题。

注意:该程序只做可行性测试,如果Orc数据量太大,则需要改进,或者使用MapReduce;

后续将介绍Java API写Orc文件,以及使用MapReduce读写Hive Orc文件。

你可能感兴趣的:(hive,基础知识)