Java写本地ORC文件(Hive2 API)

Java写本地ORC文件(Hive2 API)

Hive2.0以后,使用了新的API来读写ORC文件(https://orc.apache.org)。
本文中的代码,在本地使用Java程序生成ORC文件,然后加载到Hive表。
代码如下:

 
 
   
   
   
   
  1. package com.lxw1234.hive.orc;
  2.  
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.fs.FileSystem;
  5. import org.apache.hadoop.fs.Path;
  6. import org.apache.hadoop.hive.ql.exec.vector.BytesColumnVector;
  7. import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
  8. import org.apache.orc.CompressionKind;
  9. import org.apache.orc.OrcFile;
  10. import org.apache.orc.TypeDescription;
  11. import org.apache.orc.Writer;
  12.  
  13. public class TestORCWriter {
  14.  
  15. public static void main(String[] args) throws Exception {
  16. //定义ORC数据结构,即表结构
  17. // CREATE TABLE lxw_orc1 (
  18. // field1 STRING,
  19. // field2 STRING,
  20. // field3 STRING
  21. // ) stored AS orc;
  22. TypeDescription schema = TypeDescription.createStruct()
  23. .addField("field1", TypeDescription.createString())
  24. .addField("field2", TypeDescription.createString())
  25. .addField("field3", TypeDescription.createString());
  26. //输出ORC文件本地绝对路径
  27. String lxw_orc1_file = "/tmp/lxw_orc1_file.orc";
  28. Configuration conf = new Configuration();
  29. FileSystem.getLocal(conf);
  30. Writer writer = OrcFile.createWriter(new Path(lxw_orc1_file),
  31. OrcFile.writerOptions(conf)
  32. .setSchema(schema)
  33. .stripeSize(67108864)
  34. .bufferSize(131072)
  35. .blockSize(134217728)
  36. .compress(CompressionKind.ZLIB)
  37. .version(OrcFile.Version.V_0_12));
  38. //要写入的内容
  39. String[] contents = new String[]{"1,a,aa","2,b,bb","3,c,cc","4,d,dd"};
  40. VectorizedRowBatch batch = schema.createRowBatch();
  41. for(String content : contents) {
  42. int rowCount = batch.size++;
  43. String[] logs = content.split(",", -1);
  44. for(int i=0; i<logs.length; i++) {
  45. ((BytesColumnVector) batch.cols[i]).setVal(rowCount, logs[i].getBytes());
  46. //batch full
  47. if (batch.size == batch.getMaxSize()) {
  48. writer.addRowBatch(batch);
  49. batch.reset();
  50. }
  51. }
  52. }
  53. writer.addRowBatch(batch);
  54. writer.close();
  55. }
  56.  
  57. }
  58.  

将该程序打成普通jar包:orcwriter.jar
另外,本地使用Java程序执行依赖的Jar包有:

 
 
   
   
   
   
  1. commons-collections-3.2.1.jar
  2. hadoop-auth-2.3.0-cdh5.0.0.jar
  3. hive-exec-2.1.0.jar
  4. slf4j-log4j12-1.7.7.jar
  5. commons-configuration-1.6.jar
  6. hadoop-common-2.3.0-cdh5.0.0.jar
  7. log4j-1.2.16.jar
  8. commons-logging-1.1.1.jar
  9. hadoop-hdfs-2.3.0-cdh5.0.0.jar
  10. slf4j-api-1.7.7.jar

run2.sh中封装了执行命令:

 
 
   
   
   
   
  1. #!/bin/bash
  2.  
  3. PWD=$(dirname ${0})
  4. echo "PWD [${PWD}] .."
  5.  
  6. JARS=`find -L "${PWD}" -name '*.jar' -printf '%p:'`
  7. echo "JARS [${JARS}] .."
  8.  
  9. $JAVA_HOME/bin/java -cp ${JARS} com.lxw1234.hive.orc.TestORCWriter $*
  10.  

执行./run2.sh

在Hive中建表并LOAD数据:

可以看到,生成的ORC文件可以正常使用。

大多情况下,还是建议在Hive中将文本文件转成ORC格式,这种用JAVA在本地生成ORC文件,属于特殊需求场景。

 


你可能感兴趣的:(Hive)