hive2.1.1读取spark写入的orc:ORC split generation failed with exception:ArrayIndexOutOfBoundsException: 6

问题描述:使用spark读取kafka数据写入hive orc格式表时,数据能正确写入,但是当在hive客户端查询的时候出现错误

Failed with exception java.io.IOException:java.lang.RuntimeException: ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException: 6

cdh version:6.1.1

spark version: 2.4.0+cdh6.1.1

kafka version: 2.0.0+cdh6.1.1

hive version: 2.1.1+cdh6.1.1

hdfs version: 3.0.0+cdh6.1.1

hive 建表语句:

create table test_orc(
name string, 
age int, 
sex string, 
birth string) stored as orc;

启动spark程序写入数据,在hdfs上可见已有文件生成

 hive客户端查询:报错

使用hive --orcfiledump 查看orc文件:报错

hive2.1.1读取spark写入的orc:ORC split generation failed with exception:ArrayIndexOutOfBoundsException: 6_第1张图片

百度搜寻解决办法,尝试了各种参数调整,然而并没有什么作用,偶然翻到一篇orc里提交的bug fix ,具体信息请点 这里

该bug fix 中 如果当前orc版本号等于 FUTURE.id 时才返回 FUTURE

WriterVersion源码如下:

  /**
   * Records the version of the writer in terms of which bugs have been fixed.
   * For bugs in the writer, but the old readers already read the new data
   * correctly, bump this version instead of the Version.
   */
  public enum WriterVersion {
    ORIGINAL(0),
    HIVE_8732(1), // corrupted stripe/file maximum column statistics
    HIVE_4243(2), // use real column names from Hive tables
    HIVE_12055(3), // vectorized writer
    HIVE_13083(4), // decimal writer updating present stream wrongly

    // Don't use any magic numbers here except for the below:
    FUTURE(Integer.MAX_VALUE); // a version from a future writer

    private final int id;

    public int getId() {
      return id;
    }

    WriterVersion(int id) {
      this.id = id;
    }

    private static final WriterVersion[] values;
    static {
      // Assumes few non-negative values close to zero.
      int max = Integer.MIN_VALUE;
      for (WriterVersion v : WriterVersion.values()) {
        if (v.id < 0) throw new AssertionError();
        if (v.id > max && FUTURE.id != v.id) {
          max = v.id;
        }
      }
      values = new WriterVersion[max + 1];
      for (WriterVersion v : WriterVersion.values()) {
        if (v.id < values.length) {
          values[v.id] = v;
        }
      }
    }

    public static WriterVersion from(int val) {
      if (val == FUTURE.id) return FUTURE; // Special handling for the magic value.
      return values[val];
    }
  }
 

根据报错信息可以知道 错误发生在OrcFile文件中的writeVersion.from 方法中

打开源码开始debug,搭建源码环境参考 hive2.1.1源码编译及调试 这篇文章

经过层层查找,在Orctail.java 文件中找到了具体调用

  public OrcFile.WriterVersion getWriterVersion() {
    OrcProto.PostScript ps = fileTail.getPostscript();
    return (ps.hasWriterVersion()
        ? OrcFile.WriterVersion.from(ps.getWriterVersion()) : OrcFile.WriterVersion.ORIGINAL);
  }

找到OrcProto.PostScript getWriterVersion() 方法,等价于调用 PostScript 类的getWriterVersion()方法, 此时  writerVersion_ = input.readUInt32();  

/**
   * Read a raw Varint from the stream.  If larger than 32 bits, discard the
   * upper bits.
   */
  public int readRawVarint32() throws IOException {
    byte tmp = readRawByte();
    if (tmp >= 0) {
      return tmp;
    }
    int result = tmp & 0x7f;
    if ((tmp = readRawByte()) >= 0) {
      result |= tmp << 7;
    } else {
      result |= (tmp & 0x7f) << 7;
      if ((tmp = readRawByte()) >= 0) {
        result |= tmp << 14;
      } else {
        result |= (tmp & 0x7f) << 14;
        if ((tmp = readRawByte()) >= 0) {
          result |= tmp << 21;
        } else {
          result |= (tmp & 0x7f) << 21;
          result |= (tmp = readRawByte()) << 28;
          if (tmp < 0) {
            // Discard upper 32 bits.
            for (int i = 0; i < 5; i++) {
              if (readRawByte() >= 0) {
                return result;
              }
            }
            throw InvalidProtocolBufferException.malformedVarint();
          }
        }
      }
    }
    return result;
  }

该方法的返回值为6,具体怎么计算我也是一头雾水,没弄明白。总之是找到了这个 6 的来源,我们只需要取处理 WriterVersion 枚举类中不存在的值所导致的异常就行了。

修改同 orc 的bug fix中提交的代码一样,

if (val == FUTURE.id) return FUTURE; =====》 if (val >= values.length) { return FUTURE; }

改动获取到的当前版本值 只要大于等于 WriterVersion数组的长度则认定为是未知版本,而不是去判断当前版本号与WriterVersion枚举类的id号的大小,这样便可解决WriterVersion数组越界的异常

到此需要修改的代码已修改完毕。

接下来需要编译修改过的class文件并打包,此处受影响的有hive-exec 和 hive-orc ,需编译并打包这俩项目。

分别在各module下执行 mvn clean install -DskipTests 跳过编译测试文件,生成jar包替换集群hive/lib下的jar文件。

查询

问题解决

你可能感兴趣的:(hivebug,bug)