问题描述:使用spark读取kafka数据写入hive orc格式表时,数据能正确写入,但是当在hive客户端查询的时候出现错误
Failed with exception java.io.IOException:java.lang.RuntimeException: ORC split generation failed with exception: java.lang.ArrayIndexOutOfBoundsException: 6
cdh version:6.1.1
spark version: 2.4.0+cdh6.1.1
kafka version: 2.0.0+cdh6.1.1
hive version: 2.1.1+cdh6.1.1
hdfs version: 3.0.0+cdh6.1.1
hive 建表语句:
create table test_orc(
name string,
age int,
sex string,
birth string) stored as orc;
启动spark程序写入数据,在hdfs上可见已有文件生成
hive客户端查询:报错
使用hive --orcfiledump 查看orc文件:报错
百度搜寻解决办法,尝试了各种参数调整,然而并没有什么作用,偶然翻到一篇orc里提交的bug fix ,具体信息请点 这里
该bug fix 中 如果当前orc版本号等于 FUTURE.id 时才返回 FUTURE
WriterVersion源码如下:
/**
* Records the version of the writer in terms of which bugs have been fixed.
* For bugs in the writer, but the old readers already read the new data
* correctly, bump this version instead of the Version.
*/
public enum WriterVersion {
ORIGINAL(0),
HIVE_8732(1), // corrupted stripe/file maximum column statistics
HIVE_4243(2), // use real column names from Hive tables
HIVE_12055(3), // vectorized writer
HIVE_13083(4), // decimal writer updating present stream wrongly
// Don't use any magic numbers here except for the below:
FUTURE(Integer.MAX_VALUE); // a version from a future writer
private final int id;
public int getId() {
return id;
}
WriterVersion(int id) {
this.id = id;
}
private static final WriterVersion[] values;
static {
// Assumes few non-negative values close to zero.
int max = Integer.MIN_VALUE;
for (WriterVersion v : WriterVersion.values()) {
if (v.id < 0) throw new AssertionError();
if (v.id > max && FUTURE.id != v.id) {
max = v.id;
}
}
values = new WriterVersion[max + 1];
for (WriterVersion v : WriterVersion.values()) {
if (v.id < values.length) {
values[v.id] = v;
}
}
}
public static WriterVersion from(int val) {
if (val == FUTURE.id) return FUTURE; // Special handling for the magic value.
return values[val];
}
}
根据报错信息可以知道 错误发生在OrcFile文件中的writeVersion.from 方法中
打开源码开始debug,搭建源码环境参考 hive2.1.1源码编译及调试 这篇文章
经过层层查找,在Orctail.java 文件中找到了具体调用
public OrcFile.WriterVersion getWriterVersion() {
OrcProto.PostScript ps = fileTail.getPostscript();
return (ps.hasWriterVersion()
? OrcFile.WriterVersion.from(ps.getWriterVersion()) : OrcFile.WriterVersion.ORIGINAL);
}
找到OrcProto.PostScript 的getWriterVersion() 方法,等价于调用 PostScript 类的getWriterVersion()方法, 此时 writerVersion_ = input.readUInt32();
/**
* Read a raw Varint from the stream. If larger than 32 bits, discard the
* upper bits.
*/
public int readRawVarint32() throws IOException {
byte tmp = readRawByte();
if (tmp >= 0) {
return tmp;
}
int result = tmp & 0x7f;
if ((tmp = readRawByte()) >= 0) {
result |= tmp << 7;
} else {
result |= (tmp & 0x7f) << 7;
if ((tmp = readRawByte()) >= 0) {
result |= tmp << 14;
} else {
result |= (tmp & 0x7f) << 14;
if ((tmp = readRawByte()) >= 0) {
result |= tmp << 21;
} else {
result |= (tmp & 0x7f) << 21;
result |= (tmp = readRawByte()) << 28;
if (tmp < 0) {
// Discard upper 32 bits.
for (int i = 0; i < 5; i++) {
if (readRawByte() >= 0) {
return result;
}
}
throw InvalidProtocolBufferException.malformedVarint();
}
}
}
}
return result;
}
该方法的返回值为6,具体怎么计算我也是一头雾水,没弄明白。总之是找到了这个 6 的来源,我们只需要取处理 WriterVersion 枚举类中不存在的值所导致的异常就行了。
修改同 orc 的bug fix中提交的代码一样,
if (val == FUTURE.id) return FUTURE; =====》 if (val >= values.length) { return FUTURE; }
改动获取到的当前版本值 只要大于等于 WriterVersion数组的长度则认定为是未知版本,而不是去判断当前版本号与WriterVersion枚举类的id号的大小,这样便可解决WriterVersion数组越界的异常
到此需要修改的代码已修改完毕。
接下来需要编译修改过的class文件并打包,此处受影响的有hive-exec 和 hive-orc ,需编译并打包这俩项目。
分别在各module下执行 mvn clean install -DskipTests 跳过编译测试文件,生成jar包替换集群hive/lib下的jar文件。
查询
问题解决