parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file 记录解决办法

  1. 该问题出现原因:
    该问题出现在aws数据导入到我自己平台的hive仓库过程中出现的,AWS上该表的加工过程我也不清楚,只知道存储格式是parquet。然后通过show create table tb_a;得到了建表语句,然后我就用此建表语句在自己的仓库中建表,大致如下:

建表:

CREATE EXTERNAL TABLE `s_tb_a`(
aaa  string,
bbb double,
ccc  string,
eee  string, 
ddd  string,
ffff    string,
hhh  double,
iiii    string,
jjjj    decimal(38,4)
     )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
stored as Parquet;

从ASW拷贝数据:
hadoop distcp s3n://xxxxxx/dbName/tb_a/* /user/hive/warehouse/stage.db/s_tb_a/

然后查询就报错了:Can not read value at 0 in block -1 in file

  1. 原因分析:
    刚开始以为自己建的表跟aws格式不同所以无法加载,后来确实是没问题的;
    也把decimal数据类型改成string或double过都不行。
    后来找到这个:
    Root Cause:

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won’t face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to “true”, Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

–conf “spark.sql.parquet.writeLegacyFormat=true”
把这篇文章也发给了数据加工者,后来他们说他们做了些更改(应该是按照这个配置做的吧),通知我说弄好了,让我重新reload。我没有重新建表,直接重新把他们的数据distcp过来,加载到表了,但提示变了:

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file 记录解决办法_第1张图片
然后我把所有decimal类型改成double 重新建表,重新distcp ,在hive、impala查询都没问题。

你可能感兴趣的:(问题)