ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...

0: jdbc:hive2://master01.hadoop.dtmobile.cn:1> select * from cell_random_grid_tmp2 limit 1;
INFO : Compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:grid_row_id, type:int, comment:null), FieldSchema(name:grid_col_id, type:int, comment:null), FieldSchema(name:google_gri, type:int, comment:null), FieldSchema(name:google_gci, type:int, comment:null), FieldSchema(name:user_lon, type:double, comment:null), FieldSchema(name:user_lat, type:double, comment:null), FieldSchema(name:grid_type, type:int, comment:null), FieldSchema(name:grid_height, type:int, comment:null), FieldSchema(name:compute_region_name, type:string, comment:null), FieldSchema(name:antenna_0, type:string, comment:null), FieldSchema(name:antenna_1, type:string, comment:null), FieldSchema(name:antenna_2, type:string, comment:null), FieldSchema(name:antenna_3, type:string, comment:null), FieldSchema(name:antenna_4, type:string, comment:null), FieldSchema(name:antenna_5, type:string, comment:null), FieldSchema(name:antenna_6, type:string, comment:null), FieldSchema(name:scene, type:string, comment:null), FieldSchema(name:base_lon, type:double, comment:null), FieldSchema(name:base_lat, type:double, comment:null), FieldSchema(name:ssb_send_power, type:double, comment:null), FieldSchema(name:base_h_angle, type:double, comment:null), FieldSchema(name:antenna_height, type:double, comment:null), FieldSchema(name:m_vertical_angle, type:double, comment:null), FieldSchema(name:h_beam_precision, type:int, comment:null), FieldSchema(name:v_beam_precision, type:int, comment:null), FieldSchema(name:simu_spectrum, type:decimal(2,1), comment:null)], properties:null)
INFO : Completed compiling command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.045 seconds
INFO : Executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5): select * from cell_random_grid_tmp2 limit 1
INFO : Completed executing command(queryId=hive_20190904113737_49bb8821-f8a1-4e49-a32e-12e3b45c6af5); Time taken: 0.001 seconds
INFO : OK
Error: java.io.IOException: parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://master01.hadoop.dtmobile.cn:8020/user/hive/warehouse/capacity.db/cell_random_grid_tmp2/part-00000-82a689a5-7c2a-48a0-ab17-8bf04c963ea6-c000.snappy.parquet (state=,code=0)
0: jdbc:hive2://master01.hadoop.dtmobile.cn:1>

 

通过spark2.3 sparksql saveAsTable()执行写数据到hive,sparksql写数据到hive时候,默认是保存为parquet+snappy的数据。在数据保存完成之后,通过hive beeline查询,报错如上。但是通过spark查询,执行正常。

在stackoverflow上找到同样的问题:

根本原因如下:

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

所以尝试调整参数 spark.sql.parquet.writeLegacyFormat = true,问题解决。

 

到spark2.3源代码中查找该参数(spark.sql.parquet.writeLegacyFormat):

package org.apache.spark.sql.internal 中 关于sparksql的默认配置 SQLConf.scala中相关描述如下

  val PARQUET_WRITE_LEGACY_FORMAT = buildConf("spark.sql.parquet.writeLegacyFormat")
    .doc("Whether to be compatible with the legacy Parquet format adopted by Spark 1.4 and prior " +
      "versions, when converting Parquet schema to Spark SQL schema and vice versa.")
    .booleanConf
    .createWithDefault(false)

可以看到默认值为false

在 package org.apache.spark.sql.execution.datasources.parquet 的关于ParquetWriteSupport.scala 的描述如下:

/**
 * A Parquet [[WriteSupport]] implementation that writes Catalyst [[InternalRow]]s as Parquet
 * messages.  This class can write Parquet data in two modes:
 *
 *  - Standard mode: Parquet data are written in standard format defined in parquet-format spec.
 *  - Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
 *
 * This behavior can be controlled by SQL option `spark.sql.parquet.writeLegacyFormat`.  The value
 * of this option is propagated to this class by the `init()` method and its Hadoop configuration
 * argument.
 */

 

转载于:https://www.cnblogs.com/dtmobile-ksw/p/11458121.html

你可能感兴趣的:(ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs:...)