【Hive】CDP集群Hive NULL值排序前后的问题

最近公司在从本地大数据环境迁移到CDP的集群。在跑相同的SQL代码时,两边对比发现数据有问题,然后就开始查找原因:在对数据进行分组排序时,两个环境的NULL值默认排序不同。

排错

  • 1、首先我们执行相同的SQL,对比了两边SQL的执行计划,
select id
     , times
     , row_number() over (partition by id order by times desc) as rn
from (select 1 as id, 1659258809386 as times
      union all
      select 1 as id, null as times
      ) t1;

CDP Hive的执行计划如下

Plan optimized by CBO.

Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 4 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_12]
        Select Operator [SEL_10] (rows=2 width=16)
           Output:["_col0","_col1","_col2"]"
          PTF Operator [PTF_9] (rows=2 width=12)
             Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
            Select Operator [SEL_8] (rows=2 width=12)
               Output:["_col0","_col1"]"
            <-Union 2 [SIMPLE_EDGE]
              <-Map 1 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]"
                    Select Operator [SEL_1] (rows=1 width=12)
                       Output:["_col1"]"
                      TableScan [TS_0] (rows=1 width=1)
              <-Map 4 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]
                    Select Operator [SEL_3] (rows=1 width=8)
                       Output:["_col1"]
                      TableScan [TS_2] (rows=1 width=1)

本地的执行计划结果如下:

Plan not optimized by CBO.

Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 4 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_12]
        Select Operator [SEL_10] (rows=2 width=16)
           Output:["_col0","_col1","_col2"]"
          PTF Operator [PTF_9] (rows=2 width=12)
             Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
            Select Operator [SEL_8] (rows=2 width=12)
               Output:["_col0","_col1"]"
            <-Union 2 [SIMPLE_EDGE]
              <-Map 1 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]"
                    Select Operator [SEL_1] (rows=1 width=12)
                       Output:["_col1"]"
                      TableScan [TS_0] (rows=1 width=1)
              <-Map 4 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]
                    Select Operator [SEL_3] (rows=1 width=8)
                       Output:["_col1"]
                      TableScan [TS_2] (rows=1 width=1)

发现是对NULL值的排序顺序两边不一致
对比结果如下:

【Hive】CDP集群Hive NULL值排序前后的问题_第1张图片

  • 2、于是去官网确认默认的配置什么
    本地集群用的是Hive的2.7.3版本,在Hive官网发现ISSUE: HIVE-12994

SQL standard does not specify the behavior by default. Currently in Hive, null values sort as if lower than any non-null value; that is, NULLS FIRST is the default for ASC order, and NULLS LAST for DESC order.
即:默认情况下,SQL标准不会指定该行为。目前在Hive中,空值排序似乎低于任何非空值;即ASC顺序为NULLS FIRST, DESC顺序为NULLS LAST
如下图所示:
【Hive】CDP集群Hive NULL值排序前后的问题_第2张图片
但是CDP的Hive默认NULL值排序是NULLS FIRST。

  • 3、解决办法
      1)改动代码在排序语句后加nulls last ,或者使用nvl()函数对空值做处理
      2)修改配置不修改代码:
    CDP中默认·hive.default.nulls.last=true·,如果您需要在集群中永久设置·hive.default.nulls.last=false·
    您可如下设置:
    CM - Hive on tez - 配置 - 搜索hive_hs2_config_safety_valve - 添加hive.default.nulls.last=false - 保存并重启服务。

你可能感兴趣的:(#,Hive,hive,hadoop,数据仓库)