Hive中分区数函数使用详情和误区

文章目录

  • hive中分位数函数percentile和percentile_approx误区
    • 1. 验证过程
      • 1.1. 等频划分取中位数就算逻辑
    • 2. 再次验证
    • 同时取多个分位数函数的使用

hive中分位数函数percentile和percentile_approx误区

!!! note “”
结论:
- int型的数计算中位值(percentile函数),结果和正常理解的中位数相同,即把所有观察值高低排序后找出正中间的一个作为中位数。如果观察值有偶数个,通常取最中间的两个数值的平均数作为中位数。
- 而double型的中位数(percentile_approx函数),是按照等频划分的方法来计算中位数的。

1. 验证过程

测试数据

CREATE TABLE temp_median_test 
(
id int,
number bigint,
number_b double
);

-- 插入模拟数据
INSERT overwrite TABLE temp_median_test select 1,1,1.2;
INSERT INTO TABLE temp_median_test select 2,2,2.2;
INSERT INTO TABLE temp_median_test select 3,3,3.3;
INSERT INTO TABLE temp_median_test select 4,4,4.3;
INSERT INTO TABLE temp_median_test select 5,5,5.2;
INSERT INTO TABLE temp_median_test select 6,6,6.7;
INSERT INTO TABLE temp_median_test select 7,7,7.4;

select * from temp_median_test;

获取中位数结果,即50分位数
SELECT percentile(number,0.5) from temp_median_test;
结果为4
SELECT percentile_approx(number_b,0.5) from temp_median_test;
结果为3.8
疑问:结果为什么是3.8?

再次插入模拟数据,将数值个数达到偶数个。

INSERT INTO TABLE temp_median_test select 8,8,8.4;

SELECT percentile(number,0.5) from temp_median_test;
结果为4.5
SELECT percentile_approx(number_b,0.5) from temp_median_test;
结果为4.3
结果为什么是4.3?

1.1. 等频划分取中位数就算逻辑

等频划分的方法来计算中位数的过程:

  • 在奇数个数值时

共7个值,每个值的累积概率为1/7。从小到大排序
3的累积概率为0.42857142857
4的累积概率为0.57142857143
等距离中位数(3.3*(0.57142857143-0.5)+4.3*(0.5-0.42857142857))/(0.57142857143-0.42857142857)=3.8

  • 在偶数个数值时

8个数,每个值的累积概率为0.125,从小到大排序,第4位数的累积概率为0.5,因此结果为4.3

2. 再次验证

继续验证等距离规则取分位数的逻辑

CREATE TABLE temp_median_test_yz 
(
id int,
number bigint,
number_b double
);

INSERT overwrite TABLE temp_median_test_yz select 1,1,1.2;
INSERT INTO TABLE temp_median_test_yz select 2,2,2.2;
INSERT INTO TABLE temp_median_test_yz select 3,2,3.3;
INSERT INTO TABLE temp_median_test_yz select 4,3,4.3;
INSERT INTO TABLE temp_median_test_yz select 5,3,5.2;
INSERT INTO TABLE temp_median_test_yz select 6,4,6.7;
INSERT INTO TABLE temp_median_test_yz select 7,4,7.4;
INSERT INTO TABLE temp_median_test_yz select 8,4,8.4;
INSERT INTO TABLE temp_median_test_yz select 9,4,8.4;
INSERT INTO TABLE temp_median_test_yz select 10,4,8.4;

select * from temp_median_test_yz;

SELECT percentile(number,0.5) from temp_median_test_yz;
结果为3.5
SELECT percentile_approx(number_b,0.5) from temp_median_test_yz;
结果为5.2
10个数,从小到大排序的第五个数的累积概率为0.5,值为5.2

继续插入模拟数据,使值的个数达到奇数个。

INSERT INTO TABLE temp_median_test_yz select 11,5,9.4;

SELECT percentile(number,0.5) from temp_median_test_yz;
结果为4
SELECT percentile_approx(number_b,0.5) from temp_median_test_yz;
结果为5.95
11个数,从小到大排序,
第4位的累积概率为0.4545
第6位的累积概率为0.5456
(5.2*(0.5456-0.5)+6.7*(0.5-0.4545))/(0.5456-0.4545)=5.95

同时取多个分位数函数的使用

当需要同时计算多个分位数时,如同时计算十分位、二十分位数、三十分位数等,需要使用如下函数,返回结果均为array
percentile(BIGINT col, array(p1 [, p2]...))
percentile_approx(DOUBLE col, array(p1 [, p2]...) [, B])

SQL示例

SELECT  percentile_approx(cast(revnu_amount_m_avg_12 AS double),ARRAY(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1))
FROM app.app_wljf_recvbl_risk_predict_detail
WHERE dt = '2022-03-31'
AND rlike(seller_no, '^EBU.*|^ebu.*')

-- 返回结果为
[0.001095,0.004240333333333333,0.01079075,0.02246991666666667,0.04454166666666667,0.08530308333333332,0.17178691666666668,0.39055508333333333,1.2046255833333335,2341.8266860833332]

对上述结果的转换为行显示,使用如下SQL

SELECT  explode(percentile_approx(cast(revnu_amount_m_avg_12 as double),ARRAY(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1)))
FROM app.app_wljf_recvbl_risk_predict_detail
WHERE dt = '2022-03-31'
AND rlike(seller_no, '^EBU.*|^ebu.*')

-- 返回结果如下
0.001095
0.004240333333333333
0.01079075
0.02246991666666667
0.04454166666666667
0.08530308333333332
0.17178691666666668
0.39055508333333333
1.2046255833333335
2341.8266860833332

你可能感兴趣的:(Hive,1024程序员节,大数据,spark,hive)