hive窗口函数——ROW_NUMBER()来去重,和rank()、dense_rank()的区别与teradata中row_number()去重

原文:https://www.cnblogs.com/wujin/p/6051768.html

row_number()、rank()、dense_rank()三个函数相当于group by 之后再来个order by

上实例

id name sal
1   a   10
2   a   12
3   b   13
4   b   12
5   a   14
6   a   15
7   a   13
8   b   11
9   a   16
10  b   17
11  a   14

三个函数

select id,
name,
sal,
rank()over(partition by name order by sal desc ) as rp,
dense_rank() over(partition by name order by sal desc ) as drp,
row_number()over(partition by name order by sal desc) as rmp
from 表

结果

ID  NAME  SAL  rank   dense_rank   row_number
10   b    17    1     1    		1
3    b    13    2     2    		2
4    b    12    3     3    		3
8    b    11    4     4    		4
9    a    16    1     1    		1
6    a    15    2     2    		2
11   a    14    3     3    		3
5    a    14    3     3    		4
7    a    13    5     4    		5
2    a    12    6     5    		6
1    a    10    7     6    		7

rank()在sal排序的时候,sal遇到重复的值时,rank值也重复,接下来的值还是按正常顺序,但总数不会变  

dense_rank()在sal排序的时候,sal遇到重复的值时,rank值也重复,接下来的值按下一个顺序来,总数会变  

row_number()会正常排序,没什么说法,一个一个递增

求sal前50%的人

select * from (
select id,
name,
sal,
rank()over(partition by name order by sal desc ) rp,
dense_rank() over(partition by name order by sal desc ) drp,
row_number()over(partition by name order by sal desc) rmp,
count(*)over(partition by name) *0.5 as count
from f_test
) t where t.rp 

感觉虽然可以实现,但是有点复杂,有没有更好的方法实现呢

NTILE
NTILE(n),用于将分组数据按照顺序切分成n片,返回当前切片值
NTILE不支持ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
如果切片不均匀,默认增加第一个切片的分布

上面那个例子 正好可以用到这个

select * from (
select id,
name,
sal,
NTILE(2) over(partition by name order by sal desc ) rn
from f_test
) t where t.rn=1

查重语句:(group by 查重)

SELECT user_id,
            count(*) as num

FROM 表名

WHERE 条件  GROUP BY user_id HAVING count(*)>1;

teradata中运用row_number()去重

select * from 表名 where 条件 QUALIFY ROW_NUMBER() OVER(PARTITION BY 字段 ORDER BY 字段 DESC) = 1

qualify前面不用接and,如果没有条件的话可以用1=1来填充

你可能感兴趣的:(hive,数据库)