hive-分组按比例获取样本如何实现?

背景:

有一批样本,比如是1000w,包含了若干个车系,现在我想从这1000w里获取到每个车系40%的数据,为了保证样本的散列,随机获取。

实现方式:

方式一:

select 
series,
brand
from 
(select series,brand,
model,
count(1) over(partition by series) as count,  --获取到各个分组的总数
row_number() over(partition by series order by rand()) rank 
from t1
where  series !='' and series is not null )t
where rank < count*0.4  --获取40%即可
; 

方式二:

 with 
 t_1 as ( 
     select series, brand
     row_number() over(partition by series order by rand()) as rank 
     from a 
 ), 
 t_2 as ( 
     select 
     series, max(rank)/40 as max_rank
     from b
     group by series 
 ) 
 select a.* 
 from t_1 a 
 left join t_2 b 
 on a.series = b.series 
 and a.rn < b.max_rank

你可能感兴趣的:(hive,hive,hadoop,数据仓库)