最近在参与某toB项目,数据需离线统计出并推送至线上业务库,其中用hive做的离线分析。总结写下常见问题及心得吧。
一.工程类技术范畴:数据统计工作大题划分为四步:指标统计、批量脚本、数据格式、异常流程;
step1. 指标统计:通过创建表存储每个指标的值,例如用hive表loan_apply_rate存储申请通过率;复杂度在于:指标值多,且指标定义可能不明确;
step2. 批量脚本:将step1创建的各张表综合成批量执行的perl脚本;复杂度在于:若执行时间长,会影响业务方使用,可自行测试出大小适中的perl脚本(把大的脚本做垂直区分,如申请类一个脚本,提现类一个脚本;或者做水平区分,如vintage指标依赖中间许多逻辑,可以把部分逻辑单独拆分为中间表,最终vintage指标再依赖该中间表);
step3.数据格式:新建一张总表,该表存储所有的指标值;并且将step2生成的表转化成业务方期望的数据格式(可以把step2指标转换为多个业务方期望格式,做指标复用)。示例如下:
step4.异常流程:包括批量脚本父子任务执行顺序异常,今日统计的数据异常时数据回滚或重新统计等,数据去重以及数据备份等;
二.hive类技术范畴
1. 常用优化
1.1 定理:如果只用rn=1,即只需最值,则没必要用rownumber。查找申请表里授信金额最大的一笔订单?
case1: select * from a where dt='2018-12-19' order by loan_amount desc limit 1;(map70s 、reduce400s)(常用但低效)
case2: select * from (select *,max(loan_amount ) la from a where dt='2018-12-19') a where la=loan_amount ;(map70s 、reduce1300s )(常用但低效)
case3: select * from (select *, row_number() over(sort by loan_amount desc) rn from a where dt='2018-12-19) a where rn=1;(map70s 、reduce9000s timeout)
case4: select * from (select * from a where dt='2018-12-19') a join (select max(loan_amount) la from a where dt=2018-12-19') b on a.loan_amount=la; (map70s、map70s、reduce2s )
case5: select * from (select max(struct(apply_no,loan_amount)) la from a where dt='2018-12-19') b;(map70s、reduce2s)
1.2 定理: 替代distinct
case1: select count(distinct(user_jrid)) from user where dt=‘2018-12-19’; (完成时间:800s)(因为distinct是o(n^log2 n),且只有一个reduce)(常用但低效)
case2: select 1,count(1) from (select user_jrid from a where dt='2018-12-31' group by user_jrid) a ; (通过groupby 并行化去重,完成时间:80s)(o(n^log2 n),但是可多个reduce并行执行);
1.3 各阶段复杂度:
2. UDF
指定为月末:
2.1 when split(statistics_date,'-')[1] in ('1','3','5','7','8','10','12') then concat(statistics_date,'-31')
when split(statistics_date,'-')[1] in ('4','6','9','11') then concat(statistics_date,'-30')
when cast(split(statistics_date,'-')[0] as int)%4=0 and split(statistics_date,'-')[1] in ('2') then concat(statistics_date,'-29')
when cast(split(statistics_date,'-')[0] as int)%4!=0 and split(statistics_date,'-')[1] in ('2') then concat(statistics_date,'-28') end as new_statistics_date
2.2 date_sub(concat(substr(concat(substr(created_date, 1, 7), '-01'), 1, 7), '-01'), 1)
3.常用函数
3.1 行转列:collect_set/collect_list(得到的是array
case1: 产品默认排序,把产品汇总到一行。
3.2 列转行:lateral view explode/pos_explode
case1: select v from (select split('1 2 3 4 5 6 7 8 9 0',' ') v1 ) t1 lateral view explode(v1) t2 as v;
case2: select date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),t.pos + 1) as biz_date from (select pose_explode(split(space(30),' '))); 如下图,统计某行过去30天每天的申请提现指标。(若用group by的原因,则select的字段需做collect_set判断;本语句select字段多,繁琐)
3.3 select * from (select *,row_number() over(partition by cash_id order by modified_date desc) as rn from cash_apply) a where rn=1;提现表为增量表,上述语句可查找到最新的提现表
3.4 其他:instr; months_between;
order by,sort by, distribute by, cluster by:参照 https://blog.csdn.net/zhanglh046/article/details/78572939