Hive 高级应用及分析窗口函数

Hive高级应用
1、支持复杂数据类型
array,map,struct
支持对应复杂数据类型的遍历和查询

2、支持视图

3、函数
3.1、丰富的内置函数
3.2、支持自定义Java处理类,以jar文件的方式添加至Hive,定义临时函数关联处理类,对数据进行自定义处理
3.3、Json数据的解析和操作get_json_object,json_tuple
3.3、通过Transform在HQL中调用自定义脚本如Python
3.4、分析窗口函数
a.sum,avg,min,max窗口内聚合分析
over (partition by col1 order by col2 rows between unbounded[n] preceding and current row[n following])
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING:表示到后面的终点
b.Ntile,row_number,ran,dense_ran
NTILE(n) 用于将分组数据按照顺序切分成n片,返回当前切片值
ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列,无重复
RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位335
DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位,334
c.cume_dist,percent_rank
CUME_DIST :小于等于当前值的行数/分组内总行数
PERCENT_RANK :分组内当前行的RANK值-1/分组内总行数-1
d.lag,lead,first_value,last_value
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
first_value(col1) over (partition by col2 order by col3)取分组内排序后,截止到当前行,第一个值
last_value(col1) over (partition by col2 order by col3)取分组内排序后,截止到当前行,最后一个值
e.grouping sets,grouping_id,cube,rollup 常用于OLAP
grouping sets,grouping_id 将GROUP BY分组字段各个进行聚合,最终结果合并一块
cube 将GROUP BY分组字段所有组合的聚合
rollup 将GROUP BY分组字段层级组合的聚合

grouping sets (group by columns list):column list 不同组合
grouping__id:给不同集合编号
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
grouping sets (month,day[,month,day])
order by grouping__id;

cube: with cube 根据group by的维度的所有组合进行聚合
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
with cube
order by grouping__id; 

rollup: with rollup 根据group by的维度顺序逐层组合聚合
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
with rollup
order by grouping__id; 

lag(column,n,default):统计窗口内取前n行值,窗口内错行显示
lead(column,n,default):窗口内取后n行值,窗口内错行显示
eg:
select cookieid, createtime, url, 
  row_number() over (partition by cookieid order by createtime) as rn, 
  LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as front_1_time, 
  LEAD(createtime,2,'2018-12-24 00:00:00') over (partition by cookieid order by createtime) as behind_2_time 
from  cookie4;

first_value(column):窗口内,排序第一个值(倒排序即最后一个值)
last_value(column):窗口内排序截至当前行的最后一个值,即该列值
select  cookieid, createtime, url, 
  row_number() over (partition by cookieid order by createtime) as rn, 
  first_value(url) over (partition by cookieid order by createtime) as first1,
  first_value(url) over (partition by cookieid order by createtime desc) as last1,
  last_value(url) over (partition by cookieid order by createtime) as last2 
from cookie4;

CUME_DIST():小于等于当前值的行数/分组内总行数
PERCENT_RANK():分组内当前行的RANK值-1/分组内总行数-1
eg:
select  dept, userid, sal,
  cume_dist() over (order by sal) as rn1,
  cume_dist() over (partition by dept order by sal) as rn2
from cookie3;
select  dept, userid, sal,
  percent_rank() over (order by sal) as rn1, --分组内
  rank() over (order by sal) as rn11, --分组内的rank值
  sum(1) over (partition by null) as rn12, --分组内总行数
  percent_rank() over (partition by dept order by sal) as rn2,
  rank() over (partition by dept order by sal) as rn21,
  sum(1) over (partition by dept) as rn22 
from cookie3;

ntile(n):将窗口内的数据切成n片,窗口内分块
row_number():从1开始窗口内记录的序列
rank():窗口内记录的排名,335
dense_rank():窗口内记录的排名,334
eg:
select cookieid, createtime, pv,
  ntile(2) over (partition by cookieid order by createtime) as rn1,
  row_number() over (partition by cookieid order by pv desc) as rn2,
  rank() over (partition by cookieid order by pv desc) as rn3,
  dense_rank() over (partition by cookieid order by pv desc) as rn4
from  cookie2 
order by cookieid,createtime;

sum|avg|min|max(column) over(partition by col1 order by col2 rows between n|unbounded preceding current row and n|unbounded following current row):窗口内记录的聚合,自由定义窗口聚合范围
eg:
select cookieid,createtime,pv, 
   sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默认为从起点到当前行
   avg(pv) over (partition by cookieid order by createtime) as pv2,                                                  -- 从起点到当前行
   max(pv) over (partition by cookieid) as pv3,                                                                      -- 分组内所有行
   min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4,         -- 当前行+往前3行
   sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5,         -- 当前行+往前3行+往后1行
   avg(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6  -- 当前行+往后所有行
from cookie1;

4、特殊分隔符处理,regexserde正则表达式解析,自定义inputformat处理

lateral view explode

lateral view侧视图用于和split、explode等UDTF一起使用的,能将一行数据拆分成多行数据,在此基础上可以对拆分的数据进行聚合,lateral view首先为原始表的每行调用UDTF,UDTF会把一行拆分成一行或者多行,lateral view在把结果组合,产生一个支持别名表的虚拟表。
lateral clause 相当于一个虚拟表,与原表explode_lateral_view笛卡尔积关联。
explode不能写在别的函数内

你可能感兴趣的:(hive,etl)