Hive高级应用
1、支持复杂数据类型
array,map,struct
支持对应复杂数据类型的遍历和查询
2、支持视图
3、函数
3.1、丰富的内置函数
3.2、支持自定义Java处理类,以jar文件的方式添加至Hive,定义临时函数关联处理类,对数据进行自定义处理
3.3、Json数据的解析和操作get_json_object,json_tuple
3.3、通过Transform在HQL中调用自定义脚本如Python
3.4、分析窗口函数
a.sum,avg,min,max窗口内聚合分析
over (partition by col1 order by col2 rows between unbounded[n] preceding and current row[n following])
如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING:表示到后面的终点
b.Ntile,row_number,ran,dense_ran
NTILE(n) 用于将分组数据按照顺序切分成n片,返回当前切片值
ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列,无重复
RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位335
DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位,334
c.cume_dist,percent_rank
CUME_DIST :小于等于当前值的行数/分组内总行数
PERCENT_RANK :分组内当前行的RANK值-1/分组内总行数-1
d.lag,lead,first_value,last_value
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
first_value(col1) over (partition by col2 order by col3)取分组内排序后,截止到当前行,第一个值
last_value(col1) over (partition by col2 order by col3)取分组内排序后,截止到当前行,最后一个值
e.grouping sets,grouping_id,cube,rollup 常用于OLAP
grouping sets,grouping_id 将GROUP BY分组字段各个进行聚合,最终结果合并一块
cube 将GROUP BY分组字段所有组合的聚合
rollup 将GROUP BY分组字段层级组合的聚合
grouping sets (group by columns list):column list 不同组合
grouping__id:给不同集合编号
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
grouping sets (month,day[,month,day])
order by grouping__id;
cube: with cube 根据group by的维度的所有组合进行聚合
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
with cube
order by grouping__id;
rollup: with rollup 根据group by的维度顺序逐层组合聚合
eg:
select month,day,count(distinct cookieid) as uv,grouping__id
from cookie5
group by month,day
with rollup
order by grouping__id;
lag(column,n,default):统计窗口内取前n行值,窗口内错行显示
lead(column,n,default):窗口内取后n行值,窗口内错行显示
eg:
select cookieid, createtime, url,
row_number() over (partition by cookieid order by createtime) as rn,
LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as front_1_time,
LEAD(createtime,2,'2018-12-24 00:00:00') over (partition by cookieid order by createtime) as behind_2_time
from cookie4;
first_value(column):窗口内,排序第一个值(倒排序即最后一个值)
last_value(column):窗口内排序截至当前行的最后一个值,即该列值
select cookieid, createtime, url,
row_number() over (partition by cookieid order by createtime) as rn,
first_value(url) over (partition by cookieid order by createtime) as first1,
first_value(url) over (partition by cookieid order by createtime desc) as last1,
last_value(url) over (partition by cookieid order by createtime) as last2
from cookie4;
CUME_DIST():小于等于当前值的行数/分组内总行数
PERCENT_RANK():分组内当前行的RANK值-1/分组内总行数-1
eg:
select dept, userid, sal,
cume_dist() over (order by sal) as rn1,
cume_dist() over (partition by dept order by sal) as rn2
from cookie3;
select dept, userid, sal,
percent_rank() over (order by sal) as rn1, --分组内
rank() over (order by sal) as rn11, --分组内的rank值
sum(1) over (partition by null) as rn12, --分组内总行数
percent_rank() over (partition by dept order by sal) as rn2,
rank() over (partition by dept order by sal) as rn21,
sum(1) over (partition by dept) as rn22
from cookie3;
ntile(n):将窗口内的数据切成n片,窗口内分块
row_number():从1开始窗口内记录的序列
rank():窗口内记录的排名,335
dense_rank():窗口内记录的排名,334
eg:
select cookieid, createtime, pv,
ntile(2) over (partition by cookieid order by createtime) as rn1,
row_number() over (partition by cookieid order by pv desc) as rn2,
rank() over (partition by cookieid order by pv desc) as rn3,
dense_rank() over (partition by cookieid order by pv desc) as rn4
from cookie2
order by cookieid,createtime;
sum|avg|min|max(column) over(partition by col1 order by col2 rows between n|unbounded preceding current row and n|unbounded following current row):窗口内记录的聚合,自由定义窗口聚合范围
eg:
select cookieid,createtime,pv,
sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默认为从起点到当前行
avg(pv) over (partition by cookieid order by createtime) as pv2, -- 从起点到当前行
max(pv) over (partition by cookieid) as pv3, -- 分组内所有行
min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, -- 当前行+往前3行
sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, -- 当前行+往前3行+往后1行
avg(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 -- 当前行+往后所有行
from cookie1;
4、特殊分隔符处理,regexserde正则表达式解析,自定义inputformat处理
lateral view explode
lateral view侧视图用于和split、explode等UDTF一起使用的,能将一行数据拆分成多行数据,在此基础上可以对拆分的数据进行聚合,lateral view首先为原始表的每行调用UDTF,UDTF会把一行拆分成一行或者多行,lateral view在把结果组合,产生一个支持别名表的虚拟表。
lateral clause 相当于一个虚拟表,与原表explode_lateral_view笛卡尔积关联。
explode不能写在别的函数内