窗口函数 over() 又名开窗函数,属于分析函数的一种。
sum(col) over() : 分组对col累计求和
count(col) over() : 分组对col累计
min(col) over() : 分组对col求最小
max(col) over() : 分组求col的最大值
avg(col) over() : 分组求col列的平均值
first_value(col) over() : 某分区排序后的第一个col值
last_value(col) over() : 某分区排序后的最后一个col值
lag(col,n,DEFAULT) : 统计往前n行的col值,n可选,默认为1,DEFAULT当往上第n行为 NULL 时候, 取默认值,如不指定,则为 NULL
lead(col,n,DEFAULT) : 统计往后n行的col值,n可选,默认为1,DEFAULT当往下第n行为 NULL 时候, 取默认值,如不指定,则为 NULL
ntile(n) : 用于将分组数据按照顺序切分成n片,返回当前切片值。注意:n必须为int类型。
排名函数:
row_number() over() : 排名函数,不会重复,适合于生成主键或者不并列排名
rank() over() : 排名函数,有并列名次,名次不连续。如:1,1,3
dense_rank() over() : 排名函数,有并列名次,名次连续。如:1,1,2
over() 函数的用法
distribute by + sort by 组合
位置:在over函数的小阔号
写法:可以单独使用,也可以一起组合使用
如:
over(distribute by colName)
over(sort by colName)
over(distribute by colName sort by colName [asc|desc])
作用:
distribute by colName:用于指定分组字段,表示按照指定字段分组,那么每一组对应一个窗口,如果没有,则表示整张表为一组
sort by colName: 用于排序,如果没有distribute by组合,表示整张表为一组,进行排序,如果有则组内进行排序
partition by +order by 组合
位置:还是在over小括号里
写法:可以单独使用,也可以一起组合使用
如:
over(partition by colName)
over(order by colName)
over(partition by colName order by colName [asc|desc])
作用:与 distribute by + sort by 组合效果一模一样。
over(分组 排序 窗口)中的 order by 后使用 window 子句
作用:window子句用来更细粒度的管理窗口大小的
current row: 当前行
preceding: 向前
following: 向后
unbounded preceding: 从起点
unbounded following: 到终点
例如:
select name,orderdate,cost,
sum(cost) over() as sample1,--所有行相加
sum(cost) over(partition by name) as sample2,-- 按name分组,组内数据相加
sum(cost) over(partition by name order by orderdate) as sample3,-- 按name分组,组内数据累加
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4 ,-- 与sample3一样,由起点到当前行的聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5, -- 当前行和前面一行做聚合
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and 1 FOLLOWING ) as sample6,-- 当前行和前边一行及后面一行
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7 -- 当前行及后面所有行
from t_order;
数据: video表
uid channel min
1 1 23
2 1 12
3 1 12
4 1 32
5 1 342
6 2 13
7 2 34
8 2 13
9 2 134
drop table video;
create table video(
uid int,
channel string,
min int
)
row format delimited
fields terminated by ' '
;
load data local inpath './hivedata/video.txt' into table video;
答案:
select channel,sum(min) from video group by channel;
数据:
userid,month,visits
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,1
drop table visits;
create table visits(
userid string,
month string,
visits int
)
row format delimited
fields terminated by ','
;
load data local inpath './hivedata/visits.txt' overwrite into table visits;
完成需求:每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数,结果数据格式如下:
+---------+----------+---------+-------------+---------------+--+
| userid | month | visits | max_visits | total_visits |
+---------+----------+---------+-------------+---------------+--+
| A | 2015-01 | 33 | 33 | 33 |
| A | 2015-02 | 10 | 33 | 43 |
| A | 2015-03 | 38 | 38 | 81 |
| B | 2015-01 | 30 | 30 | 30 |
| B | 2015-02 | 15 | 30 | 45 |
| B | 2015-03 | 34 | 34 | 79 |
+---------+----------+---------+-------------+---------------+--+
select t.userid,t.month,t.visits,
max(t.visits) over(distribute by t.userid sort by t.month asc) as max_visits,
sum(t.visits) over(distribute by t.userid sort by t.month asc) as total_visits
from
(select userid,month,sum(visits) as visits from visits group by userid,month) t;
数据: t1表
Uid dt login_status(1登录成功,0异常)
1 2019-07-11 1
1 2019-07-12 1
1 2019-07-13 1
1 2019-07-14 1
1 2019-07-15 1
1 2019-07-16 1
1 2019-07-17 1
1 2019-07-18 1
2 2019-07-11 1
2 2019-07-12 1
2 2019-07-13 0
2 2019-07-14 1
2 2019-07-15 1
2 2019-07-16 0
2 2019-07-17 1
2 2019-07-18 0
2 2019-07-19 1
2 2019-07-20 0
2 2019-07-21 1
2 2019-07-22 0
2 2019-07-23 1
2 2019-07-24 0
3 2019-07-11 1
3 2019-07-12 1
3 2019-07-13 1
3 2019-07-14 1
3 2019-07-15 1
3 2019-07-16 1
3 2019-07-17 1
3 2019-07-18 1
drop table login;
create table login(
Uid int,
dt string,
login_status int
)
row format delimited
fields terminated by ' '
;
load data local inpath './hivedata/login.txt' into table login;
-- 1) 用窗口函数,根据uid分组,日期排序
select uid,dt,row_number() over(distribute by uid sort by dt) from login where login_status=1;
-- 2) 在第一步基础上使用 date_sub日期函数计算出每个日期排名前的日期
select t1.uid,date_sub(t1.dt,t1.num) dt from
(select uid,dt,row_number() over(distribute by uid sort by dt) num
from login where login_status=1) t1;
-- 3) 在第二步的基础上,根据用户id和日期分组,计算日期数目大于7的用户
select uid,dt from
(select t1.uid,date_sub(t1.dt,t1.num) dt from
(select uid,dt,row_number() over(distribute by uid sort by dt) num
from login where login_status=1) t1) t2
group by uid,dt
having count(uid)>7;
1. row_number() 分数有相同,名次是连续不重复的
分数 名次
100 1
99 2
99 3
98 4
2. rank() 分数相同,名次有重复,是间断的
分数 名次
100 1
99 2
99 2
98 4
3. dense_rank() 分数有相同,名次重复不间断
分数 名次
100 1
99 2
99 2
98 3
这三个排名函数,不能单独使用,也必须配合over函数一起使用。
数据: stu表
Stu_no class score
1 1901 90
2 1901 90
3 1901 83
4 1901 60
5 1902 66
6 1902 23
7 1902 99
8 1902 67
9 1902 87
drop table stu;
create table stu(
Stu_no int,
class string,
score int
)
row format delimited
fields terminated by '\t'
;
load data local inpath './hivedata/stu.txt' into table stu;
编写sql实现,结果如下:
+--------+---------+--------+-----+----------+--+
| class | stu_no | score | rn | rn_diff |
+--------+---------+--------+-----+----------+--+
| 1901 | 2 | 90 | 1 | 90 |
| 1901 | 1 | 90 | 2 | 0 |
| 1901 | 3 | 83 | 3 | -7 |
| 1902 | 7 | 99 | 1 | 99 |
| 1902 | 9 | 87 | 2 | -12 |
| 1902 | 8 | 67 | 3 | -20 |
+--------+---------+--------+-----+----------+--+
-- 1) 先查score排序排名后的表
select class,stu_no,score,row_number() over(distribute by class sort by score desc) rn from stu;
-- 2) 基于 1 查出的虚拟表用 where 条件查出前3名
select * from
(select class,stu_no,score,row_number() over(distribute by class sort by score desc) rn from stu) t where t.rn<=3;
-- 3)使用lag查找前一行记录
-- lag(列名,n,m): 当前记录前面第n行记录的<列名>的值,没有则默认值为m;如果不带参数n,m,则查找当前记录前面第一行的记录<列名>的值,没有则默认值为null
select t.*,t.score-nvl(lag(score) over(distribute by class sort by rn),0) rn_diff from
(select class,stu_no,score,row_number() over(distribute by class sort by score desc) rn
from stu) t where t.rn<=3;
行转列:
1、使用 case when 查询出多列即可,即可增加列。
列转行:
1、lateral view explode(),使用展开函数可以将1列转成多行,被转换列适用于array、map等类型。
posexplode 相比在 explode 之上,将一列数据转为多行之后,还会输出数据的下标。
lateral view posexplode(数组),如有排序需求,则需要索引。将数组展开成两行(索引 , 值),需要 as 两个别名。
2、case when 结合 concat_ws 与 collect_set/collect_list 实现。内层用case when,外层用 collect_set/list收集,对搜集完后用concat_ws分割连接形成列。
id sid subject int
1,001,语文,90
2,001,数学,92
3,001,英语,80
4,002,语文,88
5,002,数学,90
6,002,英语,75.5
7,003,语文,70
8,003,数学,85
9,003,英语,90
10,003,政治,82
编写sql实现,得到结果如下:
+---------+--------+--------+--------+--------+-----------+--+
| sid |u2.语文 | u2.数学 |u2.英语 | u2.政治 | u2.total |
+---------+--------+--------+--------+--------+-----------+--+
| 001 | 90.0 | 92.0 | 80.0 | 0.0 | 262.0 |
| 002 | 88.0 | 90.0 | 75.5 | 0.0 | 253.5 |
| 003 | 70.0 | 85.0 | 90.0 | 82.0 | 327.0 |
| total | 248.0 | 267.0 | 245.5 | 82.0 | 842.5 |
+---------+--------+--------+--------+--------+-----------+--+
drop table score;
create table score(
id int,
sid string,
subject string,
score double
)
row format delimited
fields terminated by ','
;
load data local inpath './hivedata/score.txt' into table score;
-- 1) 先列转行
select sid
,sum(case subject when '语文' then score else 0 end) `语文`
,sum(case subject when '数学' then score else 0 end) `数学`
,sum(case subject when '英语' then score else 0 end) `英语`
,sum(case subject when '政治' then score else 0 end) `政治`
from score group by sid;
-- 2) 基于 1 表每行求和
select u1.*,u1.`语文`+u1.`数学`+u1.`英语`+u1.`政治` `u1.total` from
(select sid
,sum(case subject when '语文' then score else 0 end) `语文`
,sum(case subject when '数学' then score else 0 end) `数学`
,sum(case subject when '英语' then score else 0 end) `英语`
,sum(case subject when '政治' then score else 0 end) `政治`
from score group by sid) u1;
-- 3) 基于 2 表增加一行每列求和
-- concat()函数 功能:将多个字符串连接成一个字符串
-- 语法:concat(str1, str2,...)
-- 返回结果为连接参数产生的字符串,如果有任何一个参数为null,则返回值为null。
select u2.* from
(select u1.*,u1.`语文`+u1.`数学`+u1.`英语`+u1.`政治` `u1.total` from
(select sid
,sum(case subject when '语文' then score else 0 end) `语文`
,sum(case subject when '数学' then score else 0 end) `数学`
,sum(case subject when '英语' then score else 0 end) `英语`
,sum(case subject when '政治' then score else 0 end) `政治`
from score group by sid) u1) u2
union
select concat('total',"") sid,
sum(`语文`) `语文`,
sum(`数学`) `数学`,
sum(`英语`) `英语`,
sum(`政治`) `政治`,
sum(`u1.total`) `u2.total` from
(select u1.*,u1.`语文`+u1.`数学`+u1.`英语`+u1.`政治` `u1.total` from
(select sid
,sum(case subject when '语文' then score else 0 end) `语文`
,sum(case subject when '数学' then score else 0 end) `数学`
,sum(case subject when '英语' then score else 0 end) `英语`
,sum(case subject when '政治' then score else 0 end) `政治`
from score group by sid) u1) u2;
数据: t1表
uid tags
1 1,2,3
2 2,3
3 1,2
编写sql实现如下结果:
uid tag
1 1
1 2
1 3
2 2
2 3
3 1
3 2
drop table t1;
create table t1(
uid int,
tags string
)
row format delimited
fields terminated by '\t'
;
load data local inpath './hivedata/t1.txt' into table t1;
select uid,tag from t1 lateral view explode(split(tags,",")) A as tag;
数据: T2表:
Tags
1,2,3
1,2
2,3
T3表:
id lab
1 A
2 B
3 C
根据T2和T3表的数据,编写sql实现如下结果:
+--------+--------+--+
| tags | labs |
+--------+--------+--+
| 1,2 | A,B |
| 1,2,3 | A,B,C |
| 2,3 | B,C |
+--------+--------+--+
drop table t2;
create table t2(
tags string
);
load data local inpath './hivedata/t2.txt' overwrite into table t2;
drop table t3;
create table t3(
id int,
lab string
)
row format delimited
fields terminated by ' '
;
load data local inpath './hivedata/t3.txt' overwrite into table t3;
-- 1) 展开t2表
select tags,tag from t2 lateral view explode(split(tags,",")) tags as tag;
-- 2) t3表连接 1 表
select A.tags,t3.lab from
(select tags,tag from t2 lateral view explode(split(tags,","))tags as tag) A left join t3 on A.tag=t3.id;
-- 3) 2表列转行
-- concat_ws(指定参数之间的分隔符,参数)
select B.tags,concat_ws(',',collect_list(B.lab)) as `labs` from
(select A.tags,t3.lab from
(select tags,tag from t2 lateral view explode(split(tags,","))tags as tag) A left join t3 on A.tag=t3.id) B
group by B.tags;
数据: t4表:
id tag flag
a b 2
a b 1
a b 3
c d 6
c d 8
c d 8
编写sql实现如下结果:
id tag flag
a b 2|1|3
c d 6|8
drop table t4;
create table t4(
id string,
tag string,
flag int
)
row format delimited
fields terminated by ' '
;
load data local inpath './hivedata/t4.txt' overwrite into table t4;
-- cast(arg1 as arg2);arg1是要转换的数据,arg2是目标类型
select id,tag,concat_ws('|',collect_set(cast(flag as string))) `flag`
from t4
group by id,tag;
数据: t5表
uid name tags
1 goudan chihuo,huaci
2 mazi sleep
3 laotie paly
编写sql实现如下结果:
uid name tag
1 goudan chihuo
1 goudan huaci
2 mazi sleep
3 laotie paly
drop table t5;
create table t5(
uid string,
name string,
tags string
)
row format delimited
fields terminated by '\t' ;
load data local inpath './hivedata/t5.txt' overwrite into table t5;
select uid,name,tag from t5 lateral view explode(split(tags,",")) A as tag;
数据: content表:
uid contents
1 i|love|china
2 china|is|good|i|i|like
统计结果如下,如果出现次数一样,则按照content名称排序:
+----------+------+--+
| content | num |
+----------+------+--+
| i | 3 |
| china | 2 |
| good | 1 |
| is | 1 |
| like | 1 |
| love | 1 |
+----------+------+--+
drop table content;
create table content(
uid int,
contents string
)
row format delimited
fields terminated by '\t'
;
load data local inpath './hivedata/content.txt' overwrite into table content;
-- 1) 行转列
select uid,content from content lateral view explode(split(contents,"\\|")) t as content;
-- 2) 在 1 表基础上分组计数排序
select content,count(content) num from
(select uid,content from content lateral view explode(split(contents,"\\|")) t as content) A
group by content
order by num desc,content;
数据: course1表
id course
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e
根据编写sql,得到结果如下(表中的1表示选修,表中的0表示未选修):
+-----+----+----+----+----+----+----+--+
| id | a | b | c | d | e | f |
+-----+----+----+----+----+----+----+--+
| 1 | 1 | 1 | 1 | 0 | 1 | 0 |
| 2 | 1 | 0 | 1 | 1 | 0 | 1 |
| 3 | 1 | 1 | 1 | 0 | 1 | 0 |
+-----+----+----+----+----+----+----+--+
create table course(
id int,
course string
)
row format delimited
fields terminated by ','
;
load data local inpath './hivedata/course.txt' overwrite into table course;
select id
,sum(case course when 'a' then 1 else 0 end) as `a`
,sum(case course when 'b' then 1 else 0 end) as `b`
,sum(case course when 'c' then 1 else 0 end) as `c`
,sum(case course when 'd' then 1 else 0 end) as `d`
,sum(case course when 'e' then 1 else 0 end) as `e`
,sum(case course when 'f' then 1 else 0 end) as `f`
from course group by id;
from_unixtime(bigint unixtime,[string format]): 时间戳转日期函数
unix_timestamp([string date]): 转换成时间戳,然后转换格式为"yyyy-MM-dd HH:mm:ss"的日期到 UNIX 时间戳。如果转化失败,则返回0,返回bigint类型
to_date(string timestamp): 将时间戳转换成日期,默认格式为2011-12-08 10:03:01
year() : 将时间戳转换成年,默认格式为2011-12-08 10:03:01
month() : 将时间戳转换成月,默认格式为2011-12-08 10:03:01
hour() : 将时间戳转换成小时,默认格式为2011-12-08 10:03:01
day(string date) : 将时间戳转换成天,默认格式为2011-12-08 10:03:01
date_diff(string enddate, string startdate) : 日期比较函数,反回结束日期减去开始日期的天数
date_sub(string startdate, int days) : 日期减少函数,返回开始日期减少days天后的日期字符串
date_add(string startdate, int days) : 日期增加函数,返回开始日期增加days天后的日期字符串
last_day(string date) : 返回该月的最后一天的日期,可忽略时分秒部分(HH:mm:ss)。
last_day(string date) :返回string类型的值。
next_day(string date,string x) : 返回下一个星期x的日期(x为前两英文星期前两位或者全写 MONDAY),返回字符串。
current_date() : 获取当天的日期,返回字符串,没有任何的参数。
current_timestamp() : 获取当前的时间戳
获取当前时间戳:
select unix_timestamp();
获取"2019-07-31 11:57:25"对应的时间戳:
select unix_timestamp("2019-07-31 11:57:25");
获取"2019-07-31 11:57"对应的时间戳:
select unix_timestamp("2019-07-31 11:57","yyyy-MM-dd HH:mm");
获取时间戳:1564545445所对应的日期和时分秒:
select from_unixtime(1564545445);
获取时间戳:1564545446所对应的日期和小时(yyyy/MM/dd HH):
select from_unixtime(1564545445,"yyyy/MM/dd HH");
数据: dt表
20190730
20190731
编写sql实现如下的结果:
2019-07-30
2019-07-31
drop table dt;
create table dt(
dt string
);
load data local inpath './hivedata/dt.txt' overwrite into table dt;
select from_unixtime(unix_timestamp(dt,"yyyyMMdd"),"yyyy-MM-dd") from dt;
sid month money
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
drop table store;
create table store(
sid string,
month string,
money int
)
row format delimited
fields terminated by ','
;
load data local inpath './hivedata/store.txt' overwrite into table store;
编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额?
-- 1) 查询每个店铺的当月销售额
select sid,month,sum(money) as money from store group by sid,month;
-- 2) 在 1 基础上查询累计到当月的总销售额
select sid,month,money,sum(money) over(partition by sid order by month) total
from
(select sid,month,sum(money) as money from store group by sid,month) A;
数据倾斜:由于key分布不均匀造成的数据向一个方向偏离的现象。
本身数据就倾斜
join 语句容易造成
count(distinct col) 很容易造成倾斜
group by 也可能会造成
处理方法:
1、如果是 group by 产生的,则可考虑设置如下属性:
-- 是否在 Map 端进行聚合,默认为 True
hive.map.aggr = true
--有数据倾斜的时候进行负载均衡(默认是 false)
hive.groupby.skewindata = true
原理:
hive.map.aggr=true 这个配置项代表是否在map端进行聚合,类似于combiner做提前聚合。
hive.groupby.skewindata=true 这个配置为true,代表生成的查询计划会有两个 MR Job。
第一个 MR Job 中,Map 的输出结果集合会随机分布到 Reduce 中,每个 Reduce 做部分聚合操作,并输出结果。这样处理的结果是相同的 Group By Key 有可能被分发到不同的 Reduce 中,从而达到负载均衡的目的。
第二个 MR Job 再根据预处理的数据结果按照 Group By Key 分布到 Reduce 中(这个过程可以保 证相同的 Group By Key 被分布到同一个 Reduce 中),最后完成最终的聚合操作。
2、count(distinct)产生的 如果数据量非常大,执行如
select a,count(distinct b) from t group by a;
类型的SQL时,会出现数据倾斜的问题。
原理:使用 sum… group by代替。如
select a,sum(1) from (select a, b from t group by a,b) group by a;
3、join 产生的 找出产生倾斜的key(单个key达到100000),然后对倾斜的key进行处理
-- 如果是join过程出现倾斜应该设置为true
set hive.optimize.skewjoin = false;
法一、 将倾斜的key单独提出来,然后进行单独处理,然后在用 union all 连接处理
法二、 给空值分配随机的key值,保证业务不会受影响,然后在进行 join
Table A 是一个用户登陆时间记录表,当月每次登陆一次会记录一条记录。A表如下:
log_time uid
2018-10-01 12:34:11 123
2018-10-02 13:21:08 123
2018-10-02 14:21:08 123
2018-10-02 14:08:09 456
2018-10-04 05:10:22 123
2018-10-04 21:38:38 456
2018-10-04 22:38:38 456
2018-10-05 09:57:32 123
2018-10-06 13:22:56 123
2018-11-01 12:34:11 123
2018-11-02 13:21:08 123
2018-11-02 14:21:08 123
2018-11-02 14:08:09 456
2018-11-04 05:10:22 123
2018-11-04 21:38:38 456
2018-11-05 09:57:32 123
2018-11-06 13:22:56 123
需计算出每个用户本月最大连续登陆天数。如表A样例数据中,用户123最大连续登陆天数为3,而用户456最大连续登陆天数为1
drop table login_time;
create table login_time(
log_time timestamp,
uid string
)
row format delimited
fields terminated by '\t';
load data local inpath './hivedata/login_time.txt' overwrite into table login_time;
-- 注意:可能需要对原始数据做清洗,保证每个用户每天只有一条登录信息
-- 1) 使用日期转字符串(格式化)函数:date_format
select distinct uid,date_format(log_time,"yyyy-MM-dd") as dt from login_time;
-- 2) 在 1 表基础上使用date_sub日期函数计算出每个日期排名前的日期
select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt))
from
(select distinct uid,date_format(log_time,"yyyy-MM-dd") as dt from login_time) A;
-- 3) 在 2 表基础上对 每个日期排名前的日期 相同的计数 对月分组
select uid,odt,date_format(dt,"yyyy-MM"),count(1) cnt from
(select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt)) as odt
from
(select distinct uid,date_format(log_time,"yyyy-MM-dd") as dt from login_time) A) B
group by uid,odt,date_format(dt,"yyyy-MM");
-- 4) 在 3 表基础上查询cnt最大值,uid分组
select uid,max(cnt) from
(select uid,odt,date_format(dt,"yyyy-MM"),count(1) cnt from
(select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt)) as odt
from
(select distinct uid,date_format(log_time,"yyyy-MM-dd") as dt from login_time) A) B
group by uid,odt,date_format(dt,"yyyy-MM")) C
group by uid;
样例数据: t1表
gender,cookie,ip,timestampe,ua
F,1707041428491566106,111.200.195.186,1208524973899,Dalvik%2F2.1.0%20%28Linux%3B%20U%3B%20Android
…具体数据如下图
将图片中的awk修改为使用sql编写,然后将上诉题作出回答?
统计pv/uv的使用sql,其它问题语言描述即可。
数据 diff_t1表:
id name
1 zs
2 ls
diff_t2表:
id name
1 zs
3 ww
结果如下:
id name
2 ls
3 ww
drop table diff_t1;
create table diff_t1(
id string,
name string
)
row format delimited
fields terminated by ' '
;
load data local inpath './hivedata/diff_t1.txt' overwrite into table diff_t1;
drop table diff_t2;
create table diff_t2(
id string,
name string
)
row format delimited
fields terminated by ' '
;
load data local inpath './hivedata/diff_t2.txt' overwrite into table diff_t2;
-- 1) 基于 t1表 查询不属于 t2 表字段
select t1.id as `id`,t1.name as `name`
from diff_t1 t1
left join diff_t2 t2 on t1.id=t2.id
where t2.id is null;
-- 2) 基于 t2表 查询不属于 t1 表字段
select t2.id `id`, t2.name `name`
from diff_t1 t1
right join diff_t2 t2 on t1.id=t2.id
where t1.id is null;
-- 3) 使用 union 连接
select t1.id as `id`,t1.name as `name`
from diff_t1 t1
left join diff_t2 t2 on t1.id=t2.id
where t2.id is null
union
select t2.id `id`, t2.name `name`
from diff_t1 t1
right join diff_t2 t2 on t1.id=t2.id
where t1.id is null;
现有某网站购买记录字段如下
orderid,userid,productid,price,timestamp,date
121,张三,3,100,1535945356,2018-08-07
122,张三,3,200,1535945356,2018-08-08
123,李四,3,200,1535945356,2018-08-08
124,王五,1,200,1535945356,2018-08-08
125,张三,3,200,1535945356,2018-08-09
126,张三,2,200,1535945356,2018-08-09
127,李四,3,200,1535945356,2018-08-09
128,李四,3,200,1535945356,2018-08-10
129,李四,3,200,1535945356,2018-08-11
用sql统计今日及昨日都购买过商品productid为3的用户及其昨日消费。
drop table product;
create table product(
orderid string,
userid string,
productid int,
price int,
tamp int,
dt date
)
row format delimited
fields terminated by ',';
load data local inpath './hivedata/product.txt' overwrite into table product;
-- 1) 使用 over 查出总消费
select userid,dt,productid,sum(price) over(partition by userid,dt order by dt) total_price
from product order by userid,dt;
-- 2) 在 1 表基础上对昨日消费查询 指定查询 productid=3
select userid,dt,
lag(dt,1) over(partition by userid order by dt) `yesterday`,
case when datediff(dt,lag(dt,1) over(partition by userid order by dt))=1
then lag(total_price) over(partition by userid order by dt)
else null end `yesterday_price`
from
(select userid,dt,productid,sum(price) over(partition by userid,dt order by dt) total_price
from product order by userid,dt) A
where productid=3;
-- 3) 基于 2 表 排除null
select * from
(select userid,dt,
lag(dt,1) over(partition by userid order by dt) `yesterday`,
case when datediff(dt,lag(dt,1) over(partition by userid order by dt))=1
then lag(total_price) over(partition by userid order by dt)
else null end `yesterday_price`
from
(select userid,dt,productid,sum(price) over(partition by userid,dt order by dt) total_price
from product order by userid,dt) A
where productid=3) B
where `yesterday_price` is not null;
表user_action_log用户行为故据
uid time action
1 Time1 Read
3 Time2 Comment
1 Time3 Share
2 Time4 Like
1 Time5 Write
2 Time6 like
3 Time7 Write
2 Time8 Read
分析用户行为习惯找到毎一个用户在表中的第一次行为
drop table user_action_log;
create table user_action_log(
uid int,
time string,
action string
)
row format delimited
fields terminated by '\t';
load data local inpath './hivedata/user_action_log.txt' overwrite into table user_action_log;
使用代码实现
-- 1) 分组查询每个用户行为次数
select uid,time,action,row_number() over(partition by uid order by time) `第n次行为` from user_action_log;
-- 2) 在 1 表基础上查询第一次行为
select * from
(select uid,time,action,row_number() over(partition by uid order by time) `第n次行为` from user_action_log) A
where `第n次行为`=1;
数据: user_login表
uid,dt
1,2019-08-01
1,2019-08-02
1,2019-08-03
2,2019-08-01
2,2019-08-02
3,2019-08-01
3,2019-08-03
4,2019-07-28
4,2019-07-29
4,2019-08-01
4,2019-08-02
4,2019-08-03
结果如下:
uid cnt_days
1 3
2 2
3 1
4 3
drop table user_login;
create table user_login(
uid int,
dt date
)
row format delimited
fields terminated by ',';
load data local inpath './hivedata/user_login.txt' overwrite into table user_login;
-- 1) 用窗口函数date_sub日期函数计算出每个日期排名前的日期 根据uid分组,日期排序
select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt)) sub_dt
from user_login;
-- 2) 在 1 表基础上用 count 分组计数
select uid,sub_dt,count(1) cnt from
(select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt)) sub_dt
from user_login) A
group by uid,sub_dt;
-- 3) 在 2 表基础上用 max 分组求最大值
select uid, max(cnt) as `cnt_days` from
(select uid,sub_dt,count(1) cnt from
(select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt)) sub_dt
from user_login) A
group by uid,sub_dt) B
group by uid;
数据:
t1表
uid dt url
1 2019-08-06 http://www.baidu.com
2 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.soho.com
3 2019-08-06 http://www.meituan.com
3 2019-08-06
结果如下:
dt uv pv
2019-08-6 3 5
drop table user_net_log;
create table user_net_log(
uid int,
dt date,
url string
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/user_net_log.txt' overwrite into table user_net_log;
select dt,count(distinct uid) `uv`,count(url) `pv`
from user_net_log
group by dt;
coalease(T v1,T v2,...):返回列表中的第一个非空元素,如果列表元素都为空则返回 NULL。
例:select coalesce(NULL,null,123,"ABC"); 返回123
nvl(T v1,T v2) : 空值判断,如果v1非空则返回v1,如果v1为空,则返回v2,v1和v2需要同类型。
例:select nvl(null,1); 返回1
concat_ws(separator, str1, str2,...) :指定分隔符(第一位)连接字符串函数。参数需要字符串。
例:select concat_ws("|","1","2","3"); 返回1|2|3
collect_list(T col) : 将某列的值连接在一起,返回字符串数组,有相同的列值不会去重。通常可以使用 group by搭配使用,但是也可以不用 group by。
例:select collect_list(id) from t1; 返回将id连接在一起的字符串。
如:id值为1,2,2,则返回["1","2","2"]
collect_set(T col) : 将某列的值连接在一起,返回字符串数组,有相同的列值会去重。通常可以使用 group by搭配使用,但是也可以不用 group by。
例:select collect_list(id) from t1; 返回将id连接在一起的字符串。如id值为1,2,2,则返回["1","2"]
regexp_replace(source_string, pattern[, replace_string [, position[,occurrence, [match_parameter]]]]):用一个指定的 replace_string 来替换匹配的模式,从而允许复杂的"搜索并替换"操作。
例:select regexp_replace(img,".jpg","*.png") from t2; 将img列中有*.png的换成.jpg. 如img有两个数据为1.png 和 2.jsp,则返回1.jpg 和 2.jsp
pk_moba表
id names
1 亚索,挖据机,艾瑞莉亚,洛,卡莎
2 亚索,盖伦,奥巴马,牛头,皇子
3 亚索,盖伦,艾瑞莉亚,宝石,琴女
4 亚索,盖伦,赵信,老鼠,锤石
请用HiveSQL计算出出场次数最多的top3英雄及其Pick率(=出现场数/总场数)
create table pk_moba(
id int,
names array<string>
)
row format delimited
fields terminated by '\t'
collection items terminated by ',';
load data local inpath './hivedata/pk_moba.txt' overwrite into table pk_moba;
-- 1) 用 explode函数 行转列 count 计数
select name,count(name) cnt from pk_moba lateral view explode(names) t1 as name
group by name;
-- 2) 基于 1 表 按 name 分组 over 排列 dense_rank()
select name,cnt,dense_rank() over(sort by cnt desc) rk from
(select name,count(name) cnt from pk_moba lateral view explode(names) t1 as name
group by name) A;
-- 3) 基于 2 表 查询pick率和top3
select name,cnt,rk `top`,concat(round(cnt/4*100,0),"%") `pick率` from
(select name,cnt,dense_rank() over(sort by cnt desc) rk from
(select name,count(name) cnt from pk_moba lateral view explode(names) t1 as name
group by name) A) B
where rk<=3;
区域(district) 区域中有两个字段分别是区域Id(disid)和区域名称(disname)
城市(city) 城市有两个字段分别是城市ID(cityid)和区域ID(disid)
订单(order) 订单有四个字段分别是订单ID(orderid)、用户ID(userid)、城市ID(cityid)和消费金额(amount)。
district表:
disid disname
1 华中
2 西南
create table district(
disid int,
disname string
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/district.txt' overwrite into table district;
city表:
cityid disid
1 1
2 1
3 2
4 2
5 2
create table city(
cityid int,
disid int
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/city.txt' overwrite into table city;
order表:
oid userid cityid amount
1 1 1 1223.9
2 1 1 9999.9
3 2 2 2322
4 2 2 8909
5 2 3 6789
6 2 3 798
7 3 4 56786
8 4 5 78890
create table order_t(
oid int,
userid int,
cityid int,
amount float
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/order.txt' overwrite into table order_t;
高消费者是消费金额大于1W的用户,使用hive hql生成如下报表:
区域名 高消费者人数 消费总额
-- 1) 连接3张表 sum(amount)
select disname,userid,sum(amount) `amount` from district d
join city c on c.disid=d.disid
join order_t o on o.cityid=c.cityid
group by disname,userid;
-- 2) 基于 1 表查询
select disname `区域名`,count(1) `高消费者人数`,sum(`amount`) `消费总额` from
(select disname,userid,sum(amount) `amount` from district d
join city c on c.disid=d.disid
join order_t o on o.cityid=c.cityid
group by disname,userid) A
where amount>10000
group by disname;
(1)、每天整体的访问UV、PV?
select log_time,count(distinct user_id) uv count(1) pv
from access_log
group by log_time;
(2)、每天每个类型的访问UV、PV?
select log_time,user_type,count(distinct user_id) uv count(1) pv
from access_log
group by log_time;
(3)、每天每个类型中最早访问时间和最晚访问时间?
select log_time,user_type,min(log_time),max(log_time)
from access_log
group by log_time;
(4)、每天每个类型中访问次数最高的10个用户?
-- 1) 先查每天用户访问的日期用substr函数截取
select substr(log_time,1,7) dt,user_type,count(1) cnt
from access_log
group by dt,user_type;
-- 2) 在 1 表基础上分组排名
select dt,user_type,cnt,row_number() over(partition by dt,user_type order by cnt desc) rn
(select substr(log_time,1,7) dt,user_type,count(1) cnt
from access_log
group by dt,user_type) A;
-- 3) 在 2 表基础上查出访问次数最高的10个用户
select dt,user_type,cnt,rn
(select dt,user_type,cnt,row_number() over(partition by dt,user_type order by cnt desc) rn
(select substr(log_time,1,7) dt,user_type,count(1) cnt
from access_log
group by dt,user_type) A) B
where rn<=10;
表login_a(登录表):
ds user_id
2019-08-06 1
2019-08-06 2
2019-08-06 3
2019-08-06 4
create table login_a(
ds date,
user_id int
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/login_a.txt' overwrite into table login_a;
表read_b(阅读表):
ds user_id read_num
2019-08-06 1 2
2019-08-06 2 3
2019-08-06 3 6
create table read_b(
ds date,
user_id int,
read_num int
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/read_b.txt' overwrite into table read_b;
表cost_c(付费表):
ds user_id price
2019-08-06 1 55.6
2019-08-06 2 55.8
create table cost_c(
ds date,
user_id int,
price float
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/cost_c.txt' overwrite into table cost_c;
基于上述三张表,请使用hive的hql语句实现如下需求:
(1)、用户登录并且当天有个阅读的用户数,已经阅读书籍数量
select A.ds,count(distinct A.user_id),sum(B.read_num)
from login_a A join read_b B on B.user_id=A.user_id and B.ds=A.ds
group by A.ds;
-- 感觉 log_time 表多余
select ds,count(distinct user_id),sum(read_num)
from read_b group by ds;
(2)、用户登录并且阅读,但是没有付费的用户数
select A.ds,count(1)
from login_a A join read_b B on B.user_id=A.user_id and B.ds=A.ds
left join cost_c C on C.user_id=B.user_id and C.ds=B.ds
where C.price is null
group by A.ds;
-- 感觉 log_time 表多余
select B.ds,count(1)
from read_b B
left join cost_c C on C.user_id=B.user_id and C.ds=B.ds
where C.price is null
group by B.ds;
(3)、用户登录并且付费,付费用户数量和金额总数
select A.ds,count(1),sum(price)
from login_a A join read_b B on B.user_id=A.user_id and B.ds=A.ds
left join cost_c C on C.user_id=B.user_id and C.ds=B.ds
group by A.ds;
-- 感觉 log_time 表多余
select B.ds,count(1),sum(price)
from read_b B
left join cost_c C on C.user_id=B.user_id and C.ds=B.ds
group by B.ds;
hive中,left join与left outer join等价。
left semi join 与 left outer join的区别:
1. left semi join相当于 in ,即会过滤掉左表中 join 不到右表的行,右表中有多行能 join 到时显示一行,并且只输出左表的字段、不输出右表的字段;
2. left outer join不会过滤掉左表中的行,右表中有多行能 join 到时显示多行,并且能够同时输出左表和右表中的字段。
create table order(
order_id long,
user_id long comment '用户id',
amount double comment '订单金额',
channel string comment '渠道',
time string comment '订单时间,yyyy-MM-dd HH:mi:ss'
)
partition by (dt string comment '天,yyyy-MM-dd');
请使用hive hql查询出2019-08-06号 每个渠道的下单用户数、订单总金额。
hql语句实现,结果表头如下: channel user_num order_amount
-- to_data(参数) 转换为 普通的时间格式
select channel,count(user_id) user_num,sum(amount) order_amount
from order
where to_date(time)='2019-08-06'
group by channel;
设计数据库表,用来存放学生基本信息,课程信息,学生的课程及成绩,
并给出查询语句,查询平均成绩大于85的所有学生。
create table stu_1(
id string,
name string,
age int,
addr string
)
row format delimited
fields terminated by ','
;
create table course_1(
cid string,
cname string
)
row format delimited
fields terminated by ',' ;
create table course_sc(
id string,
cid string,
score int
)
row format delimited
fields terminated by ','
;
load data local inpath '/hivedata/course_1.txt' overwrite into table course_1;
load data local inpath '/hivedata/stu_1.txt' overwrite into table stu_1;
load data local inpath '/hivedata/course_sc.txt' overwrite into table course_sc;
select st.id,st.name,co.cname,avg(score) `平均成绩`
from stu_1 st join course_sc sc on st.id=sc.id
join course_1 co on co.cid=sc.cid
group by st.id,st.name,co.cname
having avg(score)>85;
有用户表user(uid,name) 以及黑名单BanUser(uid)
1. 用left join 方式写sql查出所有不在黑名单的用户信息
2. 用not exists 方法写sql查询所有不在黑名单的用户信息
create table u(
id string,
name string
)
row format delimited
fields terminated by ','
;
create table banuser(
id string
);
load data local inpath '/hivedata/banuser.txt' overwrite into table banuser;
load data local inpath '/hivedata/u.txt' overwrite into table u;
-- 用left join 方式写sql查出所有不在黑名单的用户信息
select u.id,u.name from u
left join banuser on u.id=banuser.id
where banuser.id is null;
-- 用not exists 方法写sql查询所有不在黑名单的用户信息
select u.id,u.name from u
where not exists (select 1 from banuser where banuser.id is null);
course_score表数据:
1,zhangsan,数学,80,2015
2,lisi,语文,90,2016
3,lisi,数学,70,2016
4,wangwu,化学,80,2017
5,zhangsan,语文,85,2015
6,zhangsan,化学,90,2015
create table course_score(
id string,
name string,
course string,
score int,
year string
)
row format delimited
fields terminated by ','
;
load data local inpath './hivedata/course_score.txt' overwrite into table course_score;
1、查出每个学期每门课程最高分记录(包含全部5个字段)
写法一: 分组查询
select id,name,course,score,year,max(score)
from course_score
group by id,name,course,score,year order by year;
写法二: 使用窗口函数 over()
select id,name,course,score,year,max(score) over(partition by year,course)
from course_score;
2、查出单个学期中语文课在90分以上的学生的数学成绩记录(包含全部字段)
写法一: 连接查询
-- 1个job
select A.id,A.name,A.course,A.score,A.year
from course_score A join course_score B on A.name=B.name
where B.course='语文' and B.score>=90 and A.course='数学';
写法二: 子查询
-- 1个job
select cs.id,cs.name,cs.course,cs.score,cs.year
from course_score cs join
(select id,name,course,score,year from course_score
where score>=90 and course='语文') A
on cs.name=A.name
where cs.course='数学';
写法三:
-- 个job
t1表:
name course score
aa English 75
bb math 85
aa math 90
create table t1_1(
name string,
course string,
score int
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/t1_1.txt' overwrite into table t1_1;
使用hql输出以下结果
name English math
aa 75 90
bb 0 85
select name
,sum(case course when 'English' then score else 0 end) as English
,sum(case course when 'math' then score else 0 end) as math
from t1_1
group by name;
t1表:
用户 商品
A P1
B P1
A P2
B P3
请你使用hql变成如下结果:
用户 P1 P2 P3
A 1 1 0
B 1 0 1
select username
,sum(if(product='P1',1,0)) P1
,sum(if(product='p2',2,0)) P2
,sum(if(product='p3',3,0)) P3
from t1
group by username;
dpt部门
dpt_id dpt_name
1 产品
2 技术
User用户表
user_id dpt_id
1 1
2 1
3 2
4 2
5 3
使用hql输出以下结果
user_id dpt_id dpt_name
1 1 产品
2 1 产品
3 2 技术
4 2 技术
5 3 其他部门
select `user`.user_id,`user`.dpt_id,dpt.dpt_name
from `user`
left join dpt on `user`.dpt_id=dpt.dpt_id;
t1_order表:
order_id order_type order_time
111 N 10:00
111 A 10:05
111 B 10:10
create table t1_order(
order_id string,
order_type string,
order_time string
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/t1_order.txt' overwrite into table t1_order;
是用hql获取结果如下:
order_id order_type_1 order_type_2 order_time_1 order_time_2
111 N A 10:00 10:05
111 A B 10:05 10:10
-- 1) 使用over() lead()查询 下次时间
select order_id,order_type as `order_type_1`
,lead(order_type,1) over(sort by order_time) `order_type_2`
,order_time as `order_time_1`
,lead(order_time,1) over(sort by order_time) `order_time_2`
from t1_order;
-- 2)在 1 表基础上条件查询
select * from
(select order_id,order_type as `order_type_1`
,lead(order_type,1) over(sort by order_time) `order_type_2`
,order_time as `order_time_1`
,lead(order_time,1) over(sort by order_time) `order_time_2`
from t1_order) A
where `order_type_2` is not null;
t1_hobby表
name sex hobby
janson 男 打乒乓球、游泳、看电影
tom 男 打乒乓球、看电影
drop table t1_hobby;
create table t1_hobby(
name string,
sex string,
hobby string
)
row format delimited
fields terminated by ' ';
load data local inpath './hivedata/t1_hobby.txt' overwrite into table t1_hobby;
hobby最多3个值,使用hql实现结果如下:
name sex hobby1 hobby2 hobby3
janson 男 打乒乓球 游泳 看电影
tom 男 打乒乓球 看电影
select name,sex
,split(hobby,"、")[0] `hobby1`
,split(hobby,"、")[1] `hobby2`
,nvl(split(hobby,"、")[2],"") `hobby3`
from t1_hobby;