SQL、Hive场景题及答案

SQL、Hive场景题及答案

点赞收藏呦!

默认mysql老版本没有支持,在最新的8.0版本中支持, Oracle和Hive中都支持窗口函数,下列题目皆在Hive中实现

第一题:

了解哪些窗口函数,都是什么功能?找一个在某个业务中的应用? 手写窗口函数及功能意义,同时随便写一个带窗口函数的sql,并说明其sql的含义。

窗口函数:通常格式为 可用函数+over()函数

-- 查询所有明细
select * from t_order;
# 查询总量
select count(*) from t_order;
等价于:
select *, count(*) over() from t_order;

2、求出每个栏目的被观看次数及累计观看时长?

数据: video表

uid channel min 
1 1 23
2 1 12
3 1 12
4 1 32
5 1 342
6 2 13
7 2 34
8 2 13
9 2 134
create table video( 
uid int, 
channel string, 
min int 
)
row format delimited 
fields terminated by ' ' 
;
load data local inpath './data/video' into table video;

答案:

select channel,count(*) count,sum(min) total from video group by channel;

3、编写sql实现

数据:

userid,month,visits 
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,1

drop table visits;
create table visits(
userid string, 
month string, 
visits int 
)
row format delimited 
fields terminated by ',' 
;
load data local inpath './hivedata/visits' overwrite into table visits; 

完成需求:每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数,结果数据格式如下:

+---------+----------+---------+-------------+---------------+--+
| userid  |  month   | visits  | max_visits  | total_visits  |
+---------+----------+---------+-------------+---------------+--+
| A       | 2015-01  | 33      | 33          | 33            |
| A       | 2015-02  | 10      | 33          | 43            |
| A       | 2015-03  | 38      | 38          | 81            |
| B       | 2015-01  | 30      | 30          | 30            |
| B       | 2015-02  | 15      | 30          | 45            |
| B       | 2015-03  | 34      | 34          | 79            |
+---------+----------+---------+-------------+---------------+--+
select userid,month,
visits,
max(visits) over(distribute by userid sort by month) max_visits,
sum(visits) over(distribute by userid sort by month) total_visits
from (
select userid,month,
sum(visits) visits
from visits 
group by userid,month
) t;

4、编写连续7天登录的总人数:

数据: t1表

Uid dt login_status(1登录成功,0异常) 
1 2019-07-11 1 
1 2019-07-12 1 
1 2019-07-13 1 
1 2019-07-14 1 
1 2019-07-15 1 
1 2019-07-16 1 
1 2019-07-17 1 
1 2019-07-18 1 
2 2019-07-11 1 
2 2019-07-12 1 
2 2019-07-13 0 
2 2019-07-14 1 
2 2019-07-15 1 
2 2019-07-16 0 
2 2019-07-17 1 
2 2019-07-18 0 
2 2019-07-19 1 
2 2019-07-20 0 
2 2019-07-21 1 
2 2019-07-22 0 
2 2019-07-23 1 
2 2019-07-24 0 
3 2019-07-11 1 
3 2019-07-12 1 
3 2019-07-13 1 
3 2019-07-14 1 
3 2019-07-15 1 
3 2019-07-16 1 
3 2019-07-17 1 
3 2019-07-18 1 
drop table login;
create table login( 
Uid int, 
dt string, 
login_status int 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/login' into table login; 
select count(*)
from
(
select distinct uid
from(
select uid,dt,lag(dt,6) over(partition by uid order by dt) pre_dt,
sum(login_status) over(partition by uid order by dt rows between 6 preceding and current row) total
from login
) t
where date_sub(dt,6)=pre_dt and t.total=7) t1;

5、你知道的排名函数有哪些?说一说它们之间的区别? 文字说明即可

  • row_number从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列

  • RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位

  • DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位

6、编写sql语句实现每班前三名,分数一样不并列,同时求出前三名按名次排序的一次的分差:

数据: stu表

Stu_no class score 
1	1901	90
2	1901	90
3	1901	83
4	1901	60
5	1902	66
6	1902	23
7	1902	99
8	1902	67
9	1902	87
drop table stu;
create table stu( 
Stu_no int, 
class string, 
score int 
)
row format delimited 
fields terminated by '\t' 
;

load data local inpath './hivedata/stu' into table stu; 

编写sql实现,结果如下:

+--------+---------+--------+-----+----------+--+
| class  | stu_no  | score  | rn  | rn_diff  |
+--------+---------+--------+-----+----------+--+
| 1901   | 2       | 90     | 1   | 90       |
| 1901   | 1       | 90     | 2   | 0        |
| 1901   | 3       | 83     | 3   | -7       |
| 1902   | 7       | 99     | 1   | 99       |
| 1902   | 9       | 87     | 2   | -12      |
| 1902   | 8       | 67     | 3   | -20      |
+--------+---------+--------+-----+----------+--+
select class,stu_no,score,rn,rn_diff
from(
select class,stu_no,score,
row_number() over(partition by class order by score desc) rn,
score-nvl(lag(score,1) over(partition by class order by score desc),0) rn_diff
from stu
) t
where t.rn<4;

7、对于行列互换,你有哪些解决方式,详细说明每一种方式? 使用语言描述即可

collect_set,collect_list,explode.....

8、编写sql实现行列互换。数据如下:

id sid subject int
1,001,语文,90
2,001,数学,92
3,001,英语,80
4,002,语文,88
5,002,数学,90
6,002,英语,75.5
7,003,语文,70
8,003,数学,85
9,003,英语,90
10,003,政治,82

编写sql实现,得到结果如下:

+---------+--------+--------+--------+--------+-----------+--+
| sid  	  | 语文   | u2.数学  | u2.英语  | u2.政治  | u2.total  |
+---------+--------+--------+--------+--------+-----------+--+
| 001     | 90.0   | 92.0   | 80.0   | 0.0    | 262.0     |
| 002     | 88.0   | 90.0   | 75.5   | 0.0    | 253.5     |
| 003     | 70.0   | 85.0   | 90.0   | 82.0   | 327.0     |
| total   | 248.0  | 267.0  | 245.5  | 82.0   | 842.5     |
+---------+--------+--------+--------+--------+-----------+--+
drop table score;
create table score( 
id int, 
sid string, 
subject string, 
score double
)
row format delimited 
fields terminated by ',' 
;

load data local inpath './hivedata/score' into table score; 
select 
sid, 
sum(if(subject="语文",score,0)) as `语文`, 
sum(if(subject="数学",score,0)) as `数学`, 
sum(case when subject="英语" then score else 0 end) as `英语`, 
sum(case when subject="政治" then score else 0 end) as `政治`,
sum(score) total
from score 
group by sid
union
select "total",sum(`语文`),sum(`数学`),sum(`英语`),sum(`政治`),sum(total) from
(
select 
sid, 
sum(if(subject="语文",score,0)) as `语文`, 
sum(if(subject="数学",score,0)) as `数学`, 
sum(case when subject="英语" then score else 0 end) as `英语`, 
sum(case when subject="政治" then score else 0 end) as `政治`,
sum(score) total
from score 
group by sid
)t;

9、编写sql实现如下:

数据: t1表

uid tags 
1	1,2,3
2	2,3
3	1,2

编写sql实现如下结果:

uid tag 
1	1 
1	2
1	3
2	2
2	3
3	1
3	2
create table t1( 
uid int, 
tags string 
)
row format delimited 
fields terminated by '\t' 
;

load data local inpath './hivedata/t1' into table t1; 
select uid,tag
from t1 lateral view explode(split(tags,",")) t  as tag;

10、行转列

数据: T2表:

Tags 
1,2,3
1,2
2,3

T3表:

id lab 
1 A
2 B
3 C

根据T2和T3表的数据,编写sql实现如下结果:

+--------+--------+--+
|  tags  |  labs  |
+--------+--------+--+
| 1,2    | A,B    |
| 1,2,3  | A,B,C  |
| 2,3    | B,C    |
+--------+--------+--+
create table t2( 
tags string 
);
load data local inpath './hivedata/t2' overwrite into table t2;

create table t3( 
id int, 
lab string 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/t3' overwrite into table t3; 
select tags,
concat_ws(",",collect_set(lab)) labs
from
(select tags,lab
from 
(select tags,tag
from t2 lateral view explode(split(tags,",")) A as tag) B 
join t3 on B.tag = t3.id) C
group by tags;

11、行转列

数据: t4表:

id tag flag 
a b 2
a b 1
a b 3
c d 6
c d 8
c d 8

编写sql实现如下结果:

id tag flag 
a b 1|2|3
c d 6|8
create table t4( 
id string, 
tag string, 
flag int 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/t4' overwrite into table t4;
select id,tag,concat_ws("|",collect_set(cast(flag as string))) from t4 group by id,tag;

12、列转行

数据: t5表

uid name tags 
1	goudan	chihuo,huaci
2	mazi	sleep
3	laotie	paly

编写sql实现如下结果:

uid name tag 
1 goudan chihuo
1 goudan huaci
2 mazi sleep
3 laotie paly
create table t5( 
uid string, 
name string, 
tags string 
)
row format delimited 
fields terminated by '\t' ;

load data local inpath './hivedata/t5' overwrite into table t5;
select 
uid, 
name, 
tag 
from t5 
lateral view explode(split(tags,",")) t1 as tag ; 

13、行转列

数据: content表:

uid contents 
1	i|love|china
2	china|is|good|i|i|like

统计结果如下,如果出现次数一样,则按照content名称排序:

+----------+------+--+
| content  | num  |
+----------+------+--+
| i        | 3    |
| china    | 2    |
| good     | 1    |
| is       | 1    |
| like     | 1    |
| love     | 1    |
+----------+------+--+
drop table content;
create table content( 
uid int, 
contents string 
)
row format delimited 
fields terminated by '\t' 
;
load data local inpath './hivedata/content' overwrite into table content; 
select content,count(1) num
from content lateral view explode(split(contents,"\\|")) tmp as content
group by content
order by num desc,content;

14、列转行

数据: course1表

id course 
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e

根据编写sql,得到结果如下(表中的1表示选修,表中的0表示未选修):

+-----+----+----+----+----+----+----+--+
| id  | a  | b  | c  | d  | e  | f  |
+-----+----+----+----+----+----+----+--+
| 1   | 1  | 1  | 1  | 0  | 1  | 0  |
| 2   | 1  | 0  | 1  | 1  | 0  | 1  |
| 3   | 1  | 1  | 1  | 0  | 1  | 0  |
+-----+----+----+----+----+----+----+--+
create table course( 
id int, 
course string 
)
row format delimited 
fields terminated by ',' 
;

load data local inpath './hivedata/course' overwrite into table course; 
select 
id, 
sum(case when c.course="a" then 1 else 0 end) as `a`, 
sum(case when c.course="b" then 1 else 0 end) as `b`, 
sum(case when c.course="c" then 1 else 0 end) as `c`, 
sum(case when c.course="d" then 1 else 0 end) as `d`, 
sum(case when c.course="e" then 1 else 0 end) as `e`, 
sum(case when c.course="f" then 1 else 0 end) as `f` 
from course c 
group by id ; 

15、时间戳函数:unix_timestamp,from_unixtime

获取当前时间戳:

select unix_timestamp();

获取"2019-07-31 11:57:25"对应的时间戳:

select unix_timestamp("2019-07-31 11:57:25");

获取"2019-07-31 11:57"对应的时间戳:

select unix_timestamp("2019-07-31 11:57","yyyy-MM-dd HH:mm");

获取时间戳:1564545445所对应的日期和时分秒:

select from_unixtime(1564545445);

获取时间戳:1564545446所对应的日期和小时(yyyy/MM/dd HH):

select from_unixtime(1564545445,"yyyy/mm/dd HH"); 

16、时间格式转换:yyyyMMdd -> yyyy-MM-dd

数据: dt表

20190730
20190731

编写sql实现如下的结果:

2019-07-30
2019-07-31
create table dt( 
dt string 
);

load data local inpath './hivedata/dt' overwrite into table dt; 
select from_unixtime(unix_timestamp(dt,"yyyyMMdd"),"yyyy-MM-dd") from dt;

17、数据: 店铺,月份,金额

sid month money
a,01,150
a,01,200
b,01,1000
b,01,800
c,01,250
c,01,220
b,01,6000
a,02,2000
a,02,3000
b,02,1000
b,02,1500
c,02,350
c,02,280
a,03,350
a,03,250
create table store( 
sid string,
month string,
money int
)
row format delimited 
fields terminated by ',' 
;

load data local inpath './hivedata/store' overwrite into table store;

编写Hive的HQL语句求出每个店铺的当月销售额和累计到当月的总销售额?

select sid,month,sales,
sum(sales) over(partition by sid order by month) total
from (
select sid,month,sum(money) sales from store group by sid,month
) t;

18、Hive是否发生过数据倾斜,怎么处理的,原理是什么?


19、Hive中什么时候使用过array和map,为什么使用?


20、使用sql编写下面的需求:

Table A 是一个用户登陆时间记录表,当月每次登陆一次会记录一条记录。A表如下:

log_time	uid
2018-10-01 12:34:11	123
2018-10-02 13:21:08	123
2018-10-02 14:21:08	123
2018-10-02 14:08:09	456
2018-10-04 05:10:22	123
2018-10-04 21:38:38	456
2018-10-04 22:38:38	456
2018-10-05 09:57:32	123
2018-10-06 13:22:56	123
2018-11-01 12:34:11	123
2018-11-02 13:21:08	123
2018-11-02 14:21:08	123
2018-11-02 14:08:09	456
2018-11-04 05:10:22	123
2018-11-04 21:38:38	456
2018-11-05 09:57:32	123
2018-11-06 13:22:56	123

需计算出每个用户本月最大连续登陆天数。如表A样例数据中,用户123最大连续登陆天数为3,而用户456最大连续登陆天数为1

drop table login_time;
create table login_time(
log_time timestamp,
uid string
)
row format delimited
fields terminated by '\t';

load data local inpath './hivedata/login_time' overwrite into table login_time;

-- 注意:可能需要对原始数据做清洗,保证每个用户每天只有一条登录信息 
select uid,max(cnt)
from
(select uid,date_format(dt,"yyyy-MM"),count(1) cnt
from
(select uid,dt,date_sub(dt,row_number() over(partition by uid order by dt)) grp_dt
from
(
select distinct uid,date_format(log_time,"yyyy-MM-dd") as dt 
from login_time
)a
)b group by uid,grp_dt,date_format(dt,"yyyy-MM")
)c
group by uid;

21、介绍一下LAG和LEAD函数:

  • lag返回当前数据行的前第n行的数据
  • lead返回当前数据行的后第n行的数据

需求:查询顾客下次购买的时间

select name,orderdate,cost,
       lead(orderdate,1) over(partition by name order by orderdate ) as time1
from t_order;

22、使用hive求出两个数据集的差集?

数据 diff_t1表:

id name 
1 zs
2 ls

diff_t2表:

id name 
1 zs
3 ww

结果如下:

id name 
2 ls 
3 ww 
create table diff_t1( 
id string, 
name string 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/diff_t1.txt' overwrite into table diff_t1; 
create table diff_t2( 
id string, 
name string 
)
row format delimited 
fields terminated by ' ' 
;

load data local inpath './hivedata/diff_t2.txt' overwrite into table diff_t2; 
select t1.id id, t1.name name 
from diff_t1 t1 
left join diff_t2 t2 on t1.id=t2.id 
where t2.id is null 
union 
select t2.id id, t2.name name 
from diff_t1 t1 
right join diff_t2 t2 on t1.id=t2.id 
where t1.id is null ; 

23、使用hive的hql实现如下需求

现有某网站购买记录字段如下

orderid,userid,productid,price,timestamp,date
121,张三,3,100,1535945356,2018-08-07
122,张三,3,200,1535945356,2018-08-08
123,李四,3,200,1535945356,2018-08-08
124,王五,1,200,1535945356,2018-08-08
125,张三,3,200,1535945356,2018-08-09
126,张三,2,200,1535945356,2018-08-09
127,李四,3,200,1535945356,2018-08-09
128,李四,3,200,1535945356,2018-08-10
129,李四,3,200,1535945356,2018-08-11

用sql统计今日及昨日都购买过商品productid为3的用户及其昨日消费。

drop table product;
create table product(
orderid string,
userid string,
productid int,
price int,
tamp int,
dt date
)
row format delimited
fields terminated by ',';

load data local inpath './hivedata/product.txt' overwrite into table product;
select * 
from(
select userid,dt,
lag(dt,1) over(partition by userid order by dt) yestoday,
case when
datediff(dt,lag(dt,1) over(partition by userid order by dt))=1
then lag(total_price) over(partition by userid order by dt)
else null end yestoday_price
from
(
select userid,dt,productid,sum(price) over(partition by userid,dt order by dt) total_price
from product order by userid,dt
)A
where productid = 3
)B where yestoday_price is not null;

24、使用hive的hql如下:

表user_action_log用户行为故据

uid	 time	action
1	Time1	Read
3	Time2	Comment
1	Time3	Share
2	Time4	Like
1	Time5	Write
2	Time6	like
3	Time7	Write
2	Time8	Read

分析用户行为习惯找到毎一个用户在表中的第一次行为

drop table user_action_log;
create table user_action_log(
uid int,
time string,
action string
)
row format delimited
fields terminated by '\t';

load data local inpath './hivedata/user_action_log.txt' overwrite into table user_action_log;

使用代码实现

select uid,time,action
from (
select uid,time,action,row_number() over(partition by uid order by time) rn
from user_action_log
)A
where rn = 1;

25、每个用户连续登陆的最大天数?

数据: user_login表

uid,dt
1,2019-08-01
1,2019-08-02
1,2019-08-03
2,2019-08-01
2,2019-08-02
3,2019-08-01
3,2019-08-03
4,2019-07-28
4,2019-07-29
4,2019-08-01
4,2019-08-02
4,2019-08-03

结果如下:

uid cnt_days 
1 3 
2 2 
3 1 
4 3 
create table user_login(
uid int,
dt date
)
row format delimited
fields terminated by ',';

load data local inpath './hivedata/user_login.txt' overwrite into table user_login;
select uid,max(cnt)
from
(select uid,dt_sub,count(1) cnt
from
(select uid,dt,
date_sub(dt,row_number() over(partition by uid order by dt)) dt_sub
from user_login
)A 
group by uid, dt_sub
)B group by uid;

26、请使用sql计算pv、uv?

数据:

t1表

uid dt url
1 2019-08-06 http://www.baidu.com
2 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.baidu.com
3 2019-08-06 http://www.soho.com
3 2019-08-06 http://www.meituan.com
3 2019-08-06

结果如下:

dt uv pv 
2019-08-6 3 5 
create table user_net_log(
uid int,
dt date,
url string
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/user_net_log.txt' overwrite into table user_net_log;
select dt,count(distinct uid),count(url)
from user_net_log
group by dt;

27、hive中coalease()、nvl()、concat_ws()、collect_list()、collect_set()、regexp_replace().这几个函 数的意义?

自行扩展吧

28、有表如下记录了智智同学玩某moba游戏每局的英雄pick情况

pk_moba表

id	names
1	亚索,挖据机,艾瑞莉亚,洛,卡莎
2	亚索,盖伦,奥巴马,牛头,皇子
3	亚索,盖伦,艾瑞莉亚,宝石,琴女
4	亚索,盖伦,赵信,老鼠,锤石

请用HiveSQL计算出出场次数最多的top3英雄及其Pick率(=出现场数/总场数)

create table pk_moba(
id int,
names array
)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

load data local inpath './hivedata/pk_moba.txt' overwrite into table pk_moba;
select name,cnt,rn,concat(round(cnt/4*100,0),"%") lv
from 
(
select name,cnt,dense_rank() over(sort by cnt desc) rn
from (
select name,count(name) cnt
from pk_moba lateral view explode(names) tmp as name
group by name
)A
)B where rn<4;

29、有三个表,分别是:

区域(district) 区域中有两个字段分别是区域Id(disid)和区域名称(disname)

城市(city) 城市有两个字段分别是城市ID(cityid)和区域ID(disid)

订单(order) 订单有四个字段分别是订单ID(orderid)、用户ID(userid)、城市ID(cityid)和消费金额(amount)。

district表:

disid disname 
1 华中
2 西南


create table district(
disid int,
disname string
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/district.txt' overwrite into table district;

city表:

cityid disid 
1 1
2 1
3 2
4 2
5 2

create table city(
cityid int,
disid int
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/city.txt' overwrite into table city;

order表:

oid userid cityid amount 
1 1 1 1223.9
2 1 1 9999.9
3 2 2 2322
4 2 2 8909
5 2 3 6789
6 2 3 798
7 3 4 56786
8 4 5 78890

create table order_t(
oid int,
userid int,
cityid int,
amount float
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/order.txt' overwrite into table order_t;

高消费者是消费金额大于1W的用户,使用hive hql生成如下报表:

区域名 高消费者人数 消费总额

select disname,count(1),sum(amount)
from(
select disname,userid,sum(amount) amount from district A join city B on A.disid = B.disid
join order_t C on B.cityid = C.cityid
group by disname,userid
)D
where amount>10000
group by disname;

30、某APP每天访问数据存放在表access_log里面,包含日期字段ds,用户类型字段user_type,用户账 号user_id,用户访问时间log_time,请使用hive的hql语句实现如下需求:

(1)、每天整体的访问UV、PV?

select log_time,count(distinct user_id) uv,count(1) pv
from access_log
group by log_time

(2)、每天每个类型的访问UV、PV?

select log_time,user_type,count(distinct user_id) uv,count(1) pv
from access_log
group by log_time;

(3)、每天每个类型中最早访问时间和最晚访问时间?

select log_time,user_type,min(log_time),max(log_time)
from access_log
group by log_time;

(4)、每天每个类型中访问次数最高的10个用户?

select dt,user_type,cnt,rn
from(
select dt,user_type,cnt,row_number() over(partition by dt,user_type order by cnt desc) rn
from
(select substr(log_time,1,7) dt,user_type,count(1) cnt
from access_log
group by substr(log_time,1,7),user_type
)A
)B where rn <11;

31、描述序列函数NTILE

它把有序的数据集合 平均分配指定的数量(num)个桶中, 将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。

例子:

select name,orderdate,cost,
       ntile(3) over(partition by name) # 按照name进行分组,在分组内将数据切成3份
from t_order;
mart	2018-04-13	94	1
mart	2018-04-11	75	1
mart	2018-04-09	68	2
mart	2018-04-08	62	3
neil	2018-06-12	80	1
neil	2018-05-10	12	2
saml	2018-01-01	10	1
saml	2018-01-08	55	1
saml	2018-04-06	42	2
saml	2018-01-05	46	2
saml	2018-02-03	23	3
tony	2018-01-07	50	1
tony	2018-01-02	15	2
tony	2018-01-04	29	3

32、有如下三张表:

表login_a(登录表):

ds user_id 
2019-08-06 1
2019-08-06 2
2019-08-06 3
2019-08-06 4

create table login_a(
ds date,
user_id int
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/login_a.txt' overwrite into table login_a;

表read_b(阅读表):

ds user_id read_num 
2019-08-06 1 2
2019-08-06 2 3
2019-08-06 3 6

create table read_b(
ds date,
user_id int,
read_num int
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/read_b.txt' overwrite into table read_b;

表cost_c(付费表):

ds user_id price
2019-08-06 1 55.6
2019-08-06 2 55.8

create table cost_c(
ds date,
user_id int,
price float
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/cost_c.txt' overwrite into table cost_c;

基于上述三张表,请使用hive的hql语句实现如下需求:

(1)、用户登录并且当天有个阅读的用户数,已经阅读书籍数量

select A.ds,count(distinct A.user_id),sum(B.read_num)
from login_a A join read_b B on A.user_id = B.user_id and A.ds = B.ds
group by A.ds;

(2)、用户登录并且阅读,但是没有付费的用户数

select A.ds,count(1)
from login_a A join read_b B on A.user_id = B.user_id and A.ds = B.ds
left join cost_c C on B.user_id = C.user_id and B.ds = C.ds
where C.price is null
group by A.ds;

(3)、用户登录并且付费,付费用户数量和金额总数

select A.ds,count(1),sum(price)
from login_a A join read_b B on A.user_id = B.user_id and A.ds = B.ds
left join cost_c C on B.user_id = C.user_id and B.ds = C.ds
group by A.ds;

33、hive的hql中,left outer join和left semi join的区别?

left semi join,我们称之为半开连接。它是left join的一种优化形式,只能查询左表的信息,主要用于解决hive中左表的数据是否存在的问题

34、有一个订单表和渠道表,结构如下:

create table order(
order_id long,
user_id long comment '用户id',
amount double comment '订单金额',
channel string comment '渠道',
time string comment '订单时间,yyyy-MM-dd HH:mi:ss'
)
partition by (dt string comment '天,yyyy-MM-dd');

请使用hive hql查询出2019-08-06号 每个渠道的下单用户数、订单总金额。

hql语句实现,结果表头如下: channel user_num order_amount

select channel,count(user_id) user_num,sum(amount) order_acount
from order
where to_date(time) = '2019-08-06'
group by channel;

35、考虑表设计和sql查询:

设计数据库表,用来存放学生基本信息,课程信息,学生的课程及成绩,并给出查询语句,查询平均成绩大于85的所有学生。

create table stu_1( 
id string, 
name string, 
age int, 
addr string 
)
row format delimited 
fields terminated by ',' 
;

create table course_1( 
cid string, 
cname string 
)
row format delimited 
fields terminated by ',' ;

create table course_sc( 
id string, 
cid string, 
score int 
)
row format delimited 
fields terminated by ',' 
;

load data local inpath '/hivedata/course_1.txt' overwrite into table course_1; 
load data local inpath '/hivedata/stu_1.txt' overwrite into table stu_1; 
load data local inpath '/hivedata/course_sc.txt' overwrite into table course_sc; 
select cs.id, avg(score) avgscore 
from course_sc cs 
group by cs.id 
having avg(score)>85 ; 

36、需求如下:

有用户表user(uid,name) 以及黑名单BanUser(uid)

  1. 用left join 方式写sql查出所有不在黑名单的用户信息
  2. 用not exists 方法写sql查询所有不在黑名单的用户信息
create table u( 
id string, 
name string 
)
row format delimited 
fields terminated by ',' 
;

create table banuser( 
id string 
);

load data local inpath '/hivedata/banuser.txt' overwrite into table banuser; 
load data local inpath '/hivedata/u.txt' overwrite into table u; 
select u.* 
from u left join banuser on u.id = banuser.id
where banuser.id is null;

select u.*
from u 
where not exists (select 1 from banuser where banuser.id = u.id);

37、需求如下:

course_score表数据:

1,zhangsan,数学,80,2015
2,lisi,语文,90,2016
3,lisi,数学,70,2016
4,wangwu,化学,80,2017
5,zhangsan,语文,85,2015
6,zhangsan,化学,90,2015
create table course_score( 
id string, 
name string, 
course string, 
score int, 
year string 
)
row format delimited 
fields terminated by ',' 
;

load data local inpath './hivedata/course_score.txt' overwrite into table course_score; 

1、查出每个学期每门课程最高分记录(包含全部5个字段)

select id,name,course,score,year,max(score) over(partition by year,course)
from course_score;

select year,course,max(score)
from course_score
group by year,course order by year;

2、查出单个学期中语文课在90分以上的学生的数学成绩记录(包含全部字段)

select A.id,A.name,A.course,A.score,A.year
from course_score A join course_score B on A.name = B.name 
where B.course='语文' and B.score>=90 and A.course='数学';
select s.id, s.name, s.course, s.score, s.year 
from course_score s 
join
(select 
course, 
year, 
max(score) score 
from course_score 
group by course,year) t1 
on s.course=t1.course 
and 
s.year=t1.year 
and 
s.score=t1.score 
;
select s.id, s.name, s.course, s.score, s.year 
from course_score s 
join 
(select 
id, name, course, score, year from course_score 
where score >=90 and course="语文") t1 
on s.name=t1.name 
where s.course="数学" 
; 

38、需求如下

t1表:

name course score 
aa English 75
bb math 85
aa math 90

create table t1_1(
name string,
course string,
score int
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/t1_1.txt' overwrite into table t1_1;

使用hql输出以下结果

name English math 
aa 75 90 
bb 0 85
select name,
max(if(course="English",score,0)) English,
max(if(course="math",score,0)) math
from t1_1
group by name;

39、需求如下

t1表:

用户 商品 
A P1
B P1
A P2
B P3

请你使用hql变成如下结果:

用户 P1 P2 P3 
A 1 1 0
B 1 0 1
select username,
max(if(product="p1",1,0)) p1,
max(if(product="p2",1,0)) p2,
max(if(product="p3",1,0)) p3
from t1
group by username;

40、需求如下

dpt部门

dpt_id	dpt_name
1	产品
2	技术

User用户表

User_id	Dpt_id
1	1
2	1
3	2
4	2
5	3

result表

user_id	dpt_id	dpt_name
1	1	产品
2	1	产品
3	2	技术
4	2	技术
5	3	其他部门

原数据表时user和dpt表,请使用hql实现result表中的结果.

select user.user_id,user.dpt_id,dpt.dpt_id
from user 
join dpt 
on user.dpt_id = dpt.dpt_id;

41、需求如下

t1_order表:

order_id order_type order_time
111 N 10:00
111 A 10:05
111 B 10:10

create table t1_order(
order_id string,
order_type string,
order_time string
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/t1_order.txt' overwrite into table t1_order;

是用hql获取结果如下:

order_id order_type_1 order_type_2 order_time_1 order_time_2 
111 N A 10:00 10:05
111 A B 10:05 10:10
select order_id,order_type_1,order_type_2,order_time_1,order_time_2
from(
select order_id,order_type order_type_1,
lead(order_type,1) over(sort by order_time) order_type_2,
order_time order_time_1,
lead(order_time,1) over(sort by order_time) order_time_2
from t1_order
)A
where order_type_2 is not null;

42、需求如下

t1_hobby表

name sex hobby 
janson 男 打乒乓球、游泳、看电影 
tom 男 打乒乓球、看电影

drop table t1_hobby;
create table t1_hobby(
name string,
sex string,
hobby string
)
row format delimited
fields terminated by ' ';

load data local inpath './hivedata/t1_hobby.txt' overwrite into table t1_hobby;

hobby最多3个值,使用hql实现结果如下:

name sex hobby1 hobby2 hobby3 
janson 男 打乒乓球 游泳 看电影 
tom 男 打乒乓球 看电影 
select name,sex,split(hobby,"、")[0] hobby1,
split(hobby,"、")[1] hobby2,
nvl(split(hobby,"、")[2],"") hobby3
from t1_hobby;

收集编写不易!来个收藏点赞呗

你可能感兴趣的:(Hive,SQL,sql,数据库,hive,大数据)