目录
0 问题描述
1 数据分析
2 小结
(1)数据准备
create table log
(
uid string,
ip string,
time string
)row format delimited
fields terminated by '\t';
insert into log values
('a', '124', '2019-08-07 12:0:0'),
('a', '124', '2019-08-07 13:0:0'),
('b', '124', '2019-08-08 12:0:0'),
('c', '124', '2019-08-09 12:0:0'),
('a', '174', '2019-08-10 12:0:0'),
('b', '174', '2019-08-11 12:0:0'),
('a', '194', '2019-08-12 12:0:0'),
('b', '194', '2019-08-13 13:0:0'),
('c', '174', '2019-08-14 12:0:0'),
('c', '194', '2019-08-15 12:0:0')
;
hive> select * from log;
OK
a 124 2019-08-07 12:00:00
a 124 2019-08-07 13:00:00
b 124 2019-08-08 12:00:00
c 124 2019-08-09 12:00:00
a 174 2019-08-10 12:00:00
b 174 2019-08-11 12:00:00
a 194 2019-08-12 12:00:00
b 194 2019-08-13 13:00:00
c 174 2019-08-14 12:00:00
c 194 2019-08-15 12:00:00
Time taken: 0.053 seconds, Fetched: 10 row(s)
(2)共同使用问题,一般涉及到该字眼的都是需要一对多,此类问题解决的核心逻辑是自关联。
1)获取自关联后的结果集
select
t1.uid as uid_1, t1.ip, t2.uid as uid_2
from
(select uid, ip from log group by uid, ip) t1 join
(select uid, ip from log group by uid, ip) t2
on
t1.ip = t2.ip
--------------------------------------------------------------------------------
OK
a 124 a
a 124 c
a 124 b
a 174 a
a 174 c
a 174 b
a 194 a
a 194 c
a 194 b
b 124 a
b 124 c
b 124 b
b 174 a
b 174 c
b 174 b
b 194 a
b 194 c
b 194 b
c 124 a
c 124 c
c 124 b
c 174 a
c 174 c
c 174 b
c 194 a
c 194 c
c 194 b
Time taken: 8.808 seconds, Fetched: 27 row(s)
2)由于数据会两两出现,如a,b和b,a实际上是一样的,过滤掉这部分重复数据,只需要选出t1.uid < t2.uid即可,也就是a,b的数据.hive中不支持不等连接,所以选where语句
select
t1.uid as uid_1, t1.ip, t2.uid as uid_2
from
(select uid, ip from log group by uid, ip) t1 join
(select uid, ip from log group by uid, ip) t2
where
t1.ip = t2.ip and t1.uid < t2.uid
OK
a 124 c
a 124 b
a 174 c
a 174 b
a 194 c
a 194 b
b 124 c
b 174 c
b 194 c
Time taken: 16.193 seconds, Fetched: 9 row(s)
3)按照组合键分组,并过滤出符合条件的用户。最终SQL如下:
select m.uid_1,m.uid_2
from(
select
t1.uid as uid_1, t1.ip, t2.uid as uid_2
from
(select uid, ip from log group by uid, ip) t1 join
(select uid, ip from log group by uid, ip) t2
where
t1.ip = t2.ip and t1.uid < t2.uid
) m
group by m.uid_1,m.uid_2
having count(ip) >= 3
--------------------------------------------------------------------------------
OK
a b
a c
b c
Time taken: 11.073 seconds, Fetched: 3 row(s)
本题的特征为"共同使用",诸如此类两两相遇,共同,相互认识等关键字时候,往往采用自关联解决,这类问题的共同特征往往包含一对多,需要组合出各种情况。相关类问题可参考如下链接:https://blog.csdn.net/godlovedaniel/article/details/119155757