第一步:使用SQL查询,某日期区间,某几个频道的接触度
select
count(userid) as ooc,
channelname as channelname,
dt as dt
from tvlog_test.tvlog_tcl
where dt between '2015-09-01' and '2015-09-05'
and instr('["CCTV-5体育", "CCTV-6电影", "CCTV-10科教"]', channelname) <> 0
group by dt, channelname;
Result:
91885 CCTV-5体育 2015-09-01
155304 CCTV-6电影 2015-09-01
72961 CCTV-10科教 2015-09-02
82379 CCTV-5体育 2015-09-03
22599 CCTV-6电影 2015-09-03
74714 CCTV-10科教 2015-09-04
129171 CCTV-5体育 2015-09-05
191576 CCTV-6电影 2015-09-05
68713 CCTV-10科教 2015-09-01
85925 CCTV-5体育 2015-09-02
166430 CCTV-6电影 2015-09-02
195039 CCTV-10科教 2015-09-03
107881 CCTV-5体育 2015-09-04
163962 CCTV-6电影 2015-09-04
71486 CCTV-10科教 2015-09-05
第二步:将数据按日期进行合并
注意, collect_set这个UDAF的参数不支持count
select
collect_set(t1.ooc),
collect_set(t1.channelname),
t1.dt
from
(select
count(userid) as ooc,
channelname as channelname,
dt as dt
from tvlog_test.tvlog_tcl
where dt between '2015-09-01' and '2015-09-05'
and instr('["CCTV-5体育", "CCTV-6电影", "CCTV-10科教"]', channelname) <> 0
group by dt, channelname) t1
group by t1.dt;
Result:
[68713,91885,155304] ["CCTV-10科教","CCTV-5体育","CCTV-6电影"] 2015-09-01
[195039,82379,22599] ["CCTV-10科教","CCTV-5体育","CCTV-6电影"] 2015-09-03
[71486,129171,191576] ["CCTV-10科教","CCTV-5体育","CCTV-6电影"] 2015-09-05
[72961,85925,166430] ["CCTV-10科教","CCTV-5体育","CCTV-6电影"] 2015-09-02
[74714,107881,163962] ["CCTV-10科教","CCTV-5体育","CCTV-6电影"] 2015-09-04
结论:
1、Hive没办法将数组转成多列(列数不固定),使用UDTF不行,并且Hive也不支持存储过程。
2、即使可以转成多列,也没办法定义别名,也就是不知道该列是对应哪个列名
3、列数固定的话,可以使用case when 语句行转列或者编写一个固定列的UDTF(第二种不好,如果2,3,4,5都有需求,那要编写4个函数)