数据:
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
{"movie":"2804","rate":"5","timeStamp":"978300719","uid":"1"}
{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}
{"movie":"919","rate":"4","timeStamp":"978301368","uid":"1"}
{"movie":"595","rate":"5","timeStamp":"978824268","uid":"1"}
{"movie":"938","rate":"4","timeStamp":"978301752","uid":"1"}
{"movie":"2398","rate":"4","timeStamp":"978302281","uid":"1"}
{"movie":"2918","rate":"4","timeStamp":"978302124","uid":"1"}
{"movie":"1035","rate":"5","timeStamp":"978301753","uid":"1"}
{"movie":"2791","rate":"4","timeStamp":"978302188","uid":"1"}
需求:得出关注度最高的二十部电影
思路:解析json得出各项数据,根据 movie分组。所谓关注度就是每组中数据的多寡,所以统计组中数据的数目count(*),在为每组中增加一列tag 值均为1.因为这里要求出所有电影的前二十条。所以需要将分组以后的数据在放到同一组中并有排序索引,可以使用row_numer()函数实现,就需要一个同一的字段将所有数据放到同一组中,而没有符合要求的字段,就设置tag一列。
答案:
select movie,number from
(select movie,number,row_number() over(partition by tag order by number desc) as rank
from
(select movie,1 as tag,avg(rate) as number
from(
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid)
from t_json limit 10000)as one group by movie)as two)as three where rank <20;
这里因为虚拟机资源的问题,只查找了10000条。还可以设置为本地模式:set hive.exec.mode.local.auto=true;
结果:
总结:
新手很容易犯的错误:
Error: Error while compiling statement: FAILED: ParseException line 1:301 missing EOF
FAILED: ParseException line 5:47 cannot recognize input near ')' 'as' 'one' in subquery source
前两个错误让我无奈了好久,hive中子查询必须取别名!hive中子查询必须取别名!hive中子查询必须取别名!重要的事情说三遍,无论别名是否会使用,hive中子查询必须取别名!
Error while compiling statement: FAILED: SemanticException [Error 10025]: Line 4:8 Expression not in GROUP BY key 'movie'
在hive中,单独使用sum,avg这些函数可以不分组,不添加group by。
而除了查询sum,avg,还查询了某个字段的值,就必须添加group by