现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat
数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat
数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
题目要求:
数据要求:
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析':', 不支持解析'::',所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
(2)使用Hive能解析的方式进行
Hive不支持多字节的分隔符。(Hive不支持 unicode 编码 > 128的字符 )
补充一个知识点:
1、列分隔符的默认值是
ctrl + a \x01
2、集合的默认分隔符(array类型中的元素值之间的分隔符 和 map类型中 kv和kv之间的分隔符, struct类型各元素之间的分隔符)
ctrl + b
\x02
3、map类型中的k和v之间的分隔符:
ctrl + c
\x03
Hive要求:
(1)正确建表,导入数据(三张表,三份数据),并验证是否正确
drop table if exists users;
create table users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/hadoop/users.dat' INTO TABLE users;
select * from users limit 5;
drop table if exists movies;
create table movies(MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/hadoop/movies.dat' INTO TABLE movies;
select * from movies limit 5;
drop table if exists ratings;
create table ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/hadoop/ratings.dat' INTO TABLE ratings;
select * from ratings limit 5;
(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
两个表链接分组注意:
create table result2 as
select r.movieid,m.title,count(*) total
from ratings r join movies m on r.movieid=m.movieid
group by r.movieid,m.title
order by total desc limit 10;
(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)(Top10)
思路:
where gender = 'M'
group by movieid, title
avg(rating) as avgrate
order by avgrate desc limit 10;
答案:
create table result3 as
select r.movieid,m.title,avg(r.rating) as avgrate
from ratings r join movies m on r.movieid=m.movieid join users u on r.userid=u.userid
where u.gender='M'
group by r.movieid,m.title
order by avgrate desc limit 10;
create table result2_1 as select a.movieid as movieid, b.title as title, avg(rating) as avgrate, count(*) as total from ratings a join movies b on a.movieid = b.movieid join users c on a.userid = c.userid
where c.gender = 'M'
group by a.movieid, b.title
having total > 50
order by avgrate desc limit 10
(4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好)的平均影评(年龄段,影评分)
注意:如果一个需求中,出现了“每”“各”“分别”,那么就一定要进行分组
select distinct age from users;
\\如果需要划分年龄段: case 。。。 when。。。。
group by age
where movieid = 2116
avg(rating) as avgrate
from ratings a join users b on a.userid = b.userid
select age, avg(rating) as avgrate
答案:
create table result4 as
select u.age,avg(r.rating) avgrate
from users u join ratings r on u.userid=r.userid
where r.movieid=2116
group by u.age;
(5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
1、求出最喜欢看电影的女性
select a.userid, count(*) as total
from ratings a join users b on a.userid = b.userid
where b.gender = "F"
group by a.userid
order by total desc limit 1;
结果:
1150 1302
2、求出这个女性评分最高的10部电影
select rr.movieid from
(select movieid,rating from ratings where userid=1150 order by rating desc limit 10) rr;
结果:
951
3671
3307
1230
904
162
3675
1966
3163
2330
3、求出这10部电影的平均影评分
create table result5 as
select r.movieid,avg(r.rating) as avgrate
from ratings r left semi join (select rr.movieid from
(select movieid,rating from ratings where userid=1150 order by rating desc limit 10) rr) a
on r.movieid=a.movieid
group by r.movieid;
select * from result5;
结果: 162 4.063136456211812
904 4.476190476190476
951 4.249370277078086
1230 4.14167916041979
1966 3.6464646464646466
2330 4.163043478260869
3163 3.7039473684210527
3307 4.387453874538745
3671 4.047363717605005
3675 3.8265682656826567
(6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影(movieid, title)
比如:2012年上映了20部电影,结果只有9部好片。 那么最后的结果是输出9部电影还是10部电影?--》10部。
意义:求出电影大年(好片最多的年)中的最好看的10部电影
1、求出好片最多的年份
1、构造出年份字段
2、按照年份(和电影)分组,按照电影的评分排降序,取前1
3、电影的评分
one:求每个电影的评分(按照movieid分组)
// 字段:movieid, year, avgrate
create table result6_1 as
select r.movieid,substr(m.title,-5,4) as year,avg(r.rating) as avgrate
from ratings r join movies m on r.movieid=m.movieid
group by r.movieid,m.title;
two:求每年好电影的个数(按照年份分组),取top1就是好片最多的年份
create table result6_2 as
select year,count(*) as total from result6_1
where avgrate>=4.0
group by year
order by total desc limit 1;
结果: 1998 27
2、求出这一年最好看的10部电影
create table result6_3 as
select movieid,avgrate from result6_1
where year =1998 order by avgrate desc limit 10;
结果:
1830 5.0
2503 4.666666666666667
2309 4.5
2930 4.4
2028 4.337353938937053
2360 4.3076923076923075
2839 4.3
2357 4.283720930232558
3077 4.2272727272727275
2329 4.2265625
*****************************************************************************************************
group by 字段的用法: 不能是select后面字段的别名
order by 字段的用法: 不能是select后面不出现的字段, 可以使用别名
补充一点:
如果使用子查询,一定不要忘记给子查询的结果表取一个别名
******************************************************************************************************
(7)求1997年上映的电影中,评分最高的10部Comedy类电影
重点问题:判断一个电影是不是comedy电影
方式:hive内置函数instr可以用来判断,一个字段串在不在另外一个字符串中
**************
测试例子:
SELECT instr(lcase('Comedy|Horror|Thriller'), 'comedy');
lcase:将字符串全部变为小写。comedy在字段串中时,结果不为0;
分析:
where year = 1997 (result6_1)
where instr(lcase(genres), 'comedy') != 0
order by avgrate desc limit 10;
select movieid, avgrate
from result6_1 a join moveis b on a.movieid = b.movieid
答案:
create table result7 as
select r.movieid,r.avgrate from result6_1 r join movies m on r.movieid=m.movieid
where r.year=1997 and instr(lcase(m.genres),'comedy') !=0
order by r.avgrate desc limit 10;
结果:
2324 4.329861111111111
1827 4.0
1871 4.0
2444 4.0
1784 3.9501404494382024
2618 3.891304347826087
1641 3.872393661384487
1564 3.8333333333333335
1734 3.825870646766169
1500 3.813380281690141
(8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)(topn)
核心思路: 按照电影类型分组,按照评价排序,取每组前 5
前提:求出每种类型的所有电影
测试例子:
select explode(split("Horror|Sci-Fi|Thriller","\\|"));
内置函数split为:按照指定分隔符,切分字符串
结果:
Horror
Sci-Fi
Thriller
需要转换数据:
3826::Hollow Man (2000)::Horror|Sci-Fi|Thriller
为:
3826::Hollow Man (2000)::Horror
3826::Hollow Man (2000)::Sci-Fi
3826::Hollow Man (2000)::Thriller
答案:
第一步,每种类型的所有电影
create table result8_1 as
select movieid, title, type.movietype
************************************************************
from movies lateral view explode(split(genres,"\\|")) type as movietype;
(
虚拟视图)
create table result8_1 as
select movieid,title, lv_type.mtype
from movies lateral view explode(split(genres,"\\|")) lv_type as mtype;
result8_1 : movieid, title, movietype
result6_1 : movieid, year, avgrate
第二步, 得让result8_1 和 result6_1链接
XXXXX
XXXXXXXXXXXXXXXXXXXX
select a.movietype, a.movieid, b.avgrate
from result8_1 a join result6_1 b
on a.movieid = b.movieid
group by a.movietype, a.movieid; XXXXXXXXX
********************************************************************
得利用新技能实现: 窗口函数: row_number()
具体用法:
row_number() over (distribute by movietype sort by avgrate desc) as row_numer
row_numer的含义: 就是该条记录在每一组当中的排序的序号
**************************************************
a
jfdls
1
a
jhfk
2
a
lsdkfl
3
b
aa
1
b
2304
2
b
jlkfjsl
3
分组排序, distribute by movietype sort by avgrate desc
cluster by
最终的HQL语句:
create table result8_2 as
select * from
带序号的字段
(select a.mtype,a.movieid,b.avgrate,row_number() over(distribute by mtype sort by avgrate desc) as rn
from result8_1 a join result6_1 b
on a.movieid=b.movieid) ab where ab.rn<=5;
扩展:
create table result8_3 as select a.movietype, a.movieid, b.avgrate, row_number() over (distribute by movietype sort by avgrate desc) as row_numer
from result8_1 a join result6_1 b
on a.movieid = b.movieid;
(9)各年评分最高的电影类型(年份,类型,影评分)
result6_1 : movieid, year, avgrate 各个电影的评分
result8_1 : movieid, title, mtype 各种类型的电影
create table result9_1 as
select a.year,b.mtype,avg(a.avgrate) rate
from result6_1 a join result8_1 b on a.movieid = b.movieid
group by year,mtype order by year,rate;
select * from (select *,row_number() over(distribute by year sort by rate desc) as rn from result9_1) ab
where ab.rn=1;
(10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
每个地区 每个电影 评分
create table result10_1 as
select u.zipcode,m.movieid,avg(r.rating) rate
from users u join ratings r on u.userid=r.userid join movies m on r.movieid=m.movieid
group by zipcode,m.movieid;
select * from (select *,row_number() over(distribute by zipcode sort by rate desc) as rn from result10_1) ab
where ab.rn=1;
............
99203 150 5.0 1
99205 587 5.0 1
99217 2571 5.0 1
99224 2133 5.0 1
99352 1380 5.0 1
99353 2393 5.0 1
99504 3097 5.0 1
99508 1196 5.0 1
99516 2028 5.0 1
99701 3615 5.0 1
99703 1485 5.0 1
99709 2371 5.0 1
99801 3342 5.0 1
99826 2858 5.0 1
99945 3114 5.0 1