现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat 数据格式为: 1::1193::5::978300760 (.)::(.)::(.)::(.)
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
1. 正确建表,导入数据(三张表,三份数据),并验证是否正确
创建users表
create table if not exists users(UserID BigInt,Gender String,Age Int,Occupation String,Zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.)::(.)::(.)::(.)::(.*)','
output.format.string'='%1s %2s %3s %4s %5$s')
stored as textfile location "/user/data/yingping/users";
加载数据
load data local inpath "/home/hadoop/hive_data/users.dat" into table users;
检查数据
select *
from users limit 10;
创建movies表
create table if not exists movies(MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.)::(.)::(.*)','
output.format.string'='%1s %2s %3$s')
stored as textfile location "/user/data/yingping/movies";
加载数据
load data local inpath "/home/hadoop/hive_data/movies.dat" into table movies;
检查数据
select *
from movies limit 10;
创建ratings表
create table if not exists ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.)::(.)::(.)::(.)','
output.format.string'='%1s %2s %3s %4s')
stored as textfile location "/user/data/yingping/ratings";
加载数据
load data local inpath "/home/hadoop/hive_data/ratings.dat" into table ratings;
检查数据
select * from ratings limit 10;
2.求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
select Title,count(UserID)c
from movies m join ratings r
where m.MovieID=r.MovieID
group by Title order by c desc limit 10;
3.分别求男性,女性当中平均评分最高的10部电影(性别,电影名,影评分)
select Gender,Title,avg(Rating)avg
from ratings r join users u on r.UserID=u.UserID
join movies m on r.MovieID=m.MovieID
where Gender = 'F'
group by Gender,Title order by avg desc limit 10;
select Gender,Title,avg(Rating)avg
from ratings r join users u on r.UserID=u.UserID
join movies m on r.MovieID=m.MovieID
where Gender = 'M'
group by Gender,Title order by avg desc limit 10;
4.求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
select Age,avg(Rating)avg
from users u join ratings r on u.UserID=r.UserID
where MovieID=2116 group by Age;
5.求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
思路:题目要求最喜欢看电影的女性,所以我们首先要求出那位女性,然后再找出她看的电影,从中选10部平均分最高的
求最喜欢看电影的那位女性
select u.UserID,count(Rating)count
from users u join ratings r on u.UserID=r.UserID
where Gender='F'
group by u.UserID
order by count desc limit 1;
结果
u.userid count
1150 1302
求10部电影的平均影评
select r.MovieID,m.Title,avg(r.Rating)avg
from ratings r join
(select MovieID,Rating from ratings
where UserID=1150 order by Rating desc limit 10
)t on r.MovieID=t.MovieID join movies m on r.MovieID=m.MovieID group by r.MovieID,m.Title;
结果
162 Crumb (1994) 4.063136456211812
904 Rear Window (1954) 4.476190476190476
951 His Girl Friday (1940) 4.249370277078086
1230 Annie Hall (1977) 4.14167916041979
1966 Metropolitan (1990) 3.6464646464646466
2330 Hands on a Hard Body (1996) 4.163043478260869
3163 Topsy-Turvy (1999) 3.7039473684210527
3307 City Lights (1931) 4.387453874538745
3671 Blazing Saddles (1974) 4.047363717605005
3675 White Christmas (1954) 3.8265682656826567
6.求好片(评分>=4.0)最多的那个年份的最好看的10部电影
思路:首先我们肯定要将Title中的年份字段取出来,然后求出那个年份,最后求那个年份的10部电影
创建一个表包含year(年份),MovieID(电影id),avg_rate(评分)
create table year_movie_avgrate as
select
substr(a.Title,-5,4) year,a.MovieID MovieID,avg(b.Rating) avg_rate
from movies a join ratings b on a.MovieID=b.MovieID
group by a.MovieID,substr(a.Title,-5,4);
检查数据
select *
from year_movie_avgrate limit 10;
结果
year_movie_avgrate.year year_movie_avgrate.movieid year_movie_avgrate.avg_rate
1995 1 4.146846413095811
1995 2 3.20114122681883
1995 3 3.01673640167364
1995 4 2.7294117647058824
1995 5 3.0067567567567566
1995 6 3.8787234042553194
1995 7 3.410480349344978
1995 8 3.014705882352941
1995 9 2.656862745098039
1995 10 3.5405405405405403
从上面的表中找出平均分大于4,且好片最多的年份
select
year,count(*) totalcount
from year_movie_avgrate
where avg_rate >= 4.0
group by year
order by totalcount desc limit 1;
结果
year totalcount
1998 27
将上面的表嵌套进来,求1998年的最好看的10部电影
select
a.year year,b.MovieID MovieID,b.avg_rate avg_rate
from
(select
year,count(*) totalcount
from year_movie_avgrate
where avg_rate >= 4.0
group by year
order by totalcount desc limit 1) a
join year_movie_avgrate b on a.year=b.year
order by avg_rate desc limit 10;
结果
name rate
Follow the Bitch (1998) 5.0
Apple, The (Sib) (1998) 4.666666666666667
Inheritors, The (Die Siebtelbauern) (1998) 4.5
Return with Honor (1998) 4.4
Saving Private Ryan (1998) 4.337353938937053
Celebration, The (Festen) (1998) 4.3076923076923075
West Beirut (West Beyrouth) (1998) 4.3
Central Station (Central do Brasil) (1998) 4.283720930232558
42 Up (1998) 4.2272727272727275
American History X (1998) 4.2265625
7.求1997年上映的电影中,评分最高的10部Comedy类电影
思路:直接将上面包含年份的表与movies表关联,按评分排取最高的10部即可
select
a.year year,b.MovieID MovieID,b.Title Title,a.avg_rate avg_rate
from year_movie_avgrate a join movies b
on a.MovieID=b.MovieID
where a.year="1997" and instr(lcase(b.Genres),"comedy")>0
order by avg_rate desc limit 10;
结果
v.id v.name v.rate
2324 Life Is Beautiful (La Vita � bella) (1997) 4.329861111111111
2444 24 7: Twenty Four Seven (1997) 4.0
1827 Big One, The (1997) 4.0
1871 Friend of the Deceased, A (1997) 4.0
1784 As Good As It Gets (1997) 3.9501404494382024
2618 Castle, The (1997) 3.891304347826087
1641 Full Monty, The (1997) 3.872393661384487
1564 Roseanna's Grave (For Roseanna) (1997) 3.8333333333333335
1734 My Life in Pink (Ma vie en rose) (1997) 3.825870646766169
1500 Grosse Pointe Blank (1997) 3.813380281690141
8.该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
思路:首先得划分电影的类型,然后与评分表关联求平均分,取最高的五部
因为一个电影包含几个类型,并且以“|”分割,所以将这些字段炸裂开来
select
MovieID,Title,tps.type Type
from movies lateral view explode(split(Genres,"\\|")) tps as type;
创建一个按电影类型划分的评分表将上表嵌套进来并与ratings表相关联,注意这里的类型要注意大小写
create table type_movie_avgrate as
select
lower(a.Type) Type,a.Title Title,a.MovieID MovieID,avg(b.Rating) avg_rate
from
(select
MovieID,Title,tps.type Type
from movies lateral view explode(split(Genres,"\\|")) tps as type) a
join ratings b on a.MovieID=b.MovieID
group by lower(a.Type),a.Title,a.MovieID;
检查数据
select * from type_movie_avgrate limit 10;
结果
action 13th Warrior, The (1999) 2826 3.1586666666666665
action 3 Ninjas: High Noon On Mega Mountain (1998) 1739 1.3617021276595744
action 52 Pick-Up (1986) 2475 3.3
action 7th Voyage of Sinbad, The (1958) 3153 3.616279069767442
action Abyss, The (1989) 1127 3.6839650145772596
action Aces: Iron Eagle III (1992) 2817 1.64
action Action Jackson (1988) 3710 2.254054054054054
action Adrenalin: Fear the Rush (1996) 1383 1.5454545454545454
action Adventures of Robin Hood, The (1938) 940 3.9735449735449735
action African Queen, The (1951) 969 4.251655629139073
这里要求每种类型中平均评分的最高的5部电影,使用开窗函数将他们按评分和类型进行窗口的划分,找出五部
select
*
from
(select
Type,Title,MovieID,avg_rate,
row_number() over(distribute by Type sort by avg_rate desc) no
from type_movie_avgrate) a where a.no<=5;
部分结果:
a.type a.title a.movieid a.avg_rate a.no
action Sanjuro (1962) 2905 4.608695652173913 1
action Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 2019 4.560509554140127 2
action Godfather, The (1972) 858 4.524966261808367 3
action Raiders of the Lost Ark (1981) 1198 4.477724741447892 4
action Star Wars: Episode IV - A New Hope (1977) 260 4.453694416583082 5
adventure Ulysses (Ulisse) (1954) 3172 5.0 1
adventure Sanjuro (1962) 2905 4.608695652173913 2
adventure Raiders of the Lost Ark (1981) 1198 4.477724741447892 3
adventure Star Wars: Episode IV - A New Hope (1977) 260 4.453694416583082 4
adventure Lawrence of Arabia (1962) 1204 4.401925391095066 5
animation Close Shave, A (1995) 745 4.52054794520548 1
animation Wrong Trousers, The (1993) 1148 4.507936507936508 2
animation Wallace & Gromit: The Best of Aardman Animation (1996) 720 4.426940639269406 3
animation Grand Day Out, A (1992) 1223 4.361522198731501 4
animation Creature Comforts (1990) 3429 4.335766423357664 5
children's Wizard of Oz, The (1939) 919 4.247962747380675 1
children's Toy Story 2 (1999) 3114 4.218927444794953 2
children's Toy Story (1995) 1 4.146846413095811 3
children's Iron Giant, The (1999) 2761 4.0474777448071215 4
children's Winnie the Pooh and the Blustery Day (1968) 1023 3.986425339366516 5
comedy Smashing Time (1967) 3233 5.0 1
comedy Follow the Bitch (1998) 1830 5.0 2
comedy One Little Indian (1973) 3607 5.0 3
comedy Close Shave, A (1995) 745 4.52054794520548 4
comedy Wrong Trousers, The (1993) 1148 4.507936507936508 5
crime Lured (1947) 3656 5.0 1
crime Godfather, The (1972) 858 4.524966261808367 2
crime Usual Suspects, The (1995) 50 4.517106001121705 3
crime Bells, The (1926) 3517 4.5 4
crime Double Indemnity (1944) 3435 4.415607985480944 5
documentary Bittersweet Motel (2000) 3881 5.0 1
9.各年评分最高的电影类型(年份,类型,影评分)
思路:根据题目要求,我们首先要求每一年每一部电影的平均评分,直接将之前的每年的电影评分表和每种类型的电影评分表相关联。然后从这这个表里面查我们要的年份,类型,以及评分,使用开窗函数按评分排序,按年份分组。这又会得到一个表,然后我直接在这个表里取最高的评分的电影即可,即第一个。
select
*
from
(select
c.year year,c.Type Type,c.avg_rate avg_rate,
row_number() over(distribute by c.year sort by c.avg_rate desc) no
from
(
select
a.year year,b.Type Type,avg(a.avg_rate) avg_rate
from year_movie_avgrate a
join type_movie_avgrate b
on a.MovieID=b.MovieID
group by a.year,b.Type
) c ) d where d.no=1;
部分结果
d.year d.type d.avg_rate d.no
1919 comedy 3.6315789473684212 1
1920 comedy 3.6666666666666665 1
1921 action 3.7903225806451615 1
1922 horror 3.991596638655462 1
1923 comedy 3.4444444444444446 1
1925 war 3.97008547008547 1
1926 crime 4.5 1
1927 comedy 4.368932038834951 1
1928 comedy 3.6458333333333335 1
1929 musical 3.1875 1
1930 war 4.1940298507462686 1
1931 drama 4.387453874538745 1
1932 drama 3.7752100840336134 1
1933 war 4.21043771043771 1
1934 mystery 4.239726027397261 1
1935 musical 4.147410358565737 1
1936 drama 4.239130434782608 1
1937 war 4.33939393939394 1
1938 mystery 4.185929648241206 1
1939 musical 4.247962747380675 1
1940 comedy 4.000047333684532 1
10.每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
思路:我们首先要求每个地区的每部电影的评分,可以直接将user表与movies表相关联,以地区、电影id、电影名分组(我们要哪个字段就以什么分组)这会得到一个表。然后从这个表中取出我们要的地区、电影名、评分、使用开窗函数按照评分分排序,地区分,这样又会得到一个表,我们只需去最高的即可,即第一个。最后将其写入hdfs.
insert overwrite directory "/user/data/"
select
*
from
(select
d.Zipcode Zipcode,d.MovieID MovieID,d.Title Title,d.avg_rate avg_rate,row_number() over(distribute by d.Zipcode sort by d.avg_rate desc) no
from
(
select
a.Zipcode Zipcode,c.MovieID MovieID,c.Title Title,
avg(b.Rating) avg_rate
from users a join ratings b on a.UserID=b.UserID
join movies c on b.MovieID=c.MovieID
group by a.Zipcode,c.MovieID,c.Title) d )e where e.no=1;
部分结果