现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat 数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
题目要求:
数据要求:
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析':', 不支持解析'::',所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
(2)使用Hive能解析的方式进行
create table ratings (uid Bigint,mid Bigint,rating double,
timestamped string) row format
serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with
serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)',
'output.format.string'='%1$s %2$s %3$s %4$s')stored as textfile;
Hive要求:
1、正确建表,导入数据(三张表,三份数据),并验证是否正确
load data local inpath '/home/hadoop/ratings.dat' into table ratings;
2、求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
(1)求出评分次数最多的10部电影id
select mid ,count(*) n from ratings group by mid order by n desc limit 10;
+-------+-------+
| mid | n |
+-------+-------+
| 2858 | 3428 |
| 260 | 2991 |
| 1196 | 2990 |
| 1210 | 2883 |
| 480 | 2672 |
| 2028 | 2653 |
| 589 | 2649 |
| 2571 | 2590 |
| 1270 | 2583 |
| 593 | 2578 |
+-------+-------+
(2)获取电影名称
select b.title,a.n from movies b
join
(select mid ,count(*) n from ratings group by mid
order by n desc limit 10)a
on b.mid = a.mid;
+----------------------------------------------------+-------+
| b.title | a.n |
+----------------------------------------------------+-------+
| American Beauty (1999) | 3428 |
| Star Wars: Episode IV - A New Hope (1977) | 2991 |
| Star Wars: Episode V - The Empire Strikes Back (1980) | 2990 |
| Star Wars: Episode VI - Return of the Jedi (1983) | 2883 |
| Jurassic Park (1993) | 2672 |
| Saving Private Ryan (1998) | 2653 |
| Terminator 2: Judgment Day (1991) | 2649 |
| Matrix, The (1999) | 2590 |
| Back to the Future (1985) | 2583 |
| Silence of the Lambs, The (1991) | 2578 |
+----------------------------------------------------+-------+
3、分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
方案一:采用union all 进行自然连接
(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view as
(select r.*,u.sex,u.age,m.title,m.genres
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2)求出男性评分最高的10部电影,并且取评分大于50为有效数据
select sex,title,avg(rating)r,count(*)n from film_view where sex='M'
group by title,sex having n >= 50
order by r desc limit 10;
+------+----------------------------------------------------+--------------------+-------+
| sex | title | r | n |
+------+----------------------------------------------------+--------------------+-------+
| M | Sanjuro (1962) | 4.639344262295082 | 61 |
| M | Godfather, The (1972) | 4.583333333333333 | 1740 |
| M | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | 4.576628352490421 | 522 |
| M | Shawshank Redemption, The (1994) | 4.560625 | 1600 |
| M | Raiders of the Lost Ark (1981) | 4.520597322348094 | 1942 |
| M | Usual Suspects, The (1995) | 4.518248175182482 | 1370 |
| M | Star Wars: Episode IV - A New Hope (1977) | 4.495307167235495 | 2344 |
| M | Schindler's List (1993) | 4.49141503848431 | 1689 |
| M | Paths of Glory (1957) | 4.485148514851486 | 202 |
| M | Wrong Trousers, The (1993) | 4.478260869565218 | 644 |
+------+----------------------------------------------------+--------------------+-------+
(3)求出女性评分最高的10部电影,并且取评分大于50为有效数据
select sex,title,avg(rating)r,count(*)n from film_view where sex='F'
group by title,sex having n >= 50
order by r desc limit 10;
(4)拼接起来
select a.* from
(select sex,title,avg(rating)r,count(*)n from film_view where sex='M'
group by title,sex having n >= 50
order by r desc limit 10)a
union all
select b.* from
(select sex,title,avg(rating)r,count(*)n from film_view where sex='F'
group by title,sex having n >= 50
order by r desc limit 10)b;
+----------+----------------------------------------------------+--------------------+--------+
| _u1.sex | _u1.title | _u1.r | _u1.n |
+----------+----------------------------------------------------+--------------------+--------+
| M | Sanjuro (1962) | 4.639344262295082 | 61 |
| M | Godfather, The (1972) | 4.583333333333333 | 1740 |
| M | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | 4.576628352490421 | 522 |
| M | Shawshank Redemption, The (1994) | 4.560625 | 1600 |
| M | Raiders of the Lost Ark (1981) | 4.520597322348094 | 1942 |
| M | Usual Suspects, The (1995) | 4.518248175182482 | 1370 |
| M | Star Wars: Episode IV - A New Hope (1977) | 4.495307167235495 | 2344 |
| M | Schindler's List (1993) | 4.49141503848431 | 1689 |
| M | Paths of Glory (1957) | 4.485148514851486 | 202 |
| M | Wrong Trousers, The (1993) | 4.478260869565218 | 644 |
| F | Close Shave, A (1995) | 4.644444444444445 | 180 |
| F | Wrong Trousers, The (1993) | 4.588235294117647 | 238 |
| F | Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.572649572649572 | 117 |
| F | Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563106796116505 | 103 |
| F | Schindler's List (1993) | 4.56260162601626 | 615 |
| F | Shawshank Redemption, The (1994) | 4.539074960127592 | 627 |
| F | Grand Day Out, A (1992) | 4.537878787878788 | 132 |
| F | To Kill a Mockingbird (1962) | 4.536666666666667 | 300 |
| F | Creature Comforts (1990) | 4.513888888888889 | 72 |
| F | Usual Suspects, The (1995) | 4.513317191283293 | 413 |
+----------+----------------------------------------------------+--------------------+--------+
方案二:使用row_number 设计求取top10
(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view as
(select r.*,u.sex,u.age,m.title,m.genres
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2)求出男、女性评分最高的10部电影,并且取评分大于50为有效数据
create view movie_3_r as
select sex,title,avg(rating)r,count(*)n from film_view
group by title,sex having n >= 50
order by r desc;
(3)分组分别加入序号
create view movie_3_r_l as
select m.*,row_number()over(distribute by sex sort by r desc)rn
from movie_3_r m
order by m.sex,m.r desc;
(4)每组取前10
select * from movie_3_r_l
where rn < 11
order by sex, r desc;
+------------------+----------------------------------------------------+--------------------+----------------+-----------------+
| movie_3_r_l.sex | movie_3_r_l.title | movie_3_r_l.r | movie_3_r_l.n | movie_3_r_l.rn |
+------------------+----------------------------------------------------+--------------------+----------------+-----------------+
| F | Close Shave, A (1995) | 4.644444444444445 | 180 | 1 |
| F | Wrong Trousers, The (1993) | 4.588235294117647 | 238 | 2 |
| F | Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.572649572649572 | 117 | 3 |
| F | Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563106796116505 | 103 | 4 |
| F | Schindler's List (1993) | 4.56260162601626 | 615 | 5 |
| F | Shawshank Redemption, The (1994) | 4.539074960127592 | 627 | 6 |
| F | Grand Day Out, A (1992) | 4.537878787878788 | 132 | 7 |
| F | To Kill a Mockingbird (1962) | 4.536666666666667 | 300 | 8 |
| F | Creature Comforts (1990) | 4.513888888888889 | 72 | 9 |
| F | Usual Suspects, The (1995) | 4.513317191283293 | 413 | 10 |
| M | Sanjuro (1962) | 4.639344262295082 | 61 | 1 |
| M | Godfather, The (1972) | 4.583333333333333 | 1740 | 2 |
| M | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | 4.576628352490421 | 522 | 3 |
| M | Shawshank Redemption, The (1994) | 4.560625 | 1600 | 4 |
| M | Raiders of the Lost Ark (1981) | 4.520597322348094 | 1942 | 5 |
| M | Usual Suspects, The (1995) | 4.518248175182482 | 1370 | 6 |
| M | Star Wars: Episode IV - A New Hope (1977) | 4.495307167235495 | 2344 | 7 |
| M | Schindler's List (1993) | 4.49141503848431 | 1689 | 8 |
| M | Paths of Glory (1957) | 4.485148514851486 | 202 | 9 |
| M | Wrong Trousers, The (1993) | 4.478260869565218 | 644 | 10 |
+------------------+----------------------------------------------------+--------------------+----------------+-----------------+
4、求movieid = 2116这部电影各年龄段(因为年龄就只有7个,
就按这个7个分就好了)的平均影评(年龄段,影评分)
select age,avg(rating)avgrate from film_view where mid = 2116
group by age;
+------+---------------------+
| age | avgrate |
+------+---------------------+
| 1 | 3.2941176470588234 |
| 18 | 3.3580246913580245 |
| 25 | 3.436548223350254 |
| 35 | 3.2278481012658227 |
| 45 | 2.8275862068965516 |
| 50 | 3.32 |
| 56 | 3.5 |
+------+---------------------+
5、求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的
平均影评分(观影者,电影名,影评分)
(1)求最喜欢看电影(影评次数最多)的那位女性
select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a;
+--------+
| a.uid |
+--------+
| 1150 |
+--------+
(2)求那位女性评最高分的10部电影
select u.uid,r.title,r.rating from film_view r
join
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10;
改写为:
select a.uid,r.title,r.rating from film_view r
join
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a
on r.uid = a.uid
order by r.rating desc limit 10;
+--------+----------------------------------------------------+-----------+
| u.uid | r.title | r.rating |
+--------+----------------------------------------------------+-----------+
| 1150 | Close Shave, A (1995) | 5.0 |
| 1150 | Night on Earth (1991) | 5.0 |
| 1150 | Trust (1990) | 5.0 |
| 1150 | Rear Window (1954) | 5.0 |
| 1150 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 5.0 |
| 1150 | Being John Malkovich (1999) | 5.0 |
| 1150 | Roger & Me (1989) | 5.0 |
| 1150 | It Happened One Night (1934) | 5.0 |
| 1150 | Crying Game, The (1992) | 5.0 |
| 1150 | Duck Soup (1933) | 5.0 |
+--------+----------------------------------------------------+-----------+
(3)求10部电影的平均影评分(观影者,电影名,影评分)
---大表连小表用时:188s
select aa.uid,bb.* from
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
join
(select u.uid,r.title,r.rating from film_view r
join
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
on aa.title = bb.title;
+---------+----------------------------------------------------+---------------------+
| aa.uid | bb.title | bb.avgrate |
+---------+----------------------------------------------------+---------------------+
| 1150 | Being John Malkovich (1999) | 4.125390450691656 |
| 1150 | Close Shave, A (1995) | 4.52054794520548 |
| 1150 | Crying Game, The (1992) | 3.7314890154597236 |
| 1150 | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 4.4498902706656915 |
| 1150 | Duck Soup (1933) | 4.21043771043771 |
| 1150 | It Happened One Night (1934) | 4.280748663101604 |
| 1150 | Night on Earth (1991) | 3.747422680412371 |
| 1150 | Rear Window (1954) | 4.476190476190476 |
| 1150 | Roger & Me (1989) | 4.0739348370927315 |
| 1150 | Trust (1990) | 4.188888888888889 |
+---------+----------------------------------------------------+---------------------+
---小表连大表用时:236s结果一致
select aa.uid,bb.* from
(select u.uid,r.title,r.rating from film_view r
join
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
join
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
on aa.title = bb.title;
6、求好片(评分>=4.0)最多的那个年份的最好看的10部电影
(1)获取电影年份字段,在电影名字的后6位是年份
select mid,title,substring(title,-5,4)year from movies limit 5;
+------+-------+
| mid | _c1 |
+------+-------+
| 1 | 1995 |
| 2 | 1995 |
| 3 | 1995 |
| 4 | 1995 |
| 5 | 1995 |
(2)组合movies和ratings表
create view moive_6_v as
select r.rating,m.* from ratings r
join
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid
limit 5;
+-----------+--------+-----------------------------------------+---------+
| r.rating | m.mid | m.title | m.year |
+-----------+--------+-----------------------------------------+---------+
| 5.0 | 1193 | One Flew Over the Cuckoo's Nest (1975) | 1975 |
| 3.0 | 661 | James and the Giant Peach (1996) | 1996 |
| 3.0 | 914 | My Fair Lady (1964) | 1964 |
| 4.0 | 3408 | Erin Brockovich (2000) | 2000 |
| 5.0 | 2355 | Bug's Life, A (1998) | 1998 |
+-----------+--------+-----------------------------------------+---------+
(3)获取评分大于4的最多的那个年份
create view moive_6_v_a as
select f.year,f.title,avg(f.rating) avgr from moive_6_v f
group by f.year,f.title;
select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year
order by n desc
limit 5;
+---------+-----+
| m.year | n |
+---------+-----+
| 1998 | 27 |
| 1995 | 25 |
| 1996 | 24 |
| 1999 | 20 |
| 1994 | 20 |
+---------+-----+
(4)求那个年份的最好看的10部电影
select rr.title,rr.year,rr.avgrate,rr.cc from
(select mm.title,mm.year,avg(rating)avgrate,count(*)cc
from
(select r.rating,m.* from ratings r
join
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid)mm
group by mm.year,mm.title having cc >=50
order by avgrate desc)rr
join
(select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year
order by n desc
limit 1)yy
on rr.year = yy.year
limit 10;
+---------------------------------------------+----------+---------------------+--------+
| rr.title | rr.year | rr.avgrate | rr.cc |
+---------------------------------------------+----------+---------------------+--------+
| Saving Private Ryan (1998) | 1998 | 4.337353938937053 | 2653 |
| Celebration, The (Festen) (1998) | 1998 | 4.3076923076923075 | 117 |
| Central Station (Central do Brasil) (1998) | 1998 | 4.283720930232558 | 215 |
| 42 Up (1998) | 1998 | 4.2272727272727275 | 88 |
| American History X (1998) | 1998 | 4.2265625 | 640 |
| Run Lola Run (Lola rennt) (1998) | 1998 | 4.224813432835821 | 1072 |
| Shakespeare in Love (1998) | 1998 | 4.127479949345715 | 2369 |
| After Life (1998) | 1998 | 4.088235294117647 | 102 |
| Get Real (1998) | 1998 | 4.088235294117647 | 68 |
| Elizabeth (1998) | 1998 | 4.029850746268656 | 938 |
+---------------------------------------------+----------+---------------------+--------+
7、求1997年上映的电影中,评分最高的10部Comedy类电影
(1)求1997年上映的电影
select title,rating,genres from film_view
where substring(title,-5,4)=1997
limit 10;
(2)求1997年上映的电影Comedy类电影
select title,rating,genres from film_view
where substring(title,-5,4)=1997 and
(lcase(genres) like '%comedy%')
limit 10;
+---------------------------------------+---------+------------------------------------------------+
| title | rating | genres |
+---------------------------------------+---------+------------------------------------------------+
| Hercules (1997) | 4.0 | Adventure|Animation|Children's|Comedy|Musical |
| As Good As It Gets (1997) | 5.0 | Comedy|Drama |
| Full Monty, The (1997) | 2.0 | Comedy |
| Beverly Hills Ninja (1997) | 3.0 | Action|Comedy |
| Men in Black (1997) | 3.0 | Action|Adventure|Comedy|Sci-Fi |
| Liar Liar (1997) | 3.0 | Comedy |
| Love and Death on Long Island (1997) | 3.0 | Comedy|Drama |
| Grosse Pointe Blank (1997) | 3.0 | Comedy|Crime |
| Men in Black (1997) | 4.0 | Action|Adventure|Comedy|Sci-Fi |
| Billy's Hollywood Screen Kiss (1997) | 4.0 | Comedy|Romance |
+---------------------------------------+---------+------------------------------------------------+
(3)评分最高的10部
select mm.* ,f.genres from
(select m.title,avg(m.rating)avgrate,count(*)cc from
(select title,rating,genres from film_view
where substring(title,-5,4)=1997 and
(lcase(genres) like '%comedy%'))m
group by m.title having cc >= 50
order by avgrate desc
limit 10)mm
join movies f
on mm.title = f.title;
+----------------------------------------------------+---------------------+--------+---------------------------------+
| mm.title | mm.avgrate | mm.cc | f.genres |
+----------------------------------------------------+---------------------+--------+---------------------------------+
| Life Is Beautiful (La Vita � bella) (1997) | 4.329861111111111 | 1152 | Comedy|Drama |
| Big One, The (1997) | 4.0 | 102 | Comedy|Documentary |
| As Good As It Gets (1997) | 3.9501404494382024 | 1424 | Comedy|Drama |
| Full Monty, The (1997) | 3.872393661384487 | 1199 | Comedy |
| My Life in Pink (Ma vie en rose) (1997) | 3.825870646766169 | 201 | Comedy|Drama |
| Grosse Pointe Blank (1997) | 3.813380281690141 | 1136 | Comedy|Crime |
| Men in Black (1997) | 3.739952718676123 | 2538 | Action|Adventure|Comedy|Sci-Fi |
| Austin Powers: International Man of Mystery (1997) | 3.7103734439834026 | 1205 | Comedy |
| Billy's Hollywood Screen Kiss (1997) | 3.6710526315789473 | 76 | Comedy|Romance |
| Liar Liar (1997) | 3.5 | 666 | Comedy |
+----------------------------------------------------+---------------------+--------+---------------------------------+
8、该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
难点:每个类型取5个
(1)将电影类型裂变
创建新的movies数据表;
create table newmovies(mid int, title string,genres array)row format
delimited fields terminated by '\t' collection items terminated by ','stored as textfile;
将数据插入
insert into table newmovies select mid,title,split(genres,'\\|') from movies;
裂变:
create table nnmovies(mid int, title string, genres string)row
format delimited fields terminated by '\t';
insert into table nnmovies select mid, title, tpf.key from newmovies t
lateral view explode(t.genres) tpf as key;
(map裂变:select id,name, tpf.mykey as key, tpf.myvalue as value
from cdt t lateral view explode(t.piaofang) tpf as mykey, myvalue;)
(2)拼接形成视图
create view film_view3 as
(select r.*,m.title,m.genres
from ratings r
join nnmovies m on r.mid = m.mid);
(3)各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
<1>创建视图,电影按照类型平均分分类
create view movie_rate as select a.mid,a.title,a.genres,avg(rating)rate
from film_view3 a group by a.genres,a.mid,a.title;
<2>使用row_number函数将每个类型添加序号
create view movie_rate_order as
select t.*,row_number() over (distribute by genres sort by rate desc) rn
from movie_rate t order by t.genres,t.rate desc;
<3>通过每组的序号,取出前5(选择10个结果显示)
select m.* from movie_rate_order m where rn <6
order by m.genres,m.rate desc limit 10;
+--------+----------------------------------------------------+------------+--------------------+-------+
| m.mid | m.title | m.genres | m.rate | m.rn |
+--------+----------------------------------------------------+------------+--------------------+-------+
| 2905 | Sanjuro (1962) | Action | 4.608695652173913 | 1 |
| 2019 | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | Action | 4.560509554140127 | 2 |
| 858 | Godfather, The (1972) | Action | 4.524966261808367 | 3 |
| 1198 | Raiders of the Lost Ark (1981) | Action | 4.477724741447892 | 4 |
| 260 | Star Wars: Episode IV - A New Hope (1977) | Action | 4.453694416583082 | 5 |
| 3172 | Ulysses (Ulisse) (1954) | Adventure | 5.0 | 1 |
| 2905 | Sanjuro (1962) | Adventure | 4.608695652173913 | 2 |
| 1198 | Raiders of the Lost Ark (1981) | Adventure | 4.477724741447892 | 3 |
| 260 | Star Wars: Episode IV - A New Hope (1977) | Adventure | 4.453694416583082 | 4 |
| 1204 | Lawrence of Arabia (1962) | Adventure | 4.401925391095066 | 5 |
+--------+----------------------------------------------------+------------+--------------------+-------+
9、各年评分最高的电影类型(年份,类型,影评分)
(1)新建带年份、类型视图
create view movie_y_g as
(select r.*,m.title,m.genres,substring(m.title,-5,4)year
from ratings r
join nnmovies m on r.mid = m.mid);
(2)创建评分视图
create view movie_y_g_r as
select m.year,m.genres,avg(m.rating)rate,count(*)cc from movie_y_g m
group by m.year,m.genres having cc >= 50
order by m.year,rate desc;
(3)给不同年份不同类型电影加row_number
create view movie_y_g_r_l as
select f.*,row_number() over(distribute by genres sort by rate desc)rn
from movie_y_g_r f order by f.genres,f.rate desc;
(4)取每组的第一值
select mm.* from movie_y_g_r_l mm
where mm.rn < 2
order by mm.year;
+----------+--------------+---------------------+--------+--------+
| mm.year | mm.genres | mm.rate | mm.cc | mm.rn |
+----------+--------------+---------------------+--------+--------+
| 1927 | Comedy | 4.368932038834951 | 206 | 1 |
| 1931 | Drama | 4.387453874538745 | 271 | 1 |
| 1939 | Children's | 4.182008368200837 | 1912 | 1 |
| 1941 | Film-Noir | 4.395973154362416 | 1043 | 1 |
| 1942 | Romance | 4.412822049131217 | 1669 | 1 |
| 1949 | Mystery | 4.452083333333333 | 480 | 1 |
| 1949 | Thriller | 4.452083333333333 | 480 | 1 |
| 1952 | Musical | 4.2836218375499335 | 751 | 1 |
| 1961 | Western | 4.404651162790698 | 215 | 1 |
| 1962 | Adventure | 4.3997821350762525 | 918 | 1 |
| 1963 | Sci-Fi | 4.334664005322688 | 1503 | 1 |
| 1963 | War | 4.425109064469219 | 2063 | 1 |
| 1972 | Crime | 4.4660907127429805 | 2315 | 1 |
| 1974 | Horror | 4.021985343104597 | 1501 | 1 |
| 1977 | Fantasy | 4.453694416583082 | 2991 | 1 |
| 1977 | Action | 4.303571428571429 | 3584 | 1 |
| 1981 | Documentary | 4.274193548387097 | 62 | 1 |
| 1993 | Animation | 4.0367534456355285 | 1306 | 1 |
+----------+--------------+---------------------+--------+--------+
10、每个地区(邮政编码)最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view2 as
(select r.*,u.zcode,m.title,m.genres
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2) 按地区、电影名求平均分
create view movie_z_r as
select m.zcode,m.title,avg(m.rating)rate,count(*)cc
from film_view2 m
group by m.zcode,m.title having cc >= 5
order by m.zcode,rate desc;
(3)添加序号
create view movie_z_r_l as
select f.*,row_number() over(distribute by zcode sort by rate desc)rn
from movie_z_r f
order by f.zcode,f.rate desc;
(4)取最高值
create view movie_z_r_l_m as
select * from movie_z_r_l
where rn < 2
order by zcode;
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| movie_z_r_l_m.zcode | movie_z_r_l_m.title | movie_z_r_l_m.rate | movie_z_r_l_m.cc | movie_z_r_l_m.rn |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| 01002 | Star Wars: Episode IV - A New Hope (1977) | 4.4 | 5 | 1 |
| 01060 | American Beauty (1999) | 4.8 | 5 | 1 |
| 02115 | Shawshank Redemption, The (1994) | 4.8 | 5 | 1 |
| 02134 | Star Wars: Episode IV - A New Hope (1977) | 4.6 | 5 | 1 |
| 02135 | Princess Bride, The (1987) | 4.6 | 5 | 1 |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
(5)将结果存入HDFS
insert directory '/hw/hwmovie/' select * from movie_z_r_l_m;