hive案例——影评

现有如此三份数据:
1、users.dat    数据格式为:  2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码

2、movies.dat		数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型

3、ratings.dat		数据格式为:  1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳

题目要求:

数据要求:
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析':', 不支持解析'::',所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
(2)使用Hive能解析的方式进行
 create table ratings (uid Bigint,mid Bigint,rating double, 
 timestamped string) row format 
 serde 'org.apache.hadoop.hive.serde2.RegexSerDe' with 
 serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)',
 'output.format.string'='%1$s %2$s %3$s %4$s')stored as textfile;

Hive要求:
1、正确建表,导入数据(三张表,三份数据),并验证是否正确
load data local inpath '/home/hadoop/ratings.dat' into table ratings;
2、求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
(1)求出评分次数最多的10部电影id
select mid ,count(*) n from ratings group by mid order by n desc limit 10;
+-------+-------+
|  mid  |   n   |
+-------+-------+
| 2858  | 3428  |
| 260   | 2991  |
| 1196  | 2990  |
| 1210  | 2883  |
| 480   | 2672  |
| 2028  | 2653  |
| 589   | 2649  |
| 2571  | 2590  |
| 1270  | 2583  |
| 593   | 2578  |
+-------+-------+
(2)获取电影名称
select b.title,a.n from movies b
join
(select mid ,count(*) n from ratings group by mid 
order by n desc limit 10)a
on b.mid = a.mid;
+----------------------------------------------------+-------+
|                      b.title                       |  a.n  |
+----------------------------------------------------+-------+
| American Beauty (1999)                             | 3428  |
| Star Wars: Episode IV - A New Hope (1977)          | 2991  |
| Star Wars: Episode V - The Empire Strikes Back (1980) | 2990  |
| Star Wars: Episode VI - Return of the Jedi (1983)  | 2883  |
| Jurassic Park (1993)                               | 2672  |
| Saving Private Ryan (1998)                         | 2653  |
| Terminator 2: Judgment Day (1991)                  | 2649  |
| Matrix, The (1999)                                 | 2590  |
| Back to the Future (1985)                          | 2583  |
| Silence of the Lambs, The (1991)                   | 2578  |
+----------------------------------------------------+-------+
3、分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
方案一:采用union all 进行自然连接
(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view as 
(select r.*,u.sex,u.age,m.title,m.genres 
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2)求出男性评分最高的10部电影,并且取评分大于50为有效数据
select sex,title,avg(rating)r,count(*)n from film_view where sex='M' 
group by title,sex having n >= 50
order by r desc limit 10;
+------+----------------------------------------------------+--------------------+-------+
| sex  |                       title                        |         r          |   n   |
+------+----------------------------------------------------+--------------------+-------+
| M    | Sanjuro (1962)                                     | 4.639344262295082  | 61    |
| M    | Godfather, The (1972)                              | 4.583333333333333  | 1740  |
| M    | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | 4.576628352490421  | 522   |
| M    | Shawshank Redemption, The (1994)                   | 4.560625           | 1600  |
| M    | Raiders of the Lost Ark (1981)                     | 4.520597322348094  | 1942  |
| M    | Usual Suspects, The (1995)                         | 4.518248175182482  | 1370  |
| M    | Star Wars: Episode IV - A New Hope (1977)          | 4.495307167235495  | 2344  |
| M    | Schindler's List (1993)                            | 4.49141503848431   | 1689  |
| M    | Paths of Glory (1957)                              | 4.485148514851486  | 202   |
| M    | Wrong Trousers, The (1993)                         | 4.478260869565218  | 644   |
+------+----------------------------------------------------+--------------------+-------+
(3)求出女性评分最高的10部电影,并且取评分大于50为有效数据
select sex,title,avg(rating)r,count(*)n from film_view where sex='F' 
group by title,sex having n >= 50
order by r desc limit 10;

(4)拼接起来

select a.* from
(select sex,title,avg(rating)r,count(*)n from film_view where sex='M' 
group by title,sex having n >= 50
order by r desc limit 10)a
union all
select b.* from
(select sex,title,avg(rating)r,count(*)n from film_view where sex='F' 
group by title,sex having n >= 50
order by r desc limit 10)b; 
+----------+----------------------------------------------------+--------------------+--------+
| _u1.sex  |                     _u1.title                      |       _u1.r        | _u1.n  |
+----------+----------------------------------------------------+--------------------+--------+
| M        | Sanjuro (1962)                                     | 4.639344262295082  | 61     |
| M        | Godfather, The (1972)                              | 4.583333333333333  | 1740   |
| M        | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | 4.576628352490421  | 522    |
| M        | Shawshank Redemption, The (1994)                   | 4.560625           | 1600   |
| M        | Raiders of the Lost Ark (1981)                     | 4.520597322348094  | 1942   |
| M        | Usual Suspects, The (1995)                         | 4.518248175182482  | 1370   |
| M        | Star Wars: Episode IV - A New Hope (1977)          | 4.495307167235495  | 2344   |
| M        | Schindler's List (1993)                            | 4.49141503848431   | 1689   |
| M        | Paths of Glory (1957)                              | 4.485148514851486  | 202    |
| M        | Wrong Trousers, The (1993)                         | 4.478260869565218  | 644    |
| F        | Close Shave, A (1995)                              | 4.644444444444445  | 180    |
| F        | Wrong Trousers, The (1993)                         | 4.588235294117647  | 238    |
| F        | Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)      | 4.572649572649572  | 117    |
| F        | Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563106796116505  | 103    |
| F        | Schindler's List (1993)                            | 4.56260162601626   | 615    |
| F        | Shawshank Redemption, The (1994)                   | 4.539074960127592  | 627    |
| F        | Grand Day Out, A (1992)                            | 4.537878787878788  | 132    |
| F        | To Kill a Mockingbird (1962)                       | 4.536666666666667  | 300    |
| F        | Creature Comforts (1990)                           | 4.513888888888889  | 72     |
| F        | Usual Suspects, The (1995)                         | 4.513317191283293  | 413    |
+----------+----------------------------------------------------+--------------------+--------+
 方案二:使用row_number 设计求取top10
(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view as 
(select r.*,u.sex,u.age,m.title,m.genres 
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2)求出男、女性评分最高的10部电影,并且取评分大于50为有效数据
create view movie_3_r as
select sex,title,avg(rating)r,count(*)n from film_view
group by title,sex having n >= 50
order by r desc;
(3)分组分别加入序号
create view movie_3_r_l as
select m.*,row_number()over(distribute by sex sort by r desc)rn
from movie_3_r m
order by m.sex,m.r desc;
(4)每组取前10
select * from movie_3_r_l
where rn < 11
order by sex, r desc;
+------------------+----------------------------------------------------+--------------------+----------------+-----------------+
| movie_3_r_l.sex  |                 movie_3_r_l.title                  |   movie_3_r_l.r    | movie_3_r_l.n  | movie_3_r_l.rn  |
+------------------+----------------------------------------------------+--------------------+----------------+-----------------+
| F                | Close Shave, A (1995)                              | 4.644444444444445  | 180            | 1               |
| F                | Wrong Trousers, The (1993)                         | 4.588235294117647  | 238            | 2               |
| F                | Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)      | 4.572649572649572  | 117            | 3               |
| F                | Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563106796116505  | 103            | 4               |
| F                | Schindler's List (1993)                            | 4.56260162601626   | 615            | 5               |
| F                | Shawshank Redemption, The (1994)                   | 4.539074960127592  | 627            | 6               |
| F                | Grand Day Out, A (1992)                            | 4.537878787878788  | 132            | 7               |
| F                | To Kill a Mockingbird (1962)                       | 4.536666666666667  | 300            | 8               |
| F                | Creature Comforts (1990)                           | 4.513888888888889  | 72             | 9               |
| F                | Usual Suspects, The (1995)                         | 4.513317191283293  | 413            | 10              |
| M                | Sanjuro (1962)                                     | 4.639344262295082  | 61             | 1               |
| M                | Godfather, The (1972)                              | 4.583333333333333  | 1740           | 2               |
| M                | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | 4.576628352490421  | 522            | 3               |
| M                | Shawshank Redemption, The (1994)                   | 4.560625           | 1600           | 4               |
| M                | Raiders of the Lost Ark (1981)                     | 4.520597322348094  | 1942           | 5               |
| M                | Usual Suspects, The (1995)                         | 4.518248175182482  | 1370           | 6               |
| M                | Star Wars: Episode IV - A New Hope (1977)          | 4.495307167235495  | 2344           | 7               |
| M                | Schindler's List (1993)                            | 4.49141503848431   | 1689           | 8               |
| M                | Paths of Glory (1957)                              | 4.485148514851486  | 202            | 9               |
| M                | Wrong Trousers, The (1993)                         | 4.478260869565218  | 644            | 10              |
+------------------+----------------------------------------------------+--------------------+----------------+-----------------+

4、求movieid = 2116这部电影各年龄段(因为年龄就只有7个,
就按这个7个分就好了)的平均影评(年龄段,影评分)
select age,avg(rating)avgrate from film_view where mid = 2116 
group by age;
+------+---------------------+
| age  |       avgrate       |
+------+---------------------+
| 1    | 3.2941176470588234  |
| 18   | 3.3580246913580245  |
| 25   | 3.436548223350254   |
| 35   | 3.2278481012658227  |
| 45   | 2.8275862068965516  |
| 50   | 3.32                |
| 56   | 3.5                 |
+------+---------------------+
5、求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的
平均影评分(观影者,电影名,影评分)
(1)求最喜欢看电影(影评次数最多)的那位女性
select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a;
+--------+
| a.uid  |
+--------+
| 1150   |
+--------+

(2)求那位女性评最高分的10部电影

select u.uid,r.title,r.rating from film_view r
join 
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10;
改写为:

select a.uid,r.title,r.rating from film_view r
join 
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a
on r.uid = a.uid
order by r.rating desc limit 10;

+--------+----------------------------------------------------+-----------+
| u.uid  |                      r.title                       | r.rating  |
+--------+----------------------------------------------------+-----------+
| 1150   | Close Shave, A (1995)                              | 5.0       |
| 1150   | Night on Earth (1991)                              | 5.0       |
| 1150   | Trust (1990)                                       | 5.0       |
| 1150   | Rear Window (1954)                                 | 5.0       |
| 1150   | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 5.0       |
| 1150   | Being John Malkovich (1999)                        | 5.0       |
| 1150   | Roger & Me (1989)                                  | 5.0       |
| 1150   | It Happened One Night (1934)                       | 5.0       |
| 1150   | Crying Game, The (1992)                            | 5.0       |
| 1150   | Duck Soup (1933)                                   | 5.0       |
+--------+----------------------------------------------------+-----------+
(3)求10部电影的平均影评分(观影者,电影名,影评分)
---大表连小表用时:188s
select aa.uid,bb.* from
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
join 
(select u.uid,r.title,r.rating from film_view r
 join 
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
on aa.title = bb.title;
+---------+----------------------------------------------------+---------------------+
| aa.uid  |                      bb.title                      |     bb.avgrate      |
+---------+----------------------------------------------------+---------------------+
| 1150    | Being John Malkovich (1999)                        | 4.125390450691656   |
| 1150    | Close Shave, A (1995)                              | 4.52054794520548    |
| 1150    | Crying Game, The (1992)                            | 3.7314890154597236  |
| 1150    | Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963) | 4.4498902706656915  |
| 1150    | Duck Soup (1933)                                   | 4.21043771043771    |
| 1150    | It Happened One Night (1934)                       | 4.280748663101604   |
| 1150    | Night on Earth (1991)                              | 3.747422680412371   |
| 1150    | Rear Window (1954)                                 | 4.476190476190476   |
| 1150    | Roger & Me (1989)                                  | 4.0739348370927315  |
| 1150    | Trust (1990)                                       | 4.188888888888889   |
+---------+----------------------------------------------------+---------------------+
---小表连大表用时:236s结果一致
select aa.uid,bb.* from
(select u.uid,r.title,r.rating from film_view r
 join 
(select a.uid from
(select uid ,count(*)c from film_view where sex='F' group by uid 
order by c desc limit 1)a)u
on r.uid = u.uid
order by r.rating desc limit 10)aa
join 
(select f.title,avg(f.rating)avgrate from film_view f
group by f.title)bb
on aa.title = bb.title;
6、求好片(评分>=4.0)最多的那个年份的最好看的10部电影
(1)获取电影年份字段,在电影名字的后6位是年份
select mid,title,substring(title,-5,4)year from movies limit 5;
+------+-------+
| mid  |  _c1  |
+------+-------+
| 1    | 1995  |
| 2    | 1995  |
| 3    | 1995  |
| 4    | 1995  |
| 5    | 1995  |
(2)组合movies和ratings表
create view moive_6_v as
select r.rating,m.* from ratings r
join 
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid
limit 5;
+-----------+--------+-----------------------------------------+---------+
| r.rating  | m.mid  |                 m.title                 | m.year  |
+-----------+--------+-----------------------------------------+---------+
| 5.0       | 1193   | One Flew Over the Cuckoo's Nest (1975)  | 1975    |
| 3.0       | 661    | James and the Giant Peach (1996)        | 1996    |
| 3.0       | 914    | My Fair Lady (1964)                     | 1964    |
| 4.0       | 3408   | Erin Brockovich (2000)                  | 2000    |
| 5.0       | 2355   | Bug's Life, A (1998)                    | 1998    |
+-----------+--------+-----------------------------------------+---------+
(3)获取评分大于4的最多的那个年份
create view moive_6_v_a as
select f.year,f.title,avg(f.rating) avgr from moive_6_v f
group by f.year,f.title;


select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year 
order by n desc 
limit 5;


+---------+-----+
| m.year  |  n  |
+---------+-----+
| 1998    | 27  |
| 1995    | 25  |
| 1996    | 24  |
| 1999    | 20  |
| 1994    | 20  |
+---------+-----+
(4)求那个年份的最好看的10部电影


select rr.title,rr.year,rr.avgrate,rr.cc from 
(select mm.title,mm.year,avg(rating)avgrate,count(*)cc
from 
(select r.rating,m.* from ratings r
join 
(select mid,title,substring(title,-5,4)year from movies)m
on r.mid = m.mid)mm
group by mm.year,mm.title having cc >=50
order by avgrate desc)rr
join
(select m.year,count(*)n from moive_6_v_a m
where m.avgr >= 4
group by m.year 
order by n desc 
limit 1)yy
on rr.year = yy.year
limit 10;
+---------------------------------------------+----------+---------------------+--------+
|                  rr.title                   | rr.year  |     rr.avgrate      | rr.cc  |
+---------------------------------------------+----------+---------------------+--------+
| Saving Private Ryan (1998)                  | 1998     | 4.337353938937053   | 2653   |
| Celebration, The (Festen) (1998)            | 1998     | 4.3076923076923075  | 117    |
| Central Station (Central do Brasil) (1998)  | 1998     | 4.283720930232558   | 215    |
| 42 Up (1998)                                | 1998     | 4.2272727272727275  | 88     |
| American History X (1998)                   | 1998     | 4.2265625           | 640    |
| Run Lola Run (Lola rennt) (1998)            | 1998     | 4.224813432835821   | 1072   |
| Shakespeare in Love (1998)                  | 1998     | 4.127479949345715   | 2369   |
| After Life (1998)                           | 1998     | 4.088235294117647   | 102    |
| Get Real (1998)                             | 1998     | 4.088235294117647   | 68     |
| Elizabeth (1998)                            | 1998     | 4.029850746268656   | 938    |
+---------------------------------------------+----------+---------------------+--------+
7、求1997年上映的电影中,评分最高的10部Comedy类电影
(1)求1997年上映的电影
select title,rating,genres from film_view
where substring(title,-5,4)=1997
limit 10;
(2)求1997年上映的电影Comedy类电影
select title,rating,genres from film_view
where substring(title,-5,4)=1997 and 
(lcase(genres) like '%comedy%')
limit 10;
+---------------------------------------+---------+------------------------------------------------+
|                 title                 | rating  |                     genres                     |
+---------------------------------------+---------+------------------------------------------------+
| Hercules (1997)                       | 4.0     | Adventure|Animation|Children's|Comedy|Musical  |
| As Good As It Gets (1997)             | 5.0     | Comedy|Drama                                   |
| Full Monty, The (1997)                | 2.0     | Comedy                                         |
| Beverly Hills Ninja (1997)            | 3.0     | Action|Comedy                                  |
| Men in Black (1997)                   | 3.0     | Action|Adventure|Comedy|Sci-Fi                 |
| Liar Liar (1997)                      | 3.0     | Comedy                                         |
| Love and Death on Long Island (1997)  | 3.0     | Comedy|Drama                                   |
| Grosse Pointe Blank (1997)            | 3.0     | Comedy|Crime                                   |
| Men in Black (1997)                   | 4.0     | Action|Adventure|Comedy|Sci-Fi                 |
| Billy's Hollywood Screen Kiss (1997)  | 4.0     | Comedy|Romance                                 |
+---------------------------------------+---------+------------------------------------------------+
(3)评分最高的10部
select mm.* ,f.genres from
(select m.title,avg(m.rating)avgrate,count(*)cc from 
(select title,rating,genres from film_view
where substring(title,-5,4)=1997 and 
(lcase(genres) like '%comedy%'))m
group by m.title having cc >= 50
order by avgrate desc
limit 10)mm
join movies f
on mm.title = f.title;
+----------------------------------------------------+---------------------+--------+---------------------------------+
|                      mm.title                      |     mm.avgrate      | mm.cc  |            f.genres             |
+----------------------------------------------------+---------------------+--------+---------------------------------+
| Life Is Beautiful (La Vita � bella) (1997)         | 4.329861111111111   | 1152   | Comedy|Drama                    |
| Big One, The (1997)                                | 4.0                 | 102    | Comedy|Documentary              |
| As Good As It Gets (1997)                          | 3.9501404494382024  | 1424   | Comedy|Drama                    |
| Full Monty, The (1997)                             | 3.872393661384487   | 1199   | Comedy                          |
| My Life in Pink (Ma vie en rose) (1997)            | 3.825870646766169   | 201    | Comedy|Drama                    |
| Grosse Pointe Blank (1997)                         | 3.813380281690141   | 1136   | Comedy|Crime                    |
| Men in Black (1997)                                | 3.739952718676123   | 2538   | Action|Adventure|Comedy|Sci-Fi  |
| Austin Powers: International Man of Mystery (1997) | 3.7103734439834026  | 1205   | Comedy                          |
| Billy's Hollywood Screen Kiss (1997)               | 3.6710526315789473  | 76     | Comedy|Romance                  |
| Liar Liar (1997)                                   | 3.5                 | 666    | Comedy                          |
+----------------------------------------------------+---------------------+--------+---------------------------------+

8、该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
难点:每个类型取5个
(1)将电影类型裂变
创建新的movies数据表;
create table newmovies(mid int, title string,genres array)row format 
delimited fields terminated by '\t' collection items terminated by ','stored as textfile;
将数据插入
 insert into table newmovies select mid,title,split(genres,'\\|') from movies;
裂变:
 create table nnmovies(mid int, title string, genres string)row 
 format delimited fields terminated by '\t';

insert into table nnmovies select mid, title, tpf.key from newmovies t 
lateral view explode(t.genres) tpf as key;
(map裂变:select id,name, tpf.mykey as key, tpf.myvalue as value 
from cdt t lateral view explode(t.piaofang) tpf as mykey, myvalue;)
(2)拼接形成视图
create view film_view3 as 
(select r.*,m.title,m.genres 
from ratings r
join nnmovies m on r.mid = m.mid);
(3)各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
	<1>创建视图,电影按照类型平均分分类
	create view movie_rate as select a.mid,a.title,a.genres,avg(rating)rate 
	from film_view3 a group by a.genres,a.mid,a.title;
	<2>使用row_number函数将每个类型添加序号
	create view movie_rate_order as
	select t.*,row_number() over (distribute by genres sort by rate desc) rn 
	from movie_rate t order by t.genres,t.rate desc;
	<3>通过每组的序号,取出前5(选择10个结果显示)
	select m.* from movie_rate_order m where rn <6 
	order by m.genres,m.rate desc limit 10;
+--------+----------------------------------------------------+------------+--------------------+-------+
| m.mid  |                      m.title                       |  m.genres  |       m.rate       | m.rn  |
+--------+----------------------------------------------------+------------+--------------------+-------+
| 2905   | Sanjuro (1962)                                     | Action     | 4.608695652173913  | 1     |
| 2019   | Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) | Action     | 4.560509554140127  | 2     |
| 858    | Godfather, The (1972)                              | Action     | 4.524966261808367  | 3     |
| 1198   | Raiders of the Lost Ark (1981)                     | Action     | 4.477724741447892  | 4     |
| 260    | Star Wars: Episode IV - A New Hope (1977)          | Action     | 4.453694416583082  | 5     |
| 3172   | Ulysses (Ulisse) (1954)                            | Adventure  | 5.0                | 1     |
| 2905   | Sanjuro (1962)                                     | Adventure  | 4.608695652173913  | 2     |
| 1198   | Raiders of the Lost Ark (1981)                     | Adventure  | 4.477724741447892  | 3     |
| 260    | Star Wars: Episode IV - A New Hope (1977)          | Adventure  | 4.453694416583082  | 4     |
| 1204   | Lawrence of Arabia (1962)                          | Adventure  | 4.401925391095066  | 5     |
+--------+----------------------------------------------------+------------+--------------------+-------+

9、各年评分最高的电影类型(年份,类型,影评分)
(1)新建带年份、类型视图
create view movie_y_g as 
(select r.*,m.title,m.genres,substring(m.title,-5,4)year 
from ratings r
join nnmovies m on r.mid = m.mid);
(2)创建评分视图
create view movie_y_g_r as
select m.year,m.genres,avg(m.rating)rate,count(*)cc from movie_y_g m
group by m.year,m.genres having cc >= 50
order by m.year,rate desc;
(3)给不同年份不同类型电影加row_number
create view movie_y_g_r_l as
select f.*,row_number() over(distribute by genres sort by rate desc)rn
from movie_y_g_r f order by f.genres,f.rate desc;
(4)取每组的第一值
select mm.* from movie_y_g_r_l mm
where mm.rn < 2
order by mm.year;
+----------+--------------+---------------------+--------+--------+
| mm.year  |  mm.genres   |       mm.rate       | mm.cc  | mm.rn  |
+----------+--------------+---------------------+--------+--------+
| 1927     | Comedy       | 4.368932038834951   | 206    | 1      |
| 1931     | Drama        | 4.387453874538745   | 271    | 1      |
| 1939     | Children's   | 4.182008368200837   | 1912   | 1      |
| 1941     | Film-Noir    | 4.395973154362416   | 1043   | 1      |
| 1942     | Romance      | 4.412822049131217   | 1669   | 1      |
| 1949     | Mystery      | 4.452083333333333   | 480    | 1      |
| 1949     | Thriller     | 4.452083333333333   | 480    | 1      |
| 1952     | Musical      | 4.2836218375499335  | 751    | 1      |
| 1961     | Western      | 4.404651162790698   | 215    | 1      |
| 1962     | Adventure    | 4.3997821350762525  | 918    | 1      |
| 1963     | Sci-Fi       | 4.334664005322688   | 1503   | 1      |
| 1963     | War          | 4.425109064469219   | 2063   | 1      |
| 1972     | Crime        | 4.4660907127429805  | 2315   | 1      |
| 1974     | Horror       | 4.021985343104597   | 1501   | 1      |
| 1977     | Fantasy      | 4.453694416583082   | 2991   | 1      |
| 1977     | Action       | 4.303571428571429   | 3584   | 1      |
| 1981     | Documentary  | 4.274193548387097   | 62     | 1      |
| 1993     | Animation    | 4.0367534456355285  | 1306   | 1      |
+----------+--------------+---------------------+--------+--------+

10、每个地区(邮政编码)最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
(1)内连接ratings表、user表和movies表并且创建视图,作为备用
create view film_view2 as 
(select r.*,u.zcode,m.title,m.genres 
from ratings r
join users u on r.uid = u.uid
join movies m on r.mid = m.mid);
(2) 按地区、电影名求平均分
create view movie_z_r as
select m.zcode,m.title,avg(m.rating)rate,count(*)cc 
from film_view2 m
group by m.zcode,m.title having cc >= 5
order by m.zcode,rate desc;
(3)添加序号
create view movie_z_r_l as
select f.*,row_number() over(distribute by zcode sort by rate desc)rn
from movie_z_r f 
order by f.zcode,f.rate desc;
(4)取最高值
create view movie_z_r_l_m as
select * from movie_z_r_l
where rn < 2
order by zcode;
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| movie_z_r_l_m.zcode  |            movie_z_r_l_m.title             | movie_z_r_l_m.rate  | movie_z_r_l_m.cc  | movie_z_r_l_m.rn  |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
| 01002                | Star Wars: Episode IV - A New Hope (1977)  | 4.4                 | 5                 | 1                 |
| 01060                | American Beauty (1999)                     | 4.8                 | 5                 | 1                 |
| 02115                | Shawshank Redemption, The (1994)           | 4.8                 | 5                 | 1                 |
| 02134                | Star Wars: Episode IV - A New Hope (1977)  | 4.6                 | 5                 | 1                 |
| 02135                | Princess Bride, The (1987)                 | 4.6                 | 5                 | 1                 |
+----------------------+--------------------------------------------+---------------------+-------------------+-------------------+
(5)将结果存入HDFS
insert directory '/hw/hwmovie/' select * from  movie_z_r_l_m;








你可能感兴趣的:(hive)