数据源
该数据集包含了1000209条匿名评分数据,大约3900部电影数据,6040名电影用户的数据。
数据解释
t_movie
电影ID::电影名(年份)::标签1|标签2…
movieid moviename movietype
t_rating
用户ID::电影ID::评分::评分的时间
userid movieid rate String
t_user
用户ID::性别::年龄::职业::邮编
userid sex age occupation zipcode
1、各年评分最高的电影类型(年份,类型,影评分)
2、t_movie:(电影id,年份,类型)
3、t_rating:(电影id,评分)
4、评分最高:同类型同年份,求平均值
5、最高:row_number 排序第一
将数据表处理成这样的
(年份,类型,评分)
with tmp as (
select year,type,rate
from (
select b.year,b.movietype,a.rate
from
(select movieid,rate
from t_rating) as a
left join
(select movieid,moviename,movietype,substr(moviename,-5,4) as year
from t_movie) as b
on a.movieid=b.movieid
) t1
lateral view explode(split(movietype,"\\|")) as type
)
知识点
select year,type,avg_rate
from (
select year,type,avg_rate,
row_number() over(partition by year order by avg_rate desc) as rn
from (
select year,type,avg(rate) as avg_rate
from tmp
group by year,type
)t2
)t3
where rn=1
知识点
*****注意转义字符\(要2个\,用双反斜杠是因为sql需要解释一个,正则表达式需要解释一个)
1、t_movie:(电影id,电影名,类型)
2、t_rating:(电影id,评分)
3、将上面2表处理成这样的:(电影类型,电影名字,平均分)
选各种类型电影中,平均分最高的5部==>每种类型的电影,都要选5部分高的出来 row_number12345
select type,moviename
from (
select type,moviename,
row_number() over(partition by type order by avg_rate desc) as rn
from (
select b.type,b.moviename,avg(a.rate) as avg_rate
from
(select movieid,rate
from t_rating) as a
left join
(select movieid,moviename,type
from t_movie
lateral view explode(split(movietype,"\\|")) as type) as b
on a.movieid=b.movieid
group by b.type,b.moviename
) t1
)t2
where rn<=5
知识点
1、t_movie:(电影id,年份,类型)
2、t_rating:(电影id,评分)
2、1997年的电影 Comedy 类
3、评分最高(平均分的降序) 前10
select a.moviename,a.movietype,avg(b.rate) as avg_rating
from
(select movieid,moviename,substr(moviename,-5,4) as year,movietype
from t_movie) a
left join
(select movieid,rate
from t_rating) b
on a.movieid=b.movieid and year=="1997" and a.movietype like "Comedy"
group by a.moviename,a.movietype
order by avg(b.rate) desc
limit 10
知识点
1、求好片(评分>=4.0)最多的那个年份
2、最好看(平均影评分top10)的 10 部电影
求好片(评分>=4.0)最多的那个年份:1999
with t1 as (
select year
from (
select *
from (
(select movieid,rate
from t_rating) as a
left join
(select movieid,moviename,substr(moviename,-5,4) as year
from t_movie) as b
on a.movieid=b.movieid
)
)
where rate>=4.0
group by year
order by count(*) desc
limit 1
)
select * from t1
select moviename,avg(rate)
from (
(select movieid,rate
from t_rating) as a
left join
(select movieid,moviename,substr(moviename,-5,4) as year
from t_movie) as b
on a.movieid=b.movieid
)
where year ="1999"
group by moviename
order by avg(rate) desc
limit 10
1.等值连接(left join)
以上是基于spark开发完成的SQL语句,本文完成的知识点有列转行(lateral view)、rank函数、窗口函数、非等值连接–范围匹配