现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, OccupationString, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat 数据格式为: 2::Jumanji(1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat 数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double,Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
题目要求:
数据要求:
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析':', 不支持解析'::',所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
(2)使用Hive能解析的方式进行
Hive要求:
(1)正确建表,导入数据(三张表,三份数据),并验证是否正确
(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
(4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
(5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
(6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
(7)求1997年上映的电影中,评分最高的10部Comedy类电影
(8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
(9)各年评分最高的电影类型(年份,类型,影评分)
(10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
[hadoop@hadoop02 movierating]$ sed -i 's/::/,/g'users.dat
[hadoop@hadoop02 movierating]$ sed -i 's/::/,/g'movies.dat
[hadoop@hadoop02 movierating]$ sed -i 's/::/,/g'ratings.dat
建表语句:
create table ratings(UserID BigInt, MovieIDBigInt, Rating Double, Timestamped String) row format delimited fieldsterminated by "," location "/hive/movie/ratings";
create table movies(MovieID BigInt, TitleString, Genres String) row format delimited fields terminated by ","location "/hive/movie/movies";
create table users(UserID BigInt, GenderString, Age Int, Occupation String, Zipcode String) row format delimited fieldsterminated by "," location "/hive/movie/user"
create table top10 as select MovieId,count(*)cc fromratings group by MovieId order by cc desc limit 0,10;
首先建一个临时表记录评分数最多的前10部电影
select t.MovieId,t.cc,m.Title from top10 t,movies mwhere m.MovieId=t.MovieId;
联合movies表查出电影名称
最后的数据形式:
GendermovieId Rating
M
create table newrating as select * from ratings limit0,1000;
Select u.Gender, n.MovieID,avg(n.Rating) avr
from newrating n
left join users u on u.UserId=n.UserId
where u.Gender='M'
group by n.MovieId ,u.Gender
order by avr desc
limit 10;
//把男女评分前10的都合成一个表
create table MF10as (select * from m10 union select * from f10);
再和movie表连接获取电影名称
select u.gender,m.title,u.avr from MF10 u,movies m where m.movieid=u.movieid;
结果:
Movieid Age rating
2116 1 4.0
Select avg(r.rating),u.age
From ratings r
Left join users u
Onr.UserId=u.UserId
Group by u.age
Order by u.age;
5.求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
Select r.UserId, count(*)c
From ratings r
Left join users u
Onr.UserId=u.UserId
Where u.Gender= 'F'
Group by r.UserId
Order by c desc
Limit 1;
//先求出影评次数最多的女性1150 1302
Create table g10mas Select userId, movieId ,avg(rating) av
From ratings
Where UserId=1150
Group bymovieId,userid
Order by av desc
Limit 10;
//建为临时表 g10m
//求出了10部电影的id
Selectg.userid,movieid,avg(rating) avg
from g10m g
left join ratings r
ong.movieid=r.movieid
group byr.movieid,g.userid;
首先写一个py脚本 把时间戳转成年份
#!/bin/python
import sys
import datetime
for line insys.stdin:
line = line.strip()
UserID,MovieID,Rating,Timestamped =line.split(',')
timeArray = time.localtime(Timestamped)
year = time.strftime("%Y",timeArray)
print ','.join([UserID, MovieID,Rating,str(year)])
然后创建临时表 用来存放新数据
用到的表
2、movies.dat 数据格式为: 2::Jumanji(1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat 数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double,Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
create table t_bi_reg(id string,name string) |
创建一个表
Create table movies2(MovieID string, Title String,type string)
row format serde'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s%2$s %3$s');
记得记得记得 每个类型都只能是string
把数据炸开
selectmovieid,title,type1 from movies2 lateral viewexplode(split(type,"\\|")) mytable as type1;
新的hive提供了几个函数可以进行分组topN排序
连接ratings表
userid bigint
movieid bigint
rating double
timestamped string
typemovie
movieid string
title string
type1 string
selecta.t1,a.t2,a.avf from (select *,row_number() over(partition by t1 order by avf desc) as ro from toptype) a wherero<=10;
思路:需要用到的表
Ratings
userid bigint
movieid bigint
rating double
timestamped string gs
movies
movieid bigint
title string
genres string
首先创建一个表保存年份
create table yearmovie as select*,substr(title,-5,4) as year from movies limit 20;
字符串截取
数据格式
12 Dracula: Dead and Loving It (1995) Comedy|Horror 1995
13 Balto (1995) Animation|Children's 1995
14 Nixon (1995) Drama 1995
15 Cutthroat Island(1995) Action|Adventure|Romance 1995
16 Casino (1995) Drama|Thriller 1995
17 Sense and Sensibility (1995) Drama|Romance 1995
18 Four Rooms (1995) Thriller 1995
19 Ace Ventura:When Nature Calls (1995) Comedy 1995
年份 类型 评分
接着要炸开类型