电影评分分析-大数据项目

[实验数据]
本实验所用数据为美国在线影片提供商NetFlix从1998年10月到2005年12月的电影评分数据,包含了480,189用户对17,770多部影片的100,480,507条评分。

该数据包含了2个数据集。

  1. 影片评分
    数据集已经存放在HDFS上,路径为“/data/13/5/rating/rating.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.movie_rating”。

各字段的定义为:

字段	定义
MovieId	影片标识符
UserId	用户标识符
Grade	用户评分,只可能是1、2、3、4和5中的一种
RatingTime	评分时间

数据集的前5行为:

student1@master:~$ hdfs dfs -cat /data/13/5/rating/rating.csv | head -5
1   1488844 3   2005-09-06
1   822109  5   2005-05-13
1   885013  4   2005-10-19
1   30878   4   2005-12-26
1   823519  3   2004-05-03
  1. 影片信息
    数据集已经存放在HDFS上,路径为“/data/13/5/info/info.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.movie_info”。

各字段的定义为:

字段	定义
MovieId	影片标识符
Year	影片发行年份
Title	影片名称
数据集的前5行为:

student1@master:~$ hdfs dfs -cat /data/13/5/info/info.csv | head -5
1   2003    Dinosaur Planet
2   2004    Isle of Man TT 2004 Review
3   1997    Character
4   1994    Paula Abdul's Get Up & Dance
5   2004    The Rise and Fall of ECW

[实验步骤提示]
在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。

  1. 用Hive做数据分析和数据准备
    a. 统计所有2000年以后上映影片的平均评分和评分数
    统计所有2000年以后上映影片的平均评分和评分数。

     hive -e \
     "select b.Title, a.MovieId, avg(Grade) as AverageGrade, count(1) as CountGrade
     from bigdata_cases.movie_rating a
     inner join bigdata_cases.movie_info b on a.MovieId=b.MovieId and b.Year>=2000
     group by b.Title, a.MovieId
     order by AverageGrade desc,CountGrade desc;" \
    

1.csv
得到结果的前10行:

Title	MovieId	Year	AverageGrade	CountGrade
Lord of the Rings: The Return of the King: Extended Edition	14961	2003	4.723269925683507	73335
The Lord of the Rings: The Fellowship of the Ring: Extended Edition	7230	2001	4.716610825093296	73422
Lord of the Rings: The Two Towers: Extended Edition	7057	2002	4.702611063648014	74912
Lost: Season 1	3456	2004	4.6709891019450955	7249
Battlestar Galactica: Season 1	9864	2004	4.638809387521466	1747
Fullmetal Alchemist	15538	2004	4.605021432945499	1633
Veronica Mars: Season 1	12398	2004	4.592084006462035	1238
Arrested Development: Season 2	7833	2004	4.582389367165081	6621
Inu-Yasha	4238	2000	4.554434413170473	1883
Lord of the Rings: The Return of the King	14240	2003	4.5451207887760265	134284

b. 得到平均评分前5的影片的所有评分
得到平均评分前5的影片的所有评分,为之后做数据可视化做准备。

执行以下Hive-QL代码:

hive -e \
"select b.Title, a.MovieId, a.Grade
from bigdata_cases.movie_rating a
inner join bigdata_cases.movie_info b on a.MovieId=b.MovieId
where a.MovieId in (14961, 7230, 7057, 3456, 9864);" \

2.csv
得到结果的前10行:

Title	MovieId	Grade
Lost: Season 1	3456	5
Lost: Season 1	3456	4
Lost: Season 1	3456	5
Lost: Season 1	3456	5
Lost: Season 1	3456	5
Lost: Season 1	3456	4
Lost: Season 1	3456	5
Lost: Season 1	3456	5
Lost: Season 1	3456	5
Lost: Season 1	3456	5

c. 得到平均评分前5的影片的在上映第一年每天的平均评分、评分标准差和评分数
得到平均评分前5的影片的在上映第一年每天的平均评分、评分标准差和评分数。

执行以下Hive-QL代码:

hive -e \
"select b.Title, a.MovieId, round((unix_timestamp(concat(a.RatingTime,' 00:00:00'))-b.RatingStart) / (3600 * 24 *7)) as Week, 
avg(a.Grade) as AverageGrade, stddev(a.Grade) as DevGrade, count(1) CountGrade
from bigdata_cases.movie_rating a
inner join (
select a1.MovieId, b1.Title, min(unix_timestamp(concat(a1.RatingTime,' 00:00:00'))) as RatingStart
from bigdata_cases.movie_rating a1
inner join bigdata_cases.movie_info b1 on a1.MovieId=b1.MovieId
where a1.MovieId in (14961, 7230, 7057, 3456, 9864)
group by a1.MovieId, b1.Title) b on a.MovieId=b.MovieId
where unix_timestamp(concat(a.RatingTime,' 00:00:00'))-b.RatingStart<=3600*24*7*52
group by b.Title, a.MovieId, round((unix_timestamp(concat(a.RatingTime,' 00:00:00'))-b.RatingStart) / (3600 * 24 * 7));" \

3.csv
得到结果的前10行:

Title	MovieId	Week	AverageGrade	DevGrade	CountGrade
Battlestar Galactica: Season 1	9864	0.0	5.0	0.0	4
Battlestar Galactica: Season 1	9864	1.0	4.833333333333333	0.37267799624996495	6
Battlestar Galactica: Season 1	9864	2.0	4.333333333333333	1.1055415967851332	12
Battlestar Galactica: Season 1	9864	3.0	4.833333333333333	0.37267799624996495	6
Battlestar Galactica: Season 1	9864	4.0	4.5	0.5	4
Battlestar Galactica: Season 1	9864	5.0	5.0	0.0	5
Battlestar Galactica: Season 1	9864	6.0	4.666666666666667	0.6666666666666666	9
Battlestar Galactica: Season 1	9864	7.0	4.923076923076923	0.2664693550105965	13
Battlestar Galactica: Season 1	9864	8.0	5.0	0.0	3
Battlestar Galactica: Season 1	9864	9.0	5.0	0.0	15

d. 统计所有用户的平均评分和评分数
统计所有用户的平均评分和评分数。

执行以下Hive-QL代码:

hive -e \
"select UserId, count(1) as CountGrade, avg(Grade) as AverageGrade
from bigdata_cases.movie_rating
group by UserId
order by CountGrade desc,AverageGrade desc;" \

4.csv
得到结果的前10行:

keyword	PV	IP
305344	17652	1.9082256968048947
387418	17435	1.809119587037568
2439493	16565	1.2168427407183822
1664010	15813	4.264339467526718
2118461	14830	4.082333108563722
1461435	9821	1.3728744527033907
1639792	9767	1.327838640319443
1314869	9740	2.9544147843942503
2606799	9064	2.771954986760812
1932594	8880	2.281418918918919

e. 得到评分数前5的用户的所有评分
得到评分数前5的用户的所有评分,为之后做数据可视化做准备。

执行以下Hive-QL代码:

hive -e \
"select UserId, Grade
from bigdata_cases.movie_rating a
where UserId in (305344, 387418, 2439493, 1664010, 2118461);" \

5.csv
得到结果的前10行:

domain	PV
2439493	1
1664010	5
305344	1
2118461	5
387418	1
2439493	1
1664010	4
305344	1
2118461	4
387418	1
  1. 用R语言做数据可视化
    a. 载入相关程序包
    载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。

     > library(ggplot2)
    

b. 画出所有2000年以后上映影片的平均评分和评分数的直方图
画出所有2000年以后上映影片的平均评分的直方图,其中横坐标表示平均评分,纵坐标表示影片的数量。

> data1 <- read.table("D:/workspace/1.csv", sep = "\t", stringsAsFactors = FALSE)
Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : 读取的项目数必需是列数的倍数
> names(data1) <- c("Title", "MovieId", "Year", "AverageGrade", "CountGrade")
> ggplot(data1, aes(x = AverageGrade)) + geom_histogram(aes(fill = ..count..)) + 
+     scale_fill_gradient("Count", low = "green", high = "red")
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

画出所有2000年以后上映影片的评分数的直方图,其中横坐标为对数尺度表示访问量,纵坐标为平方根尺度表示域名的数量。

> ggplot(data1, aes(x = CountGrade)) + geom_histogram(aes(fill = ..count..)) + 
+     scale_fill_gradient("Count", low = "green", high = "red") + scale_x_log10()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

c. 画出平均评分前5的影片评分的分布柱状图
画出平均评分前5的影片评分的分布柱状图,其中横坐标表示不同影片,纵坐标为对数尺度表示评分的累积百分比,颜色表示不同评分。

> data2 <- read.table("D:/workspace/2.csv", sep = "\t")
> names(data2) <- c("Title", "MovieId", "Grade")
> data2$Grade <- as.factor(data2$Grade)
> ggplot(data2, aes(x = Title, fill = Grade)) + geom_bar(position = "fill") + 
+     scale_y_log10()

d. 画出平均评分前5的影片的在上映第一年每周的平均评分、评分标准差和评分数
画出平均评分前5的影片的在上映第一年每周的平均评分、评分标准差和评分数,其中横坐标表示距离影片收到第一条评分过的周数,纵坐标表示影片的平均评分,线条粗细表示评分数,灰色区域表示影片平均评分的95%置信区间。

> data3 <- read.table("D:/workspace/3.csv", sep = "\t")
> names(data3) <- c("Title", "MovieId", "Day", "AverageGrade", "DevGrade", "CountGrade")
> ggplot(data3, aes(x = Day, y = AverageGrade, group = Title)) + geom_smooth(aes(ymin = AverageGrade - 
+     1.96 * DevGrade, ymax = AverageGrade - 1.96 * DevGrade, colour = Title), 
+     size = 2) + xlim(0, 45)
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Warning: Removed 2 rows containing missing values (stat_smooth).
Warning: Removed 1 rows containing missing values (stat_smooth).
Warning: Removed 1 rows containing missing values (stat_smooth).
Warning: Removed 7 rows containing missing values (stat_smooth).
Warning: Removed 1 rows containing missing values (stat_smooth).

e. 画出所有用户平均评分和评分数的直方图
画出所有用户平均评分的直方图,其中横坐标表示平均评分,纵坐标表示用户的数量。

> data4 <- read.table("D:/workspace/4.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data4) <- c("UserId", "CountGrade", "AverageGrade")
> ggplot(data4, aes(x = AverageGrade)) + geom_histogram(aes(fill = ..count..)) + 
+     scale_fill_gradient("Count", low = "green", high = "red")
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

画出所有用户评分数的直方图,其中横坐标表示平均评分,纵坐标为对数尺度表示用户的数量。

> ggplot(data4, aes(x = CountGrade)) + geom_histogram(aes(fill = ..count..)) + 
+     scale_fill_gradient("Count", low = "green", high = "red") + scale_x_log10()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

f. 画出评分数前5的用评分的分布柱状图
画出评分数前5的用评分的分布柱状图,其中横坐标表示不同用户,纵坐标表示评分的累积百分比,颜色表示不同评分。

> data5 <- read.table("D:/workspace/5.csv", sep = "\t")
> names(data5) <- c("UserId", "Grade")
> data5$Grade <- as.factor(data5$Grade)
> ggplot(data5, aes(x = as.factor(UserId), fill = Grade)) + geom_bar(position = "fill")

你可能感兴趣的:(大数据项目)