[实验数据]
本实验所用数据为美国在线影片提供商NetFlix从1998年10月到2005年12月的电影评分数据,包含了480,189用户对17,770多部影片的100,480,507条评分。
该数据包含了2个数据集。
各字段的定义为:
字段 定义
MovieId 影片标识符
UserId 用户标识符
Grade 用户评分,只可能是1、2、3、4和5中的一种
RatingTime 评分时间
数据集的前5行为:
student1@master:~$ hdfs dfs -cat /data/13/5/rating/rating.csv | head -5
1 1488844 3 2005-09-06
1 822109 5 2005-05-13
1 885013 4 2005-10-19
1 30878 4 2005-12-26
1 823519 3 2004-05-03
各字段的定义为:
字段 定义
MovieId 影片标识符
Year 影片发行年份
Title 影片名称
数据集的前5行为:
student1@master:~$ hdfs dfs -cat /data/13/5/info/info.csv | head -5
1 2003 Dinosaur Planet
2 2004 Isle of Man TT 2004 Review
3 1997 Character
4 1994 Paula Abdul's Get Up & Dance
5 2004 The Rise and Fall of ECW
[实验步骤提示]
在以下提示步骤中,步骤1是用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤2是用R语言做数据可视化。
用Hive做数据分析和数据准备
a. 统计所有2000年以后上映影片的平均评分和评分数
统计所有2000年以后上映影片的平均评分和评分数。
hive -e \
"select b.Title, a.MovieId, avg(Grade) as AverageGrade, count(1) as CountGrade
from bigdata_cases.movie_rating a
inner join bigdata_cases.movie_info b on a.MovieId=b.MovieId and b.Year>=2000
group by b.Title, a.MovieId
order by AverageGrade desc,CountGrade desc;" \
1.csv
得到结果的前10行:
Title MovieId Year AverageGrade CountGrade
Lord of the Rings: The Return of the King: Extended Edition 14961 2003 4.723269925683507 73335
The Lord of the Rings: The Fellowship of the Ring: Extended Edition 7230 2001 4.716610825093296 73422
Lord of the Rings: The Two Towers: Extended Edition 7057 2002 4.702611063648014 74912
Lost: Season 1 3456 2004 4.6709891019450955 7249
Battlestar Galactica: Season 1 9864 2004 4.638809387521466 1747
Fullmetal Alchemist 15538 2004 4.605021432945499 1633
Veronica Mars: Season 1 12398 2004 4.592084006462035 1238
Arrested Development: Season 2 7833 2004 4.582389367165081 6621
Inu-Yasha 4238 2000 4.554434413170473 1883
Lord of the Rings: The Return of the King 14240 2003 4.5451207887760265 134284
b. 得到平均评分前5的影片的所有评分
得到平均评分前5的影片的所有评分,为之后做数据可视化做准备。
执行以下Hive-QL代码:
hive -e \
"select b.Title, a.MovieId, a.Grade
from bigdata_cases.movie_rating a
inner join bigdata_cases.movie_info b on a.MovieId=b.MovieId
where a.MovieId in (14961, 7230, 7057, 3456, 9864);" \
2.csv
得到结果的前10行:
Title MovieId Grade
Lost: Season 1 3456 5
Lost: Season 1 3456 4
Lost: Season 1 3456 5
Lost: Season 1 3456 5
Lost: Season 1 3456 5
Lost: Season 1 3456 4
Lost: Season 1 3456 5
Lost: Season 1 3456 5
Lost: Season 1 3456 5
Lost: Season 1 3456 5
c. 得到平均评分前5的影片的在上映第一年每天的平均评分、评分标准差和评分数
得到平均评分前5的影片的在上映第一年每天的平均评分、评分标准差和评分数。
执行以下Hive-QL代码:
hive -e \
"select b.Title, a.MovieId, round((unix_timestamp(concat(a.RatingTime,' 00:00:00'))-b.RatingStart) / (3600 * 24 *7)) as Week,
avg(a.Grade) as AverageGrade, stddev(a.Grade) as DevGrade, count(1) CountGrade
from bigdata_cases.movie_rating a
inner join (
select a1.MovieId, b1.Title, min(unix_timestamp(concat(a1.RatingTime,' 00:00:00'))) as RatingStart
from bigdata_cases.movie_rating a1
inner join bigdata_cases.movie_info b1 on a1.MovieId=b1.MovieId
where a1.MovieId in (14961, 7230, 7057, 3456, 9864)
group by a1.MovieId, b1.Title) b on a.MovieId=b.MovieId
where unix_timestamp(concat(a.RatingTime,' 00:00:00'))-b.RatingStart<=3600*24*7*52
group by b.Title, a.MovieId, round((unix_timestamp(concat(a.RatingTime,' 00:00:00'))-b.RatingStart) / (3600 * 24 * 7));" \
3.csv
得到结果的前10行:
Title MovieId Week AverageGrade DevGrade CountGrade
Battlestar Galactica: Season 1 9864 0.0 5.0 0.0 4
Battlestar Galactica: Season 1 9864 1.0 4.833333333333333 0.37267799624996495 6
Battlestar Galactica: Season 1 9864 2.0 4.333333333333333 1.1055415967851332 12
Battlestar Galactica: Season 1 9864 3.0 4.833333333333333 0.37267799624996495 6
Battlestar Galactica: Season 1 9864 4.0 4.5 0.5 4
Battlestar Galactica: Season 1 9864 5.0 5.0 0.0 5
Battlestar Galactica: Season 1 9864 6.0 4.666666666666667 0.6666666666666666 9
Battlestar Galactica: Season 1 9864 7.0 4.923076923076923 0.2664693550105965 13
Battlestar Galactica: Season 1 9864 8.0 5.0 0.0 3
Battlestar Galactica: Season 1 9864 9.0 5.0 0.0 15
d. 统计所有用户的平均评分和评分数
统计所有用户的平均评分和评分数。
执行以下Hive-QL代码:
hive -e \
"select UserId, count(1) as CountGrade, avg(Grade) as AverageGrade
from bigdata_cases.movie_rating
group by UserId
order by CountGrade desc,AverageGrade desc;" \
4.csv
得到结果的前10行:
keyword PV IP
305344 17652 1.9082256968048947
387418 17435 1.809119587037568
2439493 16565 1.2168427407183822
1664010 15813 4.264339467526718
2118461 14830 4.082333108563722
1461435 9821 1.3728744527033907
1639792 9767 1.327838640319443
1314869 9740 2.9544147843942503
2606799 9064 2.771954986760812
1932594 8880 2.281418918918919
e. 得到评分数前5的用户的所有评分
得到评分数前5的用户的所有评分,为之后做数据可视化做准备。
执行以下Hive-QL代码:
hive -e \
"select UserId, Grade
from bigdata_cases.movie_rating a
where UserId in (305344, 387418, 2439493, 1664010, 2118461);" \
5.csv
得到结果的前10行:
domain PV
2439493 1
1664010 5
305344 1
2118461 5
387418 1
2439493 1
1664010 4
305344 1
2118461 4
387418 1
用R语言做数据可视化
a. 载入相关程序包
载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。
> library(ggplot2)
b. 画出所有2000年以后上映影片的平均评分和评分数的直方图
画出所有2000年以后上映影片的平均评分的直方图,其中横坐标表示平均评分,纵坐标表示影片的数量。
> data1 <- read.table("D:/workspace/1.csv", sep = "\t", stringsAsFactors = FALSE)
Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : EOF within quoted string
Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : 读取的项目数必需是列数的倍数
> names(data1) <- c("Title", "MovieId", "Year", "AverageGrade", "CountGrade")
> ggplot(data1, aes(x = AverageGrade)) + geom_histogram(aes(fill = ..count..)) +
+ scale_fill_gradient("Count", low = "green", high = "red")
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
画出所有2000年以后上映影片的评分数的直方图,其中横坐标为对数尺度表示访问量,纵坐标为平方根尺度表示域名的数量。
> ggplot(data1, aes(x = CountGrade)) + geom_histogram(aes(fill = ..count..)) +
+ scale_fill_gradient("Count", low = "green", high = "red") + scale_x_log10()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
c. 画出平均评分前5的影片评分的分布柱状图
画出平均评分前5的影片评分的分布柱状图,其中横坐标表示不同影片,纵坐标为对数尺度表示评分的累积百分比,颜色表示不同评分。
> data2 <- read.table("D:/workspace/2.csv", sep = "\t")
> names(data2) <- c("Title", "MovieId", "Grade")
> data2$Grade <- as.factor(data2$Grade)
> ggplot(data2, aes(x = Title, fill = Grade)) + geom_bar(position = "fill") +
+ scale_y_log10()
d. 画出平均评分前5的影片的在上映第一年每周的平均评分、评分标准差和评分数
画出平均评分前5的影片的在上映第一年每周的平均评分、评分标准差和评分数,其中横坐标表示距离影片收到第一条评分过的周数,纵坐标表示影片的平均评分,线条粗细表示评分数,灰色区域表示影片平均评分的95%置信区间。
> data3 <- read.table("D:/workspace/3.csv", sep = "\t")
> names(data3) <- c("Title", "MovieId", "Day", "AverageGrade", "DevGrade", "CountGrade")
> ggplot(data3, aes(x = Day, y = AverageGrade, group = Title)) + geom_smooth(aes(ymin = AverageGrade -
+ 1.96 * DevGrade, ymax = AverageGrade - 1.96 * DevGrade, colour = Title),
+ size = 2) + xlim(0, 45)
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Warning: Removed 2 rows containing missing values (stat_smooth).
Warning: Removed 1 rows containing missing values (stat_smooth).
Warning: Removed 1 rows containing missing values (stat_smooth).
Warning: Removed 7 rows containing missing values (stat_smooth).
Warning: Removed 1 rows containing missing values (stat_smooth).
e. 画出所有用户平均评分和评分数的直方图
画出所有用户平均评分的直方图,其中横坐标表示平均评分,纵坐标表示用户的数量。
> data4 <- read.table("D:/workspace/4.csv", sep = "\t", stringsAsFactors = FALSE)
> names(data4) <- c("UserId", "CountGrade", "AverageGrade")
> ggplot(data4, aes(x = AverageGrade)) + geom_histogram(aes(fill = ..count..)) +
+ scale_fill_gradient("Count", low = "green", high = "red")
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
画出所有用户评分数的直方图,其中横坐标表示平均评分,纵坐标为对数尺度表示用户的数量。
> ggplot(data4, aes(x = CountGrade)) + geom_histogram(aes(fill = ..count..)) +
+ scale_fill_gradient("Count", low = "green", high = "red") + scale_x_log10()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
f. 画出评分数前5的用评分的分布柱状图
画出评分数前5的用评分的分布柱状图,其中横坐标表示不同用户,纵坐标表示评分的累积百分比,颜色表示不同评分。
> data5 <- read.table("D:/workspace/5.csv", sep = "\t")
> names(data5) <- c("UserId", "Grade")
> data5$Grade <- as.factor(data5$Grade)
> ggplot(data5, aes(x = as.factor(UserId), fill = Grade)) + geom_bar(position = "fill")