[实验数据]
本实验所用数据为新浪微博数据,包含了从2013年6月1日到14日期间的12,102,744条微博。数据集已经存放在HDFS上,路径为“/data/13/3/post/post.csv”,各字段以制表符分隔。数据集还存放在了Hive上,表名为“bigdata_cases.post”。
各字段的定义为:
字段 定义
PostId 微博标识符
UserId 用户标识符
UtcTime 微博发布的标准Unix时间
Text 微博正文
RepostsCount 微博转发数
CommentsCount 微博评论数
RepostPostId 转发微博的原微博标识符,若为0则为原创微博
数据集的前5行为:
student1@master:~$ hdfs dfs -cat /data/13/3/post/post.csv | head -5
48879661 1097414213 1369920667 【试问:强奸小学女生者,势力到底有多大?】@叶海燕宝贝 举牌劝阻校长带小学女生开房,竟然摊上了大事。中午,她在微博中说,先是有人指使房东停止租房给她,然后有11男女上门打她。@迟夙生律师 微博刚才称,叶海燕现在被拘在派出所,电话是:0775-8222414。请大家呼吁一下。 52249 7620 0
36548989 2686904145 1369989873 【妻子勒死女儿后举报副局长丈夫贪污】她是用一根数据线环住女儿脖子,将其勒死的。她的副局长丈夫通过虚开增值税发票赚取巨款,包养情妇,买房生女。在杀死女儿前,她曾多次“带着三十多张存折和大量现金”举报丈夫,却始终无果。甚至在这期间,她的丈夫还被评为了“优秀党员”。http:\\t.cn\zHa60Zx 3653 698 0
33166398 1728892794 1369966040 1943年5月24日,《解放日报》发表《谈延市二流子的改造》:延安市划定二流子110人,其中女二流子39人。二流子的门上和身上被强迫佩带有二流子的徽章标志,只有在真正参加生产之后才被准许摘去。对女二流子,规定她们受家人严格束缚,帮助丈夫整顿家务,如有不改,则丈夫打骂,政府不管,也不准离婚。 18 6 0
33166399 1104150515 1369898443 今天听到了一个广告主说,现在还在包位置的网络广告主就是“傻大黑粗”,媒体游说说自己的用户有多高端,基本都是忽悠人 36 36 0
33166400 2524610164 1369962904 即使生气,也会装作淡定;即使不开心,也会努力微笑;即使悲伤,也只是偷偷的;即使在乎,也不会解释太多,这就是现在的我。 178 4 0
同时,字段“微博正文”经中文分词后的数据集也已经存放在HDFS上,路径为“/data/13/3/post_segmented/post_segmented.csv”,各字段以制表符分隔,字段“微博正文”的分词结果中各词用逗号分隔。数据集还存放在了Hive上,表名为“bigdata_cases.post_segmented”。
[实验步骤提示]
在以下提示步骤中,步骤1是用IKAnalyzer做中文分词,步骤2用Hive做数据分析和数据准备,所有代码在大数据计算集群上执行,步骤3是用R语言做数据可视化。
具体Java代码如下:
package lab3.module13;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
public class WordTokenize {
public static void main(String[] args) throws Exception {
BufferedReader swbr = new BufferedReader(new InputStreamReader(new FileInputStream("stopword.dic"), "UTF-8"));
Set<String> stopWordSet = new HashSet<String>();
String stopWord = null;
for(; (stopWord = swbr.readLine()) != null;){
stopWordSet.add(stopWord);
}
FileInputStream fis = new FileInputStream(args[0]);
FileOutputStream fos = new FileOutputStream(args[1]);
BufferedReader br =new BufferedReader(new InputStreamReader(fis, "UTF-8"));
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fos, "UTF-8"));
String line;
while((line = br.readLine()) != null) {
String[] fields = line.split("\t");
if (fields.length != 7)
continue;
StringReader sr=new StringReader(fields[3]);
IKSegmenter iks = new IKSegmenter(sr,true);
Lexeme t;
StringBuilder sb = new StringBuilder();
while ((t = iks.next()) != null) {
if (stopWordSet.contains(t.getLexemeText()))
continue;
sb.append(t.getLexemeText()).append(",");
}
String text;
if (sb.length() > 0)
text = sb.toString().substring(0, sb.length() - 1);
else
text = "";
bw.write(fields[0] + "\t" + fields[1] + "\t" + fields[2] + "\t" + text + "\t" + fields[4] + "\t" + fields[5] + "\t" + fields[6] + "\r\n");
}
br.close();
bw.close();
}
}
编译成wordtokenize.jar,需要依赖IKAnalyzer的包IKAnalyzer2012_FF.jar,该包路径为“D:\packages\IK-Analyzer-2012FF\dist\”。提交到大数据计算集群运行。
java -cp IKAnalyzer2012_FF.jar:wordtokenize.jar lab3.module13.WordTokenize data/13/3/post/post.csv data/13/3/post_segmented/post_segmented.csv
用Hive做数据分析和数据准备
a. 统计各关键词的热度变化指标、转发数和评论数
统计各关键词的热度变化指标、转发数和评论数。热度变化指标衡量关键词在一定时间内出现次数的变化情况,定义为
max(N)−min(N)max(N)+min(N)
其中N表示关键词在单位时间(这里是每天)内出现的次数。该指标越大,说明关键词的热度变化越大,所描述的事物可能是当时的突发热点。
hive -e \
"select Word, min(CountWord) as MinWordCount, max(CountWord) as MaxWordCount,
(max(CountWord) - min(CountWord)) / (max(CountWord) + min(CountWord)) as Ratio,
sum(CountRepost) / sum(CountWord) as AverageRepost, sum(CountComment) / sum(CountWord) as AverageComment
from (
select Date, Word, count(1) as CountWord,
sum(RepostsCount) as CountRepost, sum(CommentsCount) as CountComment
from (
select from_unixtime(UtcTime,'yyyy-MM-dd') as Date, Word, RepostsCount, CommentsCount
from bigdata_cases.post_segmented lateral view explode(Text) TextTable as Word
where from_unixtime(UtcTime,'yyyy-MM-dd') between '2013-06-01' and '2013-06-14') a2
group by Date, Word) a1
group by Word
having max(CountWord) > 1000
order by Ratio desc;" \
1.csv
得到结果的前10行:
Word MinWordCount MaxWordCount Ratio AverageRepost AverageComment
龙舟竞渡 1 2170 0.9990787655458314 5.863933711295246 4.20409943305713
47人 1 1050 0.9980970504281637 163.6648160999306 55.10548230395559
作文题目 2 2015 0.9980168567178979 264.4995592124596 59.42932706435498
加纳 2 1903 0.9979002624671915 127.36005726556908 62.13266523502744
刘志军 5 4737 0.9978911851539435 108.75326525765851 29.62651389218713
斯诺 2 1825 0.9978106185002736 76.90679859559528 25.65432492818385
初五 3 2564 0.9976626412154266 112.04645476772616 65.81540342298288
寄往 6 4691 0.9974451777730466 6.891304347826087 4.53416149068323
聘礼 2 1509 0.9973527465254798 141.0823565700185 45.99876619370759
ios7 4 2205 0.9963784517881394 43.398148148148145 15.984825102880658
b. 统计热度变化前五和后五的关键词每天的出现次数、转发数和评论数
统计热度变化前五和后五的关键词每天的出现次数、转发数和评论数。
hive -e \
"select Date, Word, count(1) as CountWord,
sum(RepostsCount) as CountRepost, sum(CommentsCount) as CountComment
from (
select from_unixtime(UtcTime,'yyyy-MM-dd') as Date, Word, RepostsCount, CommentsCount
from bigdata_cases.post_segmented lateral view explode(Text) TextTable as Word
where from_unixtime(UtcTime,'yyyy-MM-dd') between '2013-06-01' and '2013-06-14') a
where Word in ('龙舟竞渡','47人','作文题目','加纳','刘志军')
group by Date, Word" \
2-1.csv
hive -e \
"select Date, Word, count(1) as CountWord,
sum(RepostsCount) as CountRepost, sum(CommentsCount) as CountComment
from (
select from_unixtime(UtcTime,'yyyy-MM-dd') as Date, Word, RepostsCount, CommentsCount
from bigdata_cases.post_segmented lateral view explode(Text) TextTable as Word
where from_unixtime(UtcTime,'yyyy-MM-dd') between '2013-06-01' and '2013-06-14') a
where Word in ('恤','情侣','鞋','广州','宝贝')
group by Date, Word" \
2-2.csv
得到结果的前10行:
Date Word CountWord CountRepost CountComment
2013-06-01 作文题目 8 214 80
2013-06-09 47人 170 5747 2051
2013-06-04 47人 9 306 81
2013-06-10 47人 91 7039 5056
2013-06-08 加纳 191 3949 1886
2013-06-14 加纳 21 790 144
2013-06-03 加纳 4 175 43
2013-06-05 刘志军 6 12 5
2013-06-05 龙舟竞渡 1 459 130
2013-06-08 作文题目 635 169564 30690
用R语言做数据可视化
a. 载入相关程序包
载入相关程序包。将Hive输出的结果文件复制到R语言可访问的路径如“D:\workspace\”。
> library(ggplot2)
> library(GGally)
Warning: package 'GGally' was built under R version 3.2.3
b. 画出热度变化指标、转发数和评论数的直方图
画出热度变化指标、转发数和评论数的直方图,其中横坐标为热度变化指标、转发数和评论数,纵坐标为关键词的数量,其中转发数和评论数的直方图中纵坐标用平方根尺度表示,横坐标用对数尺度表示。
> data1 <- read.table("D:/workspace/1.csv", sep = "\t", fileEncoding = "UTF-8")
Warning in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : 输入链结'D:/workspace/1.csv'内的输入不对
> names(data1) <- c("Word", "MinWordCount", "MaxWordCount", "HotMetric", "AverageRepost",
+ "AverageComment")
> ggplot(data1, aes(x = HotMetric)) + geom_histogram(aes(fill = ..count..)) +
+ scale_fill_gradient("Count", low = "green", high = "red")
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
> ggplot(data1, aes(x = AverageRepost)) + geom_histogram(aes(fill = ..count..)) +
+ scale_fill_gradient("Count", trans = "sqrt", low = "green", high = "red") +
+ scale_x_log10() + scale_y_sqrt()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
> ggplot(data1, aes(x = AverageComment)) + geom_histogram(aes(fill = ..count..)) +
+ scale_fill_gradient("Count", trans = "sqrt", low = "green", high = "red") +
+ scale_x_log10() + scale_y_sqrt()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
c. 画出热度变化指标、转发数和评论数的散点图矩阵和相关性
画出热度变化指标、转发数和评论数的散点图矩阵和相关性,其中转发数和评论数采用对数坐标。可以看出,转发数和评论数高度相关,但与热度变化指标相关性较小。
> ggpairs(data.frame(HotMetric = data1[[4]], AverageRepost = log10(data1[[5]]),
+ AverageComment = log10(data1[[6]])), columnLabels = c("HotMetric", "Log(Repost)",
+ "Log(Comment)"))
d. 画出热度变化前五和后五的关键词每天的出现次数、转发数和评论数
画出热度变化前五和后五的关键词每天的出现次数、转发数和评论数,其中横坐标表示日期,纵坐标表示关键词的出现次数、转发数和评论数,线条粗细表示关键词的出现次数。
> data21 <- read.table("D:/workspace/2-1.csv", sep = "\t", fileEncoding = "UTF-8")
> names(data21) <- c("Date", "Word", "CountWord", "CountRepost", "CountComment")
> data21$Date <- as.POSIXct(data21$Date)
> ggplot(data21, aes(x = Date, y = CountWord, group = Word)) + geom_line(aes(colour = Word,
+ size = CountWord))
> ggplot(data21, aes(x = Date, y = CountRepost, group = Word)) + geom_line(aes(colour = Word,
+ size = CountWord))
> ggplot(data21, aes(x = Date, y = CountComment, group = Word)) + geom_line(aes(colour = Word,
+ size = CountWord))
> data22 <- read.table("D:/workspace/2-2.csv", sep = "\t", fileEncoding = "UTF-8")
> names(data22) <- c("Date", "Word", "CountWord", "CountRepost", "CountComment")
> data22$Date <- as.POSIXct(data22$Date)
> ggplot(data22, aes(x = Date, y = CountWord, group = Word)) + geom_line(aes(colour = Word,
+ size = CountWord))
> ggplot(data22, aes(x = Date, y = CountRepost, group = Word)) + geom_line(aes(colour = Word,
+ size = CountWord))
> ggplot(data22, aes(x = Date, y = CountComment, group = Word)) + geom_line(aes(colour = Word,
+ size = CountWord))