1)数据参数
用户的历史微博数据
截止到20131215
压缩后244MB,解压后878MB
2)数据类型
整个数据是json格式
json中字段描述:
beCommentWeiboId 是否评论
beForwardWeiboId 是否是转发微博
catchTime 抓取时间
commentCount 评论次数
content 内容
createTime 创建时间
info1 信息字段1
info2信息字段2
info3信息字段3
mlevel no sure
musicurl 音乐链接
pic_list 照片列表(可以有多个)
praiseCount 点赞人数
reportCount 转发人数
source 数据来源
userId 用户id
videourl 视频链接
weiboId 微博id
weiboUrl 微博网址
1. 组织数据(Hive)
创建Hive表weibo(json STRING),表只有一个字段,导入所有数据,并验证查询前3条数据
1>建表(建库)
①创建数据库:create database weibo;
②切换数据库:use weibo;
③创建外部表:create external tableweibo(json string) row format delimited lines terminated by "\n"stored as textfile location "/exam/weibo";
2>导入数据
①上传数据:
②解压文件:unzip weibo.zip
③上传数据:hdfs dfs -put ~/data/619893/*/exam/weibo/
3>验证查询前三条数据
select json from weibo limit 3;
2. 统计需求(Hive)
(1)统计微博总量和独立用户数
①确认是否有脏数据:通过结果很容易看出没有
select get_json_object(js.json,'$.userId') from (select json from weibo)js where substr(json,1,1)="{";
②正常查询:
select "微博总量:",sum(user.cnt),"独立用户总数",count(user.userId)
from(
select jj.uid as userId ,count(*) as cnt
from (
selectget_json_object(substring(js.json,2),'$.userId')
as uid
from (
select json from weibo
)
as js
) as jj
group by jj.uid
) as user;
(2)统计用户所有微博被转发的总次数,并输出TOP-3用户
①创建一个视图:
create view userRecord
as select
get_json_object(substring(js.json,2),'$.beCommentWeiboId') asbeCommentWeiboId ,
get_json_object(substring(js.json,2),'$.beForwardWeiboId') asbeForwardWeiboId ,
get_json_object(substring(js.json,2),'$.catchTime') as catchTime ,
get_json_object(substring(js.json,2),'$.commentCount') as commentCount ,
get_json_object(substring(js.json,2),'$.content') as content,
get_json_object(substring(js.json,2),'$.createTime') as createTime ,
get_json_object(substring(js.json,2),'$.info1') as info1 ,
get_json_object(substring(js.json,2),'$.info2') as info2,
get_json_object(substring(js.json,2),'$.info3') as info3,
get_json_object(substring(js.json,2),'$.mlevel') as mlevel,
get_json_object(substring(js.json,2),'$.musicurl') as musicurl,
get_json_object(substring(js.json,2),'$.pic_list') as pic_list ,
get_json_object(substring(js.json,2),'$.praiseCount') as praiseCount,
get_json_object(substring(js.json,2),'$.reportCount') as reportCount,
get_json_object(substring(js.json,2),'$.source') as source ,
get_json_object(substring(js.json,2),'$.userId') as userId ,
get_json_object(substring(js.json,2),'$.videourl') as videourl ,
get_json_object(substring(js.json,2),'$.weiboId') as weiboId,
get_json_object(substring(js.json,2),'$.weiboUrl') as weiboUrl
from (select json from weibo) js;
②执行查询:
select userId,sum(reportCount) as cnt from userRecord group by userIdorder by cnt DESC limit 3;
(3)统计微博被转发最多的前3位用户的id
执行查询:
select uu.userId
from (
select userId,count(*)
as cnt
from userRecord
where reportCount>0
group by userId
order by cnt
)
as uu limit 3;;
(4)统计每个用户的发送微博总数,并存储到临时表
创建临时表:
create table tempory_uid_sum(
uid string,
total int
);
查询并插入数据:
insert overwrite table tempory_uid_sum select userId,sum(reportCount)from userRecord group by userId;
(5)统计带图片的微博数
执行查询:
select count(*) from userRecord where length(pic_list) >2;
(6)统计使用iphone发微博的独立用户数
执行查询:
select count(distinct(userId)) from userRecord where source="iPhone客户端";
(7)统计微博中用户ID与数据来源信息,将其放入视图中,然后统计视图中数据来源是“iPad客户端”的用户数目
创建视图:
create view view_uid_source
as
select userId,source
from
userRecord;
执行查询:
select count(distinct(userId)) from view_uid_source wheresource="iPad客户端";
3 特殊需求
①往hive中添加jar:add jar/home/hadoop/data/UDF_11.jar
②创建临时函数:create temporary functionAddTwo as "org.zkpk.func.Add";
③创建临时函数:create temporary function WordCount as"org.zkpk.func.QueryWord";
(1)实现Hive UDF完成下面的需求:
将微博的点赞人数与转发人数相加求和,并将相加之和降序排列,取前10条记录
①代码
import org.apache.hadoop.hive.ql.exec.UDF;
public class Add extends UDF {
public Integer evaluate(Integerval1,Integer val2) throws Exception{
return val1+val2;
}
}
②查询:
select b.*
from (
select createTime,userId AddTwo(cast(praiseCountas int),cast(reportCount as int)) astotal
from userRecord
) as a ,
(
select * from userRecord
) as b
where a.userId=b.userId
and a.catchTime = b.catchTime
order by a.total;
(2) 实现Hive UDF完成下面的需求:
1>微博内容content中的包含某个词的个数,方法返回值是int类型的数值
①代码:
import org.apache.hadoop.hive.ql.exec.UDF;
public classQueryWord extends UDF {
static int counter = 0;
public int stringNumbers(String str,String se) {
if (str.indexOf(se) == -1) {
return 0;
} else if (str.indexOf(se) !=-1) {
counter++;
stringNumbers(str.substring(str.indexOf(se)+ se.length()), se);
return counter;
}
return 0;
}
public Integer evaluate(String val1,String val2) throws Exception {
int sum = stringNumbers(val1,val2);
Integer in = newInteger(sum);
return in;
}
}
2>使用该方法统计微博内容中出现“iphone”次数最多的用户,最终结果输出用户ID和次数
①查询:
selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt;
②查询(这个查询完全是为了方便查看结果):
selectuserId,sum(WordCount(content,"iphone")) as cnt from userRecord groupby userId order by cnt DESC limit 5;