初步学习了hive,以搜狗实验室的一部分搜索日志为数据集进行hiveQL语句操作练习。
语料官网:http://www.sogou.com/labs/resource/q.php
简介:搜索引擎查询日志库设计为包括约1个月(2008年6月)Sogou搜索引擎部分网页查询需求及用户点击情况的网页查询日志数据集合。为进行中文搜索引擎用户行为分析的研究者提供基准研究语料。
语料格式:
“访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL”
其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID。
另附,日志数据集一共三个版本:迷你版(样例数据, 376KB); 精简版(一天数据,63MB); 完整版(1.9GB).而本次练习用的是精简版本,即一天的数据。
create extended table souGouLog(
visitTime String,
userID String,
keyWords String,
urlRank int,
clickRank int,
url String)
comment 'sougou log'
row format delimited fields terminated by '\t';
load data inpath '/ljl/sougoulog'overwrite into table souGouLog;
注:这里的路径是日志文件在hdfs上的路径,为一个目录。
select count(*) fromsouGouLog;
注:运行结果1724264,这个数据小,用文本编辑器打开就可以得到此数据
select count(distinctuserID) from souGouLog;
注:运行结果519876个用户
select count(distincturl ) from souGouLog;
注:
select count(distinct keyWords ) from souGouLog;
注:运行结果318833条
select avg(a.visitTimes) from (select count(*) as visitTimes from souGouLog group by userID) as a;
注:3.316次,其实在总日志条数和用户数统计出来后,两者相除即可得到。
每个搜索条目的查询次数
select keyWords ,count(*) as looktimes from souGouLog group by keyWords ;
查询次数最多的10条,order by
select keyWords,looktimes from (select keyWords,count(*) as looktimesfrom sougoulog group by keyWords) as a order by looktimes limit 10;
查询次数最多的10条,先sort by再order by
select keyWords, looktimes (select keyWords, looktimes from (select keyWords,count(*) as looktimes from souGouLog group by keyWords) as a sort by looktimes desc limit 10)as b order by looktimesdesc limit 10;
注,运行结果:
[哄抢救灾物资] 66906
[汶川地震原因] 58766
[封杀莎朗斯通] 12649
[一个暗娼的自述] 9758
[广州军区司令员] 8661
[暗娼李湘] 8584
[成都警方扫黄现场] 5371
[百度] 4958
[尼泊尔地图] 4886
[现役解放军中将名单] 4721
select distinctvisitHour,visitTimes from(select visitHour,count(*) as visitTimes from (selectsubstr(visitTime,0,2) as visitHour from sougoulog) a group by a.visitHour) asb;
select distinctvisitHour,visitTimes from (select substr(visitTime,0,2) as visitHour,count(*)as visitTimes from sougoulog group by substr(visitTime,0,2)) a;
注,运行结果:
00 51807
01 30498
02 19813
03 13239
04 10131
05 10838
06 16733
07 28936
08 56032
09 86227
10 104872
11 98135
12 88283
13 95095
14 101455
15 109255
16 116679
17 104756
18 91830
19 97247
20 111022
21 115283
22 100122
23 65976
select * from (selectvisitHour,keyWords ,visitTimes,rank() over (partition by visitHour order by visitTimesdesc) as keyRank from (select substr(visitTime,0,2) as visitHour,keyWords ,count(*)as visitTimes from sougoulog group by substr(visitTime,0,2),keyWords ) as a) asb where keyRank<4 ;
00 [汶川地震原因] 1579 1
00 [哄抢救灾物资] 1397 2
00 [封杀莎朗斯通] 618 3
01 [汶川地震原因] 779 1
01 [哄抢救灾物资] 658 2
01 [封杀莎朗斯通] 300 3
02 [汶川地震原因] 368 1
02 [哄抢救灾物资] 321 2
02 [封杀莎朗斯通] 150 3
03 [汶川地震原因] 246 1
03 [哄抢救灾物资] 169 2
03 [电影] 74 3
04 [汶川地震原因] 169 1
04 [哄抢救灾物资] 164 2
04 [电影] 75 3
05 [哄抢救灾物资] 259 1
05 [汶川地震原因] 173 2
05 [镣铐绳艺视频] 156 3
06 [哄抢救灾物资] 594 1
06 [汶川地震原因] 530 2
06 [封杀莎朗斯通] 132 3
07 [哄抢救灾物资] 1209 1
07 [汶川地震原因] 1090 2
07 [封杀莎朗斯通] 236 3
08 [哄抢救灾物资] 2314 1
08 [汶川地震原因] 2301 2
08 [尼泊尔地图] 470 3
09 [哄抢救灾物资] 3241 1
09 [汶川地震原因] 3233 2
09 [尼泊尔地图] 600 3
10 [哄抢救灾物资] 3851 1
10 [汶川地震原因] 3728 2
10 [暗娼李湘] 861 3
11 [哄抢救灾物资] 3668 1
11 [汶川地震原因] 3154 2
11 [广州军区司令员] 1397 3
12 [哄抢救灾物资] 3418 1
12 [汶川地震原因] 2755 2
12 [广州军区司令员] 1029 3
13 [哄抢救灾物资] 3343 1
13 [汶川地震原因] 2842 2
13 [广州军区司令员] 796 3
14 [哄抢救灾物资] 3334 1
14 [汶川地震原因] 3019 2
14 [广州军区司令员] 829 3
15 [哄抢救灾物资] 4138 1
15 [汶川地震原因] 3370 2
15 [一个暗娼的自述] 821 3
16 [哄抢救灾物资] 4426 1
16 [汶川地震原因] 3612 2
16 [一个暗娼的自述] 882 3
17 [哄抢救灾物资] 4406 1
17 [汶川地震原因] 3461 2
17 [一个暗娼的自述] 751 3
18 [哄抢救灾物资] 4114 1
18 [汶川地震原因] 3282 2
18 [封杀莎朗斯通] 732 3
19 [哄抢救灾物资] 4320 1
19 [汶川地震原因] 3375 2
19 [封杀莎朗斯通] 674 3
20 [哄抢救灾物资] 5144 1
20 [汶川地震原因] 4272 2
20 [封杀莎朗斯通] 856 3
21 [哄抢救灾物资] 5459 1
21 [汶川地震原因] 5030 2
21 [封杀莎朗斯通] 972 3
22 [哄抢救灾物资] 4546 1
22 [汶川地震原因] 4105 2
22 [封杀莎朗斯通] 755 3
23 [哄抢救灾物资] 2413 1
23 [汶川地震原因] 2293 2
23 [封杀莎朗斯通] 466 3