下方有数据可免费下载
[hadoop@hadoop000 hive_data]$ less sogou.500w.utf8
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1
20111230000007 b97920521c78de70ac38e3713f524b50 本本联盟 1 1 http://www.bblianmeng.com/
[hadoop@hadoop000 hive_data]$ wc -l sogou.500w.utf8
5000000 sogou.500w.utf8
主要目的:将第一列的‘时间’进行substr操作,分成年,月,日,时这四列,加到数据的后面,方便后面进行分区。
[hadoop@hadoop000 hive_data]$ vi sogou-log-extend.sh
#!/bin/bash
#infile=/sogou_500w.utf8
infile=$1
#outfile=/sogou_500w.utf8.ext
outfile=$2
awk -F '\t' '{print $0 "\t" substr($1,1,4) "\t" substr($1,5,2) "\t" substr($1,7,2) "\t" substr($1,9,2)}' $infile > $outfile
[hadoop@hadoop000 hive_data]$ bash sogou-log-extend.sh /home/hadoop/data/hive_data/sogou.500w.utf8 /home/hadoop/data/hive_data/sogou.500w.utf8.ext
[hadoop@hadoop000 hive_data]$ less sogou.500w.utf8.ext
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/ 2011 12 30 00
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1 2011 12 30 00
20111230000007 b97920521c78de70ac38e3713f524b50 本本联盟 1 1 http://www.bblianmeng.com/ 2011 12 30 00
将数据加载到hdfs
hadoop fs -mkdir -p /sogou/20111230
hadoop fs -put /home/hadoop/data/hive_data/sogou.500w.utf8 /sogou/20111230/
hadoop fs -mkdir -p /sogou_ext/20111230
hadoop fs -put /home/hadoop/data/hive_data/sogou.500w.utf8.ext /sogou_ext/20111230/
create external table if not exists sogou.sogou_20111230(
ts string,
uid string,
keyword string,
rank int,
order int,
url string)
comment 'This is the sogou search data of one day'
row format delimited
fields terminated by '\t'
stored as textfile
location '/sogou/20111230';
create external table if not exists sogou.sogou_ext_20111230(
ts string,
uid string,
keyword string,
rank int,
order int,
url string,
year int,
month int,
day int,
hour int)
comment 'This is the sogou search data of extend'
row format delimited
fields terminated by '\t'
stored as textfile
location '/sogou_ext/20111230';
create external table if not exists sogou.sogou_partition(
ts string,
uid string,
keyword string,
rank int,
order int,
url string)
comment 'This is the sogou search data by partition'
partitioned by(
year INT, month INT, day INT, hour INT)
row format delimited
fields terminated by '\t'
stored as textfile;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table sogou.sogou_partition partition(year, month, day, hour) select * from sogou.sogou_ext_20111230;
select * from sogou_partition limit 10;
*数据总数*
select count(*) from sogou_ext_20111230;
*非空查询条数*
select count(*) from sogou_ext_20111230 where keyword is not null and keyword != ' ';
*无重复条数(根据ts,uid,keyword,url)*
select count(*) from (select ts,uid,keyword,url from sogou_ext_20111230 group by ts,uid,keyword,url having count(*)=1) B;
*独立UID条数*
select count(distinct(uid)) from sogou_ext_20111230;
*关键词平均长度统计,关键词中没有空白字符,则长度为一*
select avg(a.cnt) from (select size(split(keyword,' s+')) as cnt from sogou_ext_20111230) a;
Total MapReduce CPU Time Spent: 19 seconds 870 msec
OK
1.0012018
Time taken: 28.049 seconds, Fetched: 1 row(s)
*查询频度排名*
select keyword,count(*) as cnt from sogou_ext_20111230 group by keyword order by cnt desc limit 10;
Total MapReduce CPU Time Spent: 43 seconds 440 msec
OK
百度 38441
baidu 18312
人体艺术 14475
4399小游戏 11438
qq空间 10317
优酷 10158
新亮剑 9654
馆陶县县长闫宁的父亲 9127
公安卖萌 8192
百度一下 你就知道 7505
Time taken: 64.503 seconds, Fetched: 10 row(s)
*查询一次,两次,三次,大于三次的UID数量*
select sum(if(uids.cnt=1,1,0)),sum(if(uids.cnt=2,1,0)),sum(if(uids.cnt=3,1,0)),sum(if(uids.cnt>3,1,0)) from (select uid,count(*) as cnt from sogou_ext_20111230 group by uid) uids;
Total MapReduce CPU Time Spent: 34 seconds 600 msec
OK
549148 257163 149562 396791
Time taken: 56.256 seconds, Fetched: 1 row(s)
*UID平均查询次数*
select sum(uids.cnt)/count(uids.uid) from (select uid,count(*) as cnt from sogou_ext_20111230 group by uid) uids;
Total MapReduce CPU Time Spent: 28 seconds 400 msec
OK
3.6964094557111005
Time taken: 49.467 seconds, Fetched: 1 row(s)
*查询次数大于两次的用户总数*
select sum(if(uids.cnt>2,1,0)) from (select uid,count(*) as cnt from sogou_ext_20111230 group by uid) uids;
Total MapReduce CPU Time Spent: 31 seconds 520 msec
OK
546353
Time taken: 51.733 seconds, Fetched: 1 row(s)
*查询次数大于两次的用户所占比*
分开计算,然后相除。
select count(*) as cnt from sogou_ext_20111230 where rank<11;
Total MapReduce CPU Time Spent: 13 seconds 230 msec
OK
4999869
Time taken: 23.677 seconds, Fetched: 1 row(s)
总数为500万,比例为4999869/5000000,可看出,绝大部分会点击前10条搜索结果。
*直接输入URL查询的比例*
select count(*) from sogou_ext_20111230 where keyword like '%www%';
Total MapReduce CPU Time Spent: 12 seconds 390 msec
OK
73979
Time taken: 24.717 seconds, Fetched: 1 row(s)
*直接输入URL查询并且查询的URL位于点击的URL中*
select sum(if(instr(url,keyword)>0,1,0)) from (select * from sogou_ext_20111230 where keyword like '%www%') a;
Total MapReduce CPU Time Spent: 12 seconds 600 msec
OK
27561
Time taken: 23.817 seconds, Fetched: 1 row(s)
可看出大部分搜索URL,并不能得到自己想要的结果。
搜索个人的行为,此处略。