数据来源:
搜狗实验室官方网站的用户查询日志,http://www.sogou.com/labs/resource/q.php
第一列:搜索时间
第二列:用户ID
第三列:搜索内容
第四列:搜索内容出现在搜索结果页面的第几行
第五列:用户点击的是页面的第几行
第六列:用户点击的超链接
可以看到第四列和第五列之间是空格不是tap,使用记事本查找替换,记事本输入tap无效,可以复制前面的tap。
数据导入:
1. 新建数据库hive
[root@hadoop01 ~]# hive
hive> show databases;
OK
default
Time taken: 7.305 seconds, Fetched: 1 row(s)
hive> create database hive;
OK
Time taken: 0.53 seconds
hive> show databases;
OK
default
hive
Time taken: 0.01 seconds, Fetched: 2 row(s)
hive> use hive;
OK
Time taken: 0.013 seconds
2. 新建表Sogou
hive> create table Sogou(Time string,ID string,word string,location1 int,location2 int,website string) row format delimited fields terminated by '\t' lines terminated by '\n';
OK
Time taken: 1.277 seconds
3. 本地数据导入到表Sogou
hive> load data local inpath '/test/SogouQ.sample' into table Sogou;
Loading data to table hive.sogouq1
OK
Time taken: 1.788 seconds
4. 查看Sogou表
hive>select * from Sogou;
有bug,中文乱码,待会查询的时候只能查英文。
(hive中文乱码问题:https://www.cnblogs.com/DreamDrive/p/7469476.html)
扩展:
导入hdfs的数据则去掉local关键字
内部表和外部表:内部表是数据存储在hive数据仓库中的表,外部表的数据不存储在hive数据仓库中。
删除内部表时,会删除表数据和表文件;删除外部表时,只删除表数据,表所在的文件不会删除,依然存在。
使用hive进行分析搜索数据:
1. count
统计总样本数,可以看到Hive的具体执行过程
hive> select count(*) from Sogou;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20190527162925_920c8536-f6d8-4246-8d03-89ba851ba58f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1558939718678_0002, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0002/
Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-05-27 16:29:43,241 Stage-1 map = 0%, reduce = 0%
2019-05-27 16:29:59,731 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.53 sec
2019-05-27 16:30:24,303 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.91 sec
MapReduce Total cumulative CPU time: 6 seconds 910 msec
Ended Job = job_1558939718678_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.91 sec HDFS Read: 885689 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 910 msec
OK
10000
Time taken: 61.229 seconds, Fetched: 1 row(s)
hive>
2. 查看搜索关键字baidu的记录有多少条
hive> select count(*) from Sogou where word like '%baidu%';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20190527163553_998bf42e-2eb6-437a-b7c5-4ddeb72a2f90
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1558939718678_0003, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0003/
Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-05-27 16:36:23,228 Stage-1 map = 0%, reduce = 0%
2019-05-27 16:36:49,593 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.11 sec
2019-05-27 16:37:05,957 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.94 sec
MapReduce Total cumulative CPU time: 6 seconds 940 msec
Ended Job = job_1558939718678_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.94 sec HDFS Read: 886490 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 940 msec
OK
17
Time taken: 75.007 seconds, Fetched: 1 row(s)
hive>
2. 查看搜索关键字baidu且排名和点击的都是第一行的记录有多少条
hive> select count(*) from Sogou where word location1=1 and location2=1 and like '%baidu%';
FAILED: ParseException line 1:38 missing EOF at 'location1' near 'word'
hive> select count(*) from Sogou where location1=1 and location2=1 and word like '%baidu%';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20190527164056_0135e26b-58d3-409f-99e5-cbda737139d2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1558939718678_0004, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0004/
Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-05-27 16:41:18,492 Stage-1 map = 0%, reduce = 0%
2019-05-27 16:41:43,808 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.68 sec
2019-05-27 16:42:04,197 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.79 sec
MapReduce Total cumulative CPU time: 11 seconds 790 msec
Ended Job = job_1558939718678_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.79 sec HDFS Read: 887049 HDFS Write: 102 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 790 msec
OK
10
Time taken: 70.628 seconds, Fetched: 1 row(s)
hive>
实验证明:hive表名、数据库名都是不分大小写的