hive——3. hive实例:搜狗用户搜索日志

数据来源:

搜狗实验室官方网站的用户查询日志,http://www.sogou.com/labs/resource/q.php

第一列:搜索时间

第二列:用户ID

第三列:搜索内容

第四列:搜索内容出现在搜索结果页面的第几行

第五列:用户点击的是页面的第几行

第六列:用户点击的超链接

可以看到第四列和第五列之间是空格不是tap,使用记事本查找替换,记事本输入tap无效,可以复制前面的tap。

 

数据导入:

1. 新建数据库hive

[root@hadoop01 ~]# hive

hive> show databases;

OK

default

Time taken: 7.305 seconds, Fetched: 1 row(s)

hive> create database hive;

OK

Time taken: 0.53 seconds

hive> show databases;

OK

default

hive

Time taken: 0.01 seconds, Fetched: 2 row(s)

hive> use hive;

OK

Time taken: 0.013 seconds

2. 新建表Sogou

hive> create table Sogou(Time string,ID string,word string,location1 int,location2 int,website string) row format delimited fields terminated by '\t' lines terminated by '\n';

OK

Time taken: 1.277 seconds

3. 本地数据导入到表Sogou

hive> load data local inpath '/test/SogouQ.sample' into table Sogou;

Loading data to table hive.sogouq1

OK

Time taken: 1.788 seconds

4. 查看Sogou表

hive>select * from Sogou;

有bug,中文乱码,待会查询的时候只能查英文。

(hive中文乱码问题:https://www.cnblogs.com/DreamDrive/p/7469476.html)

 

扩展:

导入hdfs的数据则去掉local关键字

内部表和外部表:内部表是数据存储在hive数据仓库中的表,外部表的数据不存储在hive数据仓库中。

删除内部表时,会删除表数据和表文件;删除外部表时,只删除表数据,表所在的文件不会删除,依然存在。

 

 

使用hive进行分析搜索数据:

1. count

统计总样本数,可以看到Hive的具体执行过程

hive> select count(*) from Sogou;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = root_20190527162925_920c8536-f6d8-4246-8d03-89ba851ba58f

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

Starting Job = job_1558939718678_0002, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0002/

Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-05-27 16:29:43,241 Stage-1 map = 0%, reduce = 0%

2019-05-27 16:29:59,731 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.53 sec

2019-05-27 16:30:24,303 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.91 sec

MapReduce Total cumulative CPU time: 6 seconds 910 msec

Ended Job = job_1558939718678_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.91 sec HDFS Read: 885689 HDFS Write: 105 SUCCESS

Total MapReduce CPU Time Spent: 6 seconds 910 msec

OK

10000

Time taken: 61.229 seconds, Fetched: 1 row(s)

hive>

 

2. 查看搜索关键字baidu的记录有多少条

hive> select count(*) from Sogou where word like '%baidu%';

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = root_20190527163553_998bf42e-2eb6-437a-b7c5-4ddeb72a2f90

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

Starting Job = job_1558939718678_0003, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0003/

Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-05-27 16:36:23,228 Stage-1 map = 0%, reduce = 0%

2019-05-27 16:36:49,593 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.11 sec

2019-05-27 16:37:05,957 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.94 sec

MapReduce Total cumulative CPU time: 6 seconds 940 msec

Ended Job = job_1558939718678_0003

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.94 sec HDFS Read: 886490 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 6 seconds 940 msec

OK

17

Time taken: 75.007 seconds, Fetched: 1 row(s)

hive>

 

2. 查看搜索关键字baidu且排名和点击的都是第一行的记录有多少条

hive> select count(*) from Sogou where word location1=1 and location2=1 and like '%baidu%';

FAILED: ParseException line 1:38 missing EOF at 'location1' near 'word'

hive> select count(*) from Sogou where location1=1 and location2=1 and word like '%baidu%';

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Query ID = root_20190527164056_0135e26b-58d3-409f-99e5-cbda737139d2

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

Starting Job = job_1558939718678_0004, Tracking URL = http://hadoop01:8088/proxy/application_1558939718678_0004/

Kill Command = /export/servers/hadoop/bin/hadoop job -kill job_1558939718678_0004

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-05-27 16:41:18,492 Stage-1 map = 0%, reduce = 0%

2019-05-27 16:41:43,808 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.68 sec

2019-05-27 16:42:04,197 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.79 sec

MapReduce Total cumulative CPU time: 11 seconds 790 msec

Ended Job = job_1558939718678_0004

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.79 sec HDFS Read: 887049 HDFS Write: 102 SUCCESS

Total MapReduce CPU Time Spent: 11 seconds 790 msec

OK

10

Time taken: 70.628 seconds, Fetched: 1 row(s)

hive>

 

实验证明:hive表名、数据库名都是不分大小写的

你可能感兴趣的:(hadoop学习)