需求:统计每天24小时每个时段的PV和UV的数量
主要使用Hive查询,Sqoop导出到MySQL
PV:Page View 一条url就算一次
UV:Unique View 一个用户只算一次
操作系统:虚拟机centos7
软件:CDH版本Hadoop、Hive、Sqoop,MySQL
资源下载:链接:https://pan.baidu.com/s/1lgJkPzJqvzrsCIaLXtuFXg 提取码:g73u
开始之前确保启动了HDFS和Yarn、MySQL
7473 DataNode
7426 NameNode
7526 SecondaryNameNode
7719 JobHistoryServer
7646 NodeManager
7742 Jps
7599 ResourceManager
[fanl@centos7 hive-1.1.0-cdh5.14.2]$
(1)创建新的Hive库"weblogs"
hive (default)> create database weblogs;
OK
Time taken: 6.628 seconds
hive (default)> use weblogs;
OK
Time taken: 0.034 seconds
hive (weblogs)>
(2)创建源表logs_src,此表用于保存所有源数据
create table logs_src(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)row format delimited fields terminated by '\t';
(3)创建分区表,按照日期和小时将文件分区
create table logs_partition(
id string,
url string,
guid string
)partitioned by (day string,hour string)
row format delimited fields terminated by '\t';
(4)创建临时表,用于计算
create table logs_temp(
id string,
url string,
guid string,
day string,
hour string
)row format delimited fields terminated by '\t';
(1)将日志文件上传到Linux
虚拟机Linux用的是Shell操作,用rz可以将文件传入到用户根目录下
-rw-r--r-- 1 fanl fanl 39425518 12月 13 2016 2015082818
(2)将数据加载到源表logs_src中
hive (weblogs)> load data local inpath '/home/fanl/2015082818' into table logs_src;
Loading data to table weblogs.logs_src
Table weblogs.logs_src stats: [numFiles=1, totalSize=39425518]
OK
Time taken: 1.187 seconds
hive (weblogs)>
(3)将分区表28号18点数据存入临时表
根据字段tracktime 作为时间分割点,日期:substring(tracktime,9,2) 小时:substring(tracktime,12,2)
hive (weblogs)> insert into table logs_temp select id,url,guid,
> substring(trackTime,9,2)
> day,substring(trackTime,12,2) hour from logs_src;
(5)从临时表添加数据到分区表
hive (weblogs)> insert into table logs_partition
> partition(day='20150828',hour='18') select id,url,guid from
> logs_temp where day='28' and hour ='18';
(6)查询PV
hive (weblogs)> select day,hour,count(url) from logs_partition group by day,hour;
day hour _c2
20150828 18 64972
Time taken: 20.776 seconds, Fetched: 1 row(s)
hive (weblogs)>
(7)查询UV
hive (weblogs)> select day,hour,count(distinct guid) uv from logs_partition group by day,hour;
OK
day hour uv
20150828 18 23938
Time taken: 21.869 seconds, Fetched: 1 row(s)
hive (weblogs)>
(8)将结果集中到logs_reult中
hive (weblogs)> create table logs_result as select day,
> hour,count(url) pv,count(distinct guid) uv
> from logs_partition group by day,hour;
-- 执行完毕
hive (weblogs)> select * from logs_result;
OK
logs_result.day logs_result.hour logs_result.pv logs_result.uv
20150828 18 64972 23938
Time taken: 0.069 seconds, Fetched: 1 row(s)
(1)在MySQL创建结果表logs_reult
mysql> create database sqoop;
Query OK, 1 row affected (0.12 sec)
mysql> use sqoop;
Database changed
mysql> create table logs_result(
-> day varchar(20) not null,
-> hour varchar(20) not null,
-> pv varchar(20) not null,
-> uv varchar(20) not null,
-> primary key(day,hour)
-> );
Query OK, 0 rows affected (0.01 sec)
mysql> select * from logs_result;
Empty set (0.00 sec)
mysql>
(2)通过Sqoop导入到MySQL
[fanl@centos7 sqoop-1.4.6-cdh5.14.2]$ bin/sqoop export \
--connect jdbc:mysql://127.0.0.1:3306/sqoop \
--username root \
--password 123456 \
--table logs_result \
--export-dir '/user/hive/warehouse/weblogs.db/logs_result' \
--num-mappers 2 \
--input-fields-terminated-by '\001'
## hive默认的分隔符:\001
查询导入结果
mysql> select * from logs_result;
+----------+------+--------+-------+
| day | hour | pv | uv |
+----------+------+--------+-------+
| 20150828 | 18 | 64972 | 23938 |
+----------+------+--------+-------+
1 row in set (0.00 sec)
mysql>
按照步骤操作,注意hive默认的分隔符:\001。