Maven Hadoop日志清洗项目(二)
Hadoop 2.7.2
Hive 2.1.0
Sqoop 1.4.6
参考:
http://www.cnblogs.com/edisonchou/p/4464349.html
1、将HDFS中清洗好的文件入库hive
为了能够借助Hive进行统计分析,首先我们需要将清洗后的数据存入Hive中,那么我们需要先建立一张表。这里我们选择分区表,以日期作为分区的指标,建表语句如下:(这里关键之处就在于确定映射的HDFS位置,我这里是/user/root/logcleanjob_output即清洗后的数据存放的位置)
文件位置:
root@py-server:/projects/data# hadoop fs -ls /user/root/logcleanjob_output
Found 2 items
-rw-r--r-- 2 root supergroup 0 2016-08-13 18:46 /user/root/logcleanjob_output/_SUCCESS
-rw-r--r-- 2 root supergroup 50810594 2016-08-13 18:46 /user/root/logcleanjob_output/part-r-00000
hive>create database logtest;
hive>CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/user/root/logcleanjob_output';
验证:
select * from techbbs;
119.127.191.86
20130531235956
forum.php?mod=viewthread&tid=11528&page=60&authorid=53387
157.56.177.164
20130531235957
api.php?mod=js&bid=65
223.240.215.151
20130531235958
source/plugin/pcmgr_url_safeguard/url_api.inc.php
112.64.235.246
20130531235957
home.php?mod=misc&ac=sendmail&rand=1370014195
49.74.113.251
20130531235958
home.php?mod=spacecp&ac=follow&op=checkfeed&rand=1370015996
117.79.176.9
20130531235958
home.php?mod=space&do=notice
Time taken: 0.097 seconds, Fetched: 672261 row(s)
2、统计PV
页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。这里,我们只需要统计日志中的记录个数即可,HQL代码如下:
hive>CREATE TABLE techbbs_pv_2015_04_25 AS SELECT COUNT(1) AS PV FROM techbbs;
hive> select * from techbbs_pv_2015_04_25;
OK
672261
Time taken: 0.128 seconds, Fetched: 1 row(s)
3、注册用户数
该论坛的用户注册页面为member.php,而当用户点击注册时请求的又是member.php?mod=register的url。因此,这里我们只需要统计出日志中访问的URL是member.php?mod=register的即可,HQL代码如下:
hive>CREATE TABLE techbbs_reguser_2015_04_25 AS SELECT COUNT(1) AS REGUSER FROM techbbs WHERE INSTR(url,'member.php?mod=register')>0;
hive> select * from techbbs_reguser_2015_04_25;
OK
551
Time taken: 0.087 seconds, Fetched: 1 row(s)
4、关键指标之三:独立IP数
一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。因此,这里我们只需要统计日志中处理的独立IP数即可,在SQL中我们可以通过DISTINCT关键字,在HQL中也是通过这个关键字:
hive>CREATE TABLE techbbs_ip_2015_04_25 AS SELECT COUNT(DISTINCT ip) AS IP FROM techbbs;
hive> select * from techbbs_ip_2015_04_25;
OK
31535
Time taken: 0.118 seconds, Fetched: 1 row(s)
5、关键指标之四:跳出用户数
只浏览了一个页面便离开了网站的访问次数,即只浏览了一个页面便不再访问的访问次数。这里,我们可以通过用户的IP进行分组,如果分组后的记录数只有一条,那么即为跳出用户。将这些用户的数量相加,就得出了跳出用户数,HQL代码如下:
hive>CREATE TABLE techbbs_jumper_2015_04_25 AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM techbbs GROUP BY ip HAVING times=1) e;
hive> show tables;
OK
techbbs
techbbs_ip_2015_04_25
techbbs_jumper_2015_04_25
techbbs_pv_2015_04_25
techbbs_reguser_2015_04_25
Time taken: 0.01 seconds, Fetched: 5 row(s)
hive> select * from techbbs_jumper_2015_04_25;
OK
10622
Time taken: 0.081 seconds, Fetched: 1 row(s)
6、将所有关键指标放入一张汇总表中以便于通过Sqoop导出到MySQL
为了方便通过Sqoop统一导出到MySQL,这里我们借助一张汇总表将刚刚统计到的结果整合起来,通过表连接结合,HQL代码如下:
hive>CREATE TABLE techbbs_2015_04_25 AS SELECT '2015_04_25', a.pv, b.reguser, c.ip, d.jumper FROM techbbs_pv_2015_04_25 a JOIN techbbs_reguser_2015_04_25 b ON 1=1 JOIN techbbs_ip_2015_04_25 c ON 1=1 JOIN techbbs_jumper_2015_04_25 d ON 1=1;
hive> select * from techbbs_2015_04_25;
OK
2015_04_25
672261
551
31535
10622
Time taken: 0.104 seconds, Fetched: 1 row(s)
7、使用Sqoop导入到MySQL
7.1 准备工作:在MySQL中创建结果汇总表
(1)Step1:创建一个新数据库:techbbs
mysql> create database techbbs;
Query OK, 1 row affected (0.00 sec)
(2)Step2:创建一张新数据表:techbbs_logs_stat
mysql> create table techbbs_logs_stat(logdate varchar(10) primary key, pv int,reguser int,ip int,jumper int);
Query OK, 0 rows affected (0.24 sec)
7.2 导入操作:通过export命令
root@py-server:/projects/data# hadoop fs -ls /user/hive/warehouse/logtest.db
Found 5 items
drwxrwxr-x - root supergroup 0 2016-08-13 21:59 /user/hive/warehouse/logtest.db/techbbs_2015_04_25
drwxrwxr-x - root supergroup 0 2016-08-13 21:54 /user/hive/warehouse/logtest.db/techbbs_ip_2015_04_25
drwxrwxr-x - root supergroup 0 2016-08-13 21:57 /user/hive/warehouse/logtest.db/techbbs_jumper_2015_04_25
drwxrwxr-x - root supergroup 0 2016-08-13 21:48 /user/hive/warehouse/logtest.db/techbbs_pv_2015_04_25
drwxrwxr-x - root supergroup 0 2016-08-13 21:53 /user/hive/warehouse/logtest.db/techbbs_reguser_2015_04_25
mysql> show tables;
+---------------------------+
| Tables_in_hive |
+---------------------------+
| techbbs_logs_stat |
+---------------------------+
(1)Step1:编写导出命令
sqoop export --connect jdbc:mysql://py-server:3306/hive --username root --password fm1106 --table techbbs_logs_stat --fields-terminated-by '\001' --export-dir '/user/hive/warehouse/logtest.db/techbbs_2015_04_25'
这里的--export-dir是指定的hive目录下的汇总表所在位置,我这里是/user/hive/warehouse/logtest.db/techbbs_2015_04_25。
16/08/13 22:13:04 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=539988
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=849
HDFS: Number of bytes written=0
HDFS: Number of read operations=19
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=4
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=17659
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=17659
Total vcore-milliseconds taken by all map tasks=17659
Total megabyte-milliseconds taken by all map tasks=18082816
Map-Reduce Framework
Map input records=1
Map output records=1
Input split bytes=741
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=463
CPU time spent (ms)=3280
Physical memory (bytes) snapshot=711569408
Virtual memory (bytes) snapshot=7689871360
Total committed heap usage (bytes)=581435392
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/08/13 22:13:04 INFO mapreduce.ExportJobBase: Transferred 849 bytes in 17.7487 seconds (47.8344 bytes/sec)
16/08/13 22:13:04 INFO mapreduce.ExportJobBase: Exported 1 records.
(2)Step2:查看导出结果
mysql> select * from techbbs_logs_stat;
+------------+--------+---------+-------+--------+
| logdate | pv | reguser | ip | jumper |
+------------+--------+---------+-------+--------+
| 2015_04_25 | 672261 | 551 | 31535 | 10622 |
+------------+--------+---------+-------+--------+
1 row in set (0.00 sec)
后记:
关于原文(http://www.cnblogs.com/edisonchou/p/4464349.html)的定时任务和按日期的设置,请参考原文。
感谢原作者!