大数据基础(十)Maven构建Hadoop日志清洗项目(二)

Maven Hadoop日志清洗项目(二)


Hadoop 2.7.2 
Hive 2.1.0 
Sqoop 1.4.6




参考:
http://www.cnblogs.com/edisonchou/p/4464349.html


1、将HDFS中清洗好的文件入库hive


为了能够借助Hive进行统计分析,首先我们需要将清洗后的数据存入Hive中,那么我们需要先建立一张表。这里我们选择分区表,以日期作为分区的指标,建表语句如下:(这里关键之处就在于确定映射的HDFS位置,我这里是/user/root/logcleanjob_output即清洗后的数据存放的位置)


文件位置:
root@py-server:/projects/data# hadoop fs -ls /user/root/logcleanjob_output
Found 2 items
-rw-r--r--   2 root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output/_SUCCESS
-rw-r--r--   2 root supergroup   50810594 2016-08-13 18:46 /user/root/logcleanjob_output/part-r-00000


hive>create database logtest;


hive>CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/user/root/logcleanjob_output';


验证:
select * from techbbs;
119.127.191.86 20130531235956 forum.php?mod=viewthread&tid=11528&page=60&authorid=53387
157.56.177.164 20130531235957 api.php?mod=js&bid=65
223.240.215.151 20130531235958 source/plugin/pcmgr_url_safeguard/url_api.inc.php
112.64.235.246 20130531235957 home.php?mod=misc&ac=sendmail&rand=1370014195
49.74.113.251 20130531235958 home.php?mod=spacecp&ac=follow&op=checkfeed&rand=1370015996
117.79.176.9 20130531235958 home.php?mod=space&do=notice
Time taken: 0.097 seconds, Fetched: 672261 row(s)




2、统计PV


页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。这里,我们只需要统计日志中的记录个数即可,HQL代码如下:


hive>CREATE TABLE techbbs_pv_2015_04_25 AS SELECT COUNT(1) AS PV FROM techbbs;


hive> select * from techbbs_pv_2015_04_25;
OK
672261
Time taken: 0.128 seconds, Fetched: 1 row(s)






3、注册用户数


  该论坛的用户注册页面为member.php,而当用户点击注册时请求的又是member.php?mod=register的url。因此,这里我们只需要统计出日志中访问的URL是member.php?mod=register的即可,HQL代码如下:


hive>CREATE TABLE techbbs_reguser_2015_04_25 AS SELECT COUNT(1) AS REGUSER FROM techbbs WHERE INSTR(url,'member.php?mod=register')>0;  


hive> select * from techbbs_reguser_2015_04_25;
OK
551
Time taken: 0.087 seconds, Fetched: 1 row(s)
  
4、关键指标之三:独立IP数


  一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。因此,这里我们只需要统计日志中处理的独立IP数即可,在SQL中我们可以通过DISTINCT关键字,在HQL中也是通过这个关键字:


hive>CREATE TABLE techbbs_ip_2015_04_25 AS SELECT COUNT(DISTINCT ip) AS IP FROM techbbs;


hive> select * from techbbs_ip_2015_04_25;
OK
31535
Time taken: 0.118 seconds, Fetched: 1 row(s)




5、关键指标之四:跳出用户数


  只浏览了一个页面便离开了网站的访问次数,即只浏览了一个页面便不再访问的访问次数。这里,我们可以通过用户的IP进行分组,如果分组后的记录数只有一条,那么即为跳出用户。将这些用户的数量相加,就得出了跳出用户数,HQL代码如下:


hive>CREATE TABLE techbbs_jumper_2015_04_25 AS SELECT COUNT(1) AS jumper FROM (SELECT COUNT(ip) AS times FROM techbbs GROUP BY ip HAVING times=1) e;


hive> show tables;
OK
techbbs
techbbs_ip_2015_04_25
techbbs_jumper_2015_04_25
techbbs_pv_2015_04_25
techbbs_reguser_2015_04_25
Time taken: 0.01 seconds, Fetched: 5 row(s)
hive> select * from techbbs_jumper_2015_04_25;
OK
10622
Time taken: 0.081 seconds, Fetched: 1 row(s)




6、将所有关键指标放入一张汇总表中以便于通过Sqoop导出到MySQL


  为了方便通过Sqoop统一导出到MySQL,这里我们借助一张汇总表将刚刚统计到的结果整合起来,通过表连接结合,HQL代码如下:


hive>CREATE TABLE techbbs_2015_04_25 AS SELECT '2015_04_25', a.pv, b.reguser, c.ip, d.jumper FROM techbbs_pv_2015_04_25 a JOIN techbbs_reguser_2015_04_25 b ON 1=1 JOIN techbbs_ip_2015_04_25 c ON 1=1 JOIN techbbs_jumper_2015_04_25 d ON 1=1;


hive> select * from techbbs_2015_04_25;
OK
2015_04_25 672261 551 31535 10622
Time taken: 0.104 seconds, Fetched: 1 row(s)




7、使用Sqoop导入到MySQL


7.1  准备工作:在MySQL中创建结果汇总表
  (1)Step1:创建一个新数据库:techbbs


mysql> create database techbbs;
Query OK, 1 row affected (0.00 sec)


  (2)Step2:创建一张新数据表:techbbs_logs_stat


mysql> create table techbbs_logs_stat(logdate varchar(10) primary key, pv int,reguser int,ip int,jumper int);
Query OK, 0 rows affected (0.24 sec)




7.2 导入操作:通过export命令


root@py-server:/projects/data# hadoop fs -ls /user/hive/warehouse/logtest.db
Found 5 items
drwxrwxr-x   - root supergroup          0 2016-08-13 21:59 /user/hive/warehouse/logtest.db/techbbs_2015_04_25
drwxrwxr-x   - root supergroup          0 2016-08-13 21:54 /user/hive/warehouse/logtest.db/techbbs_ip_2015_04_25
drwxrwxr-x   - root supergroup          0 2016-08-13 21:57 /user/hive/warehouse/logtest.db/techbbs_jumper_2015_04_25
drwxrwxr-x   - root supergroup          0 2016-08-13 21:48 /user/hive/warehouse/logtest.db/techbbs_pv_2015_04_25
drwxrwxr-x   - root supergroup          0 2016-08-13 21:53 /user/hive/warehouse/logtest.db/techbbs_reguser_2015_04_25


mysql> show tables;
+---------------------------+
| Tables_in_hive            |
+---------------------------+
| techbbs_logs_stat         |
+---------------------------+




  (1)Step1:编写导出命令


sqoop export --connect jdbc:mysql://py-server:3306/hive --username root --password fm1106 --table techbbs_logs_stat --fields-terminated-by '\001' --export-dir '/user/hive/warehouse/logtest.db/techbbs_2015_04_25'


  这里的--export-dir是指定的hive目录下的汇总表所在位置,我这里是/user/hive/warehouse/logtest.db/techbbs_2015_04_25。


16/08/13 22:13:04 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=539988
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=849
HDFS: Number of bytes written=0
HDFS: Number of read operations=19
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters 
Launched map tasks=4
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=17659
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=17659
Total vcore-milliseconds taken by all map tasks=17659
Total megabyte-milliseconds taken by all map tasks=18082816
Map-Reduce Framework
Map input records=1
Map output records=1
Input split bytes=741
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=463
CPU time spent (ms)=3280
Physical memory (bytes) snapshot=711569408
Virtual memory (bytes) snapshot=7689871360
Total committed heap usage (bytes)=581435392
File Input Format Counters 
Bytes Read=0
File Output Format Counters 
Bytes Written=0
16/08/13 22:13:04 INFO mapreduce.ExportJobBase: Transferred 849 bytes in 17.7487 seconds (47.8344 bytes/sec)
16/08/13 22:13:04 INFO mapreduce.ExportJobBase: Exported 1 records.




  (2)Step2:查看导出结果


mysql> select * from techbbs_logs_stat;
+------------+--------+---------+-------+--------+
| logdate    | pv     | reguser | ip    | jumper |
+------------+--------+---------+-------+--------+
| 2015_04_25 | 672261 |     551 | 31535 |  10622 |
+------------+--------+---------+-------+--------+
1 row in set (0.00 sec)


后记:


关于原文(http://www.cnblogs.com/edisonchou/p/4464349.html)的定时任务和按日期的设置,请参考原文。

感谢原作者!

你可能感兴趣的:(大数据开发基础)