1、写一个自动化脚本,自动拷贝/home/admin/WebLog/目录(服务器产生日志的目录)下的两个日志文件到指定目录/opt/modules/weblog。
2、crontab -e定时执行日志文件的拷贝迁移。
3、我们把日志上传到HDFS,删除该目录下的文件(有可能是只保存7天以内的日志)
4、数据清洗
5、数据分析,产生最终的结果表
6、将结果导入到Mysql中(就1条数据)
文档:
需求分析
日期 UV PV 登录人数 游客人数 平均访问时长 二跳率 独立IP
20150828 38985 131668.0 18548 21902 750.7895179233622 0.5166089521610876 29668
逻辑图
数据字典
解决方案:
预定义变量:
HIVE_DIR=/opt/modules/cdh/hive-0.13.1-cdh5.3.6
HADOOP_DIR=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6
HQL_DIR=/opt/modules/cdh/clean(SHELL脚本和HIVE脚本的存放路径)
bin/hdfs dfs -put /日志在Linux本地存放的绝对路径 /HDFS路径
例如:bin/hdfs dfs -put /opt/modules/weblog/20170725/2015082818 /weblog/
脚本:
YESTERDAY=$(date --date="1 day ago" +%Y%m%d)
创建目录:
$HADOOP_DIR/bin/hdfs dfs -mkdir /weblog/$YESTERDAY(以上两步不要放在循环中执行)
循环上传日志文件:
$HADOOP_DIR/bin/hdfs dfs -put /opt/modules/weblog/$YESTERDAY/$i
bin/yarn jar /home/admin/Desktop/cleanlog.jar com.z.etl.CleanLogMapReduce /weblog/20170725/2015082818
/user/hive/warehouse/syllabus.db/track_log/date=20150828/hour=18
*** Tools.java:执行删除本次任务中可能涉及到的输出路径,代码如下:
public static void deleteFileInHDFS(String directory, String exist) throws IOException{
// 获取文件系统管理对象
FileSystem fileSystem = FileSystem.get(URI.create(directory), conf);
FileStatus[] fileList = fileSystem.listStatus(new Path(directory));
for(int i = 0; i < fileList.length; i++){
FileStatus fileStatus = fileList[i];
if(fileStatus.getPath().getName().startsWith(exist)){
fileSystem.delete(fileStatus.getPath(), true);
}
}
}
*** 截取字符串:
///user/hive/warehouse/syllabus.db/track_log/date=20150828/hour=18
Tools.deleteFileInHDFS(args[1].substring(0, 57), args[1].substring(57));
代码:见清洗代码包
注意:
1如果在导包时指定了MainClass,则调用:
bin/yarn jar xxxxx.jar /input/ /output/
2如果在导包时没有指定MainClass,则调用:
bin/yarn jar xxxxx.jar MainClass /input/ /output/
JAR包的路径是:/home/admin/Desktop/clearlog.jar
先测试:
$ /opt/modules/cdh/hadoop-2.5.0-cdh5.3.6/bin/yarn jar \
/home/admin/Desktop/clearlog.jar
com.z.hive.etl.LogCleanMapReduce \
/weblog/20170725/2015082818 \
/user/hive/warehouse/syllabus.db/track_log/date=20150828/hour=18
///weblog/20170725/2015082818 输入路径
///user/hive/warehouse/syllabus.db/track_log/date=20150828/hour=18输出路径
//com.z.hive.etl.LogCleanMapReduce 主函数的名字,包含包名,LogCleanMapReduce是包含main的类名
//date=20150828和hour=18是文件的名称,只不过教程中问加你的名称取的比较特殊
尖叫提示:要测试两遍,在没有目录的时候测试一次,在有目录的时候再测试一次。
建表语句:
create database if not exists syllabus;
create table if not exists syllabus.track_log(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)
partitioned by (date string,hour string)
row format delimited fields terminated by '\t';
将如上代码保存在create_table_track_log.hql文件中,执行单个测试,命令如下:
$ /opt/modules/cdh/hive-0.13.1-cdh5.3.6/bin/hive -f /opt/modules/cdh/clean/create_table_track_log.hql
如下代码有BUG,Hive-0.14以及Hive-0.15版本已经修复该BUG
尖叫提示:解决方案,先use表所在的库,然后再alter,而且在alter的时候,表名前边不能有库名。
alter table track_log add partition(date=‘20150828’, hour=‘19’) location “/user/hive/warehouse/syllabus.db/track_log/date=20150828/hour=18”;
load data inpath "/user/hive/warehouse/syllabus.db/track_log/date=20150828/hour=18/part-r-00000" into table syllabus.track_log partition(date='20150828', hour='18');
alter 关联代码:
//在执行alter语句之前,输入use syllabus语句,就不会报错了
use syllabus;alter table track_log add partition(date='${hiveconf:DATE_NEW}', hour='${hiveconf:HOUR_NEW}') location "${hiveconf:LOCATION_NEW}";
该功能对应shell脚本如下:
echo "======关联清洗后的数据=="
$HIVE_DIR/bin/hive \
--hiveconf LOCATION_NEW=/user/hive/warehouse/syllabus.db/track_log/date=$DATE/hour=$HOUR \
--hiveconf DATE_NEW=$DATE \
--hiveconf HOUR_NEW=$HOUR \
-f $HQL_DIR/alter_table_track_log.hql
测试查询数据:
hive> select * from track_log limit 1;
创建一系列分析语句
-rw-rw-r-- 1 admin admin 209 Jul 26 16:15 create_result_info.hql
create table syllabus.result_info(
date string,
uv string,
pv string,
login_users string,
visit_users string,
avg_time string,
sec_hop string,
ip_count string
)
-rw-rw-r-- 1 admin admin 319 Jul 26 16:12 create_session_info.hql
create table if not exists syllabus.session_info(
session_id string,
guid string,
tracker_u string,
landing_url string,
landing_url_ref string,
user_id string,
pv string,
stay_time string,
min_tracktime string,
ip string,
province_id string)
partitioned by (date string)
row format delimited fields terminated by '\t';
-rw-rw-r-- 1 admin admin 235 Jul 26 16:13 create_session_info_temp1.hql
insert overwrite table syllabus.session_info_temp1
select
sessionId,
max(guid),
max(endUserId),
count(url),
max(unix_timestamp(trackTime)) - min(unix_timestamp(trackTime)),
from_unixtime(min(unix_timestamp(trackTime))),
max(ip),
max(provinceId)
from syllabus.track_log where date='20150828'
group by
sessionId;
-rw-rw-r-- 1 admin admin 190 Jul 26 16:14 create_session_info_temp2.hql
create table syllabus.session_info_temp2(
session_id string,
tracktime string,
tracker_u string,
landing_url string,
landing_url_ref string
)
row format delimited fields terminated by '\t';
-rw-rw-r-- 1 admin admin 388 Jul 26 16:15 insert_result_info.hql
insert overwrite table syllabus.result_info
select
date,
count(distinct guid),
sum(pv),
count(case when user_id != '' then user_id else null end),
count(case when user_id = '' then user_id else null end),
avg(stay_time),
count(distinct (case when pv >= 2 then guid else null end))/count(distinct guid),
count(distinct ip)
from syllabus.session_info where date='20170725'
group by
date;
-rw-rw-r-- 1 admin admin 368 Jul 26 16:14 insert_session_info.hql
insert overwrite table syllabus.session_info partition(date='20150828')
select
p1.session_id,
p1.guid,
p2.tracker_u,
p2.landing_url,
p2.landing_url_ref,
p1.user_id,
p1.pv,
p1.stay_time,
p1.min_tracktime,
p1.ip,
p1.province_id
from syllabus.session_info_temp1 p1 join syllabus.session_info_temp2 p2
on p1.session_id=p2.session_id and p1.min_tracktime=p2.tracktime;
-rw-rw-r-- 1 admin admin 314 Jul 26 16:13 insert_session_info_temp1.hql
create table if not exists syllabus.session_info_temp1(
session_id string,
guid string,
user_id string,
pv string,
stay_time string,
min_tracktime string,
ip string,
province_id string
)
row format delimited fields terminated by '\t';
-rw-rw-r-- 1 admin admin 153 Jul 26 16:14 insert_session_info_temp2.hql
insert overwrite table syllabus.session_info_temp2
select
sessionId,
trackTime,
trackerU,
url,
referer
from syllabus.track_log where date='20150828';
编写脚本依次执行
echo "==开始分析网站流量数据=="
echo "==检查所有表结构========"
$HIVE_DIR/bin/hive -f $HQL_DIR/create_session_info.hql
$HIVE_DIR/bin/hive -f $HQL_DIR/create_session_info_temp1.hql
$HIVE_DIR/bin/hive -f $HQL_DIR/create_session_info_temp2.hql
$HIVE_DIR/bin/hive -f $HQL_DIR/create_result_info.hql
echo "==开始插入数据=========="
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_session_info_temp1.hql
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_session_info_temp2.hql
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_session_info.hql
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_result_info.hql
创建sqoop可执行的opt文件
$ vi /opt/modules/cdh/clean/weblog_hive_2_mysql.opt
内容如下:
export
--connect
jdbc:mysql://hadoop-senior01.itguigu.com:3306/syllabus_weblog
--username
root
--password
123456
--table
result_web_log
--num-mappers
1
--export-dir
/user/hive/warehouse/syllabus.db/result_info
--input-fields-terminated-by
"\t"
进入Mysql创建对应数据库以及数据表
完整脚本
#!/bin/bash
#执行系统环境变量脚本,初始化一些变量信息
. /etc/profile
#定义Hive目录
HIVE_DIR=/opt/modules/cdh/hive-0.13.1-cdh5.3.6
HADOOP_DIR=/opt/modules/cdh/hadoop-2.5.0-cdh5.3.6
HQL_DIR=/opt/modules/cdh/clean
SQOOP_DIR=/opt/modules/cdh/sqoop-1.4.5-cdh5.3.6
echo $HIVE_DIR
echo $HADOOP_DIR
#定义日志的存储路径
WEB_LOG=/opt/modules/weblog
#昨天的日期,用于访问目录
YESTERDAY=$(date --date="1 day ago" +%Y%m%d)
#在HDFS上创建指定文件夹
echo "======正在创建目录======"
#/weblog/20170725
$HADOOP_DIR/bin/hdfs dfs -mkdir /weblog/$YESTERDAY
echo "======正在检查数据表===="
$HIVE_DIR/bin/hive -f $HQL_DIR/create_table_track_log.hql
#遍历目录
for i in `ls $WEB_LOG/$YESTERDAY`
do
#20150828
DATE=${i:0:8}
#18
HOUR=${i:8:2}
#上传文件到HDFS
echo "======正在上传日志======"
#bin/hdfs /opt/modules/weblog/2015082818 /weblog/20170726
$HADOOP_DIR/bin/hdfs dfs -put $WEB_LOG/$YESTERDAY/$i /weblog/$YESTERDAY
#清洗日志
echo "======开始清洗=========="
# bin/yarn jar /home/admin/Desktop/clearlog.jar com.z.hive.etl.LogCleanMapReduce /weblog/20170726/2015082818 /user/hive/warehouse/syllabus.db/track_log/date=20170725/hour=18
$HADOOP_DIR/bin/yarn jar /home/admin/Desktop/clearlog.jar com.z.hive.etl.LogCleanMapReduce /weblog/$YESTERDAY/$i /user/hive/warehouse/syllabus.db/track_log/date=$DATE/hour=$HOUR
echo "======关联清洗后的数据=="
$HIVE_DIR/bin/hive \
--hiveconf LOCATION_NEW=/user/hive/warehouse/syllabus.db/track_log/date=$DATE/hour=$HOUR \
--hiveconf DATE_NEW=$DATE \
--hiveconf HOUR_NEW=$HOUR \
-f $HQL_DIR/alter_table_track_log.hql
done
echo "==开始分析网站流量数据=="
echo "==检查所有表结构========"
$HIVE_DIR/bin/hive -f $HQL_DIR/create_session_info.hql
$HIVE_DIR/bin/hive -f $HQL_DIR/create_session_info_temp1.hql
$HIVE_DIR/bin/hive -f $HQL_DIR/create_session_info_temp2.hql
$HIVE_DIR/bin/hive -f $HQL_DIR/create_result_info.hql
echo "==开始插入数据=========="
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_session_info_temp1.hql
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_session_info_temp2.hql
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_session_info.hql
$HIVE_DIR/bin/hive \
--hiveconf DATE_NEW=$YESTERDAY \
-f $HQL_DIR/insert_result_info.hql
echo "==开始导出数据到Mysql=="
$SQOOP_DIR/bin/sqoop --options-file $HQL_DIR/weblog_hive_2_mysql.opt
echo "==任务完成============="
if [ file -r ]
-eq 等于则为真
-ne 不等于则为真
-gt 大于则为真
-ge 大于等于则为真
-lt 小于则为真
-le 小于等于则为真
字串测试
= 等于则为真
!= 不相等则为真
-z字串 字串长度伪则为真
-n字串 字串长度不伪则为真
文件测试
-e 文件名 如果文件存在则为真
-r 文件名 果文件存在且可读则为真
-w 文件名 如果文件存在且可写则为真
-x 文件名 如果文件存在且可执行则为真
文件测试
-s 文件名 如果文件存在且至少有一个字符则为真
-d 文件名 如果文件存在且为目录则为真
-f 文件名 如果文件存在且为普通文件则为真
-c 文件名 如果文件存在且为字符型特殊文件则为真
-b 文件名 如果文件存在且为块特殊文件则为真
Linux还提供了非(!)、或(-o)、与(-a)三个逻辑操作符,用于将测试条件连接起来,其优先顺序为:!最高,-a次之,-o最低