Hive日志分析之数据清洗

需求分析

  • 日志格式
"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /course/view.php?id=27 HTTP/1.1" "303" "440" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com"
"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /login/index.php HTTP/1.1" "303" "465" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com"

数据清洗

  1. 创建源表
    由于日志文件中,字段与字段之间以空格分割,而且每个字段中也存在空格,所以不能直接按照空格作为分隔符加载数据。所以利用正则表达式去匹配每一个字段,其中正则表达式中的每一个小括号的内容即为一个字段。
create table IF NOT EXISTS source_log (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "(\"[^ ]*\") (\"[-|^ ]*\") (\"[^\"]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") ([-|^ ]*) (\"[^ ]*\") (\"[^\"]*\") (\"[-|^ ]*\") (\"[^ ]*\")"
)
STORED AS TEXTFILE;
  1. 加载日志数据到源表中
    加载:
load data local inpath '/opt/datas/moodle.ibeifeng.access.log' into talbe source_log;

检查是否加载成功

hive (log)> select time_local from source_log limit 5;
13.png
  1. 过滤字段
    分析需求,只需要 用户的IP地址、时间、请求地址、referer地址(从什么链接中来访问的)。
    这里采用parquet存储格式以节约存储空间。
    创建一个新表
create table clean_log(
remote_addr string,
time_local string,
request string,
http_referer string
)
row format delimited fields terminated by '\t'
stored as parquet tblproperties("parquet.compress"="SNAPPY");
  1. 字段格式化
  • 去掉每个字段的双引号
  • 日期格式转换
  • 截取请求地址
  • 截取referer主地址
    自定UDF实现以上三个功能并在Hive中加载,详细步骤见下面我以前写的文章:
    http://www.jianshu.com/p/c8a8aa590e1e
    http://www.jianshu.com/p/ff0913045610
    函数加载到Hive后的结果:
    去除双引号:rm
    日期转换:df
    网址截取:ct
    referer : rf

关联自定义函数

hive (log)> create temporary function rm as "hiveUDF.RemoveQuote";
hive (log)> create temporary function df as "hiveUDF.DateTransform";
hive (log)> create temporary function rf as "hiveUDF.TruncationMainAdd";
hive (log)> create temporary function ct as "hiveUDF.TruncationRequestAdd"; 

格式化后加载数据到新表中

insert overwrite table clean_log
select rm(remote_addr),df(rm(time_local)),ct(rm(request)),rf(rm(http_referer))
from source_log;


作者:心_的方向
链接:https://www.jianshu.com/p/da1f9bc326ee
來源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

你可能感兴趣的:(大数据~Hive)