grep 由于内置高效的字符串搜索算法,兼容各种风格的正则,且功能众多,有着 linux 下字符串处理三剑客之一的称号,但是到了如今的大数据/分布式时代,这种单机时代的工具显得有些廉颇老矣。。。
我们经常会遇到需要在 hadoop 上查找原始日志,校对 ETL 数据的情况,往往很多同学直接用的老办法:
hadoop fs -cat /M_track/$yesterday/* | grep ooxx | wc -l
这种情况下是要把分布在整个集群上的日志都拉到单机上 grep 然后 wc,这是一件极其痛苦的事情,瓶颈很显然卡在了网络 IO 上,一百多 G 的日志,一个简单的 grep 往往半小时都出不来结果。。。
好在数据在 hadoop 上,那我们为了执行分布式查询,自然可以想到写 mr 来解决,但是一个简单的统计查询,也太重量级了,整个编写打包上传流程走下来,时间成本太高,那我们自然想到了 hive, 使用 hive 直接解析,查询原始日志。
hive 中的 like 支持的是通配,和 mysql 一样,RLIKE/REGEXP 支持的是正则,这样大部分的 grep 正则、通配 + awk 能干的事情,hive 也都可以干了,而且是分分钟的事情。
create EXTERNAL table IF NOT EXISTS ext_M_track ( line string ) PARTITIONED BY (statDate STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/M_track' ;
友情提示:建议所建的表属性是 “external 扩展表”,否则 hive 默认会移动并接管原日志。
ALTER TABLE ext_M_track ADD PARTITION (statDate='20140903') LOCATION '/M_track/20140903';
select count(1) from ext_M_track where statDate='20140903' and line like '%tag=pvstatall%';
-- 分 select regexp_extract(line, "(03/Sep/2014:\\d{2})", 1) hour, count(1) pv from ext_M_track where statDate='20140903' and line like '%tag=pvstatall%' group by regexp_extract(line, "(03/Sep/2014:\\d{2})", 1); -- 秒 select regexp_extract(line, "(09/Mar/2015:\\d{2}:\\d{2}:\\d{2})", 1) second, count(1) pv from ext_M_track where statDate='20150309' group by regexp_extract(line, "(09/Mar/2015:\\d{2}:\\d{2}:\\d{2})", 1); 03/Sep/2014:22 40867 03/Sep/2014:21 38951 03/Sep/2014:13 35113 03/Sep/2014:14 34285 03/Sep/2014:15 34120 03/Sep/2014:20 33852 03/Sep/2014:12 33308 03/Sep/2014:10 32644 03/Sep/2014:11 32362 03/Sep/2014:16 32284 03/Sep/2014:09 30031 03/Sep/2014:17 29023 03/Sep/2014:23 28247 03/Sep/2014:19 28125 03/Sep/2014:18 26250 03/Sep/2014:08 24452 03/Sep/2014:07 17456 03/Sep/2014:00 16103 03/Sep/2014:01 11679 03/Sep/2014:06 11074 03/Sep/2014:02 7262 03/Sep/2014:05 5367 03/Sep/2014:03 5047 4666 03/Sep/2014:04 4221
set hive.cli.print.header=true; select count(split(line, " ")[0]) totalPv, count(distinct split(line, " ")[0]) totalUv, count(split(split(line, " ")[0], "@@##")[0]) idwbPv, count(distinct split(split(line, " ")[0], "@@##")[0]) idwbUv, count(split(split(line, " ")[0], "@@##")[1]) idmPv, count(distinct split(split(line, " ")[0], "@@##")[1]) idmUv from ext_M_track where statDate='20140903' and line like '%tag=pvstatall%'; totalpv totaluv idwbpv idwbuv idmpv idmuv 5967 386 5967 378 5921 363
select count(1) from ext_M_track where statDate='20140903' and line like '%tag=pvstatall%' and (split(line, " ")[0]='-' or (instr(line, 'trackURL={') > 0 and instr(line, '}&rand_id=') = 0) or size(split(line, "\"")) != 9); 16 select count(1) from ext_M_track where statDate='20140903' and line like '%tag=pvstatall%' and instr(line, '&smsc={') > 0; 517
步骤 2.2 中我们对某天的日志进行了手动处理,我们实际使用当中,很容易流程化,将下面的脚本用 crontab 定期调度即可:
#!/bin/bash # 为库 tmpdb 表 ext_M_track 每天建立分区 yesterday=`date -d '1 days ago' +'%Y%m%d'` hive -e "use tmpdb; ALTER TABLE ext_M_track ADD PARTITION (statDate='$yesterday') LOCATION '/M_track/$yesterday';"
[1] LanguageManual UDF
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
[2] archive-analysis
https://github.com/vinaygoel/archive-analysis/tree/master/hive/cdx
[3] Hive分析窗口函数(一) SUM,AVG,MIN,MAX
http://superlxw1234.iteye.com/blog/2205770