按照国际惯例先来理论的介绍。
awk是一个强大的文本分析工具,相对于grep的查找,sed的编辑,awk在其对数据分析并生成报告时,显得尤为强大。简单来说awk就是把文件逐行的读入,以空格为默认分隔符将每行切片,切开的部分再进行各种分析处理。
awk有3个不同版本: awk、nawk和gawk,未作特别说明,一般指gawk,gawk 是 AWK 的 GNU 版本。
awk其名称得自于它的创始人 Alfred Aho 、Peter Weinberger 和 Brian Kernighan 姓氏的首个字母。实际上 AWK 的确拥有自己的语言: AWK 程序设计语言 , 三位创建者已将它正式定义为“样式扫描和处理语言”。它允许您创建简短的程序,这些程序读取输入文件、为数据排序、处理数据、对输入执行计算以及生成报表,还有无数其他的功能。
user表
id name addr
1 zhangsan hubei
3 lisi tianjin
4 wangmazi guangzhou
2 wangwu beijing
consumer表
id cost date
1 15 20121213
2 20 20121213
3 100 20121213
4 99 20121213
1 25 20121114
2 108 20121114
3 100 20121114
4 66 20121114
1 15 20121213
1 115 20121114
select * from user;
awk 1 user;
select * from consumer where cost > 100;
awk '$2>100' consumer
select distinct(date) from consumer;
awk '!a[$3]++{print $3}' consumer
select distinct(*) from consumer;
awk '!a[$0]++' consumer
select id from user order by id;
awk '{a[$1]}END{asorti(a);for(i=1;i<=length(a);i++){print a[i]}}' user
select * from consumer limit 2;
awk 'NR<=2' consumer
awk 'NR>2{exit}1' consumer # performance is better
select id, count(1), sum(cost) from consumer group by id having count(1) > 2;
awk '{a[$1]=a[$1]==""?$2:a[$1]","$2}END{for(i in a){c=split(a[i],b,",");if(c>2){sum=0;for(j in b){sum+=b[j]};print i"\t"c"\t"sum}}}' consumer
select name from user where name like 'wang%';
awk '$2 ~/^wang/{print $2}' user
select addr from user where addr like '%bei';
awk '/.*bei$/{print $3}' user
select addr from user where addr like '%bei%';
awk '$3 ~/bei/{print $3}' user
select a.* , b.* from user a inner join consumer b on a.id = b.id and b.id = 2;
awk 'ARGIND==1{a[$1]=$0;next}{if(($1 in a)&&$1==2){print a[$1]"\t"$2"\t"$3}}' user consumer
select a.* from user a union all select b.* from user b;
awk 1 user user
select a.* from user a union select b.* from user b;
awk '!a[$0]++' user user
SELECT * FROM consumer ORDER BY RAND() LIMIT 2;
awk 'BEGIN{srand();while(i<2){k=int(rand()*10)+1;if(!(k in a)){a[k];i++}}}(NR in a)' consumer
2.2.10 复杂的awk
awk -F'\t' 'BEGIN{OFS="\t"}{if(NR==FNR){install[$1]=$2}else{print $1,install[$1],$2}}' install.txt alive.txt >test.txt
实现需求:将install.txt中的数据读出,放入变量install[]中,类似java的map,第一列为key,第二列为value;读完install.txt后,再读入alive.txt,将alive.txt的第一列,install.txt中的对应的第二列,还有alive.txt的第二列,一起输出出来。