<总结向> awk使用手册

awk 用于文本文件的分析与处理

0x00 使用方法

awk '{pattern + action}' [filenames]

其中pattern代表的是正则表达式,用于匹配我们需要截取的数据,需要用斜杠括起来。

action是在找到数据时执行的操作。

0x01 例子

awk工作流程是这样的:读入有\n换行符分割的一条记录,然后将记录按指定的域分隔符(默认空白符或制表符)划分域,填充域,$0则表示所有域,$1表示第一个域,$n表示第n个域。

如下我们执行last -n 5

last -n 5
root     pts/4        172.20.3.158     Mon Aug  1 11:20   still logged in   
root     pts/3        172.20.3.158     Mon Aug  1 10:58   still logged in   
root     pts/2        172.20.3.158     Mon Aug  1 10:57   still logged in   
root     pts/1        172.20.3.158     Mon Aug  1 10:57   still logged in   
root     pts/0        172.20.3.158     Mon Aug  1 10:57   still logged in   

wtmp begins Mon Apr 25 17:46:29 2016

再以默认分隔符去分割输出 可得第一个域和第二个域

last -n 5 | awk '{print $1,$2}'
root pts/4
root pts/3
root pts/2
root pts/1
root pts/0

wtmp begins

接下来我们尝试设置其域分隔符,通常以-F来设置域分隔符。再将打印的域以\t分隔打印输出。

cat /etc/passwd | awk -F ':' '{print $1"\t"$7}'
at      /bin/bash
bin     /bin/bash
daemon  /bin/bash
ftp     /bin/bash
ftpsecure       /bin/false
games   /bin/bash
gdm     /bin/false
lp      /bin/bash
mail    /bin/false
man     /bin/bash
messagebus      /bin/false
news    /bin/bash
nobody  /bin/bash
nscd    /sbin/nologin
ntp     /bin/false
openslp /sbin/nologin
polkitd /sbin/nologin
postfix /bin/false
pulse   /sbin/nologin
root    /bin/zsh
rpc     /sbin/nologin
rtkit   /bin/false
scard   /usr/sbin/nologin
sshd    /bin/false
statd   /sbin/nologin
usbmux  /sbin/nologin
uucp    /bin/bash
vnc     /sbin/nologin
wwwrun  /bin/false
edward  /bin/zsh
ftp-edward      /bin/bash
lighthttpd      /bin/bash

再接着我们尝试用BEGIN,PROC,END来指定程序的执行流程。一般来说,程序会先执行BEGIN部分代码,再读取文件以\n划分被处理的一条条记录,执行PROC部分内容,填充域,最后在执行完PROC部分之后再执行END部分内容。

现在我们将上面的程序改造一下,让他先打印name shell,最后输出一段话Action Finished

cat /etc/passwd | awk -F ':' 'BEGIN {print "name,shell"} {print $1","$7} END {print "Action Finished"}'
name,shell
at,/bin/bash
bin,/bin/bash
daemon,/bin/bash
ftp,/bin/bash
ftpsecure,/bin/false
games,/bin/bash
gdm,/bin/false
lp,/bin/bash
mail,/bin/false
man,/bin/bash
messagebus,/bin/false
news,/bin/bash
nobody,/bin/bash
nscd,/sbin/nologin
ntp,/bin/false
openslp,/sbin/nologin
polkitd,/sbin/nologin
postfix,/bin/false
pulse,/sbin/nologin
root,/bin/zsh
rpc,/sbin/nologin
rtkit,/bin/false
scard,/usr/sbin/nologin
sshd,/bin/false
statd,/sbin/nologin
usbmux,/sbin/nologin
uucp,/bin/bash
vnc,/sbin/nologin
wwwrun,/bin/false
edward,/bin/zsh
ftp-edward,/bin/bash
lighthttpd,/bin/bash
Action Finished

那么我们要获取/etc/passwd里关于root账户的shell信息该怎么做呢?

awk -F ':' '/root/{print $7}' /etc/passwd
/bin/zsh

这里的意思就是先//之中的为pattern,即若当前行匹配root的正则表达式,则对该行进行处理。

0x02 内置变量

awk存在许多内置变量来设置环境信息,这些变量可以被改变。

ARGC               命令行参数个数
ARGV               命令行参数排列
ENVIRON            支持队列中系统环境变量的使用
FILENAME           awk浏览的文件名
FNR                浏览文件的记录数
FS                 设置输入域分隔符,等价于命令行 -F选项
NF                 当前行中域的个数
NR                 已读的行数
OFS                输出域分隔符
ORS                输出记录分隔符
RS                 控制记录分隔符

现在我们对其进行试用

awk -F ':' 'BEGIN {print "ARGC:" ARGC " ARGV:" ARGV[0]","ARGV[1] " Filename:" FILENAME " Total:" FNR "}{print "currLine:" NR " currColumns:" NF " content:" $0}' /etc/passwd
ARGC:2 ARGV:awk,/etc/passwd Filename: Total:0 Field Separator: Row Separator:

currLine:1 currColumns:7 content:at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash
currLine:2 currColumns:7 content:bin:x:1:1:bin:/bin:/bin/bash
currLine:3 currColumns:7 content:daemon:x:2:2:Daemon:/sbin:/bin/bash
currLine:4 currColumns:7 content:ftp:x:40:49:FTP account:/srv/ftp:/bin/bash
currLine:5 currColumns:7 content:ftpsecure:x:488:65534:Secure FTP User:/var/lib/empty:/bin/false
currLine:6 currColumns:7 content:games:x:12:100:Games account:/var/games:/bin/bash
currLine:7 currColumns:7 content:gdm:x:486:485:Gnome Display Manager daemon:/var/lib/gdm:/bin/false
currLine:8 currColumns:7 content:lp:x:4:7:Printing daemon:/var/spool/lpd:/bin/bash
currLine:9 currColumns:7 content:mail:x:8:12:Mailer daemon:/var/spool/clientmqueue:/bin/false
currLine:10 currColumns:7 content:man:x:13:62:Manual pages viewer:/var/cache/man:/bin/bash
currLine:11 currColumns:7 content:messagebus:x:499:499:User for D-Bus:/var/run/dbus:/bin/false
currLine:12 currColumns:7 content:news:x:9:13:News system:/etc/news:/bin/bash
currLine:13 currColumns:7 content:nobody:x:65534:65533:nobody:/var/lib/nobody:/bin/bash
currLine:14 currColumns:7 content:nscd:x:496:495:User for nscd:/run/nscd:/sbin/nologin
currLine:15 currColumns:7 content:ntp:x:74:492:NTP daemon:/var/lib/ntp:/bin/false
currLine:16 currColumns:7 content:openslp:x:494:2:openslp daemon:/var/lib/empty:/sbin/nologin
currLine:17 currColumns:7 content:polkitd:x:497:496:User for polkitd:/var/lib/polkit:/sbin/nologin
currLine:18 currColumns:7 content:postfix:x:51:51:Postfix Daemon:/var/spool/postfix:/bin/false
currLine:19 currColumns:7 content:pulse:x:490:489:PulseAudio daemon:/var/lib/pulseaudio:/sbin/nologin
currLine:20 currColumns:7 content:root:x:0:0:root:/root:/bin/zsh
currLine:21 currColumns:7 content:rpc:x:495:65534:user for rpcbind:/var/lib/empty:/sbin/nologin
currLine:22 currColumns:7 content:rtkit:x:491:490:RealtimeKit:/proc:/bin/false
currLine:23 currColumns:7 content:scard:x:487:487:Smart Card Reader:/var/run/pcscd:/usr/sbin/nologin
currLine:24 currColumns:7 content:sshd:x:498:498:SSH daemon:/var/lib/sshd:/bin/false
currLine:25 currColumns:7 content:statd:x:489:65534:NFS statd daemon:/var/lib/nfs:/sbin/nologin
currLine:26 currColumns:7 content:usbmux:x:493:65534:usbmuxd daemon:/var/lib/usbmuxd:/sbin/nologin
currLine:27 currColumns:7 content:uucp:x:10:14:Unix-to-Unix CoPy system:/etc/uucp:/bin/bash
currLine:28 currColumns:7 content:vnc:x:492:491:user for VNC:/var/lib/empty:/sbin/nologin
currLine:29 currColumns:7 content:wwwrun:x:30:8:WWW daemon apache:/var/lib/wwwrun:/bin/false
currLine:30 currColumns:7 content:edward:x:1000:100:Edward:/home/edward:/bin/zsh
currLine:31 currColumns:7 content:ftp-edward:x:1001:100::/home/ftp-edward:/bin/bash
currLine:32 currColumns:7 content:lighthttpd:x:1004:1000::/home/lighthttpd:/bin/bash

由此可见在未读入目标文件时,文件名,域分隔符,记录分隔符,以及总记录数未知。于是我们修改为:

awk -F ':' 'BEGIN {
  print "ARGC:" ARGC
  print "ARGV:"
  for (i=0;i

同样的我们可以通过printf函数对输出进行格式化,使代码更加易懂。

0x03 awk编程

变量与赋值

除了awk的内置变量,awk还可以设置自定义变量。

如下我们统计/etc/passwd里用户的个数。我们先初始化count为1,若不初始化,其初值为0。

awk 'BEGIN {
  count = 1;
  print count;
}
{
  count++;
  print $0;
}
END {
  print "user count is "count;
}
' /etc/passwd
1
at:x:25:25:Batch jobs daemon:/var/spool/atjobs:/bin/bash
bin:x:1:1:bin:/bin:/bin/bash
daemon:x:2:2:Daemon:/sbin:/bin/bash
ftp:x:40:49:FTP account:/srv/ftp:/bin/bash
ftpsecure:x:488:65534:Secure FTP User:/var/lib/empty:/bin/false
games:x:12:100:Games account:/var/games:/bin/bash
gdm:x:486:485:Gnome Display Manager daemon:/var/lib/gdm:/bin/false
lp:x:4:7:Printing daemon:/var/spool/lpd:/bin/bash
mail:x:8:12:Mailer daemon:/var/spool/clientmqueue:/bin/false
man:x:13:62:Manual pages viewer:/var/cache/man:/bin/bash
messagebus:x:499:499:User for D-Bus:/var/run/dbus:/bin/false
news:x:9:13:News system:/etc/news:/bin/bash
nobody:x:65534:65533:nobody:/var/lib/nobody:/bin/bash
nscd:x:496:495:User for nscd:/run/nscd:/sbin/nologin
ntp:x:74:492:NTP daemon:/var/lib/ntp:/bin/false
openslp:x:494:2:openslp daemon:/var/lib/empty:/sbin/nologin
polkitd:x:497:496:User for polkitd:/var/lib/polkit:/sbin/nologin
postfix:x:51:51:Postfix Daemon:/var/spool/postfix:/bin/false
pulse:x:490:489:PulseAudio daemon:/var/lib/pulseaudio:/sbin/nologin
root:x:0:0:root:/root:/bin/zsh
rpc:x:495:65534:user for rpcbind:/var/lib/empty:/sbin/nologin
rtkit:x:491:490:RealtimeKit:/proc:/bin/false
scard:x:487:487:Smart Card Reader:/var/run/pcscd:/usr/sbin/nologin
sshd:x:498:498:SSH daemon:/var/lib/sshd:/bin/false
statd:x:489:65534:NFS statd daemon:/var/lib/nfs:/sbin/nologin
usbmux:x:493:65534:usbmuxd daemon:/var/lib/usbmuxd:/sbin/nologin
uucp:x:10:14:Unix-to-Unix CoPy system:/etc/uucp:/bin/bash
vnc:x:492:491:user for VNC:/var/lib/empty:/sbin/nologin
wwwrun:x:30:8:WWW daemon apache:/var/lib/wwwrun:/bin/false
edward:x:1000:100:Edward:/home/edward:/bin/zsh
ftp-edward:x:1001:100::/home/ftp-edward:/bin/bash
lighthttpd:x:1004:1000::/home/lighthttpd:/bin/bash
user count is 33

接下来统计一个文件夹下文件占用的字节总数。

ls -l | awk 'BEGIN {
  size = 0;
  printf("[start]Initial Size is %s\n",size);
}{
  print $5;
  size = size + $5;
}
END {
  printf("[end]Final Size is %s\n",size);
}'
[start]Initial Size is 0

2713
0
472
244
58464
0
0
0
26
0
0
31729
0
49548
46
49548
47650
11
[end]Final Size is 240451

若要以M显示。

ls -l | awk 'BEGIN {
  size = 0;
  printf("[start]Initial Size is %s\n",size);
}{
  print $5;
  size = size + $5;
}
END {
  printf("[end]Final Size is %sM\n",size/1024/1024);
}'
[start]Initial Size is 0

2713
0
472
244
58464
0
0
0
26
0
0
31729
0
49548
46
49548
47650
11
[end]Final Size is 0.229312M

条件语句

if (expression) {
    statement;
    statement;
    ... ...
}

if (expression) {
    statement;
} else {
    statement2;
}

if (expression) {
    statement1;
} else if (expression1) {
    statement2;
} else {
    statement3;
}

## 循环语句
循环语句也差不多的

## 数组
因为awk中数组的下标可以是数字和字母,数组的下标通常被称为关键字(key)。值和关键字都存储在内部的一张针对key/value应用hash的表格里。由于hash不是顺序存储,因此在显示数组内容时会发现,它们并不是按照你预料的顺序显示出来的。数组和变量一样,都是在使用时自动创建的,awk也同样会自动判断其存储的是数字还是字符串。一般而言,awk中的数组用来从记录中**收集信息**,可以用于**计算总和**、**统计单词**以及**跟踪模板被匹配的次数**等等。

你可能感兴趣的:(<总结向> awk使用手册)