gawk Ⅰ

awk vs gawk

除了VI这种交互式的文本编辑器(interactive text editor)，Linux中还有两个命令行文本编辑器(command line editor)，就是sed和gawk，本文介绍gawk。

gawk

gawk是Unix中awk的GNU版本，gawk是一个编程语言，可以：

定义变量存储数据
使用数学和字符串运算符来操作数据
添加逻辑控制，比如if-then和loop
从数据文件中抽取数据，生成格式良好的报告

gawk语法：

gawk options program file

options参数有：

Option	Description
-F fs	指定间隔符，分割数据行
-f file	使用指定文件中的程序命令来处理数据
-v var=value	定义变量及默认值
-mf N	Specifies the maximum number of fields to process in the data file
-mr N	Specifies the maximum record size in the data file
-W keyword	Specifies the compatibility mode or warning level for gawk

Reading the program script from the command line

gawk program script必须用 { } 大括号包起来，同时gawk认为程序脚本是一个字符串，所以需要用单引号括起来：

gawk '{print "Hello World!"}'

当你执行上这行脚本，什么也没有输出，而是等待你输入内容。这是因为没有为gawk指定file，gawk默认从STDIN获取内容。
print命令将打印内容到STDOUT。gawk像sed一样，会一行一行的处理数据流中的内容。要终止上面的gawk命令，可以使用Ctrl+D模拟文件结束符EOF。

Using data field variables

gawk在处理每行数据时，可以使用如下变量：

$0 代表整行数据
$1 代表数据行被分割的第1个字段
$2 代表数据行被分割的第2个字段
$n 代表数据行被分割的第N个字段

gawk默认的字段分割符是任何的空白字符，比如tab或空格。
使用默认分割符：

$ cat data2.txt
One line of test text.
Two lines of test text.
Three lines of test text.

$ gawk '{print $1}' data2.txt
One
Two
Three

指定分割符：

$ gawk -F: '{print $1}' /etc/passwd
root
bin
daemon
adm
lp
sync

Using multiple commands in the program script

像sed一样，在命令行使用多个程序脚本时，需要使用分号来分隔命令：

$ echo "My name is Rich" | gawk '{$4="Christine"; print $0}'
My name is Christine

同样，也可以使用secondary prompt：

$ echo "My name is Colin" | gawk '{
> $4="f"
> print $0}'
My name is f

Reading the program from a file

gawk也可以从文件中读取程序脚本：

$ cat script2.gawk
{
text = "'s home directory is "
print $1 text $6
}

$ gawk -F : -f script2.gawk /etc/passwd
root's home directory is /root
bin's home directory is /bin
daemon's home directory is /sbin
adm's home directory is /var/adm
lp's home directory is /var/spool/lpd

注意：gawk程序脚本中也可以定义变量，而且在引用变量时不需要$符号

Running scripts before processing data

通过 BEGIN 可以指定gawk在处理数据行之前，先执行哪些程序脚本：

$ cat data3.txt
Line 1
Line 2
Line 3

$ gawk 'BEGIN {print "The data3 file contents:"} {print $0}' data3.txt 
The data3 file contents:
Line 1
Line 2
Line 3

Running scripts after processing data

通过 END 可以指定gawk在处理完所有数据行之后，执行哪些程序脚本：

$ gawk '
> BEGIN {print "The data3 file contents:"}
> {print $0}
> END {print "End of File"}' data3.txt
The data3 file contents:
Line 1
Line 2
Line 3
End of File

创建一个gawk脚本

script4.gawk内容:

BEGIN {
  print "The latest list of users and shells"
  print " UserID \t Shell"
  print "-------- \t -------"
  FS=":"
}

{
  print $1 " \t " $7
}

END {
  print "This concludes the listing"
}

注意：可以使用FS来定义数据行字段分隔符(field separation character)

使用该脚本生成报告：

gawk -f script4.gawk /etc/passwd

参考：Linux Command Line and Shell Scripting Bible 3rd Edition 第19章