Awk 有三个版本awk、nawk和gawk. 一般是指gawk, linux是链接到gawk。
语法描述:
awk '{pattern + action}' {filenames}
三种调用方式, 首先掌握命令行方式即可。
1.命令行方式
awk [-F field-separator] 'commands' input-file(s)
其中,commands 是真正awk命令,[-F域分隔符]是可选的。 input-file(s) 是待处理的文件。
在awk中,文件的每一行中,由域分隔符分开的每一项称为一个域。通常,在不指名-F域分隔符的情况下,默认的域分隔符是空格。
2.shell脚本方式
将所有的awk命令插入一个文件,并使awk程序可执行,然后awk命令解释器作为脚本的首行,一遍通过键入脚本名称来调用。
相当于shell脚本首行的:#!/bin/sh
可以换成:#!/bin/awk
3.将所有的awk命令插入一个单独文件,然后调用:
awk -f awk-script-file input-file(s)
其中,-f选项加载awk-script-file中的awk脚本,input-file(s)跟上面的是一样的。
先测试一个最简单的应用.
$export PS1='fxb$';
fxb$cat abcd.log
111 aaa adfsk
2222 bbb skeioskls
33 cccc slweoklsdoi
44 ddddd slkwoelksl
只显示第2列,
fxb$cat abcd.log|awk '{print $2}'
aaa
bbb
cccc
ddddd
也可以用fxb$awk '{print $2}' abcd.log, 效果一样, 不过一般还是习惯用管道符号
这个就是最简单的应用, 没有指定分隔符, 因为是默认是空格, 也没有指定pattern, 只是用了action.
在来看看使用pattern 的简单例子:
找到匹配222的行, 然后输出整行
fxb$cat abcd.log|awk '/222/{print $0}'
2222 bbb skeioskls
fxb$cat abcd.log|awk '/222/'
2222 bbb skeioskls
这里看到print $0 效果一样.
这里‘/ /’表示其间是正则表达式, 而$0代表整行.
l 关于正则表达式, 见正则表达式学习总结.
通过这两个例子, 来说明一下awk的工作原理 :
awk 逐行扫描文件, 从第一行开始到最后一行. 并与模式想匹配, 如果匹配上, 就执行特定的操作. 如果没有指定模式, 就是所有行, 如果没有指定操作, 则显示整行。
接下来就要详细了解 awk中模式和操作, 以及awk中的变量。
模式可以有:
l /正则表达式/:使用通配符的扩展集。包括模式匹配表达式, 也就是使用用运算符~(匹配)和~!(不匹配)。
l 关系表达式模式
l 范围模式:使用成对模式来获取一个范围,从某一行到另一行。
l BEGIN/END: 指定开始或者结尾的规则
l BEGINFILE/ENDFILE: Two special patterns for advanced control.
l Empty: The empty pattern, which matches every record.
使用/regexp / 符号来表示使用了正则表达式,
Note: 这里不重复正则表达式部分的内容, 具体参考正则表达式的学习内容
但关于awk 正则表达式,还有下面的内容: ,学习
l 动态正则表达式 ,
什么是动态正则, 有什么用.
以下类容, 来至网络..(关于动态正则,暂时先了解这么多)
‘~’和’!~’操作的右边不一定是一个常量正则表达式(字符串在‘//’之间,比如’/pattern/’),它可以使任何表达式,这个表达式被计算和转换成一个字符串,在有必要的情况下,这个字符串的内容被用作正则表达式,用这种方式计算的正则表达式叫做动态正则表达式(动态正则),例如:
BEGIN { identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" } #无’//’ $0 ~ identifier_regexp { print }设置identifier_regexp为一个正则表达式,使用awk变量在存储,然后测试输入的记录是否匹配这个正则表达式。 注意:当使用‘~’和’!~’操作的时候,使用’//’的常量正则表达式和使用双引号的字符串常量有一点不同,如果你要使用字符串常量,你必须懂得字符串实际上市被扫描了两次,第一次是awk读取你的程序的时候,第二次是使用操作右边的模式匹配左边的字符串的时候,这是任何字符串值的表达式(例如上面的identifier_regexp,为一个变量表达式),不仅仅是常量字符串。
如果字符串被扫描两次,它有什么不同呢?答案是必须做转义序列,特别是使用反斜杠,为了在字符串的正则表达式里面获取一个反斜杠,你必须键入两个反斜杠。
例如:/\*/是一个为了匹配字面意义的’*’常量正则表达式,仅使用一个反斜杠,为了在字符串中做同样的事情,你必须键入”\\*”,第一个反斜杠转义第二个反斜杠,视为了让字符串包含两个字符’\’和’*’,明确来说就是为了避免让shell进行元字符扩展。
你也可以在match函数中使用动态正则表达式,其余gsub,sub,gensub,未测试,建议最好使用常量正则表达式。
awk自己专用的正则表达式,注意学习:
\s
Matches any whitespace character. Think of it as shorthand for ‘[[:space:]]’.
\S
Matches any character that is not whitespace. Think of it as shorthand for ‘[^[:space:]]’.
\w
Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for ‘[[:alnum:]_]’.
\W
Matches any character that is not word-constituent. Think of it as shorthand for ‘[^[:alnum:]_]’.
\<
Matches the empty string at the beginning of a word. For example, /\
\>
Matches the empty string at the end of a word. For example, /stow\>/ matches ‘stow’ but not ‘stowaway’.
\y
Matches the empty string at either the beginning or the end of a word (i.e., the word boundary). For example, ‘\yballs?\y’ matches either ‘ball’ or ‘balls’, as a separate word.
\B
Matches the empty string that occurs between two word-constituent characters. For example, /\Brat\B/ matches ‘crate’, but it does not match ‘dirty rat’. ‘\B’ is essentially the opposite of ‘\y’.
There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. Other GNU programs, including gawk, consider the entire string to match as the buffer. The operators are:
\`
Matches the empty string at the beginning of a buffer (string)
\'
Matches the empty string at the end of a buffer (string)
Escape Sequences 转义序列,
两种;
1) 使用反斜杠 (backslash ),‘\’ 来作为转义
2) 其它的转义序列, 来表示不能打印的字符,例如TAB.
‘\’ 作为转义的列子
$ awk 'BEGIN { print "a" }'
a
---- 普通的输出
$ awk 'BEGIN { print "\"a\"" }'
"a"
---- 对“ 符号进行了转义, 输出了 ” 符号。
$ awk 'BEGIN { print "\ta" }'
a
---- 使用了 \t ,输出了一个水平的TAB
下面是转义序列, 注意一些常用的, 比如\n, \t, \nnn
\\
A literal backslash, ‘\’.
\a
The “alert” character, Ctrl-g, ASCII code 7 (BEL). (This often makes some sort of audible noise.)
\b
Backspace, Ctrl-h, ASCII code 8 (BS).
\f
Formfeed, Ctrl-l, ASCII code 12 (FF).
\n
Newline, Ctrl-j, ASCII code 10 (LF).
\r
Carriage return, Ctrl-m, ASCII code 13 (CR).
\t
Horizontal TAB, Ctrl-i, ASCII code 9 (HT).
\v
Vertical TAB, Ctrl-k, ASCII code 11 (VT).
\nnn
The octal value nnn, where nnn stands for 1 to 3 digits between ‘0’ and ‘7’. For example, the code for the ASCII ESC (escape) character is ‘\033’.
例子:
fxb$awk 'BEGIN { print "Don\47t Panic!" }'
Don't Panic!
\xhh…
The hexadecimal value hh, where hh stands for a sequence of hexadecimal digits (‘0’–‘9’, and either ‘A’–‘F’ or ‘a’–‘f’). Like the same construct in ISO C, the escape sequence continues until the first nonhexadecimal digit is seen. (c.e.) However, using more than two hexadecimal digits produces undefined results. (The ‘\x’ escape sequence is not allowed in POSIX awk.)
CAUTION: The next major release of gawk will change, such that a maximum of two hexadecimal digits following the ‘\x’ will be used.
\/
A literal slash (necessary for regexp constants only). This sequence is used when you want to write a regexp constant that contains a slash (such as /.*:\/home\/[[:alnum:]]+:.*/; the ‘[[:alnum:]]’ notation is discussed in Bracket Expressions). Because the regexp is delimited by slashes, you need to escape any slash that is part of the pattern, in order to tell awk to keep processing the rest of the regexp.
\"
A literal double quote (necessary for string constants only). This sequence is used when you want to write a string constant that contains a double quote (such as "He said \"hi!\" to her."). Because the string is delimited by double quotes, you need to escape any quote that is part of the string, in order to tell awk to keep processing the rest of the string.
使用模式匹配, ~(匹配)和~!(不匹配), 模式匹配就是可以指定文本中某个部分来满足:
例子
fxb$more 123.log
ssh scp find grep
scp ssh grep find
grep find scp ssh
只使用正则表达式,所有行都要出来,
fxb$cat 123.log|awk '/ssh/{print $0}'
ssh scp find grep
scp ssh grep find
grep find scp ssh
只让第二列是ssh的出来.
fxb$cat 123.log|awk '$2~/ssh/{print $0}'
scp ssh grep find
让第二列不是ssh的出来.
fxb$cat 123.log|awk '$2!~/ssh/{print $0}'
ssh scp find grep
grep find scp ssh
ok !!!
关系表达式的列子
fxb$more guanxibiaoda.log
1 2 3 5
99 73 12 11
128 128 12 12
123 1122 222
212 93 09 38
223 239 23
找第1列大于150的,
fxb$awk '$1 > 150' guanxibiaoda.log
212 93 09 38
223 239 23
找第1列+第2列大于400的,
fxb$awk '$1 + $2 > 400' guanxibiaoda.log
123 1122 222
223 239 23
很简单, 但这里要了解一下 awk 的运算符:
记得在awk 的参考手册中查询这些运算符.
运算符 |
描述 |
= += -= *= /= %= ^= **= |
赋值 |
?: |
C条件表达式 |
|| |
逻辑或 |
&& |
逻辑与 |
~ ~! |
匹配正则表达式和不匹配正则表达式 |
< <= > >= != == |
关系运算符 |
空格 |
连接 |
+ - |
加,减 |
* / & |
乘,除与求余 |
+ - ! |
一元加,减和逻辑非 |
^ *** |
求幂 |
++ -- |
增加或减少,作为前缀或后缀 |
$ |
字段引用 |
in |
数组成员 |
范围模式, 就是按照‘begpat, endpat’.的方式, 当匹配begpat条件的行的时候就开始输出, 遇到匹配endpat的行的时候就开始结束。
A range pattern is made of two patterns separated by a comma, in the form ‘begpat, endpat’. It is used to match ranges of consecutive input records. The first pattern, begpat, controls where the range begins, while endpat controls where the pattern ends. For example, the following:
awk '$1 == "on", $1 == "off"' myfile
prints every record in myfile between ‘on’/‘off’ pairs, inclusive.
例子, 打印出第一列为“9“ 到”iwer”的行。
fxb$ cat range.log
5 skxoksdl
9 sdiowls
10 woksols
sdf woslsix
11 slsiex
iwer sdkxe
aaa sdkoexe
ccc sexocjoix5 skxoksdl
fxb$ cat range.log|awk '$1=="9",$1=="iwer" {print $0}'
9 sdiowls
10 woksols
sdf woslsix
11 slsiex
iwer sdkxe
当然, 你也可以先配合对原文件按做sort , 再来使用range表达式..
BEGIN模块后紧跟着动作块,这个动作块在awk处理任何输入文件之前执行。所以它可以在没有任何输入的情况下进行测试。它通常用来改变内建变量的值,如OFS,RS和FS等,以及打印标题。如:$ awk 'BEGIN{FS=":"; OFS="\t"; ORS="\n\n"}{print $1,$2,$3} test。上式表示,在处理输入文件以前,域分隔符(FS)被设为冒号,输出文件分隔符(OFS)被设置为制表符,输出记录分隔符(ORS)被设置为两个换行符。
fxb$awk 'BEGIN {print "this begin test"}'
this begin test
END不匹配任何的输入文件,但是执行动作块中的所有动作,它在整个输入文件处理完成后被执行。如$ awk 'END{print "The number of records is" NR}' test,上式将打印所有被处理的记录数。
fxb$more range.log
5 skxoksdl
9 sdiowls
10 woksols
sdf woslsix
11 slsiex
iwer sdkxe
aaa sdkoexe
ccc sexocjoix5 skxoksdl
fxb$awk 'END{print "The number of records is " NR}' range.log
The number of records is 8
Gawk 专门的命令,
参考: http://compgroups.net/comp.lang.awk/xgawk-beginfile-endfile-extension-proposal/185394
为什么要有beginfile和endfile,作用是什么?
相关的nextfile Statement, next Statement
(这部分内容暂时先跳过)
常见的预定义变量:重点掌握下面部分的内容
ARGC 命令行变元个数
ARGV 命令行变元数组
FILENAME 当前输入文件名
FNR 当前文件中的记录号
FS 输入域分隔符,默认为一个空格
RS 输入记录分隔符
NF 当前记录里域个数
NR 到目前为止记录数
OFS 输出域分隔符
ORS 输出记录分隔符
Gnu 手册中, 把awk的预定义变量分为了3部分
1. 控制awk行为的预定义变量 , 例如FS, 这是告诉awk使用什么做分割..
2. 从awk 返回信息的预定义变量, 例如NR, 放回当前的记录数, filename 当前的文件名
3. ARGC and ARGV
BINMODE #
CONVFMT
FIELDWIDTHS #
FPAT #
FS
IGNORECASE #
LINT #
OFMT
OFS
ORS
PREC #
ROUNDMODE #
RS
SUBSEP
TEXTDOMAIN #
ARGC, ARGV
ARGIND #
ENVIRON
ERRNO #
FILENAME
FNR
NF
FUNCTAB #
NR
PROCINFO #
RLENGTH
RSTART
RT #
SYMTAB #
变量这部分自己多测试。
AWK 中可以有多组rule 和函数定义, , 每个rule 包含模式和操作, 也可以只包含其中一个,
[pattern] { action }
pattern [{ action }]
…
function name(args) { … }
注意下面两个是不同的, 第一个是不做什么, 第二个是输出所有行
/foo/ { } match foo, do nothing — empty action
/foo/ match foo, print the record — omitted action
ACTION 可以有以下几种:
l 表达式
l 控制语句
l 混合的命令语句
l 输入语句
l 输出语句
l 删除语句
是指给一些变量赋值,, 注意一些较特殊的赋值操作符和方法, 参考手册中的 “Assignment Expressions“
控制语句包括:
• If Statement:
Conditionally execute some awk statements.
• While Statement:
Loop until some condition is satisfied.
• Do Statement:
Do specified action while looping until some condition is satisfied.
• For Statement:
Another looping statement, that provides initialization and increment clauses.
• Switch Statement:
Switch/case evaluation for conditional execution of statements based on a value.
• Break Statement:
Immediately exit the innermost enclosing loop.
• Continue Statement:
Skip to the end of the innermost enclosing loop.
• Next Statement:
Stop processing the current input record.
• Nextfile Statement:
Stop processing the current file.
• Exit Statement:
Stop execution of awk.
这些控制语句, 具体参考手册,
Getline , next statement
Awk 中自动会逐行处理, 为什么还需要getline的输入呢.. 可以这样理解, 在你处理完当前行后, 你可能想在awk跳到下一行之前, 自己先去读取一些行 ? 比如下一行和当前行比对, 然后根据结果输出, 所以这就需要有Getline的用法了。(想想oracle的开窗函数的使用场景 ),
或者是需要从其它文件读取行, 或者是从pipe中读取, 都可以使用到getline
详细的Getline 使用情况也比较复杂, 大概有如下, 具体参考手册:
• Plain Getline: Using getline with no arguments.
• Getline/Variable: Using getline into a variable.
• Getline/File: Using getline from a file.
• Getline/Variable/File: Using getline into a variable from a file.
• Getline/Pipe: Using getline from a pipe.
• Getline/Variable/Pipe: Using getline into a variable from a pipe.
• Getline/Coprocess: Using getline from a coprocess.
• Getline/Variable/Coprocess: Using getline into a variable from a coprocess.
• Getline Notes: Important things to know about getline.
• Getline Summary: Summary of getline Variants.
Such as print and printf.
这一部分需要详细了解一下, 有如下的内容: (后面逐个学习一下 )
l Print: The print statement.
l Print Examples: Simple examples of print statements.
l Output Separators: The output separators and how to change them.
l OFMT: Controlling Numeric Output With print.
l Printf: The printf statement.
l Redirection: How to redirect output to multiple files and pipes.
l Special FD: Special files for I/O.
l Special Files: File name interpretation in gawk. gawk allows access to inherited file descriptors.
l Close Files And Pipes: Closing Input and Output Files and Pipes.
l Output Summary: Output summary.
l Output Exercises: Exercises.
主要是删除数组元素, See Delete.
其它:
AWK 可以在一个命令中输入多个规则, 列如:
$ awk '/12/ { print $0 }
> /21/ { print $0 }' mail-list inventory-shipped
稍微复杂一点的, 计算当前目录下, 最后修改是Nov的文件的总大小
ls -l | awk '$6 == "Nov" { sum += $5 }
END { print sum }'
Awk 可以在一行中有单个规则或者多个规则, 但如果你要把单个规则拆开到多行, 比如使用一个 ‘\’符号在前一行的末尾
例子:
fxb$ cat 123.log
ssh scp find grep
scp ssh grep find
grep find scp ssh
在一行是正确的,
fxb$ cat 123.log|awk '$1~/ssh/{print $0}'
ssh scp find grep
把这个命令分到两行, 报错.
fxb$ cat 123.log|awk '$1~/s
> sh/{print $0}'
awk: $1~/s
awk: ^ unterminated regexp
awk: cmd. line:1: sh/{print $0}
awk: cmd. line:1: ^ syntax error
如果加上在 ‘\’ 在行末尾, 则没有问题
fxb$ cat 123.log|awk '$1~/s\
> sh/{print $0}'
ssh scp find grep
再来测试一个多个规则的。
fxb$ cat 123.log|awk '$1~/ssh/{print $0} $3~/grep/{print "this secend rule ---" $0}'
ssh scp find grep
this secend rule ---scp ssh grep find
参考
1.6 awk Statements Versus Lines
Most often, each line in an awk program is a separate statement or separate rule, like this:
awk '/12/ { print $0 }
/21/ { print $0 }' mail-list inventory-shipped
However, gawk ignores newlines after any of the following symbols and keywords:
, { ? : || && do else
A newline at any other point is considered the end of the statement.9
If you would like to split a single statement into two lines at a point where a newline would terminate it, you can continue it by ending the first line with a backslash character (‘\’). The backslash must be the final character on the line in order to be recognized as a continuation character. A backslash is allowed anywhere in the statement, even in the middle of a string or regular expression. For example:
awk '/This regular expression is too long, so continue it\
on the next line/ { print $1 }'
We have generally not used backslash continuation in our sample programs. gawk places no limit on the length of a line, so backslash continuation is never strictly necessary; it just makes programs more readable. For this same reason, as well as for clarity, we have kept most statements short in the programs presented throughout the Web page. Backslash continuation is most useful when your awk program is in a separate source file instead of entered from the command line. You should also note that many awk implementations are more particular about where you may use backslash continuation. For example, they may not allow you to split a string constant using backslash continuation. Thus, for maximum portability of your awk programs, it is best not to split your lines in the middle of a regular expression or a string.
CAUTION: Backslash continuation does not work as described with the C shell. It works for awk programs in files and for one-shot programs,provided you are using a POSIX-compliant shell, such as the Unix Bourne shell or Bash. But the C shell behaves differently! There you must use two backslashes in a row, followed by a newline. Note also that when using the C shell, every newline in your awk program must be escaped with a backslash. To illustrate:
% awk 'BEGIN { \
? print \\
? "hello, world" \
? }'
-| hello, world
Here, the ‘%’ and ‘?’ are the C shell’s primary and secondary prompts, analogous to the standard shell’s ‘$’ and ‘>’.
Compare the previous example to how it is done with a POSIX-compliant shell:
$ awk 'BEGIN {
> print \
> "hello, world"
> }'
-| hello, world
awk is a line-oriented language. Each rule’s action has to begin on the same line as the pattern. To have the pattern and action on separate lines, you must use backslash continuation; there is no other option.
Another thing to keep in mind is that backslash continuation and comments do not mix. As soon as awk sees the ‘#’ that starts a comment, it ignoreseverything on the rest of the line. For example:
$ gawk 'BEGIN { print "dont panic" # a friendly \
> BEGIN rule
> }'
error→ gawk: cmd. line:2: BEGIN rule
error→ gawk: cmd. line:2: ^ syntax error
In this case, it looks like the backslash would continue the comment onto the next line. However, the backslash-newline combination is never even noticed because it is “hidden” inside the comment. Thus, the BEGIN is noted as a syntax error.
When awk statements within one rule are short, you might want to put more than one of them on a line. This is accomplished by separating the statements with a semicolon (‘;’). This also applies to the rules themselves. Thus, the program shown at the start of this section could also be written this way:
/12/ { print $0 } ; /21/ { print $0 }
NOTE: The requirement that states that rules on the same line must be separated with a semicolon was not in the original awk language; it was added for consistency with the treatment of statements within an action.
后续学习:
在下面这个例子中, 看起来是执行了多个parttern 和action,这是怎么回事, 第一行, 后面直接有一个 \ , 这个是怎么回事。 ?
$cat /etc/passwd | awk -F: '\
NF != 7{\
printf("line %d,does not have 7 fields:%s\n",NR,$0)}\
$1 !~ /[A-Za-z0-9]/{printf("line %d,non alpha and numeric user id:%d: %s\n",NR,$0)}\
$2 == "*" {printf("line %d, no password: %s\n",NR,$0)}'
cat把结果输出给awk,awk把域之间的分隔符设为冒号。 |
|
如果域的数量(NF)不等于7,就执行下面的程序。 |
|
printf打印字符串"line ?? does not have 7 fields",并显示该条记录。 |
|
如果第一个域没有包含任何字母和数字,printf打印“no alpha and numeric user id" ,并显示记录数和记录。 |
|
如果第二个域是一个星号,就打印字符串“no passwd”,紧跟着显示记录数和记录本身。 |