awk运行效率存在巨大差异,4.0很给力!

 

测试:
[root@apply1 67]# wc -l  duomi-delay-2012-11-01
37818115 duomi-delay-2012-11-01
/usr/local/bin/awk  --version
GNU Awk 4.0.1
Copyright (C) 1989, 1991-2012 Free Software Foundation.
[root@apply1 67]# time  /usr/local/bin/awk  '{if($0 ~/^\[ip_addr/){gsub(/[\[\]]/,"");a=$0}else if($0 ~/^<AD_DISPLAY/){print $0"|"a}}'   duomi-delay-2012-11-01 >1.txt

real 2m3.128s
user 1m10.753s
sys 0m4.142s
-----------------------------------------------------------------------------------------
[root@apply1 67]# /bin/gawk   --version
GNU Awk 3.1.5
Copyright (C) 1989, 1991-2005 Free Software Foundation.
[root@apply1 67]# time  /bin/gawk  '{if($0 ~/^\[ip_addr/){gsub(/[\[\]]/,"");a=$0}else if($0 ~/^<AD_DISPLAY/){print $0"|"a}}'   duomi-delay-2012-11-01 >1.txt
real 16m27.601s
user 12m36.505s
sys 0m6.733s
第二个明显慢很多,用4.0.1同步执行一个脚本需要2分钟,在3.1.5上需要16分钟。
以下是4.0的新功能介绍:
This note announces the next major release of GNU Awk: version 4.0.0.
The following files may be retrieved from ftp://ftp.gnu.org/gnu/gawk,
or via HTTP from http://ftp.gnu.org/gnu/gawk:
-rw-r--r--    1 1003     65534     1589204 Jun 29 21:32 gawk-4.0.0.tar.xz
-rw-r--r--    1 1003     65534     2651812 Jun 29 21:31 gawk-4.0.0.tar.gz
-rw-r--r--    1 1003     65534     2063647 Jun 29 21:31 gawk-4.0.0.tar.bz2
This is a major new release, with a number of new features, including
revamped internals.  The relevant part of the NEWS file is appended below.
This release represents close to two years of very hard work by a number
of people.  I thank them all for their contributions, I could not have
done it by myself.
Differences from gawk 3.1.8 are not available; they would be too large.
The usual GNU build incantation should be used:
        tar -xpvzf gawk-4.0.0.tar.gz
        cd gawk-4.0.0
        ./configure && make && make check
Bug reports should be sent to address@hidden
Enjoy!
Arnold Robbins (on behalf of all the gawk developers)
arnold AT skeeve.com
------------------------------------------------------------
Changes from 3.1.8 to 4.0.0
---------------------------
1. The special files /dev/pid, /dev/ppid, /dev/pgrpid and /dev/user are
   now completely gone. Use PROCINFO instead.
2. The POSIX 2008 behavior for `sub' and `gsub' are now the default.
   THIS CHANGES BEHAVIOR!!!!
3. The \s and \S escape sequences are now recognized in regular expressions.
4. The split() function accepts an optional fourth argument which is an array
   to hold the values of the separators.
5. The new -b / --characters-as-bytes option means "hands off my data"; gawk
   won't try to treat input as a multibyte string.
6. There is a new --sandbox option; see the doc.
7. Indirect function calls are now available.
8. Interval expressions are now part of default regular expressions for
   GNU Awk syntax.
9. --gen-po is now correctly named --gen-pot.
10. switch / case is now enabled by default. There's no longer a need
    for a configure-time option.
11. Gawk now supports BEGINFILE and ENDFILE. See the doc for details.
12. Directories named on the command line now produce a warning, not
    a fatal error, unless --posix or --traditional.
13. The new FPAT variable allows you to specify a regexp that matches
    the fields, instead of matching the field separator. The new patsplit()
    function gives the same capability for splitting.
14. All long options now have short options, for use in `#!' scripts.
15. Support for IPv6 is added via the /inet6/... special file. /inet4/...
    forces IPv4 and /inet chooses the system default (probably IPv4).
16. Added a warning for /[:space:]/ that should be /[[:space:]]/.
17. Merged with John Haque's byte code internals. Adds dgawk debugger and
    possibly improved performance.
18. `break' and `continue' are no longer valid outside a loop, even with
    --traditional.
19. POSIX character classes work with --traditional (BWK awk supports them).
20. Nuked redundant --compat, --copyleft, and --usage long options.
21. Arrays of arrays added. See the doc.
22. Per the GNU Coding Standards, dynamic extensions must now define
    a global symbol indicating that they are GPL-compatible. See
    the documentation and example extensions.
    THIS CHANGES BEHAVIOR!!!!
23. In POSIX mode, string comparisons use strcoll/wcscoll.
    THIS CHANGES BEHAVIOR!!!!
24. The option for raw sockets was removed, since it was never implemented.
25. Gawk now treats ranges of the form [d-h] as if they were in the C
    locale, no matter what kind of regexp is being used, and even if
    --posix.  The latest POSIX standard allows this, and the documentation
    has been updated.  Maybe this will stop all the questions about
    [a-z] matching uppercase letters.
    THIS CHANGES BEHAVIOR!!!!
26. PROCINFO["strftime"] now holds the default format for strftime().
27. Updated to latest infrastructure: Autoconf 2.68, Automake 1.11.1,
    Gettext 0.18.1, Bison 2.5.
28. Many code cleanups. Removed code for many old, unsupported systems:
        - Atari
        - Amiga
        - BeOS
        - Cray
        - MIPS RiscOS
        - MS-DOS with Microsoft Compiler
        - MS-Windows with Microsoft Compiler
        - NeXT
        - SunOS 3.x, Sun 386 (Road Runner)
        - Tandem (non-POSIX)
        - Prestandard VAX C compiler for VAX/VMS
        - Probably others that I've forgotten
29. If PROCINFO["sorted_in"] exists, for(iggy in foo) loops sort the
    indices before looping over them.  The value of this element
    provides control over how the indices are sorted before the loop
    traversal starts. See the manual.
30. A new isarray() function exists to distinguish if an item is an array
    or not, to make it possible to traverse multidimensional arrays.
31. asort() and asorti() take a third argument specifying how to sort.
    See the doc.
-- 
Aharon (Arnold) Robbins                         arnold AT skeeve DOT com
P.O. Box 354            Home Phone: +972  8 979-0381
Nof Ayalon              Cell Phone: +972 50 729-7545
D.N. Shimshon 99785     ISRAEL

你可能感兴趣的:(awk)