小记:正则表达式


重新了解了下正则表达式,小记如下,
参考:《Classic Shell Scripting》 p33 ~ p47


POSIX BRE and ERE metacharacters

Character    BRE / ERE    Meaning in a pattern
\                   Both       
.                   Both       
*                  Both       
^                  Both       
$                   Both       
[...]               Both       
\{n,m\}         BRE       
\( \)              BRE       
\n                 BRE       
+                  ERE       
?                   ERE       
|                   ERE       
( )                 ERE       
{n,m}            ERE   


POSIX bracket expressions

Character classes

    Class            Matching characters
    [:alnum:]      Alphanumeric characters
    [:alpha:]       Alphabetic characters
    [:blank:]       Space and tab characters
    [:cntrl:]         Control characters
    [:digit:]         Numeric characters
    [:graph:]       Nonspace characters
    [:lower:]       Lowercase characters
    [:print:]        Printable characters
    [:punct:]       Punctuation characters
    [:space:]      Whitespace characters
    [:upper:]      Uppercase characters
    [:xdigit:]       Hexadecimal digits

Collating symbols

        A collating symbol is a multicharacter sequence that should be treated as a unit.
It consists of the characters bracketed by [. and .]. Collating symbols are specific to
the locale in which they are used.

Equivalence classes

        An equivalence class lists a set of characters that should be considered equivalent,
such as e and è. It consists of a named element from the locale, bracketed by [= and =].

        All three of these constructs must appear inside the square brackets of a bracket
expression. For example, [[:alpha:]!] matches any single alphabetic character or the
exclamation mark,and [[.ch.]] matches the collating element ch, but does not match just
the letter c or the letter h. In a French locale, [[=e=]] might match any of e, è, ë, ê, or é.





Basic Regular Expressions

Matching single characters

    • Ordinary characters
    • Metacharacters: escaping it
    • The . (dot) character
    • Bracket expression: [](such as [012345], [0-5], [^0-5]) or Character classes(such as
[:digit:]) or Equivalence classes(such as [=e=]) or Collating symbols(such as [.ch.]).

        Within bracket expressions, all other metacharacters lose their special meanings. Thus,
[*\.] matches a literal asterisk, a literal backslash, or a literal period. To get a ] into the set,
place it first in the list: [ ]*\.] adds the ] to the list. To get a minus character into the set,
place it first in the list: [-*\.]. If you need both a right bracket and a minus, make the right
bracket the first character, and make the minus the last one in the list: [ ]*\.-].

Backreferences

Pattern                                                    Matches
\(ab\)\(cd\)[def]*\2\1           
\(why\).*\1               
\([[:alpha:]_][[:alnum:]_]*\) = \1;   
\(["']\).*\1               

Matching multiple characters with one expression

*
\{N\}
\{N,\}
\{N,M\}
\{,M\}

Anchoring text matches

^
$

BRE operator precedence

Operator                Meaning
[..] [==] [::]       
\metacharacter       
[]           
\(\) \digit       
* \{\}           
no symbol                Concatenation
^ $           





Extended Regular Expressions

Matching single characters

        same as BREs. But one notable exception is that in awk, \ is special inside bracket
expressions. Thus, to match a left bracket, dash, right bracket, or backslash, you could
use [\[\-\]\\].

Backreferences don’t exist


Matching multiple regular expressions with one expression

*
+
?
{N}
{N,}
{N,M}
{,M}

Alternation

|

Grouping

()

Anchoring text matches

        same as BRE. But there is one significant difference: in EREs, ^ and $ are always
metacharacters. Thus, regular expressions such as ab^cd and ef$gh are valid, but cannot
match anything,

ERE operator precedence

Operator                Meaning
[..] [==] [::]       
\metacharacter       
[]           
()           
* + ? {}       
no symbol              Concatenation
^ $           
|                           Alternation





Additional GNU regular expression operators

Operator               Meaning
\w                        Matches any word-constituent character. Equivalent to [[:alnum:]_].
\W                        Matches any nonword-constituent character. Equivalent to [^[:alnum:]_].
\< \>                    Matches the beginning and end of a word, as described previously.
\b                         Matches the null string found at either the beginning or the end of a word.
                            This is a generalization of the \< and \> operators. Note: Because awk uses
                             \b to represent the backspace character, GNU awk (gawk) uses \y.
\B                         Matches the null string between two word-constituent characters.
\' \`                      Matches the beginning and end of an emacs buffer, respectively. GNU
                            programs (besides emacs) generally treat these as being equivalent to ^ and $.

        Finally, although POSIX explicitly states that the NUL character need not be matchable, GNU
programs have no such restriction. If a NUL character occurs in input data, it can be matched by
the . metacharacter or a bracket expression.





Unix programs and their regular expression type

Type    grep    sed    ed    ex/vi    more    egrep    awk    lex
BRE        •        •       •       •         •
ERE                                                        •          •       •
\< \>      •        •       •       •         •





你可能感兴趣的:(正则,BRE,ERE)