re2的c接口版本cre2的使用手册

This is cre2.info, produced by makeinfo version 6.3 from cre2.texi.

This document describes version 0.3.4 of CRE2, a C language wrapper for
the C++ library RE2: a fast, safe, thread-friendly alternative to
backtracking regular expression engines like those used in PCRE, Perl,
and Python.

The latest release can be downloaded from:

       

development takes place at:

              

and as backup at:

            

Copyright (C) 2012, 2016 by Marco Maggi http://github.com/marcomaggi
Copyright (C) 2011 by Keegan McAllister http://github.com/kmcallister/

Portions of this document come from the source code of RE2 itself,
see the file ‘LICENSE.re2’ for the license notice.

 Permission is granted to copy, distribute and/or modify this
 document under the terms of the GNU Free Documentation License,
 Version 1.3 or any later version published by the Free Software
 Foundation; with Invariant Sections being "GNU Free Documentation
 License" and "GNU General Public License", no Front-Cover Texts,
 and no Back-Cover Texts.  A copy of the license is included in the
 section entitled "GNU Free Documentation License".

INFO-DIR-SECTION Development
START-INFO-DIR-ENTRY
* cre2: (cre2). C wrapper for RE2.
END-INFO-DIR-ENTRY


File: cre2.info, Node: Top, Next: overview, Up: (dir)

C wrapper for RE2


This document describes version 0.3.4 of CRE2, a C language wrapper for
the C++ library RE2: a fast, safe, thread-friendly alternative to
backtracking regular expression engines like those used in PCRE, Perl,
and Python.

The latest release can be downloaded from:

       

development takes place at:

              

and as backup at:

            

Copyright (C) 2012, 2016 by Marco Maggi http://github.com/marcomaggi
Copyright (C) 2011 by Keegan McAllister http://github.com/kmcallister/

Portions of this document come from the source code of RE2 itself,
see the file ‘LICENSE.re2’ for the license notice.

 Permission is granted to copy, distribute and/or modify this
 document under the terms of the GNU Free Documentation License,
 Version 1.3 or any later version published by the Free Software
 Foundation; with Invariant Sections being "GNU Free Documentation
 License" and "GNU General Public License", no Front-Cover Texts,
 and no Back-Cover Texts.  A copy of the license is included in the
 section entitled "GNU Free Documentation License".
  • Menu:

  • overview:: Overview of the package.

  • version:: Version functions.
  • regexps:: Precompiled regular expressions
    construction.
  • options:: Matching configuration.
  • matching:: Matching regular expressions.
  • other:: Other matching functions.
  • tips:: Tips for using the regexp syntax.

Appendices

  • Package License:: Package license.
  • Documentation License:: GNU Free Documentation License.
  • references:: Bibliography and references.

Indexes

  • concept index:: An entry for each concept.
  • function index:: An entry for each function.
  • variable index:: An entry for each variable.
  • type index:: An entry for each type.


File: cre2.info, Node: overview, Next: version, Prev: Top, Up: Top

1 Overview of the package


CRE2 is a C language wrapper for the C++ library RE2: a fast, safe,
thread-friendly alternative to backtracking regular expression engines
like those used in PCRE, Perl, and Python. CRE2 is based on code by
Keegan McAllister for the ‘haskell-re2’ binding:

          

For the supported regular expressions syntax we should refer to the
original documentation:

          

The C wrapper is meant to make it easier to interface RE2 with other
languages. The exposed API allows searching for substrings of text
matching regular expressions and reporting portions of text matching
parenthetical subexpressions.

CRE2 installs the single header file ‘cre2.h’. All the function
names in the API are prefixed with ‘cre2_’; all the constant names are
prefixed with ‘CRE2_’; all the type names are prefixed with ‘cre2_’ and
suffixed with ‘_t’.

When searching for the installed libraries with the GNU Autotools, we
can use the following macros in ‘configure.ac’:

 AC_CHECK_LIB([re2],[main],,
   [AC_MSG_FAILURE([test for RE2 library failed])])

 AC_CHECK_LIB([cre2],[cre2_version_string],,
   [AC_MSG_FAILURE([test for CRE2 library failed])])
 AC_CHECK_HEADERS([cre2.h],,
   [AC_MSG_ERROR([test for CRE2 header failed])])

notice that there is no need to check for the header file ‘re2/re2.h’.

It is customary for regular expression engines to provide methods to
replace backslash sequences like ‘\1’, ‘\2’, … in a given string with
portions of text that matched the first, second, … parenthetical
subexpression; CRE2 does not provide such methods in its public API,
because they require interacting with the storage mechanism in the
client code. However, it is not difficult to implement such
substitutions given the results of a regular expression matching
operation.

Some functions and methods from RE2 requiring memory allocation
handling are unofficially wrapped by CRE2 with unsafe code (execution
will succeed when no memory allocation errors happen). These
“problematic” functions are documented in the header file ‘cre2.h’ and,
at present, are not considered part of the public API of CRE2.

It is sometimes useful to try a program in the original C++ to verify
if a problem is caused by CRE2 or is in the original RE2 code; we may
want to start by customising this program:

 /* compile and run with:

    $ g++ -Wall -o proof proof.cpp -lre2 && ./proof
 */

 #include 
 #include 

 static void try_match (RE2::Options& opt, const char * text);

 int
 main (int argc, const char *const argv[])
 {
   RE2::Options opt;
   opt.set_never_nl(true);
   try_match(opt, "abcdef");
   return 0;
 }
 void
 try_match (RE2::Options& opt, const char * text)
 {
   RE2 re("abcdef", opt);
   assert(re.ok());
   assert(RE2::FullMatch(text, re));
   //assert(RE2::PartialMatch(text, re));
 }


File: cre2.info, Node: version, Next: regexps, Prev: overview, Up: Top

2 Version functions


The installed libraries follow version numbering as established by the
GNU Autotools. For an explanation of interface numbers as managed by
GNU Libtool *Note interface: (libtool)Libtool versioning.

– Function: const char * cre2_version_string (void)
Return a pointer to a statically allocated ASCIIZ string
representing the interface version number.

– Function: int cre2_version_interface_current (void)
Return an integer representing the library interface current
number.

– Function: int cre2_version_interface_revision (void)
Return an integer representing the library interface current
revision number.

– Function: int cre2_version_interface_age (void)
Return an integer representing the library interface current age.


File: cre2.info, Node: regexps, Next: options, Prev: version, Up: Top

3 Precompiled regular expressions construction


Regular expression objects are built and finalised as follows:

 cre2_regexp_t *   rex;
 cre2_options_t *  opt;

 opt = cre2_opt_new();
 if (opt) {
   cre2_opt_set_log_errors(opt, 0);
   rex = cre2_new("ciao", 4, opt);
   if (rex) {
     if (!cre2_error_code(rex))
       /* successfully built */
     else
       /* an error occurred while compiling rex */
     cre2_delete(rex);
   } else {
     /* rex memory allocation error */
   }
   cre2_opt_delete(opt);
 } else {
   /* opt memory allocation error */
 }

– Opaque Type: cre2_regexp_t
Opaque type for regular expression objects; it is meant to be used
to declare pointers to objects. Instances of this type can be used
for any number of matching operations and are safe for concurrent
use by multiple threads.

– Struct Typedef: cre2_string_t
Simple data structure used to reference a portion of another
string. It has the following fields:

 'const char * data'
      Pointer to the first byte in the referenced substring.

 'int length'
      The number of bytes in the referenced substring.

– Enumeration Typedef: cre2_error_code_t
Enumeration type for error codes returned by ‘cre2_error_code()’.
It contains the following symbols:

 'CRE2_NO_ERROR'
      Defined as '0', represents a successful operation.

 'CRE2_ERROR_INTERNAL'
      Unexpected error.

 'CRE2_ERROR_BAD_ESCAPE'
      Bad escape sequence.

 'CRE2_ERROR_BAD_CHAR_CLASS'
      Bad character class.

 'CRE2_ERROR_BAD_CHAR_RANGE'
      Bad character class range.

 'CRE2_ERROR_MISSING_BRACKET'
      Missing closing ']'.

 'CRE2_ERROR_MISSING_PAREN'
      Missing closing ')'.

 'CRE2_ERROR_TRAILING_BACKSLASH'
      Trailing '\' at end of regexp.

 'CRE2_ERROR_REPEAT_ARGUMENT'
      Repeat argument missing, e.g.  '*'.

 'CRE2_ERROR_REPEAT_SIZE'
      Bad repetition argument.

 'CRE2_ERROR_REPEAT_OP'
      Bad repetition operator.

 'CRE2_ERROR_BAD_PERL_OP'
      Bad Perl operator.

 'CRE2_ERROR_BAD_UTF8'
      Invalid UTF-8 in regexp.

 'CRE2_ERROR_BAD_NAMED_CAPTURE'
      Bad named capture group.

 'CRE2_ERROR_PATTERN_TOO_LARGE'
      Pattern too large (compile failed).

– Function: cre2_regexp_t * cre2_new (const char * PATTERN, int
PATTERN_LEN, const cre2_options_t * OPT)
Build and return a new regular expression object representing the
PATTERN of length PATTERN_LEN bytes; the object is configured with
the options in OPT. If memory allocation fails: the return value
is a ‘NULL’ pointer.

 The options object OPT is duplicated in the internal state of the
 regular expression instance, so OPT can be safely mutated or
 finalised after this call.  If OPT is 'NULL': the regular
 expression object is built with the default set of options.

– Function: void cre2_delete (cre2_regexp_t * REX)
Finalise a regular expression object releasing all the associated
resources.

– Function: const char * cre2_pattern (const cre2_regexp_t * REX)
Whether REX is a successfully built regular expression object or
not: return a pointer to the pattern string. The returned pointer
is valid only while REX is alive: if ‘cre2_delete()’ is applied to
REX the pointer becomes invalid.

– Function: int cre2_num_capturing_groups (const cre2_regexp_t * REX)
If REX is a successfully built regular expression object: return a
non-negative integer representing the number of capturing groups
(parenthetical subexpressions) in the pattern. If an error
occurred while building REX: return ‘-1’.

– Function: int cre2_find_named_capturing_groups (const cre2_regexp_t
* REX, const char * NAME)
If REX is a successfully built regular expression object: return a
non-negative integer representing the index of the named capturing
group whose name is NAME. If an error occurred while building REX
or the name is invalid: return ‘-1’.

      const char *      pattern = "from (?P.*) to (?P.*)";
      cre2_options_t *  opt     = cre2_opt_new();
      cre2_regexp_t *   rex     = cre2_new(pattern, strlen(pattern),
                                           opt);
      {
        if (cre2_error_code(rex))
          { /* handle the error */ }
        int nmatch = cre2_num_capturing_groups(rex) + 1;
        cre2_string_t strings[nmatch];
        int e, SIndex, DIndex;

        const char * text = \
           "from Montreal, Canada to Lausanne, Switzerland";
        int text_len = strlen(text);

        e = cre2_match(rex, text, text_len, 0, text_len,
                       CRE2_UNANCHORED, strings, nmatch);
        if (0 == e)
          { /* handle the error */ }

        SIndex = cre2_find_named_capturing_groups(rex, "S");
        if (0 != strncmp("Montreal, Canada",
                         strings[SIndex].data, strings[SIndex].length))
          { /* handle the error */ }

        DIndex = cre2_find_named_capturing_groups(rex, "D");
        if (0 != strncmp("Lausanne, Switzerland",
                         strings[DIndex].data, strings[DIndex].length))
          { /* handle the error */ }
      }
      cre2_delete(rex);
      cre2_opt_delete(opt);

– Function: int cre2_program_size (const cre2_regexp_t * REX)
If REX is a successfully built regular expression object: return a
non-negative integer representing the program size, a very
approximate measure of a regexp’s “cost”; larger numbers are more
expensive than smaller numbers. If an error occurred while
building REX: return ‘-1’.

– Function: int cre2_error_code (const cre2_regexp_t * REX)
In case an error occurred while building REX: return an integer
representing the associated error code. Return zero if no error
occurred.

– Function: const char * cre2_error_string (const cre2_regexp_t * REX)
If an error occurred while building REX: return a pointer to an
ASCIIZ string representing the associated error message. The
returned pointer is valid only while REX is alive: if
‘cre2_delete()’ is applied to REX the pointer becomes invalid.

 If REX is a successfully built regular expression object: return a
 pointer to an empty string.

 The following code:

      cre2_regexp_t *   rex;

      rex = cre2_new("ci(ao", 5, NULL);
      {
        printf("error: code=%d, msg=\"%s\"\n",
               cre2_error_code(rex),
               cre2_error_string(rex));
      }
      cre2_delete(rex);

 prints:

      error: code=6, msg="missing ): ci(ao"

– Function: void cre2_error_arg (const cre2_regexp_t * REX,
cre2_string_t * ARG)
If an error occurred while building REX: fill the structure
referenced by ARG with the interval of bytes representing the
offending portion of the pattern.

 If REX is a successfully built regular expression object: ARG
 references an empty string.

 The following code:

      cre2_regexp_t *   rex;
      cre2_string_t     S;

      rex = cre2_new("ci(ao", 5, NULL);
      {
        cre2_error_arg(rex, &S);
        printf("arg: len=%d, data=\"%s\"\n", S.length, S.data);
      }
      cre2_delete(rex);

 prints:

      arg: len=5 data="ci(ao"


File: cre2.info, Node: options, Next: matching, Prev: regexps, Up: Top

4 Matching configuration


Compiled regular expressions can be configured, at construction-time,
with a number of options collected in a ‘cre2_options_t’ object. Notice
that, by default, when attempting to compile an invalid regular
expression pattern, RE2 will print to ‘stderr’ an error message; usually
we want to avoid this logging by disabling the associated option:

 cre2_options_t *  opt;

 opt = cre2_opt_new();
 cre2_opt_set_log_errors(opt, 0);

– Opaque Typedef: cre2_options_t
Type of opaque pointers to options objects. Any instance of this
type can be used to configure any number of regular expression
objects.

– Enumeration Typedef: cre2_encoding_t
Enumeration type for constants selecting encoding. It contains the
following values:

      CRE2_UNKNOWN
      CRE2_UTF8
      CRE2_Latin1

 The value 'CRE2_UNKNOWN' should never be used: it exists only in
 case there is a mismatch between the definitions of RE2 and CRE2.

– Function: cre2_options_t * cre2_opt_new (void)
Allocate and return a new options object. If memory allocation
fails: the return value is a ‘NULL’ pointer.

– Function: void cre2_opt_delete (cre2_options_t * OPT)
Finalise an options object releasing all the associated resources.
Compiled regular expressions configured with this object are not
affected by its destruction.

All the following functions are getters and setters for regular
expression options; the FLAG argument to the setter must be false to
disable the option and true to enable it; unless otherwise specified the
‘int’ return value is true if the option is enabled and false if it is
disabled.

– Function: cre2_encoding_t cre2_opt_encoding (cre2_options_t * OPT)
– Function: void cre2_opt_set_encoding (cre2_options_t * OPT,
cre2_encoding_t ENC)
By default, the regular expression pattern and input text are
interpreted as UTF-8. CRE2_Latin1 encoding causes them to be
interpreted as Latin-1.

 The getter returns 'CRE2_UNKNOWN' if the encoding value returned by
 RE2 is unknown.

– Function: int cre2_opt_posix_syntax (cre2_options_t * OPT)
– Function: void cre2_opt_set_posix_syntax (cre2_options_t * OPT, int
FLAG)
Restrict regexps to POSIX egrep syntax. Default is disabled.

– Function: int cre2_opt_longest_match (cre2_options_t * OPT)
– Function: void cre2_opt_set_longest_match (cre2_options_t * OPT, int
FLAG)
Search for longest match, not first match. Default is disabled.

– Function: int cre2_opt_log_errors (cre2_options_t * OPT)
– Function: void cre2_opt_set_log_errors (cre2_options_t * OPT, int
FLAG)
Log syntax and execution errors to ‘stderr’. Default is enabled.

– Function: int cre2_opt_literal (cre2_options_t * OPT)
– Function: void cre2_opt_set_literal (cre2_options_t * OPT, int FLAG)
Interpret the pattern string as literal, not as regular expression.
Default is disabled.

 Setting this option is equivalent to quoting all the special
 characters defining a regular expression pattern:

      cre2_regexp_t *   rex;
      cre2_options_t *  opt;
      const char *      pattern = "(ciao) (hello)";
      const char *      text    = pattern;
      int               len     = strlen(pattern);

      opt = cre2_opt_new();
      cre2_opt_set_literal(opt, 1);
      rex = cre2_new(pattern, len, opt);
      {
        /* successful match */
        cre2_match(rex, text, len, 0, len,
                   CRE2_UNANCHORED, NULL, 0);
      }
      cre2_delete(rex);
      cre2_opt_delete(opt);

– Function: int cre2_opt_never_nl (cre2_options_t * OPT)
– Function: void cre2_opt_set_never_nl (cre2_options_t * OPT, int
FLAG)
Never match a newline character, even if it is in the regular
expression pattern; default is disabled. Turning on this option
allows us to attempt a partial match, against the beginning of a
multiline text, without using subpatterns to exclude the newline in
the regexp pattern.

    * When set to true: matching always fails if the text or the
      regexp contains a newline.

    * When set to false: matching succeeds or fails taking normal
      account of newlines.

    * The option does *not* cause newlines to be skipped.

– Function: int cre2_opt_dot_nl (cre2_options_t * OPT)
– Function: void cre2_opt_set_dot_nl (cre2_options_t * OPT, int FLAG)
The dot matches everything, including the new line; default is
disabled.

– Function: int cre2_opt_never_capture (cre2_options_t * OPT)
– Function: void cre2_opt_set_never_capture (cre2_options_t * OPT, int
FLAG)
Parse all the parentheses as non-capturing; default is disabled.

– Function: int cre2_opt_case_sensitive (cre2_options_t * OPT)
– Function: void cre2_opt_set_case_sensitive (cre2_options_t * OPT,
int FLAG)
Match is case-sensitive; the regular expression pattern can
override this setting with ‘(?i)’ unless configured in POSIX syntax
mode. Default is enabled.

– Function: int cre2_opt_max_mem (cre2_options_t * OPT)
– Function: void cre2_opt_set_max_mem (cre2_options_t * OPT, int M)
The max memory option controls how much memory can be used to hold
the compiled form of the regular expression and its cached DFA
graphs. These functions set and get such amount of memory. See
the documentation of RE2 for details.

The following options are only consulted when POSIX syntax is
enabled; when POSIX syntax is disabled: these features are always
enabled and cannot be turned off.

– Function: int cre2_opt_perl_classes (cre2_options_t * OPT)
– Function: void cre2_opt_set_perl_classes (cre2_options_t * OPT, int
FLAG)
Allow Perl’s ‘\d’, ‘\s’, ‘\w’, ‘\D’, ‘\S’, ‘\W’. Default is
disabled.

– Function: int cre2_opt_word_boundary (cre2_options_t * OPT)
– Function: void cre2_opt_set_word_boundary (cre2_options_t * OPT, int
FLAG)
Allow Perl’s ‘\b’, ‘\B’ (word boundary and not). Default is
disabled.

– Function: int cre2_opt_one_line (cre2_options_t * OPT)
– Function: void cre2_opt_set_one_line (cre2_options_t * OPT, int
FLAG)
The patterns ‘^’ and ‘$’ only match at the beginning and end of the
text. Default is disabled.


File: cre2.info, Node: matching, Next: other, Prev: options, Up: Top

5 Matching regular expressions


Basic pattern matching goes as follows (with error checking omitted):

 cre2_regexp_t *   rex;
 cre2_options_t *  opt;
 const char *      pattern = "(ciao) (hello)";

 opt = cre2_opt_new();
 cre2_opt_set_posix_syntax(opt, 1);

 rex = cre2_new(pattern, strlen(pattern), opt);
 {
   const char *   text     = "ciao hello";
   int            text_len = strlen(text);
   int            nmatch   = 3;
   cre2_string_t  match[nmatch];

   cre2_match(rex, text, text_len, 0, text_len, CRE2_UNANCHORED,
              match, nmatch);

   /* prints: full match: ciao hello */
   printf("full match: ");
   fwrite(match[0].data, match[0].length, 1, stdout);
   printf("\n");

   /* prints: first group: ciao */
   printf("first group: ");
   fwrite(match[1].data, match[1].length, 1, stdout);
   printf("\n");

   /* prints: second group: hello */
   printf("second group: ");
   fwrite(match[2].data, match[2].length, 1, stdout);
   printf("\n");
 }
 cre2_delete(rex);
 cre2_opt_delete(opt);

– Enumeration Typedef: cre2_anchor_t
Enumeration type for the anchor point of matching operations. It
contains the following constants:

      CRE2_UNANCHORED
      CRE2_ANCHOR_START
      CRE2_ANCHOR_BOTH

– Function: int cre2_match (const cre2_regexp_t * REX, const char *
TEXT, int TEXT_LEN, int START_POS, int END_POS, cre2_anchor_t
ANCHOR, cre2_string_t * MATCH, int NMATCH)
Match a substring of the text referenced by TEXT and holding
TEXT_LEN bytes against the regular expression object REX. Return
true if the text matched, false otherwise.

 The zero-based indices START_POS (inclusive) and END_POS
 (exclusive) select the substring of TEXT to be examined.  ANCHOR
 selects the anchor point for the matching operation.

 Data about the matching groups is stored in the array MATCH, which
 must have at least NMATCH entries; the referenced substrings are
 portions of the TEXT buffer.  If we are only interested in
 verifying if the text matches or not (ignoring the matching
 portions of text): we can use 'NULL' as MATCH argument and 0 as
 NMATCH argument.

 The first element of MATCH (index 0) references the full portion of
 the substring of TEXT matching the pattern; the second element of
 MATCH (index 1) references the portion of text matching the first
 parenthetical subexpression, the third element of MATCH (index 2)
 references the portion of text matching the second parenthetical
 subexpression; and so on.

– Function: int cre2_easy_match (const char * PATTERN, int
PATTERN_LEN, const char * TEXT, int TEXT_LEN, cre2_string_t *
MATCH, int NMATCH)
Like ‘cre2_match()’ but the pattern is specified as string PATTERN
holding PATTERN_LEN bytes. Also the text is fully matched without
anchoring.

 If the text matches the pattern: the return value is 1.  If the
 text does not match the pattern: the return value is 0.  If the
 pattern is invalid: the return value is 2.

– Struct Typedef: cre2_range_t
Structure type used to represent a substring of the text to be
matched as starting and ending indices. It has the following
fields:

 'long start'
      Inclusive start byte index.

 'long past'
      Exclusive end byte index.

– Function: void cre2_strings_to_ranges (const char * TEXT,
cre2_range_t * RANGES, cre2_string_t * STRINGS, int NMATCH)
Given an array of STRINGS with NMATCH elements being the result of
matching TEXT against a regular expression: fill the array of
RANGES with the index intervals in the TEXT buffer representing the
same results.


File: cre2.info, Node: other, Next: tips, Prev: matching, Up: Top

6 Other matching functions


The following functions match a buffer of text against a regular
expression, allowing the extraction of portions of text matching
parenthetical subexpressions. All of them show the following behaviour:

  • If the text matches the pattern: the return value is 1; if the text
    does not match the pattern: the return value is 0.

  • If the pattern is invalid: the return value is 0; there is no way
    to distinguish this case from the case of text not matching other
    than looking at what RE2 prints to ‘stderr’.

  • It is impossible to turn off logging of error messages to ‘stderr’
    when the specification of the regular expression is invalid.

  • Data about the matching groups is stored in the array MATCH, which
    must have at least NMATCH slots; the referenced substrings are
    portions of the TEXT buffer.

  • The array MATCH can have a number of slots between zero (included)
    and the number of parenthetical subexpressions in PATTERN
    (excluded); if NMATCH is greater than the number of parenthetical
    subexpressions: the return value is 0.

  • If we are only interested in verifying if the text matches the
    pattern or not: we can use ‘NULL’ as MATCH argument and 0 as NMATCH
    argument.

  • The first slot of MATCH (index 0) references the portion of text
    matching the first parenthetical subexpression; the second slot of
    MATCH (index 1) references the portion of text matching the second
    parenthetical subexpression; and so on.

see the documentation of each function for the differences.

The following example is a successful match:

 const char *   pattern = "ci.*ut";
 const char *   text    = "ciao salut";
 cre2_string_t  input   = {
   .data   = text,
   .length = strlen(text)
 };
 int            result;
 result = cre2_full_match(pattern, &input, NULL, 0);

 result => 1

the following example is a successful match in which the parenthetical
subexpression is ignored:

 const char *   pattern = "(ciao) salut";
 const char *   text    = "ciao salut";
 cre2_string_t  input   = {
   .data   = text,
   .length = strlen(text)
 };
 int            result;
 result = cre2_full_match(pattern, &input, NULL, 0);

 result => 1

the following example is a successful match in which the portion of text
matching the parenthetical subexpression is reported:

 const char *   pattern = "(ciao) salut";
 const char *   text    = "ciao salut";
 cre2_string_t  input   = {
   .data   = text,
   .length = strlen(text)
 };
 int            nmatch  = 1;
 cre2_string_t  match[nmatch];
 int            result;
 result = cre2_full_match(pattern, &input, match, nmatch);

 result => 1
 strncmp(text, input.data, input.length)         => 0
 strncmp("ciao", match[0].data, match[0].length) => 0

– Function: int cre2_full_match (const char * PATTERN, const
cre2_string_t * TEXT, cre2_string_t * MATCH, int NMATCH)
– Function: int cre2_full_match_re (cre2_regexp_t * REX, const
cre2_string_t * TEXT, cre2_string_t * MATCH, int NMATCH)
Match the zero-terminated string PATTERN or the precompiled regular
expression REX against the full buffer TEXT.

 For example: the text 'abcdef' matches the pattern 'abcdef'
 according to this function, but neither the pattern 'abc' nor the
 pattern 'def' will match.

– Function: int cre2_partial_match (const char * PATTERN, const
cre2_string_t * TEXT, cre2_string_t * MATCH, int NMATCH)
– Function: int cre2_partial_match_re (cre2_regexp_t * REX, const
cre2_string_t * TEXT, cre2_string_t * MATCH, int NMATCH)
Match the zero-terminated string PATTERN or the precompiled regular
expression REX against the buffer TEXT, resulting in success if a
substring of TEXT matches; these functions behave like the full
match ones, but the matching text does not need to be anchored to
the beginning and end.

 For example: the text 'abcDEFghi' matches the pattern 'DEF'
 according to this function.

– Function: int cre2_consume (const char * PATTERN, cre2_string_t *
TEXT, cre2_string_t * MATCH, int NMATCH)
– Function: int cre2_consume_re (cre2_regexp_t * REX, cre2_string_t *
TEXT, cre2_string_t * MATCH, int NMATCH)
Match the zero-terminated string PATTERN or the precompiled regular
expression REX against the buffer TEXT, resulting in success if the
prefix of TEXT matches. The data structure referenced by TEXT is
mutated to reference text right after the last byte that matched
the pattern.

 For example: the text 'abcDEF' matches the pattern 'abc' according
 to this function; after the call TEXT will reference the text
 'DEF'.

– Function: int cre2_find_and_consume (const char * PATTERN,
cre2_string_t * TEXT, cre2_string_t * MATCH, int NMATCH)
– Function: int cre2_find_and_consume_re (cre2_regexp_t * REX,
cre2_string_t * TEXT, cre2_string_t * MATCH, int NMATCH)
Match the zero-terminated string PATTERN or the precompiled regular
expression REX against the buffer TEXT, resulting in success if,
after skipping a non-matching prefix in TEXT, a substring of TEXT
matches. The data structure referenced by TEXT is mutated to
reference text right after the last byte that matched the pattern.

 For example: the text 'abcDEFghi' matches the pattern 'DEF'
 according to this function; the prefix 'abc' is skipped; after the
 call TEXT will reference the text 'ghi'.


File: cre2.info, Node: tips, Next: Package License, Prev: other, Up: Top

7 Tips for using the regexp syntax


  • Menu:

  • tips dot:: Matching newlines with the
    ‘.’ subpattern.


File: cre2.info, Node: tips dot, Up: tips

7.1 Matching newlines with the ‘.’ subpattern

By default the dot subpattern ‘.’ matches any character but newlines; to
enable newline matching we have to enable the ‘s’ flag using the special
subpattern ‘(?)’ or ‘(?:)’, where ‘’ is a
sequence of characters, one character for each flag, and ‘’ is a
regexp subpattern. Notice that the parentheses in ‘(?:)’ are
non-capturing.

So let’s consider the text ‘ciao\nhello’:

  • The regexp ‘ciao.hello’ does not match because ‘s’ is disabled.

  • The regexp ‘(?s)ciao.hello’ matches because the subpattern ‘(?s)’
    has enabled flag ‘s’ for the rest of the pattern, including the
    dot.

  • The regexp ‘ciao(?s).hello’ matches because the subpattern ‘(?s)’
    has enabled flag ‘s’ for the rest of the pattern, including the
    dot.

  • The regexp ‘ciao(?s:.)hello’ matches because the subpattern
    ‘(?s:.)’ has enabled flag ‘s’ for the subpattern ‘.’ which is the
    dot.


File: cre2.info, Node: Package License, Next: Documentation License, Prev: tips, Up: Top

Appendix A Package license


Copyright (C) 2012, 2016 Marco Maggi http://github.com/marcomaggi
Copyright (C) 2011 Keegan McAllister http://github.com/kmcallister/
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

  1. Redistributions of source code must retain the above copyright
    notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
    notice, this list of conditions and the following disclaimer in the
    documentation and/or other materials provided with the
    distribution.

  3. Neither the name of the author nor the names of his contributors
    may be used to endorse or promote products derived from this
    software without specific prior written permission.

    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
    “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
    PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR
    CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
    EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
    PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
    PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
    LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
    NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
    SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


File: cre2.info, Node: Documentation License, Next: references, Prev: Package License, Up: Top

Appendix B GNU Free Documentation License


                 Version 1.3, 3 November 2008

 Copyright (C) 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc.
 

 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.
  1. PREAMBLE

    The purpose of this License is to make a manual, textbook, or other
    functional and useful document “free” in the sense of freedom: to
    assure everyone the effective freedom to copy and redistribute it,
    with or without modifying it, either commercially or
    noncommercially. Secondarily, this License preserves for the
    author and publisher a way to get credit for their work, while not
    being considered responsible for modifications made by others.

    This License is a kind of “copyleft”, which means that derivative
    works of the document must themselves be free in the same sense.
    It complements the GNU General Public License, which is a copyleft
    license designed for free software.

    We have designed this License in order to use it for manuals for
    free software, because free software needs free documentation: a
    free program should come with manuals providing the same freedoms
    that the software does. But this License is not limited to
    software manuals; it can be used for any textual work, regardless
    of subject matter or whether it is published as a printed book. We
    recommend this License principally for works whose purpose is
    instruction or reference.

  2. APPLICABILITY AND DEFINITIONS

    This License applies to any manual or other work, in any medium,
    that contains a notice placed by the copyright holder saying it can
    be distributed under the terms of this License. Such a notice
    grants a world-wide, royalty-free license, unlimited in duration,
    to use that work under the conditions stated herein. The
    “Document”, below, refers to any such manual or work. Any member
    of the public is a licensee, and is addressed as “you”. You accept
    the license if you copy, modify or distribute the work in a way
    requiring permission under copyright law.

    A “Modified Version” of the Document means any work containing the
    Document or a portion of it, either copied verbatim, or with
    modifications and/or translated into another language.

    A “Secondary Section” is a named appendix or a front-matter section
    of the Document that deals exclusively with the relationship of the
    publishers or authors of the Document to the Document’s overall
    subject (or to related matters) and contains nothing that could
    fall directly within that overall subject. (Thus, if the Document
    is in part a textbook of mathematics, a Secondary Section may not
    explain any mathematics.) The relationship could be a matter of
    historical connection with the subject or with related matters, or
    of legal, commercial, philosophical, ethical or political position
    regarding them.

    The “Invariant Sections” are certain Secondary Sections whose
    titles are designated, as being those of Invariant Sections, in the
    notice that says that the Document is released under this License.
    If a section does not fit the above definition of Secondary then it
    is not allowed to be designated as Invariant. The Document may
    contain zero Invariant Sections. If the Document does not identify
    any Invariant Sections then there are none.

    The “Cover Texts” are certain short passages of text that are
    listed, as Front-Cover Texts or Back-Cover Texts, in the notice
    that says that the Document is released under this License. A
    Front-Cover Text may be at most 5 words, and a Back-Cover Text may
    be at most 25 words.

    A “Transparent” copy of the Document means a machine-readable copy,
    represented in a format whose specification is available to the
    general public, that is suitable for revising the document
    straightforwardly with generic text editors or (for images composed
    of pixels) generic paint programs or (for drawings) some widely
    available drawing editor, and that is suitable for input to text
    formatters or for automatic translation to a variety of formats
    suitable for input to text formatters. A copy made in an otherwise
    Transparent file format whose markup, or absence of markup, has
    been arranged to thwart or discourage subsequent modification by
    readers is not Transparent. An image format is not Transparent if
    used for any substantial amount of text. A copy that is not
    “Transparent” is called “Opaque”.

    Examples of suitable formats for Transparent copies include plain
    ASCII without markup, Texinfo input format, LaTeX input format,
    SGML or XML using a publicly available DTD, and standard-conforming
    simple HTML, PostScript or PDF designed for human modification.
    Examples of transparent image formats include PNG, XCF and JPG.
    Opaque formats include proprietary formats that can be read and
    edited only by proprietary word processors, SGML or XML for which
    the DTD and/or processing tools are not generally available, and
    the machine-generated HTML, PostScript or PDF produced by some word
    processors for output purposes only.

    The “Title Page” means, for a printed book, the title page itself,
    plus such following pages as are needed to hold, legibly, the
    material this License requires to appear in the title page. For
    works in formats which do not have any title page as such, “Title
    Page” means the text near the most prominent appearance of the
    work’s title, preceding the beginning of the body of the text.

    The “publisher” means any person or entity that distributes copies
    of the Document to the public.

    A section “Entitled XYZ” means a named subunit of the Document
    whose title either is precisely XYZ or contains XYZ in parentheses
    following text that translates XYZ in another language. (Here XYZ
    stands for a specific section name mentioned below, such as
    “Acknowledgements”, “Dedications”, “Endorsements”, or “History”.)
    To “Preserve the Title” of such a section when you modify the
    Document means that it remains a section “Entitled XYZ” according
    to this definition.

    The Document may include Warranty Disclaimers next to the notice
    which states that this License applies to the Document. These
    Warranty Disclaimers are considered to be included by reference in
    this License, but only as regards disclaiming warranties: any other
    implication that these Warranty Disclaimers may have is void and
    has no effect on the meaning of this License.

  3. VERBATIM COPYING

    You may copy and distribute the Document in any medium, either
    commercially or noncommercially, provided that this License, the
    copyright notices, and the license notice saying this License
    applies to the Document are reproduced in all copies, and that you
    add no other conditions whatsoever to those of this License. You
    may not use technical measures to obstruct or control the reading
    or further copying of the copies you make or distribute. However,
    you may accept compensation in exchange for copies. If you
    distribute a large enough number of copies you must also follow the
    conditions in section 3.

    You may also lend copies, under the same conditions stated above,
    and you may publicly display copies.

  4. COPYING IN QUANTITY

    If you publish printed copies (or copies in media that commonly
    have printed covers) of the Document, numbering more than 100, and
    the Document’s license notice requires Cover Texts, you must
    enclose the copies in covers that carry, clearly and legibly, all
    these Cover Texts: Front-Cover Texts on the front cover, and
    Back-Cover Texts on the back cover. Both covers must also clearly
    and legibly identify you as the publisher of these copies. The
    front cover must present the full title with all words of the title
    equally prominent and visible. You may add other material on the
    covers in addition. Copying with changes limited to the covers, as
    long as they preserve the title of the Document and satisfy these
    conditions, can be treated as verbatim copying in other respects.

    If the required texts for either cover are too voluminous to fit
    legibly, you should put the first ones listed (as many as fit
    reasonably) on the actual cover, and continue the rest onto
    adjacent pages.

    If you publish or distribute Opaque copies of the Document
    numbering more than 100, you must either include a machine-readable
    Transparent copy along with each Opaque copy, or state in or with
    each Opaque copy a computer-network location from which the general
    network-using public has access to download using public-standard
    network protocols a complete Transparent copy of the Document, free
    of added material. If you use the latter option, you must take
    reasonably prudent steps, when you begin distribution of Opaque
    copies in quantity, to ensure that this Transparent copy will
    remain thus accessible at the stated location until at least one
    year after the last time you distribute an Opaque copy (directly or
    through your agents or retailers) of that edition to the public.

    It is requested, but not required, that you contact the authors of
    the Document well before redistributing any large number of copies,
    to give them a chance to provide you with an updated version of the
    Document.

  5. MODIFICATIONS

    You may copy and distribute a Modified Version of the Document
    under the conditions of sections 2 and 3 above, provided that you
    release the Modified Version under precisely this License, with the
    Modified Version filling the role of the Document, thus licensing
    distribution and modification of the Modified Version to whoever
    possesses a copy of it. In addition, you must do these things in
    the Modified Version:

    A. Use in the Title Page (and on the covers, if any) a title
    distinct from that of the Document, and from those of previous
    versions (which should, if there were any, be listed in the
    History section of the Document). You may use the same title
    as a previous version if the original publisher of that
    version gives permission.

    B. List on the Title Page, as authors, one or more persons or
    entities responsible for authorship of the modifications in
    the Modified Version, together with at least five of the
    principal authors of the Document (all of its principal
    authors, if it has fewer than five), unless they release you
    from this requirement.

    C. State on the Title page the name of the publisher of the
    Modified Version, as the publisher.

    D. Preserve all the copyright notices of the Document.

    E. Add an appropriate copyright notice for your modifications
    adjacent to the other copyright notices.

    F. Include, immediately after the copyright notices, a license
    notice giving the public permission to use the Modified
    Version under the terms of this License, in the form shown in
    the Addendum below.

    G. Preserve in that license notice the full lists of Invariant
    Sections and required Cover Texts given in the Document’s
    license notice.

    H. Include an unaltered copy of this License.

    I. Preserve the section Entitled “History”, Preserve its Title,
    and add to it an item stating at least the title, year, new
    authors, and publisher of the Modified Version as given on the
    Title Page. If there is no section Entitled “History” in the
    Document, create one stating the title, year, authors, and
    publisher of the Document as given on its Title Page, then add
    an item describing the Modified Version as stated in the
    previous sentence.

    J. Preserve the network location, if any, given in the Document
    for public access to a Transparent copy of the Document, and
    likewise the network locations given in the Document for
    previous versions it was based on. These may be placed in the
    “History” section. You may omit a network location for a work
    that was published at least four years before the Document
    itself, or if the original publisher of the version it refers
    to gives permission.

    K. For any section Entitled “Acknowledgements” or “Dedications”,
    Preserve the Title of the section, and preserve in the section
    all the substance and tone of each of the contributor
    acknowledgements and/or dedications given therein.

    L. Preserve all the Invariant Sections of the Document, unaltered
    in their text and in their titles. Section numbers or the
    equivalent are not considered part of the section titles.

    M. Delete any section Entitled “Endorsements”. Such a section
    may not be included in the Modified Version.

    N. Do not retitle any existing section to be Entitled
    “Endorsements” or to conflict in title with any Invariant
    Section.

    O. Preserve any Warranty Disclaimers.

    If the Modified Version includes new front-matter sections or
    appendices that qualify as Secondary Sections and contain no
    material copied from the Document, you may at your option designate
    some or all of these sections as invariant. To do this, add their
    titles to the list of Invariant Sections in the Modified Version’s
    license notice. These titles must be distinct from any other
    section titles.

    You may add a section Entitled “Endorsements”, provided it contains
    nothing but endorsements of your Modified Version by various
    parties–for example, statements of peer review or that the text
    has been approved by an organization as the authoritative
    definition of a standard.

    You may add a passage of up to five words as a Front-Cover Text,
    and a passage of up to 25 words as a Back-Cover Text, to the end of
    the list of Cover Texts in the Modified Version. Only one passage
    of Front-Cover Text and one of Back-Cover Text may be added by (or
    through arrangements made by) any one entity. If the Document
    already includes a cover text for the same cover, previously added
    by you or by arrangement made by the same entity you are acting on
    behalf of, you may not add another; but you may replace the old
    one, on explicit permission from the previous publisher that added
    the old one.

    The author(s) and publisher(s) of the Document do not by this
    License give permission to use their names for publicity for or to
    assert or imply endorsement of any Modified Version.

  6. COMBINING DOCUMENTS

    You may combine the Document with other documents released under
    this License, under the terms defined in section 4 above for
    modified versions, provided that you include in the combination all
    of the Invariant Sections of all of the original documents,
    unmodified, and list them all as Invariant Sections of your
    combined work in its license notice, and that you preserve all
    their Warranty Disclaimers.

    The combined work need only contain one copy of this License, and
    multiple identical Invariant Sections may be replaced with a single
    copy. If there are multiple Invariant Sections with the same name
    but different contents, make the title of each such section unique
    by adding at the end of it, in parentheses, the name of the
    original author or publisher of that section if known, or else a
    unique number. Make the same adjustment to the section titles in
    the list of Invariant Sections in the license notice of the
    combined work.

    In the combination, you must combine any sections Entitled
    “History” in the various original documents, forming one section
    Entitled “History”; likewise combine any sections Entitled
    “Acknowledgements”, and any sections Entitled “Dedications”. You
    must delete all sections Entitled “Endorsements.”

  7. COLLECTIONS OF DOCUMENTS

    You may make a collection consisting of the Document and other
    documents released under this License, and replace the individual
    copies of this License in the various documents with a single copy
    that is included in the collection, provided that you follow the
    rules of this License for verbatim copying of each of the documents
    in all other respects.

    You may extract a single document from such a collection, and
    distribute it individually under this License, provided you insert
    a copy of this License into the extracted document, and follow this
    License in all other respects regarding verbatim copying of that
    document.

  8. AGGREGATION WITH INDEPENDENT WORKS

    A compilation of the Document or its derivatives with other
    separate and independent documents or works, in or on a volume of a
    storage or distribution medium, is called an “aggregate” if the
    copyright resulting from the compilation is not used to limit the
    legal rights of the compilation’s users beyond what the individual
    works permit. When the Document is included in an aggregate, this
    License does not apply to the other works in the aggregate which
    are not themselves derivative works of the Document.

    If the Cover Text requirement of section 3 is applicable to these
    copies of the Document, then if the Document is less than one half
    of the entire aggregate, the Document’s Cover Texts may be placed
    on covers that bracket the Document within the aggregate, or the
    electronic equivalent of covers if the Document is in electronic
    form. Otherwise they must appear on printed covers that bracket
    the whole aggregate.

  9. TRANSLATION

    Translation is considered a kind of modification, so you may
    distribute translations of the Document under the terms of section

    1. Replacing Invariant Sections with translations requires special
      permission from their copyright holders, but you may include
      translations of some or all Invariant Sections in addition to the
      original versions of these Invariant Sections. You may include a
      translation of this License, and all the license notices in the
      Document, and any Warranty Disclaimers, provided that you also
      include the original English version of this License and the
      original versions of those notices and disclaimers. In case of a
      disagreement between the translation and the original version of
      this License or a notice or disclaimer, the original version will
      prevail.

      If a section in the Document is Entitled “Acknowledgements”,
      “Dedications”, or “History”, the requirement (section 4) to
      Preserve its Title (section 1) will typically require changing the
      actual title.

  10. TERMINATION

    You may not copy, modify, sublicense, or distribute the Document
    except as expressly provided under this License. Any attempt
    otherwise to copy, modify, sublicense, or distribute it is void,
    and will automatically terminate your rights under this License.

    However, if you cease all violation of this License, then your
    license from a particular copyright holder is reinstated (a)
    provisionally, unless and until the copyright holder explicitly and
    finally terminates your license, and (b) permanently, if the
    copyright holder fails to notify you of the violation by some
    reasonable means prior to 60 days after the cessation.

    Moreover, your license from a particular copyright holder is
    reinstated permanently if the copyright holder notifies you of the
    violation by some reasonable means, this is the first time you have
    received notice of violation of this License (for any work) from
    that copyright holder, and you cure the violation prior to 30 days
    after your receipt of the notice.

    Termination of your rights under this section does not terminate
    the licenses of parties who have received copies or rights from you
    under this License. If your rights have been terminated and not
    permanently reinstated, receipt of a copy of some or all of the
    same material does not give you any rights to use it.

  11. FUTURE REVISIONS OF THIS LICENSE

    The Free Software Foundation may publish new, revised versions of
    the GNU Free Documentation License from time to time. Such new
    versions will be similar in spirit to the present version, but may
    differ in detail to address new problems or concerns. See
    http://www.gnu.org/copyleft/.

    Each version of the License is given a distinguishing version
    number. If the Document specifies that a particular numbered
    version of this License “or any later version” applies to it, you
    have the option of following the terms and conditions either of
    that specified version or of any later version that has been
    published (not as a draft) by the Free Software Foundation. If the
    Document does not specify a version number of this License, you may
    choose any version ever published (not as a draft) by the Free
    Software Foundation. If the Document specifies that a proxy can
    decide which future versions of this License can be used, that
    proxy’s public statement of acceptance of a version permanently
    authorizes you to choose that version for the Document.

  12. RELICENSING

    “Massive Multiauthor Collaboration Site” (or “MMC Site”) means any
    World Wide Web server that publishes copyrightable works and also
    provides prominent facilities for anybody to edit those works. A
    public wiki that anybody can edit is an example of such a server.
    A “Massive Multiauthor Collaboration” (or “MMC”) contained in the
    site means any set of copyrightable works thus published on the MMC
    site.

    “CC-BY-SA” means the Creative Commons Attribution-Share Alike 3.0
    license published by Creative Commons Corporation, a not-for-profit
    corporation with a principal place of business in San Francisco,
    California, as well as future copyleft versions of that license
    published by that same organization.

    “Incorporate” means to publish or republish a Document, in whole or
    in part, as part of another Document.

    An MMC is “eligible for relicensing” if it is licensed under this
    License, and if all works that were first published under this
    License somewhere other than this MMC, and subsequently
    incorporated in whole or in part into the MMC, (1) had no cover
    texts or invariant sections, and (2) were thus incorporated prior
    to November 1, 2008.

    The operator of an MMC Site may republish an MMC contained in the
    site under CC-BY-SA on the same site at any time before August 1,
    2009, provided the MMC is eligible for relicensing.

ADDENDUM: How to use this License for your documents

To use this License in a document you have written, include a copy of
the License in the document and put the following copyright and license
notices just after the title page:

   Copyright (C)  YEAR  YOUR NAME.
   Permission is granted to copy, distribute and/or modify this document
   under the terms of the GNU Free Documentation License, Version 1.3
   or any later version published by the Free Software Foundation;
   with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
   Texts.  A copy of the license is included in the section entitled ``GNU
   Free Documentation License''.

If you have Invariant Sections, Front-Cover Texts and Back-Cover
Texts, replace the “with…Texts.” line with this:

     with the Invariant Sections being LIST THEIR TITLES, with
     the Front-Cover Texts being LIST, and with the Back-Cover Texts
     being LIST.

If you have Invariant Sections without Cover Texts, or some other
combination of the three, merge those two alternatives to suit the
situation.

If your document contains nontrivial examples of program code, we
recommend releasing these examples in parallel under your choice of free
software license, such as the GNU General Public License, to permit
their use in free software.


File: cre2.info, Node: references, Next: concept index, Prev: Documentation License, Up: Top

Appendix C Bibliography and references



File: cre2.info, Node: concept index, Next: function index, Prev: references, Up: Top

Appendix D An entry for each concept


你可能感兴趣的:(自然语言处理)