regcomp, regexec, regerror, regfree -- POSIX regex functions
简介
#include <sys/types.h> #include <regex.h>
int regcomp(regex_t *preg, const char *regex, int cflags); int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags); size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size); void regfree(regex_t *preg); |
POSIX(Portable Operating System Interface,可移植操作系统接口)。
regcomp() 是用来将某个正则表达式编译成适合regexec()查找的格式。
regcomp() 的preg参数是一个指向保存模式(正则表达式)的缓冲区的指针;regex是一个指向一个以空’\0’结束的字符串的指针;cflags是用来决定编译类型的标记。
所有的正则表达式搜索必须通过一个经编译的模式缓冲区才能完成,所以regexec()必须需要regcom()提供的经初始化的模式缓冲区。
cflags由以下一个或者多个选项按位或组成:
REG_EXTENDED
用POSIX正则表达式扩展规范来翻译正则表达式regex。如果没有设置REG_EXTENDED标志,则只支持POSIX正则表达式基本规范。
REG_ICASE
不区分大小写。使用regexec()使用模式缓冲区进行搜索的时候是大小写敏感的。使用这个标志对大小写将不再敏感。
REG_NOSUB
不支持匹配子字符串地址。如果模式缓冲区被这个标志编译则将忽视nmath和pmatch两个参数。
REG_NEWLINE
匹配满足正则表达式的任何字符符但不匹配换行符。
在一个[^…]序列中,虽然不包含换行符,但也不匹配换行符。
匹配行首的字符类(^)在换行符后匹配一个空字符串,忽视regexec()的执行标记eflags参数是否包含了REG_NOTBOL标记。
匹配行末字符类($)在换行符前匹配一个空字符串,忽视eflags参数是否包含了REG_NOTEOL。
regexec()根据之前编译的模式缓冲区preg匹配一个以空’\0’结束的字符串。nmatch和pmath用来提供匹配位置的信息。eflags是REG_NOTBOL和REG_NOTEOL之一或者两者的或运算,用来改变匹配行为。具体描述如下:
REG_NOTBOL
匹配行首字符经常会匹配失败(但设置上面的编译标志REG_NEWLINE后例外),当有不同部分字符串传递给regexec()并且字符串开头不被翻译作为一行开头时这个标志将会被用到。
REG_NOTEOL
匹配行末的字符经常会匹配失败(但设置上面的编译标志REG_NEWLINE后例外)。
除非REG_NOSUB标志被用来编译模式缓冲区,否则是能获取到匹配到的子字符串的地址信息的。pmatch的维数大小至少要能容纳nmatch个元素。pmatch元素被regexec()用匹配到的子字符串地址信息填充。没有填充的结构体元素将包含值-1。
regmatch_t结构体是pmatch参数的类型,在<regex.h>头文件中定义:
typedef struct {
regoff_t rm_so;
regoff_t rm_eo;
}regmatch_t;
如果rm_so元素的值不为-1其值就代表在整个字符串中所匹配到的最大子字符串在整个字符串中的开始偏移量,rm_eo是所匹配到的子字符串在整个字符串中的结束偏移量,是匹配文本之后第一个字符的偏移量。
regerror()用来转换regcomp()和regexec()函数返回的错误码成字符串。
errcode是传给regerror()函数的错误码,preg是模式缓冲区,errbuf是指向字符串缓冲区的指针,errbuf_size是字符串缓冲区的大小。此函数返回errbuf所包含的以空结尾的错误信息字符串的长度。如果errbuf和errbuf_size都非0,那么errbuf的内容是errbuf_size – 1个字符,最后以’\0’结束,代表错误信息。
传给regfree()函数经regcom()函数编译过的模式缓冲区的地址并调用regfree()函数,那么曾分配给regcomp()函数的缓冲区内存将会被释放。
regcomp()编译成功则返回0,编译失败则返回一段错误码。
regexec()匹配成功则返回0,匹配失败则返回REG_NOMATCH。
regcomp()可能会返回以下错误:
REG_BADBR
后面的字符无效。
REG_BADPAT
模式字符类无效,如组或者序列。
REG_BADRPT
重复字符类无效,如将’*’作为第一个字符。
REG_EBRACE
不可匹配的算子空间。
REG_EBRACK
不可匹配的序列字符类。
REG_ECOLLATE
无法校对的元素。
REG_ECTYPE
未知的字符类名。
REG_EEND
不是特别的错误。在POSIX.2中无此项定义。
REG_EESCAPE
末尾有反斜杠。
REG_EPAREN
括号组不对应。
REG_ERANGE
非法使用范围字符,如:范围的结束点优先发生在起点。
REG_ESIZE
编译正则表达式要求超过了64KB。此项在POSIX.2中无定义。
REG_ESPACE
正则表达式代码越界。
REG_ESUBREG
后面的子表达式无效。
符合的标准 POSIX.1-2001
翻译差得有些过分,不断检讨英文水平。将英文也粘贴到这里。man regex
REGEX(3) Linux Programmer's Manual REGEX(3)
NAME
regcomp, regexec, regerror, regfree - POSIX regex functions
SYNOPSIS
#include <sys/types.h>
#include <regex.h>
int regcomp(regex_t *preg, const char *regex, int cflags);
int regexec(const regex_t *preg, const char *string, size_t nmatch,
regmatch_t pmatch[], int eflags);
size_t regerror(int errcode, const regex_t *preg, char *errbuf,
size_t errbuf_size);
void regfree(regex_t *preg);
DESCRIPTION
POSIX Regex Compiling
regcomp() is used to compile a regular expression into a form that is suitable for subsequent regexec() searches.
regcomp() is supplied with preg, a pointer to a pattern buffer storage area; regex, a pointer to the null-terminated string and cflags, flags used to deter‐
mine the type of compilation.
All regular expression searching must be done via a compiled pattern buffer, thus regexec() must always be supplied with the address of a regcomp() initial‐
ized pattern buffer.
cflags may be the bitwise-or of one or more of the following:
REG_EXTENDED
Use POSIX Extended Regular Expression syntax when interpreting regex. If not set, POSIX Basic Regular Expression syntax is used.
REG_ICASE
Do not differentiate case. Subsequent regexec() searches using this pattern buffer will be case insensitive.
REG_NOSUB
Support for substring addressing of matches is not required. The nmatch and pmatch arguments to regexec() are ignored if the pattern buffer supplied
was compiled with this flag set.
REG_NEWLINE
Match-any-character operators don't match a newline.
A nonmatching list ([^...]) not containing a newline does not match a newline.
Match-beginning-of-line operator (^) matches the empty string immediately after a newline, regardless of whether eflags, the execution flags of
regexec(), contains REG_NOTBOL.
Match-end-of-line operator ($) matches the empty string immediately before a newline, regardless of whether eflags contains REG_NOTEOL.
POSIX Regex Matching
regexec() is used to match a null-terminated string against the precompiled pattern buffer, preg. nmatch and pmatch are used to provide information regarding
the location of any matches. eflags may be the bitwise-or of one or both of REG_NOTBOL and REG_NOTEOL which cause changes in matching behavior described
below.
REG_NOTBOL
The match-beginning-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above) This flag may be used when different por‐
tions of a string are passed to regexec() and the beginning of the string should not be interpreted as the beginning of the line.
REG_NOTEOL
The match-end-of-line operator always fails to match (but see the compilation flag REG_NEWLINE above)
Byte Offsets
Unless REG_NOSUB was set for the compilation of the pattern buffer, it is possible to obtain substring match addressing information. pmatch must be dimen‐
sioned to have at least nmatch elements. These are filled in by regexec() with substring match addresses. Any unused structure elements will contain the
value -1.
The regmatch_t structure which is the type of pmatch is defined in <regex.h>.
typedef struct {
regoff_t rm_so;
regoff_t rm_eo;
} regmatch_t;
Each rm_so element that is not -1 indicates the start offset of the next largest substring match within the string. The relative rm_eo element indicates the
end offset of the match, which is the offset of the first character after the matching text.
Posix Error Reporting
regerror() is used to turn the error codes that can be returned by both regcomp() and regexec() into error message strings.
regerror() is passed the error code, errcode, the pattern buffer, preg, a pointer to a character string buffer, errbuf, and the size of the string buffer,
errbuf_size. It returns the size of the errbuf required to contain the null-terminated error message string. If both errbuf and errbuf_size are nonzero,
errbuf is filled in with the first errbuf_size - 1 characters of the error message and a terminating null byte ('\0').
POSIX Pattern Buffer Freeing
Supplying regfree() with a precompiled pattern buffer, preg will free the memory allocated to the pattern buffer by the compiling process, regcomp().
RETURN VALUE
regcomp() returns zero for a successful compilation or an error code for failure.
regexec() returns zero for a successful match or REG_NOMATCH for failure.
ERRORS
The following errors can be returned by regcomp():
REG_BADBR
Invalid use of back reference operator.
REG_BADPAT
Invalid use of pattern operators such as group or list.
REG_BADRPT
Invalid use of repetition operators such as using '*' as the first character.
REG_EBRACE
Un-matched brace interval operators.
REG_EBRACK
Un-matched bracket list operators.
REG_ECOLLATE
Invalid collating element.
REG_ECTYPE
Unknown character class name.
REG_EEND
Non specific error. This is not defined by POSIX.2.
REG_EESCAPE
Trailing backslash.
REG_EPAREN
Un-matched parenthesis group operators.
REG_ERANGE
Invalid use of the range operator, e.g., the ending point of the range occurs prior to the starting point.
REG_ESIZE
Compiled regular expression requires a pattern buffer larger than 64Kb. This is not defined by POSIX.2.
REG_ESPACE
The regex routines ran out of memory.
REG_ESUBREG
Invalid back reference to a subexpression.
CONFORMING TO
POSIX.1-2001.
SEE ALSO
grep(1), regex(7)
The glibc manual section, Regular Expression Matching
COLOPHON
This page is part of release 3.44 of the Linux man-pages project. A description of the project, and information about reporting bugs, can be found at
http://www.kernel.org/doc/man-pages/.
GNU 2012-06-11 REGEX(3)