regcomp(3C) regcomp(3C)
NAME
regcomp(), regerror(), regexec(), regfree() - regular expression
matching routines
SYNOPSIS
#include <<<<regex.h>>>>
int regcomp(regex_t *preg, const char *pattern, int cflags);
int regexec(
const regex_t *preg,
const char *string,
size_t nmatch,
regmatch_t pmatch[],
int eflags
);
void regfree(regex_t *preg);
size_t regerror(
int errcode,
const regex_t *preg,
char *errbuf,
size_t errbuf_size
);
DESCRIPTION
These functions interpret regular expressions as described in
regexp(5). They support both basic and extended regular expressions.
The structures regex_t and regmatch_t are defined in the header
<regex.h>.
The regex_t structure contains at least the following member (use of
other members results in non-portable code):
size_t re_nsub Number of parenthesized subexpressions.
The regmatch_t structure contains at least the following members:
regoff_t rm_so Byte offset from start of string to
start of substring.
regoff_t rm_eo Byte offset from start of string to the
first character after the end of the
substring.
regcomp() compiles the regular expression specified by the pattern
argument and places the results in the structure pointed to by preg.
The cflags argument is the bit-wise logical OR of zero or more of the
following flags (defined in <regex.h>):
Hewlett-Packard Company - 1 - HP-UX Release 11i: November 2000
regcomp(3C) regcomp(3C)
REG_EXTENDED Use extended regular expressions.
REG_NEWLINE IF REG_NEWLINE is not set in cflags, a newline
character in pattern or string is treated as an
ordinary character. If REG_NEWLINE is set,
newlines are treated as ordinary characters
except as follows:
1. A newline in string is not matched by
a period outside of a bracket
expression or by any form of a
nonmatching list.
2. A circumflex (^) in pattern, when used
to specify expression anchoring,
matches the zero-length string
immediately after a newline in string,
regardless of the setting of
REG_NOTBOL.
3. A dollar-sign ($) in pattern, when
used to specify expression anchoring,
matches the zero-length string
immediately before a newline in
string, regardless of the setting of
REG_NOTEOL.
REG_ICASE Ignore case in match. If a character in
pattern is defined in the current LC_CTYPE
locale as having one or more opposite-case
counterpoints, both the character and any
counterpoints match the pattern character.
This applies to all portions of the pattern,
including a string of characters specified to
be matched via a back-reference expression
(\n).
Within bracket expressions: Collation ranges,
character classes, and equivalence classes are
effectively expanded into equivalent lists of
collation elements and characters. Opposite-
case counterpoints are then generated for each
collation element or character to form the
complete matching list or non-matching list for
the bracket expression. Opposite-case
counterpoints for a multi-character collating
element include all possible combinations of
opposite-case counterpoints for each individual
character comprising the collating element.
These are then combined to form new valid
multi-character collating elements. For
Hewlett-Packard Company - 2 - HP-UX Release 11i: November 2000
regcomp(3C) regcomp(3C)
example, the opposite-case counterpoints for
[.ch.] could be [.Ch.], [.cH.], and [.CH.].
The default regular expression type for pattern is Basic Regular
Expression. The application can specify Extended Regular Expressions
by using the REG_EXTENDED cflags value.
If the function regcomp() succeeds, it returns zero; otherwise it
returns a non-zero value indicating the error.
If regcomp() succeeds, and if the REG_NOSUB flag was not set in
cflags, regcomp() sets re_nsub to the number of parenthesized
subexpressions (delimited by \( and \) in basic regular expressions or
( and ) in extended regular expressions) found in pattern.
regexec() matches the null-terminated string specified by string
against the compiled regular expression preg initialized by a previous
call to regcomp(). If it finds a match, regexec() returns zero;
otherwise it returns non-zero indicating either no match or an error.
The eflags argument is the bit-wise logical OR of the following flags:
REG_NOTBOL The first character of the string pointed to by
string is not the beginning of the line.
Therefore, the circumflex character (^), when
taken as a special character, never matches.
REG_NOTEOL The last character of the string pointed to by
string is not the end of the line. Therefore,
the dollar sign ($), when taken as a special
character, never matches.
If nmatch is not zero, and REG_NOSUB was not set in the cflags
argument to regcomp(), then regexec() fills in the pmatch array with
byte offsets to the substrings of string that correspond to the
parenthesized subexpressions of pattern: pmatch[i].rm_so is the byte
offset of the beginning and pmatch[i].rm_eo is the byte offset one
byte past the end of the substring i. (Subexpression i begins at the
ith matched left parenthesis, counting from 1). Offsets in pmatch[0]
identify the substring that corresponds to the entire regular
expression. Unused elements of pmatch are set to -1. If there are
more than nmatch subexpressions in pattern (pattern itself counts as a
subexpression), regexec() still does the match, but only records the
first nmatch substrings.
When matching a regular expression, any given parenthesized
subexpression of pattern might participate in the match of several
different substrings of string, or it might not match any substring,
even though the pattern as a whole did match. The following explains
which substrings are reported in pmatch when matching regular
expressions:
Hewlett-Packard Company - 3 - HP-UX Release 11i: November 2000
regcomp(3C) regcomp(3C)
1. If subexpression i in a regular expression is not contained
within another subexpression, and it participated in the
match several times, the byte offsets in pmatch[i] delimit
the last such match.
2. If subexpression i is not contained within another
subexpression, and it did not participate in an otherwise
successful match (because either *, ?, or | was used), then
the byte offsets in pmatch[i] are -1.
3. If subexpression i is contained in subexpression j, and a
match of subexpression j is reported in pmatch[j], the match
or no-match reported in pmatch[i] is the last one that
occurred within the substring in pmatch[j].
4. If subexpression i is contained in subexpression j, and the
offsets in pmatch[j] are -1, the offsets in pmatch[i] will
also be -1.
5. If subexpression i matched a zero-length string, both
offsets in pmatch[i] refer to the character immediately
following the zero-length substring.
If REG_NOSUB was set in cflags in the call to regcomp(), and nmatch is
not zero in the call to regexec(), the content of the pmatch array is
unspecified.
regfree() frees any memory allocated by regcomp() associated with
preg.
If the preg argument to regexec() or regfree() is not a compiled
regular expression returned by regcomp(), the result is undefined. A
preg can no longer be treated as a compiled regular expression after
it is given to regfree().
regerror() provides a mapping from error codes returned by regcomp()
and regexec() to printable strings. regerror() generates a string
corresponding to the value of the errcode parameter, which was the
last non-zero value returned by regcomp() or regexec() with the given
value of preg. The errcode parameter can take on any of the error
values defined in <regex.h>. If errbuf_size is not zero, regerror()
copies an appropriate error message into the buffer specified by
errbuf. If the error message (including the terminating null) cannot
fit in the buffer, it is truncated to errbuf_size - 1 bytes and null
terminated.
If errbuf_size is zero, the errbuf parameter is ignored, but the
return value is as defined below.
regerror() returns the size of the buffer (including terminating null)
that is required to hold the entire error message.
Hewlett-Packard Company - 4 - HP-UX Release 11i: November 2000
regcomp(3C) regcomp(3C)
EXTERNAL INFLUENCES
Locale
The LC_COLLATE category determines the collating sequence used in
compiling and executing regular expressions.
The LC_CTYPE category determines the interpretation of text as single
and/or multi-byte characters, the characters matched by character-
class expressions in regular expressions, and the opposite-case
counterpart for each character.
International Code Set Support
Single- and multi-byte character code sets are supported.
RETURN VALUE
regcomp() returns zero for success and non-zero for an invalid
expression or other failure. regexec() returns zero if it finds a
match and non-zero for no match or other failure.
ERRORS
If regcomp() or regexec() detects one of the error conditions listed
below, it returns the corresponding non-zero error code. The error
codes are defined in the header <regex.h>.
REG_BADBR The contents within the pair \{ (backslash
left brace) and \} (backslash right brace)
are unusable: not a number, number too large,
more than two numbers, or first number larger
than second.
REG_BADPAT An invalid regular expression.
REG_BADRPT The ? (question mark), * (asterisk), or +
(plus sign) symbols are not preceded by a
valid regular expression.
REG_EBRACE The use of a pair of \{ (backslash left
brace) and \} (backslash right brace) or {}
(braces) is unbalanced.
REG_EBRACK The use of [] (brackets) is unbalanced.
REG_EBOL Using the ^ (caret) anchor and not beginning
of line.
REG_ECHAR There is an invalid multibyte character.
REG_ECOLLATE There is an unusable collating element
referenced.
REG_ECTYPE There is an unusable character class type
referenced.
Hewlett-Packard Company - 5 - HP-UX Release 11i: November 2000
regcomp(3C) regcomp(3C)
REG_EEOL Using the $ (dollar) anchor and not end of
line.
REG_EESCAPE There is a trailing \ in the pattern.
REG_EPAREN The use of a pair of \( (backslash left
parenthesis) and \) (backslash right
parenthesis) or () is unbalanced.
REG_ERANGE There is an unusable endpoint in the range
expression.
REG_ESPACE There is insufficient memory space.
REG_ESUBREG The number in \digit is invalid or in error.
REG_NOMATCH The regexec() function failed to match.
EXAMPLES
/* match string against the extended regular expression in pattern,
treating errors as no match. Return 1 for match, 0 for no match.
Print an error message if an error occurs. */
int
match(string, pattern)
char *string;
char *pattern;
{
int i;
regex_t re;
char buf[256];
i=regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB);
if (i != 0) {
(void)regerror(i,&re,buf,sizeof buf);
printf("%s\n",buf);
return(0); /* report error */
}
i = regexec(&re, string, (size_t) 0, NULL, 0);
regfree(&re);
if (i != 0) {
(void)regerror(i,&re,buf,sizeof buf);
printf("%s\n",buf);
return(0); /* report error */
}
return(1);
}
The following demonstrates how the REG_NOTBOL flag could be used with
regexec() to find all substrings in a line that match a pattern
supplied by a user.
Hewlett-Packard Company - 6 - HP-UX Release 11i: November 2000
regcomp(3C) regcomp(3C)
(void) regcomp(&re, pattern, 0);
/* look for first match at start of line */
error = regexec(&re, &buffer[0], 1, &pm, 0);
while (error == 0) { /* while matches found */
/* find next match on line */
error = regexec(&re, &buffer[pm.rm_eo], 1, &pm, REG_NOTBOL);
}
AUTHOR
regcomp(), regerror(), regexec(), and regfree() were developed by OSF
and HP.
SEE ALSO
regexp(5).
STANDARDS CONFORMANCE
regcomp(): XPG4, POSIX.2
regerror(): XPG4, POSIX.2
regexec(): XPG4, POSIX.2
regfree(): XPG4, POSIX.2
Hewlett-Packard Company - 7 - HP-UX Release 11i: November 2000
|