src/lib/libc/regex/regex.3 - annotate

Return to regex.3 CVS log
Up to [cvs.NetBSD.org] / src / lib / libc / regex
Annotation of src/lib/libc/regex/regex.3, Revision 1.3

1.3     ! cgd         1: .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
        !             2: .\" Copyright (c) 1992, 1993, 1994
        !             3: .\"    The Regents of the University of California.  All rights reserved.
        !             4: .\"
        !             5: .\" This code is derived from software contributed to Berkeley by
        !             6: .\" Henry Spencer.
        !             7: .\"
        !             8: .\" Redistribution and use in source and binary forms, with or without
        !             9: .\" modification, are permitted provided that the following conditions
        !            10: .\" are met:
        !            11: .\" 1. Redistributions of source code must retain the above copyright
        !            12: .\"    notice, this list of conditions and the following disclaimer.
        !            13: .\" 2. Redistributions in binary form must reproduce the above copyright
        !            14: .\"    notice, this list of conditions and the following disclaimer in the
        !            15: .\"    documentation and/or other materials provided with the distribution.
        !            16: .\" 3. All advertising materials mentioning features or use of this software
        !            17: .\"    must display the following acknowledgement:
        !            18: .\"    This product includes software developed by the University of
        !            19: .\"    California, Berkeley and its contributors.
        !            20: .\" 4. Neither the name of the University nor the names of its contributors
        !            21: .\"    may be used to endorse or promote products derived from this software
        !            22: .\"    without specific prior written permission.
        !            23: .\"
        !            24: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
        !            25: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
        !            26: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
        !            27: .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
        !            28: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
        !            29: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
        !            30: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
        !            31: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
        !            32: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
        !            33: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
        !            34: .\" SUCH DAMAGE.
        !            35: .\"
        !            36: .\"    @(#)regex.3     8.4 (Berkeley) 3/20/94
        !            37: .\"
        !            38: .TH REGEX 3 "March 20, 1994"
1.1       jtc        39: .de ZR
                     40: .\" one other place knows this name:  the SEE ALSO section
1.2       cgd        41: .IR re_format (7) \\$1
1.1       jtc        42: ..
                     43: .SH NAME
                     44: regcomp, regexec, regerror, regfree \- regular-expression library
                     45: .SH SYNOPSIS
                     46: .ft B
                     47: .\".na
                     48: #include <sys/types.h>
                     49: .br
                     50: #include <regex.h>
                     51: .HP 10
                     52: int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
                     53: .HP
                     54: int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
                     55: size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
                     56: .HP
                     57: size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
                     58: char\ *errbuf, size_t\ errbuf_size);
                     59: .HP
                     60: void\ regfree(regex_t\ *preg);
                     61: .\".ad
                     62: .ft
                     63: .SH DESCRIPTION
                     64: These routines implement POSIX 1003.2 regular expressions (``RE''s);
                     65: see
                     66: .ZR .
                     67: .I Regcomp
                     68: compiles an RE written as a string into an internal form,
                     69: .I regexec
                     70: matches that internal form against a string and reports results,
                     71: .I regerror
                     72: transforms error codes from either into human-readable messages,
                     73: and
                     74: .I regfree
                     75: frees any dynamically-allocated storage used by the internal form
                     76: of an RE.
                     77: .PP
                     78: The header
                     79: .I <regex.h>
                     80: declares two structure types,
                     81: .I regex_t
                     82: and
                     83: .IR regmatch_t ,
                     84: the former for compiled internal forms and the latter for match reporting.
                     85: It also declares the four functions,
                     86: a type
                     87: .IR regoff_t ,
                     88: and a number of constants with names starting with ``REG_''.
                     89: .PP
                     90: .I Regcomp
                     91: compiles the regular expression contained in the
                     92: .I pattern
                     93: string,
                     94: subject to the flags in
                     95: .IR cflags ,
                     96: and places the results in the
                     97: .I regex_t
                     98: structure pointed to by
                     99: .IR preg .
                    100: .I Cflags
                    101: is the bitwise OR of zero or more of the following flags:
                    102: .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
                    103: Compile modern (``extended'') REs,
                    104: rather than the obsolete (``basic'') REs that
                    105: are the default.
                    106: .IP REG_BASIC
                    107: This is a synonym for 0,
                    108: provided as a counterpart to REG_EXTENDED to improve readability.
                    109: .IP REG_NOSPEC
                    110: Compile with recognition of all special characters turned off.
                    111: All characters are thus considered ordinary,
                    112: so the ``RE'' is a literal string.
                    113: This is an extension,
                    114: compatible with but not specified by POSIX 1003.2,
                    115: and should be used with
                    116: caution in software intended to be portable to other systems.
                    117: REG_EXTENDED and REG_NOSPEC may not be used
                    118: in the same call to
                    119: .IR regcomp .
                    120: .IP REG_ICASE
                    121: Compile for matching that ignores upper/lower case distinctions.
                    122: See
                    123: .ZR .
                    124: .IP REG_NOSUB
                    125: Compile for matching that need only report success or failure,
                    126: not what was matched.
                    127: .IP REG_NEWLINE
                    128: Compile for newline-sensitive matching.
                    129: By default, newline is a completely ordinary character with no special
                    130: meaning in either REs or strings.
                    131: With this flag,
                    132: `[^' bracket expressions and `.' never match newline,
                    133: a `^' anchor matches the null string after any newline in the string
                    134: in addition to its normal function,
                    135: and the `$' anchor matches the null string before any newline in the
                    136: string in addition to its normal function.
                    137: .IP REG_PEND
                    138: The regular expression ends,
                    139: not at the first NUL,
                    140: but just before the character pointed to by the
                    141: .I re_endp
                    142: member of the structure pointed to by
                    143: .IR preg .
                    144: The
                    145: .I re_endp
                    146: member is of type
                    147: .IR const\ char\ * .
                    148: This flag permits inclusion of NULs in the RE;
                    149: they are considered ordinary characters.
                    150: This is an extension,
                    151: compatible with but not specified by POSIX 1003.2,
                    152: and should be used with
                    153: caution in software intended to be portable to other systems.
                    154: .PP
                    155: When successful,
                    156: .I regcomp
                    157: returns 0 and fills in the structure pointed to by
                    158: .IR preg .
                    159: One member of that structure
                    160: (other than
                    161: .IR re_endp )
                    162: is publicized:
                    163: .IR re_nsub ,
                    164: of type
                    165: .IR size_t ,
                    166: contains the number of parenthesized subexpressions within the RE
                    167: (except that the value of this member is undefined if the
                    168: REG_NOSUB flag was used).
                    169: If
                    170: .I regcomp
                    171: fails, it returns a non-zero error code;
                    172: see DIAGNOSTICS.
                    173: .PP
                    174: .I Regexec
                    175: matches the compiled RE pointed to by
                    176: .I preg
                    177: against the
                    178: .IR string ,
                    179: subject to the flags in
                    180: .IR eflags ,
                    181: and reports results using
                    182: .IR nmatch ,
                    183: .IR pmatch ,
                    184: and the returned value.
                    185: The RE must have been compiled by a previous invocation of
                    186: .IR regcomp .
                    187: The compiled form is not altered during execution of
                    188: .IR regexec ,
                    189: so a single compiled RE can be used simultaneously by multiple threads.
                    190: .PP
                    191: By default,
                    192: the NUL-terminated string pointed to by
                    193: .I string
                    194: is considered to be the text of an entire line, minus any terminating
                    195: newline.
                    196: The
                    197: .I eflags
                    198: argument is the bitwise OR of zero or more of the following flags:
                    199: .IP REG_NOTBOL \w'REG_STARTEND'u+2n
                    200: The first character of
                    201: the string
                    202: is not the beginning of a line, so the `^' anchor should not match before it.
                    203: This does not affect the behavior of newlines under REG_NEWLINE.
                    204: .IP REG_NOTEOL
                    205: The NUL terminating
                    206: the string
                    207: does not end a line, so the `$' anchor should not match before it.
                    208: This does not affect the behavior of newlines under REG_NEWLINE.
                    209: .IP REG_STARTEND
                    210: The string is considered to start at
                    211: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
                    212: and to have a terminating NUL located at
                    213: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
                    214: (there need not actually be a NUL at that location),
                    215: regardless of the value of
                    216: .IR nmatch .
                    217: See below for the definition of
                    218: .IR pmatch
                    219: and
                    220: .IR nmatch .
                    221: This is an extension,
                    222: compatible with but not specified by POSIX 1003.2,
                    223: and should be used with
                    224: caution in software intended to be portable to other systems.
                    225: Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
                    226: REG_STARTEND affects only the location of the string,
                    227: not how it is matched.
                    228: .PP
                    229: See
                    230: .ZR
                    231: for a discussion of what is matched in situations where an RE or a
                    232: portion thereof could match any of several substrings of
                    233: .IR string .
                    234: .PP
                    235: Normally,
                    236: .I regexec
                    237: returns 0 for success and the non-zero code REG_NOMATCH for failure.
                    238: Other non-zero error codes may be returned in exceptional situations;
                    239: see DIAGNOSTICS.
                    240: .PP
                    241: If REG_NOSUB was specified in the compilation of the RE,
                    242: or if
                    243: .I nmatch
                    244: is 0,
                    245: .I regexec
                    246: ignores the
                    247: .I pmatch
                    248: argument (but see below for the case where REG_STARTEND is specified).
                    249: Otherwise,
                    250: .I pmatch
                    251: points to an array of
                    252: .I nmatch
                    253: structures of type
                    254: .IR regmatch_t .
                    255: Such a structure has at least the members
                    256: .I rm_so
                    257: and
                    258: .IR rm_eo ,
                    259: both of type
                    260: .I regoff_t
                    261: (a signed arithmetic type at least as large as an
                    262: .I off_t
                    263: and a
                    264: .IR ssize_t ),
                    265: containing respectively the offset of the first character of a substring
                    266: and the offset of the first character after the end of the substring.
                    267: Offsets are measured from the beginning of the
                    268: .I string
                    269: argument given to
                    270: .IR regexec .
                    271: An empty substring is denoted by equal offsets,
                    272: both indicating the character following the empty substring.
                    273: .PP
                    274: The 0th member of the
                    275: .I pmatch
                    276: array is filled in to indicate what substring of
                    277: .I string
                    278: was matched by the entire RE.
                    279: Remaining members report what substring was matched by parenthesized
                    280: subexpressions within the RE;
                    281: member
                    282: .I i
                    283: reports subexpression
                    284: .IR i ,
                    285: with subexpressions counted (starting at 1) by the order of their opening
                    286: parentheses in the RE, left to right.
                    287: Unused entries in the array\(emcorresponding either to subexpressions that
                    288: did not participate in the match at all, or to subexpressions that do not
                    289: exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
                    290: .I rm_so
                    291: and
                    292: .I rm_eo
                    293: set to \-1.
                    294: If a subexpression participated in the match several times,
                    295: the reported substring is the last one it matched.
                    296: (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
                    297: the parenthesized subexpression matches each of the three `b's and then
                    298: an infinite number of empty strings following the last `b',
                    299: so the reported substring is one of the empties.)
                    300: .PP
                    301: If REG_STARTEND is specified,
                    302: .I pmatch
                    303: must point to at least one
                    304: .I regmatch_t
                    305: (even if
                    306: .I nmatch
                    307: is 0 or REG_NOSUB was specified),
                    308: to hold the input offsets for REG_STARTEND.
                    309: Use for output is still entirely controlled by
                    310: .IR nmatch ;
                    311: if
                    312: .I nmatch
                    313: is 0 or REG_NOSUB was specified,
                    314: the value of
                    315: .IR pmatch [0]
                    316: will not be changed by a successful
                    317: .IR regexec .
                    318: .PP
                    319: .I Regerror
                    320: maps a non-zero
                    321: .I errcode
                    322: from either
                    323: .I regcomp
                    324: or
                    325: .I regexec
                    326: to a human-readable, printable message.
                    327: If
                    328: .I preg
                    329: is non-NULL,
                    330: the error code should have arisen from use of
                    331: the
                    332: .I regex_t
                    333: pointed to by
                    334: .IR preg ,
                    335: and if the error code came from
                    336: .IR regcomp ,
                    337: it should have been the result from the most recent
                    338: .I regcomp
                    339: using that
                    340: .IR regex_t .
                    341: .RI ( Regerror
                    342: may be able to supply a more detailed message using information
                    343: from the
                    344: .IR regex_t .)
                    345: .I Regerror
                    346: places the NUL-terminated message into the buffer pointed to by
                    347: .IR errbuf ,
                    348: limiting the length (including the NUL) to at most
                    349: .I errbuf_size
                    350: bytes.
                    351: If the whole message won't fit,
                    352: as much of it as will fit before the terminating NUL is supplied.
                    353: In any case,
                    354: the returned value is the size of buffer needed to hold the whole
                    355: message (including terminating NUL).
                    356: If
                    357: .I errbuf_size
                    358: is 0,
                    359: .I errbuf
                    360: is ignored but the return value is still correct.
                    361: .PP
                    362: If the
                    363: .I errcode
                    364: given to
                    365: .I regerror
                    366: is first ORed with REG_ITOA,
                    367: the ``message'' that results is the printable name of the error code,
                    368: e.g. ``REG_NOMATCH'',
                    369: rather than an explanation thereof.
                    370: If
                    371: .I errcode
                    372: is REG_ATOI,
                    373: then
                    374: .I preg
                    375: shall be non-NULL and the
                    376: .I re_endp
                    377: member of the structure it points to
                    378: must point to the printable name of an error code;
                    379: in this case, the result in
                    380: .I errbuf
                    381: is the decimal digits of
                    382: the numeric value of the error code
                    383: (0 if the name is not recognized).
                    384: REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
                    385: they are extensions,
                    386: compatible with but not specified by POSIX 1003.2,
                    387: and should be used with
                    388: caution in software intended to be portable to other systems.
                    389: Be warned also that they are considered experimental and changes are possible.
                    390: .PP
                    391: .I Regfree
                    392: frees any dynamically-allocated storage associated with the compiled RE
                    393: pointed to by
                    394: .IR preg .
                    395: The remaining
                    396: .I regex_t
                    397: is no longer a valid compiled RE
                    398: and the effect of supplying it to
                    399: .I regexec
                    400: or
                    401: .I regerror
                    402: is undefined.
                    403: .PP
                    404: None of these functions references global variables except for tables
                    405: of constants;
                    406: all are safe for use from multiple threads if the arguments are safe.
                    407: .SH IMPLEMENTATION CHOICES
                    408: There are a number of decisions that 1003.2 leaves up to the implementor,
                    409: either by explicitly saying ``undefined'' or by virtue of them being
                    410: forbidden by the RE grammar.
                    411: This implementation treats them as follows.
                    412: .PP
                    413: See
                    414: .ZR
                    415: for a discussion of the definition of case-independent matching.
                    416: .PP
                    417: There is no particular limit on the length of REs,
                    418: except insofar as memory is limited.
                    419: Memory usage is approximately linear in RE size, and largely insensitive
                    420: to RE complexity, except for bounded repetitions.
                    421: See BUGS for one short RE using them
                    422: that will run almost any system out of memory.
                    423: .PP
                    424: A backslashed character other than one specifically given a magic meaning
                    425: by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
                    426: is taken as an ordinary character.
                    427: .PP
                    428: Any unmatched [ is a REG_EBRACK error.
                    429: .PP
                    430: Equivalence classes cannot begin or end bracket-expression ranges.
                    431: The endpoint of one range cannot begin another.
                    432: .PP
                    433: RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
                    434: .PP
                    435: A repetition operator (?, *, +, or bounds) cannot follow another
                    436: repetition operator.
                    437: A repetition operator cannot begin an expression or subexpression
                    438: or follow `^' or `|'.
                    439: .PP
                    440: `|' cannot appear first or last in a (sub)expression or after another `|',
                    441: i.e. an operand of `|' cannot be an empty subexpression.
                    442: An empty parenthesized subexpression, `()', is legal and matches an
                    443: empty (sub)string.
                    444: An empty string is not a legal RE.
                    445: .PP
                    446: A `{' followed by a digit is considered the beginning of bounds for a
                    447: bounded repetition, which must then follow the syntax for bounds.
                    448: A `{' \fInot\fR followed by a digit is considered an ordinary character.
                    449: .PP
                    450: `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
                    451: REs are anchors, not ordinary characters.
                    452: .SH SEE ALSO
1.2       cgd       453: grep(1), re_format(7)
1.1       jtc       454: .PP
                    455: POSIX 1003.2, sections 2.8 (Regular Expression Notation)
                    456: and
                    457: B.5 (C Binding for Regular Expression Matching).
                    458: .SH DIAGNOSTICS
                    459: Non-zero error codes from
                    460: .I regcomp
                    461: and
                    462: .I regexec
                    463: include the following:
                    464: .PP
                    465: .nf
                    466: .ta \w'REG_ECOLLATE'u+3n
                    467: REG_NOMATCH    regexec() failed to match
                    468: REG_BADPAT     invalid regular expression
                    469: REG_ECOLLATE   invalid collating element
                    470: REG_ECTYPE     invalid character class
                    471: REG_EESCAPE    \e applied to unescapable character
                    472: REG_ESUBREG    invalid backreference number
                    473: REG_EBRACK     brackets [ ] not balanced
                    474: REG_EPAREN     parentheses ( ) not balanced
                    475: REG_EBRACE     braces { } not balanced
                    476: REG_BADBR      invalid repetition count(s) in { }
                    477: REG_ERANGE     invalid character range in [ ]
                    478: REG_ESPACE     ran out of memory
                    479: REG_BADRPT     ?, *, or + operand invalid
                    480: REG_EMPTY      empty (sub)expression
                    481: REG_ASSERT     ``can't happen''\(emyou found a bug
                    482: REG_INVARG     invalid argument, e.g. negative-length string
                    483: .fi
                    484: .SH HISTORY
1.3     ! cgd       485: Originally written by Henry Spencer.
        !           486: Altered for inclusion in the 4.4BSD distribution.
1.1       jtc       487: .SH BUGS
                    488: This is an alpha release with known defects.
                    489: Please report problems.
                    490: .PP
                    491: There is one known functionality bug.
                    492: The implementation of internationalization is incomplete:
                    493: the locale is always assumed to be the default one of 1003.2,
                    494: and only the collating elements etc. of that locale are available.
                    495: .PP
                    496: The back-reference code is subtle and doubts linger about its correctness
                    497: in complex cases.
                    498: .PP
                    499: .I Regexec
                    500: performance is poor.
                    501: This will improve with later releases.
                    502: .I Nmatch
                    503: exceeding 0 is expensive;
                    504: .I nmatch
                    505: exceeding 1 is worse.
                    506: .I Regexec
                    507: is largely insensitive to RE complexity \fIexcept\fR that back
                    508: references are massively expensive.
                    509: RE length does matter; in particular, there is a strong speed bonus
                    510: for keeping RE length under about 30 characters,
                    511: with most special characters counting roughly double.
                    512: .PP
                    513: .I Regcomp
                    514: implements bounded repetitions by macro expansion,
                    515: which is costly in time and space if counts are large
                    516: or bounded repetitions are nested.
                    517: An RE like, say,
                    518: `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
                    519: will (eventually) run almost any existing machine out of swap space.
                    520: .PP
                    521: There are suspected problems with response to obscure error conditions.
                    522: Notably,
                    523: certain kinds of internal overflow,
                    524: produced only by truly enormous REs or by multiply nested bounded repetitions,
                    525: are probably not handled well.
                    526: .PP
                    527: Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
                    528: a special character only in the presence of a previous unmatched `('.
                    529: This can't be fixed until the spec is fixed.
                    530: .PP
                    531: The standard's definition of back references is vague.
                    532: For example, does
                    533: `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
                    534: Until the standard is clarified,
                    535: behavior in such cases should not be relied on.
                    536: .PP
                    537: The implementation of word-boundary matching is a bit of a kludge,
                    538: and bugs may lurk in combinations of word-boundary matching and anchoring.
CVSweb <webmaster@jp.NetBSD.org>