src/lib/libc/regex/re_format.7 - annotate

Return to re_format.7 CVS log
Up to [cvs.NetBSD.org] / src / lib / libc / regex
Annotation of src/lib/libc/regex/re_format.7, Revision 1.11

1.11    ! wiz         1: .\"    $NetBSD: re_format.7,v 1.10 2013/01/25 11:51:42 wiz Exp $
1.5       cgd         2: .\"
1.4       cgd         3: .\" Copyright (c) 1992, 1993, 1994
                      4: .\"    The Regents of the University of California.  All rights reserved.
1.8       agc         5: .\"
                      6: .\" This code is derived from software contributed to Berkeley by
                      7: .\" Henry Spencer.
                      8: .\"
                      9: .\" Redistribution and use in source and binary forms, with or without
                     10: .\" modification, are permitted provided that the following conditions
                     11: .\" are met:
                     12: .\" 1. Redistributions of source code must retain the above copyright
                     13: .\"    notice, this list of conditions and the following disclaimer.
                     14: .\" 2. Redistributions in binary form must reproduce the above copyright
                     15: .\"    notice, this list of conditions and the following disclaimer in the
                     16: .\"    documentation and/or other materials provided with the distribution.
                     17: .\" 3. Neither the name of the University nor the names of its contributors
                     18: .\"    may be used to endorse or promote products derived from this software
                     19: .\"    without specific prior written permission.
                     20: .\"
                     21: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
                     22: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
                     23: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
                     24: .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
                     25: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
                     26: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
                     27: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
                     28: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
                     29: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
                     30: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
                     31: .\" SUCH DAMAGE.
                     32: .\"
                     33: .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
1.4       cgd        34: .\"
                     35: .\" This code is derived from software contributed to Berkeley by
                     36: .\" Henry Spencer.
                     37: .\"
                     38: .\" Redistribution and use in source and binary forms, with or without
                     39: .\" modification, are permitted provided that the following conditions
                     40: .\" are met:
                     41: .\" 1. Redistributions of source code must retain the above copyright
                     42: .\"    notice, this list of conditions and the following disclaimer.
                     43: .\" 2. Redistributions in binary form must reproduce the above copyright
                     44: .\"    notice, this list of conditions and the following disclaimer in the
                     45: .\"    documentation and/or other materials provided with the distribution.
                     46: .\" 3. All advertising materials mentioning features or use of this software
                     47: .\"    must display the following acknowledgement:
                     48: .\"    This product includes software developed by the University of
                     49: .\"    California, Berkeley and its contributors.
                     50: .\" 4. Neither the name of the University nor the names of its contributors
                     51: .\"    may be used to endorse or promote products derived from this software
                     52: .\"    without specific prior written permission.
                     53: .\"
                     54: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
                     55: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
                     56: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
                     57: .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
                     58: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
                     59: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
                     60: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
                     61: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
                     62: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
                     63: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
                     64: .\" SUCH DAMAGE.
                     65: .\"
                     66: .\"    @(#)re_format.7 8.3 (Berkeley) 3/20/94
                     67: .\"
1.9       joerg      68: .Dd March 20, 1994
                     69: .Dt RE_FORMAT 7
                     70: .Os
                     71: .Sh NAME
                     72: .Nm re_format
                     73: .Nd POSIX 1003.2 regular expressions
                     74: .Sh DESCRIPTION
1.1       jtc        75: Regular expressions (``RE''s),
                     76: as defined in POSIX 1003.2, come in two forms:
                     77: modern REs (roughly those of
1.9       joerg      78: .Xr egrep 1 ;
1.1       jtc        79: 1003.2 calls these ``extended'' REs)
                     80: and obsolete REs (roughly those of
1.9       joerg      81: .Xr ed 1 ;
1.1       jtc        82: 1003.2 ``basic'' REs).
                     83: Obsolete REs mostly exist for backward compatibility in some old programs;
                     84: they will be discussed at the end.
                     85: 1003.2 leaves some aspects of RE syntax and semantics open;
1.11    ! wiz        86: `(*)' marks decisions on these aspects that
1.1       jtc        87: may not be fully portable to other 1003.2 implementations.
1.9       joerg      88: .Pp
1.11    ! wiz        89: A (modern) RE is one(*) or more non-empty(*)
1.9       joerg      90: .Em branches ,
1.1       jtc        91: separated by `|'.
                     92: It matches anything that matches one of the branches.
1.9       joerg      93: .Pp
1.11    ! wiz        94: A branch is one(*) or more
1.9       joerg      95: .Em pieces ,
                     96: concatenated.
1.1       jtc        97: It matches a match for the first, followed by a match for the second, etc.
1.9       joerg      98: .Pp
                     99: A piece is an
                    100: .Em atom
                    101: possibly followed
1.11    ! wiz       102: by a single(*) `*', `+', `?', or
1.9       joerg     103: .Em bound .
1.1       jtc       104: An atom followed by `*' matches a sequence of 0 or more matches of the atom.
                    105: An atom followed by `+' matches a sequence of 1 or more matches of the atom.
                    106: An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
1.9       joerg     107: .Pp
                    108: A
                    109: .Em bound
                    110: is `{' followed by an unsigned decimal integer, possibly followed by `,'
1.1       jtc       111: possibly followed by another unsigned decimal integer,
                    112: always followed by `}'.
1.11    ! wiz       113: The integers must lie between 0 and RE_DUP_MAX (255(*)) inclusive,
1.1       jtc       114: and if there are two of them, the first may not exceed the second.
1.9       joerg     115: An atom followed by a bound containing one integer
                    116: .Em i
                    117: and no comma matches a sequence of exactly
                    118: .Em i
                    119: matches of the atom.
                    120: An atom followed by a bound containing one integer
                    121: .Em i
                    122: and a comma matches a sequence of
                    123: .Em i
                    124: or more matches of the atom.
                    125: An atom followed by a bound containing two integers
                    126: .Em i
                    127: and
                    128: .Em j
                    129: matches a sequence of
                    130: .Em i
                    131: through
                    132: .Em j
                    133: (inclusive) matches of the atom.
                    134: .Pp
1.1       jtc       135: An atom is a regular expression enclosed in `()' (matching a match for the
1.11    ! wiz       136: regular expression), an empty set of `()' (matching the null string)(*), a
1.9       joerg     137: .Em bracket expression
                    138: (see below), `.' (matching any single character),
                    139: `^' (matching the null string at the beginning of a line),
                    140: `$' (matching the null string at the end of a line),
                    141: a `\e' followed by one of the characters `^.[$()|*+?{\e'
1.1       jtc       142: (matching that character taken as an ordinary character),
1.11    ! wiz       143: a `\e' followed by any other character(*)
1.1       jtc       144: (matching that character taken as an ordinary character,
1.11    ! wiz       145: as if the `\e' had not been present(*)),
1.1       jtc       146: or a single character with no other significance (matching that character).
                    147: A `{' followed by a character other than a digit is an ordinary
1.11    ! wiz       148: character, not the beginning of a bound(*).
1.1       jtc       149: It is illegal to end an RE with `\e'.
1.9       joerg     150: .Pp
                    151: A
                    152: .Em bracket expression
                    153: is a list of characters enclosed in `[]'.
1.1       jtc       154: It normally matches any single character from the list (but see below).
                    155: If the list begins with `^',
1.9       joerg     156: it matches any single character (but see below)
                    157: .Em not
                    158: from the rest of the list.
1.1       jtc       159: If two characters in the list are separated by `\-', this is shorthand
1.9       joerg     160: for the full
                    161: .Em range
                    162: of characters between those two (inclusive) in the collating sequence,
1.1       jtc       163: e.g. `[0-9]' in ASCII matches any decimal digit.
1.11    ! wiz       164: It is illegal(*) for two ranges to share an endpoint, e.g. `a-c-e'.
1.1       jtc       165: Ranges are very collating-sequence-dependent,
                    166: and portable programs should avoid relying on them.
1.9       joerg     167: .Pp
1.1       jtc       168: To include a literal `]' in the list, make it the first character
                    169: (following a possible `^').
                    170: To include a literal `\-', make it the first or last character,
                    171: or the second endpoint of a range.
                    172: To use a literal `\-' as the first endpoint of a range,
                    173: enclose it in `[.' and `.]' to make it a collating element (see below).
                    174: With the exception of these and some combinations using `[' (see next
                    175: paragraphs), all other special characters, including `\e', lose their
                    176: special significance within a bracket expression.
1.9       joerg     177: .Pp
1.1       jtc       178: Within a bracket expression, a collating element (a character,
                    179: a multi-character sequence that collates as if it were a single character,
                    180: or a collating-sequence name for either)
                    181: enclosed in `[.' and `.]' stands for the
                    182: sequence of characters of that collating element.
                    183: The sequence is a single element of the bracket expression's list.
1.6       wiz       184: A bracket expression containing a multi-character collating element
1.1       jtc       185: can thus match more than one character,
                    186: e.g. if the collating sequence includes a `ch' collating element,
                    187: then the RE `[[.ch.]]*c' matches the first five characters
                    188: of `chchcc'.
1.9       joerg     189: .Pp
1.1       jtc       190: Within a bracket expression, a collating element enclosed in `[=' and
                    191: `=]' is an equivalence class, standing for the sequences of characters
                    192: of all collating elements equivalent to that one, including itself.
                    193: (If there are no other equivalent collating elements,
                    194: the treatment is as if the enclosing delimiters were `[.' and `.]'.)
1.9       joerg     195: For example, if o and '\(^o' are the members of an equivalence class,
                    196: then `[[=o=]]', `[[=\(^o'=]]', and `[o\(^o']' are all synonymous.
1.11    ! wiz       197: An equivalence class may not(*) be an endpoint
1.1       jtc       198: of a range.
1.9       joerg     199: .Pp
                    200: Within a bracket expression, the name of a
                    201: .Em character class
                    202: enclosed in `[:' and `:]' stands for the list of all characters
                    203: belonging to that class.
1.1       jtc       204: Standard character class names are:
1.9       joerg     205: .Bl -column "alnum" "digit" "xdigit"
                    206: .It alnum      digit   punct
                    207: .It alpha      graph   space
                    208: .It blank      lower   upper
                    209: .It cntrl      print   xdigit
                    210: .El
                    211: .Pp
1.1       jtc       212: These stand for the character classes defined in
1.9       joerg     213: .Xr ctype 3 .
1.1       jtc       214: A locale may provide others.
                    215: A character class may not be used as an endpoint of a range.
1.9       joerg     216: .Pp
1.11    ! wiz       217: There are two special cases(*) of bracket expressions:
1.9       joerg     218: the bracket expressions `[[:\*[Lt]:]]' and `[[:\*[Gt]:]]' match
                    219: the null string at the beginning and end of a word respectively.
                    220: A word is defined as a sequence of word characters
                    221: which is neither preceded nor followed by word characters.
1.2       jtc       222: A word character is an
1.9       joerg     223: .Em alnum
1.2       jtc       224: character (as defined by
1.9       joerg     225: .Xr ctype 3 )
1.2       jtc       226: or an underscore.
1.9       joerg     227: This is an extension, compatible with but not specified by POSIX 1003.2,
                    228: and should be used with caution in software intended to be portable
                    229: to other systems.
                    230: .Pp
1.1       jtc       231: In the event that an RE could match more than one substring of a given
1.9       joerg     232: string, the RE matches the one starting earliest in the string.
1.1       jtc       233: If the RE could match more than one substring starting at that point,
                    234: it matches the longest.
                    235: Subexpressions also match the longest possible substrings, subject to
                    236: the constraint that the whole match be as long as possible,
                    237: with subexpressions starting earlier in the RE taking priority over
                    238: ones starting later.
                    239: Note that higher-level subexpressions thus take priority over
                    240: their lower-level component subexpressions.
1.9       joerg     241: .Pp
1.1       jtc       242: Match lengths are measured in characters, not collating elements.
                    243: A null string is considered longer than no match at all.
                    244: For example,
                    245: `bb*' matches the three middle characters of `abbbc',
                    246: `(wee|week)(knights|nights)' matches all ten characters of `weeknights',
                    247: when `(.*).*' is matched against `abc' the parenthesized subexpression
                    248: matches all three characters, and
                    249: when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
                    250: subexpression match the null string.
1.9       joerg     251: .Pp
1.1       jtc       252: If case-independent matching is specified,
                    253: the effect is much as if all case distinctions had vanished from the
                    254: alphabet.
                    255: When an alphabetic that exists in multiple cases appears as an
                    256: ordinary character outside a bracket expression, it is effectively
                    257: transformed into a bracket expression containing both cases,
                    258: e.g. `x' becomes `[xX]'.
                    259: When it appears inside a bracket expression, all case counterparts
                    260: of it are added to the bracket expression, so that (e.g.) `[x]'
                    261: becomes `[xX]' and `[^x]' becomes `[^xX]'.
1.9       joerg     262: .Pp
1.11    ! wiz       263: No particular limit is imposed on the length of REs(*).
1.1       jtc       264: Programs intended to be portable should not employ REs longer
                    265: than 256 bytes,
                    266: as an implementation can refuse to accept such REs and remain
                    267: POSIX-compliant.
1.9       joerg     268: .Pp
1.1       jtc       269: Obsolete (``basic'') regular expressions differ in several respects.
                    270: `|', `+', and `?' are ordinary characters and there is no equivalent
                    271: for their functionality.
                    272: The delimiters for bounds are `\e{' and `\e}',
                    273: with `{' and `}' by themselves ordinary characters.
                    274: The parentheses for nested subexpressions are `\e(' and `\e)',
                    275: with `(' and `)' by themselves ordinary characters.
                    276: `^' is an ordinary character except at the beginning of the
1.11    ! wiz       277: RE or(*) the beginning of a parenthesized subexpression,
1.1       jtc       278: `$' is an ordinary character except at the end of the
1.11    ! wiz       279: RE or(*) the end of a parenthesized subexpression,
1.1       jtc       280: and `*' is an ordinary character if it appears at the beginning of the
                    281: RE or the beginning of a parenthesized subexpression
                    282: (after a possible leading `^').
1.9       joerg     283: Finally, there is one new type of atom, a
                    284: .Em back reference :
                    285: `\e' followed by a non-zero decimal digit
                    286: .Em d
1.1       jtc       287: matches the same sequence of characters
1.9       joerg     288: matched by the
                    289: .Em d Ns th parenthesized subexpression
1.1       jtc       290: (numbering subexpressions by the positions of their opening parentheses,
                    291: left to right),
                    292: so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
1.9       joerg     293: .Sh SEE ALSO
                    294: .Xr regex 3
                    295: .Pp
1.1       jtc       296: POSIX 1003.2, section 2.8 (Regular Expression Notation).
1.9       joerg     297: .Sh BUGS
1.1       jtc       298: Having two kinds of REs is a botch.
1.9       joerg     299: .Pp
1.1       jtc       300: The current 1003.2 spec says that `)' is an ordinary character in
                    301: the absence of an unmatched `(';
1.9       joerg     302: this was an unintentional result of a wording error, and change is likely.
1.1       jtc       303: Avoid relying on it.
1.9       joerg     304: .Pp
1.1       jtc       305: Back references are a dreadful botch,
                    306: posing major problems for efficient implementations.
                    307: They are also somewhat vaguely defined
1.9       joerg     308: (does `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
1.1       jtc       309: Avoid using them.
1.9       joerg     310: .Pp
1.1       jtc       311: 1003.2's specification of case-independent matching is vague.
                    312: The ``one case implies all cases'' definition given above
                    313: is current consensus among implementors as to the right interpretation.
1.9       joerg     314: .Pp
1.2       jtc       315: The syntax for word boundaries is incredibly ugly.
CVSweb <webmaster@jp.NetBSD.org>