[BACK]Return to re_format.7 CVS log [TXT][DIR] Up to [cvs.NetBSD.org] / src / lib / libc / regex

Annotation of src/lib/libc/regex/re_format.7, Revision 1.14

1.14    ! wiz         1: .\" $NetBSD$
        !             2: .\"
1.13      christos    3: .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
1.4       cgd         4: .\" Copyright (c) 1992, 1993, 1994
                      5: .\"    The Regents of the University of California.  All rights reserved.
1.8       agc         6: .\"
                      7: .\" This code is derived from software contributed to Berkeley by
                      8: .\" Henry Spencer.
                      9: .\"
                     10: .\" Redistribution and use in source and binary forms, with or without
                     11: .\" modification, are permitted provided that the following conditions
                     12: .\" are met:
                     13: .\" 1. Redistributions of source code must retain the above copyright
                     14: .\"    notice, this list of conditions and the following disclaimer.
                     15: .\" 2. Redistributions in binary form must reproduce the above copyright
                     16: .\"    notice, this list of conditions and the following disclaimer in the
                     17: .\"    documentation and/or other materials provided with the distribution.
1.4       cgd        18: .\" 3. All advertising materials mentioning features or use of this software
                     19: .\"    must display the following acknowledgement:
                     20: .\"    This product includes software developed by the University of
                     21: .\"    California, Berkeley and its contributors.
                     22: .\" 4. Neither the name of the University nor the names of its contributors
                     23: .\"    may be used to endorse or promote products derived from this software
                     24: .\"    without specific prior written permission.
                     25: .\"
                     26: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
                     27: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
                     28: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
                     29: .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
                     30: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
                     31: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
                     32: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
                     33: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
                     34: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
                     35: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
                     36: .\" SUCH DAMAGE.
                     37: .\"
                     38: .\"    @(#)re_format.7 8.3 (Berkeley) 3/20/94
1.13      christos   39: .\" $FreeBSD: head/lib/libc/regex/re_format.7 314373 2017-02-28 05:14:42Z glebius $
1.4       cgd        40: .\"
1.13      christos   41: .Dd February 22, 2021
1.9       joerg      42: .Dt RE_FORMAT 7
                     43: .Os
                     44: .Sh NAME
                     45: .Nm re_format
                     46: .Nd POSIX 1003.2 regular expressions
                     47: .Sh DESCRIPTION
1.13      christos   48: Regular expressions
                     49: .Pq Dq RE Ns s ,
                     50: as defined in
                     51: .St -p1003.2 ,
                     52: come in two forms:
1.1       jtc        53: modern REs (roughly those of
1.9       joerg      54: .Xr egrep 1 ;
1.13      christos   55: 1003.2 calls these
                     56: .Dq extended
                     57: REs)
1.1       jtc        58: and obsolete REs (roughly those of
1.9       joerg      59: .Xr ed 1 ;
1.13      christos   60: 1003.2
                     61: .Dq basic
                     62: REs).
1.1       jtc        63: Obsolete REs mostly exist for backward compatibility in some old programs;
                     64: they will be discussed at the end.
1.13      christos   65: .St -p1003.2
                     66: leaves some aspects of RE syntax and semantics open;
                     67: `\(dd' marks decisions on these aspects that
                     68: may not be fully portable to other
                     69: .St -p1003.2
                     70: implementations.
1.9       joerg      71: .Pp
1.13      christos   72: A (modern) RE is one\(dd or more non-empty\(dd
1.9       joerg      73: .Em branches ,
1.13      christos   74: separated by
                     75: .Ql \&| .
1.1       jtc        76: It matches anything that matches one of the branches.
1.9       joerg      77: .Pp
1.13      christos   78: A branch is one\(dd or more
1.9       joerg      79: .Em pieces ,
                     80: concatenated.
1.1       jtc        81: It matches a match for the first, followed by a match for the second, etc.
1.9       joerg      82: .Pp
                     83: A piece is an
                     84: .Em atom
                     85: possibly followed
1.13      christos   86: by a single\(dd
                     87: .Ql \&* ,
                     88: .Ql \&+ ,
                     89: .Ql \&? ,
                     90: or
1.9       joerg      91: .Em bound .
1.13      christos   92: An atom followed by
                     93: .Ql \&*
                     94: matches a sequence of 0 or more matches of the atom.
                     95: An atom followed by
                     96: .Ql \&+
                     97: matches a sequence of 1 or more matches of the atom.
                     98: An atom followed by
                     99: .Ql ?\&
                    100: matches a sequence of 0 or 1 matches of the atom.
1.9       joerg     101: .Pp
                    102: A
                    103: .Em bound
1.13      christos  104: is
                    105: .Ql \&{
                    106: followed by an unsigned decimal integer,
                    107: possibly followed by
                    108: .Ql \&,
1.1       jtc       109: possibly followed by another unsigned decimal integer,
1.13      christos  110: always followed by
                    111: .Ql \&} .
                    112: The integers must lie between 0 and
                    113: .Dv RE_DUP_MAX
                    114: (255\(dd) inclusive,
1.1       jtc       115: and if there are two of them, the first may not exceed the second.
1.9       joerg     116: An atom followed by a bound containing one integer
                    117: .Em i
1.13      christos  118: and no comma matches
                    119: a sequence of exactly
1.9       joerg     120: .Em i
                    121: matches of the atom.
1.13      christos  122: An atom followed by a bound
                    123: containing one integer
1.9       joerg     124: .Em i
1.13      christos  125: and a comma matches
                    126: a sequence of
1.9       joerg     127: .Em i
                    128: or more matches of the atom.
1.13      christos  129: An atom followed by a bound
                    130: containing two integers
1.9       joerg     131: .Em i
                    132: and
                    133: .Em j
1.13      christos  134: matches
                    135: a sequence of
1.9       joerg     136: .Em i
                    137: through
                    138: .Em j
                    139: (inclusive) matches of the atom.
                    140: .Pp
1.13      christos  141: An atom is a regular expression enclosed in
                    142: .Ql ()
                    143: (matching a match for the
                    144: regular expression),
                    145: an empty set of
                    146: .Ql ()
                    147: (matching the null string)\(dd,
                    148: a
1.9       joerg     149: .Em bracket expression
1.13      christos  150: (see below),
                    151: .Ql .\&
                    152: (matching any single character),
                    153: .Ql \&^
                    154: (matching the null string at the beginning of a line),
                    155: .Ql \&$
                    156: (matching the null string at the end of a line), a
                    157: .Ql \e
                    158: followed by one of the characters
                    159: .Ql ^.[$()|*+?{\e
1.1       jtc       160: (matching that character taken as an ordinary character),
1.13      christos  161: a
                    162: .Ql \e
                    163: followed by any other character\(dd
1.1       jtc       164: (matching that character taken as an ordinary character,
1.13      christos  165: as if the
                    166: .Ql \e
                    167: had not been present\(dd),
1.1       jtc       168: or a single character with no other significance (matching that character).
1.13      christos  169: A
                    170: .Ql \&{
                    171: followed by a character other than a digit is an ordinary
                    172: character, not the beginning of a bound\(dd.
                    173: It is illegal to end an RE with
                    174: .Ql \e .
1.9       joerg     175: .Pp
                    176: A
                    177: .Em bracket expression
1.13      christos  178: is a list of characters enclosed in
                    179: .Ql [] .
1.1       jtc       180: It normally matches any single character from the list (but see below).
1.13      christos  181: If the list begins with
                    182: .Ql \&^ ,
                    183: it matches any single character
                    184: (but see below)
1.9       joerg     185: .Em not
                    186: from the rest of the list.
1.13      christos  187: If two characters in the list are separated by
                    188: .Ql \&- ,
                    189: this is shorthand
1.9       joerg     190: for the full
                    191: .Em range
1.13      christos  192: of characters between those two (inclusive) in the
                    193: collating sequence,
                    194: .No e.g. Ql [0-9]
                    195: in ASCII matches any decimal digit.
                    196: It is illegal\(dd for two ranges to share an
                    197: endpoint,
                    198: .No e.g. Ql a-c-e .
1.1       jtc       199: Ranges are very collating-sequence-dependent,
                    200: and portable programs should avoid relying on them.
1.9       joerg     201: .Pp
1.13      christos  202: To include a literal
                    203: .Ql \&]
                    204: in the list, make it the first character
                    205: (following a possible
                    206: .Ql \&^ ) .
                    207: To include a literal
                    208: .Ql \&- ,
                    209: make it the first or last character,
1.1       jtc       210: or the second endpoint of a range.
1.13      christos  211: To use a literal
                    212: .Ql \&-
                    213: as the first endpoint of a range,
                    214: enclose it in
                    215: .Ql [.\&
                    216: and
                    217: .Ql .]\&
                    218: to make it a collating element (see below).
                    219: With the exception of these and some combinations using
                    220: .Ql \&[
                    221: (see next paragraphs), all other special characters, including
                    222: .Ql \e ,
                    223: lose their special significance within a bracket expression.
1.9       joerg     224: .Pp
1.1       jtc       225: Within a bracket expression, a collating element (a character,
                    226: a multi-character sequence that collates as if it were a single character,
                    227: or a collating-sequence name for either)
1.13      christos  228: enclosed in
                    229: .Ql [.\&
                    230: and
                    231: .Ql .]\&
                    232: stands for the
1.1       jtc       233: sequence of characters of that collating element.
                    234: The sequence is a single element of the bracket expression's list.
1.6       wiz       235: A bracket expression containing a multi-character collating element
1.1       jtc       236: can thus match more than one character,
1.13      christos  237: e.g.\& if the collating sequence includes a
                    238: .Ql ch
                    239: collating element,
                    240: then the RE
                    241: .Ql [[.ch.]]*c
                    242: matches the first five characters
                    243: of
                    244: .Ql chchcc .
1.9       joerg     245: .Pp
1.13      christos  246: Within a bracket expression, a collating element enclosed in
                    247: .Ql [=
                    248: and
                    249: .Ql =]
                    250: is an equivalence class, standing for the sequences of characters
1.1       jtc       251: of all collating elements equivalent to that one, including itself.
                    252: (If there are no other equivalent collating elements,
1.13      christos  253: the treatment is as if the enclosing delimiters were
                    254: .Ql [.\&
                    255: and
                    256: .Ql .] . )
                    257: For example, if
                    258: .Ql x
                    259: and
                    260: .Ql y
                    261: are the members of an equivalence class,
                    262: then
                    263: .Ql [[=x=]] ,
                    264: .Ql [[=y=]] ,
                    265: and
                    266: .Ql [xy]
                    267: are all synonymous.
                    268: An equivalence class may not\(dd be an endpoint
1.1       jtc       269: of a range.
1.9       joerg     270: .Pp
                    271: Within a bracket expression, the name of a
                    272: .Em character class
1.13      christos  273: enclosed in
                    274: .Ql [:
                    275: and
                    276: .Ql :]
                    277: stands for the list of all characters belonging to that
                    278: class.
1.1       jtc       279: Standard character class names are:
1.13      christos  280: .Bl -column "alnum" "digit" "xdigit" -offset indent
                    281: .It Em "alnum  digit   punct"
                    282: .It Em "alpha  graph   space"
                    283: .It Em "blank  lower   upper"
                    284: .It Em "cntrl  print   xdigit"
1.9       joerg     285: .El
                    286: .Pp
1.1       jtc       287: These stand for the character classes defined in
1.9       joerg     288: .Xr ctype 3 .
1.1       jtc       289: A locale may provide others.
                    290: A character class may not be used as an endpoint of a range.
1.9       joerg     291: .Pp
1.13      christos  292: A bracketed expression like
                    293: .Ql [[:class:]]
                    294: can be used to match a single character that belongs to a character
                    295: class.
                    296: The reverse, matching any character that does not belong to a specific
                    297: class, the negation operator of bracket expressions may be used:
                    298: .Ql [^[:class:]] .
                    299: .Pp
                    300: There are two special cases\(dd of bracket expressions:
                    301: the bracket expressions
                    302: .Ql [[:<:]]
                    303: and
                    304: .Ql [[:>:]]
                    305: match the null string at the beginning and end of a word respectively.
1.9       joerg     306: A word is defined as a sequence of word characters
1.13      christos  307: which is neither preceded nor followed by
                    308: word characters.
1.2       jtc       309: A word character is an
1.9       joerg     310: .Em alnum
1.2       jtc       311: character (as defined by
1.9       joerg     312: .Xr ctype 3 )
1.2       jtc       313: or an underscore.
1.13      christos  314: This is an extension,
                    315: compatible with but not specified by
                    316: .St -p1003.2 ,
                    317: and should be used with
                    318: caution in software intended to be portable to other systems.
                    319: The additional word delimiters
                    320: .Ql \e<
                    321: and
                    322: .Ql \e>
                    323: are provided to ease compatibility with traditional
                    324: SVR4
                    325: systems but are not portable and should be avoided.
1.9       joerg     326: .Pp
1.1       jtc       327: In the event that an RE could match more than one substring of a given
1.13      christos  328: string,
                    329: the RE matches the one starting earliest in the string.
1.1       jtc       330: If the RE could match more than one substring starting at that point,
                    331: it matches the longest.
                    332: Subexpressions also match the longest possible substrings, subject to
                    333: the constraint that the whole match be as long as possible,
                    334: with subexpressions starting earlier in the RE taking priority over
                    335: ones starting later.
                    336: Note that higher-level subexpressions thus take priority over
                    337: their lower-level component subexpressions.
1.9       joerg     338: .Pp
1.1       jtc       339: Match lengths are measured in characters, not collating elements.
                    340: A null string is considered longer than no match at all.
                    341: For example,
1.13      christos  342: .Ql bb*
                    343: matches the three middle characters of
                    344: .Ql abbbc ,
                    345: .Ql (wee|week)(knights|nights)
                    346: matches all ten characters of
                    347: .Ql weeknights ,
                    348: when
                    349: .Ql (.*).*\&
                    350: is matched against
                    351: .Ql abc
                    352: the parenthesized subexpression
1.1       jtc       353: matches all three characters, and
1.13      christos  354: when
                    355: .Ql (a*)*
                    356: is matched against
                    357: .Ql bc
                    358: both the whole RE and the parenthesized
1.1       jtc       359: subexpression match the null string.
1.9       joerg     360: .Pp
1.1       jtc       361: If case-independent matching is specified,
                    362: the effect is much as if all case distinctions had vanished from the
                    363: alphabet.
                    364: When an alphabetic that exists in multiple cases appears as an
                    365: ordinary character outside a bracket expression, it is effectively
                    366: transformed into a bracket expression containing both cases,
1.13      christos  367: .No e.g. Ql x
                    368: becomes
                    369: .Ql [xX] .
1.1       jtc       370: When it appears inside a bracket expression, all case counterparts
1.13      christos  371: of it are added to the bracket expression, so that (e.g.)
                    372: .Ql [x]
                    373: becomes
                    374: .Ql [xX]
                    375: and
                    376: .Ql [^x]
                    377: becomes
                    378: .Ql [^xX] .
1.9       joerg     379: .Pp
1.13      christos  380: No particular limit is imposed on the length of REs\(dd.
1.1       jtc       381: Programs intended to be portable should not employ REs longer
                    382: than 256 bytes,
                    383: as an implementation can refuse to accept such REs and remain
                    384: POSIX-compliant.
1.9       joerg     385: .Pp
1.13      christos  386: Obsolete
                    387: .Pq Dq basic
                    388: regular expressions differ in several respects.
                    389: .Ql \&|
                    390: is an ordinary character and there is no equivalent
                    391: for its functionality.
                    392: .Ql \&+
                    393: and
                    394: .Ql ?\&
                    395: are ordinary characters, and their functionality
                    396: can be expressed using bounds
                    397: .Po
                    398: .Ql {1,}
                    399: or
                    400: .Ql {0,1}
                    401: respectively
                    402: .Pc .
                    403: Also note that
                    404: .Ql x+
                    405: in modern REs is equivalent to
                    406: .Ql xx* .
                    407: The delimiters for bounds are
                    408: .Ql \e{
                    409: and
                    410: .Ql \e} ,
                    411: with
                    412: .Ql \&{
                    413: and
                    414: .Ql \&}
                    415: by themselves ordinary characters.
                    416: The parentheses for nested subexpressions are
                    417: .Ql \e(
                    418: and
                    419: .Ql \e) ,
                    420: with
                    421: .Ql \&(
                    422: and
                    423: .Ql \&)
                    424: by themselves ordinary characters.
                    425: .Ql \&^
                    426: is an ordinary character except at the beginning of the
                    427: RE or\(dd the beginning of a parenthesized subexpression,
                    428: .Ql \&$
                    429: is an ordinary character except at the end of the
                    430: RE or\(dd the end of a parenthesized subexpression,
                    431: and
                    432: .Ql \&*
                    433: is an ordinary character if it appears at the beginning of the
1.1       jtc       434: RE or the beginning of a parenthesized subexpression
1.13      christos  435: (after a possible leading
                    436: .Ql \&^ ) .
1.9       joerg     437: Finally, there is one new type of atom, a
                    438: .Em back reference :
1.13      christos  439: .Ql \e
                    440: followed by a non-zero decimal digit
1.9       joerg     441: .Em d
1.1       jtc       442: matches the same sequence of characters
1.9       joerg     443: matched by the
1.13      christos  444: .Em d Ns th
                    445: parenthesized subexpression
1.1       jtc       446: (numbering subexpressions by the positions of their opening parentheses,
                    447: left to right),
1.13      christos  448: so that (e.g.)
                    449: .Ql \e([bc]\e)\e1
                    450: matches
                    451: .Ql bb
                    452: or
                    453: .Ql cc
                    454: but not
                    455: .Ql bc .
1.9       joerg     456: .Sh SEE ALSO
                    457: .Xr regex 3
1.13      christos  458: .Rs
                    459: .%T Regular Expression Notation
                    460: .%R IEEE Std
                    461: .%N 1003.2
                    462: .%P section 2.8
                    463: .Re
1.9       joerg     464: .Sh BUGS
1.1       jtc       465: Having two kinds of REs is a botch.
1.9       joerg     466: .Pp
1.13      christos  467: The current
                    468: .St -p1003.2
                    469: spec says that
                    470: .Ql \&)
                    471: is an ordinary character in
                    472: the absence of an unmatched
                    473: .Ql \&( ;
                    474: this was an unintentional result of a wording error,
                    475: and change is likely.
1.1       jtc       476: Avoid relying on it.
1.9       joerg     477: .Pp
1.1       jtc       478: Back references are a dreadful botch,
                    479: posing major problems for efficient implementations.
                    480: They are also somewhat vaguely defined
1.13      christos  481: (does
                    482: .Ql a\e(\e(b\e)*\e2\e)*d
                    483: match
                    484: .Ql abbbd ? ) .
1.1       jtc       485: Avoid using them.
1.9       joerg     486: .Pp
1.13      christos  487: .St -p1003.2
                    488: specification of case-independent matching is vague.
                    489: The
                    490: .Dq one case implies all cases
                    491: definition given above
1.1       jtc       492: is current consensus among implementors as to the right interpretation.
1.9       joerg     493: .Pp
1.2       jtc       494: The syntax for word boundaries is incredibly ugly.

CVSweb <webmaster@jp.NetBSD.org>