Annotation of src/lib/libc/regex/re_format.7, Revision 1.11
1.11 ! wiz 1: .\" $NetBSD: re_format.7,v 1.10 2013/01/25 11:51:42 wiz Exp $
1.5 cgd 2: .\"
1.4 cgd 3: .\" Copyright (c) 1992, 1993, 1994
4: .\" The Regents of the University of California. All rights reserved.
1.8 agc 5: .\"
6: .\" This code is derived from software contributed to Berkeley by
7: .\" Henry Spencer.
8: .\"
9: .\" Redistribution and use in source and binary forms, with or without
10: .\" modification, are permitted provided that the following conditions
11: .\" are met:
12: .\" 1. Redistributions of source code must retain the above copyright
13: .\" notice, this list of conditions and the following disclaimer.
14: .\" 2. Redistributions in binary form must reproduce the above copyright
15: .\" notice, this list of conditions and the following disclaimer in the
16: .\" documentation and/or other materials provided with the distribution.
17: .\" 3. Neither the name of the University nor the names of its contributors
18: .\" may be used to endorse or promote products derived from this software
19: .\" without specific prior written permission.
20: .\"
21: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24: .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
31: .\" SUCH DAMAGE.
32: .\"
33: .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
1.4 cgd 34: .\"
35: .\" This code is derived from software contributed to Berkeley by
36: .\" Henry Spencer.
37: .\"
38: .\" Redistribution and use in source and binary forms, with or without
39: .\" modification, are permitted provided that the following conditions
40: .\" are met:
41: .\" 1. Redistributions of source code must retain the above copyright
42: .\" notice, this list of conditions and the following disclaimer.
43: .\" 2. Redistributions in binary form must reproduce the above copyright
44: .\" notice, this list of conditions and the following disclaimer in the
45: .\" documentation and/or other materials provided with the distribution.
46: .\" 3. All advertising materials mentioning features or use of this software
47: .\" must display the following acknowledgement:
48: .\" This product includes software developed by the University of
49: .\" California, Berkeley and its contributors.
50: .\" 4. Neither the name of the University nor the names of its contributors
51: .\" may be used to endorse or promote products derived from this software
52: .\" without specific prior written permission.
53: .\"
54: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
55: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
56: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
57: .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
58: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
59: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
60: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
61: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
62: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
63: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
64: .\" SUCH DAMAGE.
65: .\"
66: .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94
67: .\"
1.9 joerg 68: .Dd March 20, 1994
69: .Dt RE_FORMAT 7
70: .Os
71: .Sh NAME
72: .Nm re_format
73: .Nd POSIX 1003.2 regular expressions
74: .Sh DESCRIPTION
1.1 jtc 75: Regular expressions (``RE''s),
76: as defined in POSIX 1003.2, come in two forms:
77: modern REs (roughly those of
1.9 joerg 78: .Xr egrep 1 ;
1.1 jtc 79: 1003.2 calls these ``extended'' REs)
80: and obsolete REs (roughly those of
1.9 joerg 81: .Xr ed 1 ;
1.1 jtc 82: 1003.2 ``basic'' REs).
83: Obsolete REs mostly exist for backward compatibility in some old programs;
84: they will be discussed at the end.
85: 1003.2 leaves some aspects of RE syntax and semantics open;
1.11 ! wiz 86: `(*)' marks decisions on these aspects that
1.1 jtc 87: may not be fully portable to other 1003.2 implementations.
1.9 joerg 88: .Pp
1.11 ! wiz 89: A (modern) RE is one(*) or more non-empty(*)
1.9 joerg 90: .Em branches ,
1.1 jtc 91: separated by `|'.
92: It matches anything that matches one of the branches.
1.9 joerg 93: .Pp
1.11 ! wiz 94: A branch is one(*) or more
1.9 joerg 95: .Em pieces ,
96: concatenated.
1.1 jtc 97: It matches a match for the first, followed by a match for the second, etc.
1.9 joerg 98: .Pp
99: A piece is an
100: .Em atom
101: possibly followed
1.11 ! wiz 102: by a single(*) `*', `+', `?', or
1.9 joerg 103: .Em bound .
1.1 jtc 104: An atom followed by `*' matches a sequence of 0 or more matches of the atom.
105: An atom followed by `+' matches a sequence of 1 or more matches of the atom.
106: An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
1.9 joerg 107: .Pp
108: A
109: .Em bound
110: is `{' followed by an unsigned decimal integer, possibly followed by `,'
1.1 jtc 111: possibly followed by another unsigned decimal integer,
112: always followed by `}'.
1.11 ! wiz 113: The integers must lie between 0 and RE_DUP_MAX (255(*)) inclusive,
1.1 jtc 114: and if there are two of them, the first may not exceed the second.
1.9 joerg 115: An atom followed by a bound containing one integer
116: .Em i
117: and no comma matches a sequence of exactly
118: .Em i
119: matches of the atom.
120: An atom followed by a bound containing one integer
121: .Em i
122: and a comma matches a sequence of
123: .Em i
124: or more matches of the atom.
125: An atom followed by a bound containing two integers
126: .Em i
127: and
128: .Em j
129: matches a sequence of
130: .Em i
131: through
132: .Em j
133: (inclusive) matches of the atom.
134: .Pp
1.1 jtc 135: An atom is a regular expression enclosed in `()' (matching a match for the
1.11 ! wiz 136: regular expression), an empty set of `()' (matching the null string)(*), a
1.9 joerg 137: .Em bracket expression
138: (see below), `.' (matching any single character),
139: `^' (matching the null string at the beginning of a line),
140: `$' (matching the null string at the end of a line),
141: a `\e' followed by one of the characters `^.[$()|*+?{\e'
1.1 jtc 142: (matching that character taken as an ordinary character),
1.11 ! wiz 143: a `\e' followed by any other character(*)
1.1 jtc 144: (matching that character taken as an ordinary character,
1.11 ! wiz 145: as if the `\e' had not been present(*)),
1.1 jtc 146: or a single character with no other significance (matching that character).
147: A `{' followed by a character other than a digit is an ordinary
1.11 ! wiz 148: character, not the beginning of a bound(*).
1.1 jtc 149: It is illegal to end an RE with `\e'.
1.9 joerg 150: .Pp
151: A
152: .Em bracket expression
153: is a list of characters enclosed in `[]'.
1.1 jtc 154: It normally matches any single character from the list (but see below).
155: If the list begins with `^',
1.9 joerg 156: it matches any single character (but see below)
157: .Em not
158: from the rest of the list.
1.1 jtc 159: If two characters in the list are separated by `\-', this is shorthand
1.9 joerg 160: for the full
161: .Em range
162: of characters between those two (inclusive) in the collating sequence,
1.1 jtc 163: e.g. `[0-9]' in ASCII matches any decimal digit.
1.11 ! wiz 164: It is illegal(*) for two ranges to share an endpoint, e.g. `a-c-e'.
1.1 jtc 165: Ranges are very collating-sequence-dependent,
166: and portable programs should avoid relying on them.
1.9 joerg 167: .Pp
1.1 jtc 168: To include a literal `]' in the list, make it the first character
169: (following a possible `^').
170: To include a literal `\-', make it the first or last character,
171: or the second endpoint of a range.
172: To use a literal `\-' as the first endpoint of a range,
173: enclose it in `[.' and `.]' to make it a collating element (see below).
174: With the exception of these and some combinations using `[' (see next
175: paragraphs), all other special characters, including `\e', lose their
176: special significance within a bracket expression.
1.9 joerg 177: .Pp
1.1 jtc 178: Within a bracket expression, a collating element (a character,
179: a multi-character sequence that collates as if it were a single character,
180: or a collating-sequence name for either)
181: enclosed in `[.' and `.]' stands for the
182: sequence of characters of that collating element.
183: The sequence is a single element of the bracket expression's list.
1.6 wiz 184: A bracket expression containing a multi-character collating element
1.1 jtc 185: can thus match more than one character,
186: e.g. if the collating sequence includes a `ch' collating element,
187: then the RE `[[.ch.]]*c' matches the first five characters
188: of `chchcc'.
1.9 joerg 189: .Pp
1.1 jtc 190: Within a bracket expression, a collating element enclosed in `[=' and
191: `=]' is an equivalence class, standing for the sequences of characters
192: of all collating elements equivalent to that one, including itself.
193: (If there are no other equivalent collating elements,
194: the treatment is as if the enclosing delimiters were `[.' and `.]'.)
1.9 joerg 195: For example, if o and '\(^o' are the members of an equivalence class,
196: then `[[=o=]]', `[[=\(^o'=]]', and `[o\(^o']' are all synonymous.
1.11 ! wiz 197: An equivalence class may not(*) be an endpoint
1.1 jtc 198: of a range.
1.9 joerg 199: .Pp
200: Within a bracket expression, the name of a
201: .Em character class
202: enclosed in `[:' and `:]' stands for the list of all characters
203: belonging to that class.
1.1 jtc 204: Standard character class names are:
1.9 joerg 205: .Bl -column "alnum" "digit" "xdigit"
206: .It alnum digit punct
207: .It alpha graph space
208: .It blank lower upper
209: .It cntrl print xdigit
210: .El
211: .Pp
1.1 jtc 212: These stand for the character classes defined in
1.9 joerg 213: .Xr ctype 3 .
1.1 jtc 214: A locale may provide others.
215: A character class may not be used as an endpoint of a range.
1.9 joerg 216: .Pp
1.11 ! wiz 217: There are two special cases(*) of bracket expressions:
1.9 joerg 218: the bracket expressions `[[:\*[Lt]:]]' and `[[:\*[Gt]:]]' match
219: the null string at the beginning and end of a word respectively.
220: A word is defined as a sequence of word characters
221: which is neither preceded nor followed by word characters.
1.2 jtc 222: A word character is an
1.9 joerg 223: .Em alnum
1.2 jtc 224: character (as defined by
1.9 joerg 225: .Xr ctype 3 )
1.2 jtc 226: or an underscore.
1.9 joerg 227: This is an extension, compatible with but not specified by POSIX 1003.2,
228: and should be used with caution in software intended to be portable
229: to other systems.
230: .Pp
1.1 jtc 231: In the event that an RE could match more than one substring of a given
1.9 joerg 232: string, the RE matches the one starting earliest in the string.
1.1 jtc 233: If the RE could match more than one substring starting at that point,
234: it matches the longest.
235: Subexpressions also match the longest possible substrings, subject to
236: the constraint that the whole match be as long as possible,
237: with subexpressions starting earlier in the RE taking priority over
238: ones starting later.
239: Note that higher-level subexpressions thus take priority over
240: their lower-level component subexpressions.
1.9 joerg 241: .Pp
1.1 jtc 242: Match lengths are measured in characters, not collating elements.
243: A null string is considered longer than no match at all.
244: For example,
245: `bb*' matches the three middle characters of `abbbc',
246: `(wee|week)(knights|nights)' matches all ten characters of `weeknights',
247: when `(.*).*' is matched against `abc' the parenthesized subexpression
248: matches all three characters, and
249: when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
250: subexpression match the null string.
1.9 joerg 251: .Pp
1.1 jtc 252: If case-independent matching is specified,
253: the effect is much as if all case distinctions had vanished from the
254: alphabet.
255: When an alphabetic that exists in multiple cases appears as an
256: ordinary character outside a bracket expression, it is effectively
257: transformed into a bracket expression containing both cases,
258: e.g. `x' becomes `[xX]'.
259: When it appears inside a bracket expression, all case counterparts
260: of it are added to the bracket expression, so that (e.g.) `[x]'
261: becomes `[xX]' and `[^x]' becomes `[^xX]'.
1.9 joerg 262: .Pp
1.11 ! wiz 263: No particular limit is imposed on the length of REs(*).
1.1 jtc 264: Programs intended to be portable should not employ REs longer
265: than 256 bytes,
266: as an implementation can refuse to accept such REs and remain
267: POSIX-compliant.
1.9 joerg 268: .Pp
1.1 jtc 269: Obsolete (``basic'') regular expressions differ in several respects.
270: `|', `+', and `?' are ordinary characters and there is no equivalent
271: for their functionality.
272: The delimiters for bounds are `\e{' and `\e}',
273: with `{' and `}' by themselves ordinary characters.
274: The parentheses for nested subexpressions are `\e(' and `\e)',
275: with `(' and `)' by themselves ordinary characters.
276: `^' is an ordinary character except at the beginning of the
1.11 ! wiz 277: RE or(*) the beginning of a parenthesized subexpression,
1.1 jtc 278: `$' is an ordinary character except at the end of the
1.11 ! wiz 279: RE or(*) the end of a parenthesized subexpression,
1.1 jtc 280: and `*' is an ordinary character if it appears at the beginning of the
281: RE or the beginning of a parenthesized subexpression
282: (after a possible leading `^').
1.9 joerg 283: Finally, there is one new type of atom, a
284: .Em back reference :
285: `\e' followed by a non-zero decimal digit
286: .Em d
1.1 jtc 287: matches the same sequence of characters
1.9 joerg 288: matched by the
289: .Em d Ns th parenthesized subexpression
1.1 jtc 290: (numbering subexpressions by the positions of their opening parentheses,
291: left to right),
292: so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
1.9 joerg 293: .Sh SEE ALSO
294: .Xr regex 3
295: .Pp
1.1 jtc 296: POSIX 1003.2, section 2.8 (Regular Expression Notation).
1.9 joerg 297: .Sh BUGS
1.1 jtc 298: Having two kinds of REs is a botch.
1.9 joerg 299: .Pp
1.1 jtc 300: The current 1003.2 spec says that `)' is an ordinary character in
301: the absence of an unmatched `(';
1.9 joerg 302: this was an unintentional result of a wording error, and change is likely.
1.1 jtc 303: Avoid relying on it.
1.9 joerg 304: .Pp
1.1 jtc 305: Back references are a dreadful botch,
306: posing major problems for efficient implementations.
307: They are also somewhat vaguely defined
1.9 joerg 308: (does `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
1.1 jtc 309: Avoid using them.
1.9 joerg 310: .Pp
1.1 jtc 311: 1003.2's specification of case-independent matching is vague.
312: The ``one case implies all cases'' definition given above
313: is current consensus among implementors as to the right interpretation.
1.9 joerg 314: .Pp
1.2 jtc 315: The syntax for word boundaries is incredibly ugly.
CVSweb <webmaster@jp.NetBSD.org>