Annotation of src/lib/libc/regex/regex.3, Revision 1.3
1.3 ! cgd 1: .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
! 2: .\" Copyright (c) 1992, 1993, 1994
! 3: .\" The Regents of the University of California. All rights reserved.
! 4: .\"
! 5: .\" This code is derived from software contributed to Berkeley by
! 6: .\" Henry Spencer.
! 7: .\"
! 8: .\" Redistribution and use in source and binary forms, with or without
! 9: .\" modification, are permitted provided that the following conditions
! 10: .\" are met:
! 11: .\" 1. Redistributions of source code must retain the above copyright
! 12: .\" notice, this list of conditions and the following disclaimer.
! 13: .\" 2. Redistributions in binary form must reproduce the above copyright
! 14: .\" notice, this list of conditions and the following disclaimer in the
! 15: .\" documentation and/or other materials provided with the distribution.
! 16: .\" 3. All advertising materials mentioning features or use of this software
! 17: .\" must display the following acknowledgement:
! 18: .\" This product includes software developed by the University of
! 19: .\" California, Berkeley and its contributors.
! 20: .\" 4. Neither the name of the University nor the names of its contributors
! 21: .\" may be used to endorse or promote products derived from this software
! 22: .\" without specific prior written permission.
! 23: .\"
! 24: .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
! 25: .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
! 26: .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
! 27: .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
! 28: .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
! 29: .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
! 30: .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
! 31: .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
! 32: .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
! 33: .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
! 34: .\" SUCH DAMAGE.
! 35: .\"
! 36: .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
! 37: .\"
! 38: .TH REGEX 3 "March 20, 1994"
1.1 jtc 39: .de ZR
40: .\" one other place knows this name: the SEE ALSO section
1.2 cgd 41: .IR re_format (7) \\$1
1.1 jtc 42: ..
43: .SH NAME
44: regcomp, regexec, regerror, regfree \- regular-expression library
45: .SH SYNOPSIS
46: .ft B
47: .\".na
48: #include <sys/types.h>
49: .br
50: #include <regex.h>
51: .HP 10
52: int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
53: .HP
54: int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
55: size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
56: .HP
57: size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
58: char\ *errbuf, size_t\ errbuf_size);
59: .HP
60: void\ regfree(regex_t\ *preg);
61: .\".ad
62: .ft
63: .SH DESCRIPTION
64: These routines implement POSIX 1003.2 regular expressions (``RE''s);
65: see
66: .ZR .
67: .I Regcomp
68: compiles an RE written as a string into an internal form,
69: .I regexec
70: matches that internal form against a string and reports results,
71: .I regerror
72: transforms error codes from either into human-readable messages,
73: and
74: .I regfree
75: frees any dynamically-allocated storage used by the internal form
76: of an RE.
77: .PP
78: The header
79: .I <regex.h>
80: declares two structure types,
81: .I regex_t
82: and
83: .IR regmatch_t ,
84: the former for compiled internal forms and the latter for match reporting.
85: It also declares the four functions,
86: a type
87: .IR regoff_t ,
88: and a number of constants with names starting with ``REG_''.
89: .PP
90: .I Regcomp
91: compiles the regular expression contained in the
92: .I pattern
93: string,
94: subject to the flags in
95: .IR cflags ,
96: and places the results in the
97: .I regex_t
98: structure pointed to by
99: .IR preg .
100: .I Cflags
101: is the bitwise OR of zero or more of the following flags:
102: .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
103: Compile modern (``extended'') REs,
104: rather than the obsolete (``basic'') REs that
105: are the default.
106: .IP REG_BASIC
107: This is a synonym for 0,
108: provided as a counterpart to REG_EXTENDED to improve readability.
109: .IP REG_NOSPEC
110: Compile with recognition of all special characters turned off.
111: All characters are thus considered ordinary,
112: so the ``RE'' is a literal string.
113: This is an extension,
114: compatible with but not specified by POSIX 1003.2,
115: and should be used with
116: caution in software intended to be portable to other systems.
117: REG_EXTENDED and REG_NOSPEC may not be used
118: in the same call to
119: .IR regcomp .
120: .IP REG_ICASE
121: Compile for matching that ignores upper/lower case distinctions.
122: See
123: .ZR .
124: .IP REG_NOSUB
125: Compile for matching that need only report success or failure,
126: not what was matched.
127: .IP REG_NEWLINE
128: Compile for newline-sensitive matching.
129: By default, newline is a completely ordinary character with no special
130: meaning in either REs or strings.
131: With this flag,
132: `[^' bracket expressions and `.' never match newline,
133: a `^' anchor matches the null string after any newline in the string
134: in addition to its normal function,
135: and the `$' anchor matches the null string before any newline in the
136: string in addition to its normal function.
137: .IP REG_PEND
138: The regular expression ends,
139: not at the first NUL,
140: but just before the character pointed to by the
141: .I re_endp
142: member of the structure pointed to by
143: .IR preg .
144: The
145: .I re_endp
146: member is of type
147: .IR const\ char\ * .
148: This flag permits inclusion of NULs in the RE;
149: they are considered ordinary characters.
150: This is an extension,
151: compatible with but not specified by POSIX 1003.2,
152: and should be used with
153: caution in software intended to be portable to other systems.
154: .PP
155: When successful,
156: .I regcomp
157: returns 0 and fills in the structure pointed to by
158: .IR preg .
159: One member of that structure
160: (other than
161: .IR re_endp )
162: is publicized:
163: .IR re_nsub ,
164: of type
165: .IR size_t ,
166: contains the number of parenthesized subexpressions within the RE
167: (except that the value of this member is undefined if the
168: REG_NOSUB flag was used).
169: If
170: .I regcomp
171: fails, it returns a non-zero error code;
172: see DIAGNOSTICS.
173: .PP
174: .I Regexec
175: matches the compiled RE pointed to by
176: .I preg
177: against the
178: .IR string ,
179: subject to the flags in
180: .IR eflags ,
181: and reports results using
182: .IR nmatch ,
183: .IR pmatch ,
184: and the returned value.
185: The RE must have been compiled by a previous invocation of
186: .IR regcomp .
187: The compiled form is not altered during execution of
188: .IR regexec ,
189: so a single compiled RE can be used simultaneously by multiple threads.
190: .PP
191: By default,
192: the NUL-terminated string pointed to by
193: .I string
194: is considered to be the text of an entire line, minus any terminating
195: newline.
196: The
197: .I eflags
198: argument is the bitwise OR of zero or more of the following flags:
199: .IP REG_NOTBOL \w'REG_STARTEND'u+2n
200: The first character of
201: the string
202: is not the beginning of a line, so the `^' anchor should not match before it.
203: This does not affect the behavior of newlines under REG_NEWLINE.
204: .IP REG_NOTEOL
205: The NUL terminating
206: the string
207: does not end a line, so the `$' anchor should not match before it.
208: This does not affect the behavior of newlines under REG_NEWLINE.
209: .IP REG_STARTEND
210: The string is considered to start at
211: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
212: and to have a terminating NUL located at
213: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
214: (there need not actually be a NUL at that location),
215: regardless of the value of
216: .IR nmatch .
217: See below for the definition of
218: .IR pmatch
219: and
220: .IR nmatch .
221: This is an extension,
222: compatible with but not specified by POSIX 1003.2,
223: and should be used with
224: caution in software intended to be portable to other systems.
225: Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
226: REG_STARTEND affects only the location of the string,
227: not how it is matched.
228: .PP
229: See
230: .ZR
231: for a discussion of what is matched in situations where an RE or a
232: portion thereof could match any of several substrings of
233: .IR string .
234: .PP
235: Normally,
236: .I regexec
237: returns 0 for success and the non-zero code REG_NOMATCH for failure.
238: Other non-zero error codes may be returned in exceptional situations;
239: see DIAGNOSTICS.
240: .PP
241: If REG_NOSUB was specified in the compilation of the RE,
242: or if
243: .I nmatch
244: is 0,
245: .I regexec
246: ignores the
247: .I pmatch
248: argument (but see below for the case where REG_STARTEND is specified).
249: Otherwise,
250: .I pmatch
251: points to an array of
252: .I nmatch
253: structures of type
254: .IR regmatch_t .
255: Such a structure has at least the members
256: .I rm_so
257: and
258: .IR rm_eo ,
259: both of type
260: .I regoff_t
261: (a signed arithmetic type at least as large as an
262: .I off_t
263: and a
264: .IR ssize_t ),
265: containing respectively the offset of the first character of a substring
266: and the offset of the first character after the end of the substring.
267: Offsets are measured from the beginning of the
268: .I string
269: argument given to
270: .IR regexec .
271: An empty substring is denoted by equal offsets,
272: both indicating the character following the empty substring.
273: .PP
274: The 0th member of the
275: .I pmatch
276: array is filled in to indicate what substring of
277: .I string
278: was matched by the entire RE.
279: Remaining members report what substring was matched by parenthesized
280: subexpressions within the RE;
281: member
282: .I i
283: reports subexpression
284: .IR i ,
285: with subexpressions counted (starting at 1) by the order of their opening
286: parentheses in the RE, left to right.
287: Unused entries in the array\(emcorresponding either to subexpressions that
288: did not participate in the match at all, or to subexpressions that do not
289: exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
290: .I rm_so
291: and
292: .I rm_eo
293: set to \-1.
294: If a subexpression participated in the match several times,
295: the reported substring is the last one it matched.
296: (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
297: the parenthesized subexpression matches each of the three `b's and then
298: an infinite number of empty strings following the last `b',
299: so the reported substring is one of the empties.)
300: .PP
301: If REG_STARTEND is specified,
302: .I pmatch
303: must point to at least one
304: .I regmatch_t
305: (even if
306: .I nmatch
307: is 0 or REG_NOSUB was specified),
308: to hold the input offsets for REG_STARTEND.
309: Use for output is still entirely controlled by
310: .IR nmatch ;
311: if
312: .I nmatch
313: is 0 or REG_NOSUB was specified,
314: the value of
315: .IR pmatch [0]
316: will not be changed by a successful
317: .IR regexec .
318: .PP
319: .I Regerror
320: maps a non-zero
321: .I errcode
322: from either
323: .I regcomp
324: or
325: .I regexec
326: to a human-readable, printable message.
327: If
328: .I preg
329: is non-NULL,
330: the error code should have arisen from use of
331: the
332: .I regex_t
333: pointed to by
334: .IR preg ,
335: and if the error code came from
336: .IR regcomp ,
337: it should have been the result from the most recent
338: .I regcomp
339: using that
340: .IR regex_t .
341: .RI ( Regerror
342: may be able to supply a more detailed message using information
343: from the
344: .IR regex_t .)
345: .I Regerror
346: places the NUL-terminated message into the buffer pointed to by
347: .IR errbuf ,
348: limiting the length (including the NUL) to at most
349: .I errbuf_size
350: bytes.
351: If the whole message won't fit,
352: as much of it as will fit before the terminating NUL is supplied.
353: In any case,
354: the returned value is the size of buffer needed to hold the whole
355: message (including terminating NUL).
356: If
357: .I errbuf_size
358: is 0,
359: .I errbuf
360: is ignored but the return value is still correct.
361: .PP
362: If the
363: .I errcode
364: given to
365: .I regerror
366: is first ORed with REG_ITOA,
367: the ``message'' that results is the printable name of the error code,
368: e.g. ``REG_NOMATCH'',
369: rather than an explanation thereof.
370: If
371: .I errcode
372: is REG_ATOI,
373: then
374: .I preg
375: shall be non-NULL and the
376: .I re_endp
377: member of the structure it points to
378: must point to the printable name of an error code;
379: in this case, the result in
380: .I errbuf
381: is the decimal digits of
382: the numeric value of the error code
383: (0 if the name is not recognized).
384: REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
385: they are extensions,
386: compatible with but not specified by POSIX 1003.2,
387: and should be used with
388: caution in software intended to be portable to other systems.
389: Be warned also that they are considered experimental and changes are possible.
390: .PP
391: .I Regfree
392: frees any dynamically-allocated storage associated with the compiled RE
393: pointed to by
394: .IR preg .
395: The remaining
396: .I regex_t
397: is no longer a valid compiled RE
398: and the effect of supplying it to
399: .I regexec
400: or
401: .I regerror
402: is undefined.
403: .PP
404: None of these functions references global variables except for tables
405: of constants;
406: all are safe for use from multiple threads if the arguments are safe.
407: .SH IMPLEMENTATION CHOICES
408: There are a number of decisions that 1003.2 leaves up to the implementor,
409: either by explicitly saying ``undefined'' or by virtue of them being
410: forbidden by the RE grammar.
411: This implementation treats them as follows.
412: .PP
413: See
414: .ZR
415: for a discussion of the definition of case-independent matching.
416: .PP
417: There is no particular limit on the length of REs,
418: except insofar as memory is limited.
419: Memory usage is approximately linear in RE size, and largely insensitive
420: to RE complexity, except for bounded repetitions.
421: See BUGS for one short RE using them
422: that will run almost any system out of memory.
423: .PP
424: A backslashed character other than one specifically given a magic meaning
425: by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
426: is taken as an ordinary character.
427: .PP
428: Any unmatched [ is a REG_EBRACK error.
429: .PP
430: Equivalence classes cannot begin or end bracket-expression ranges.
431: The endpoint of one range cannot begin another.
432: .PP
433: RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
434: .PP
435: A repetition operator (?, *, +, or bounds) cannot follow another
436: repetition operator.
437: A repetition operator cannot begin an expression or subexpression
438: or follow `^' or `|'.
439: .PP
440: `|' cannot appear first or last in a (sub)expression or after another `|',
441: i.e. an operand of `|' cannot be an empty subexpression.
442: An empty parenthesized subexpression, `()', is legal and matches an
443: empty (sub)string.
444: An empty string is not a legal RE.
445: .PP
446: A `{' followed by a digit is considered the beginning of bounds for a
447: bounded repetition, which must then follow the syntax for bounds.
448: A `{' \fInot\fR followed by a digit is considered an ordinary character.
449: .PP
450: `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
451: REs are anchors, not ordinary characters.
452: .SH SEE ALSO
1.2 cgd 453: grep(1), re_format(7)
1.1 jtc 454: .PP
455: POSIX 1003.2, sections 2.8 (Regular Expression Notation)
456: and
457: B.5 (C Binding for Regular Expression Matching).
458: .SH DIAGNOSTICS
459: Non-zero error codes from
460: .I regcomp
461: and
462: .I regexec
463: include the following:
464: .PP
465: .nf
466: .ta \w'REG_ECOLLATE'u+3n
467: REG_NOMATCH regexec() failed to match
468: REG_BADPAT invalid regular expression
469: REG_ECOLLATE invalid collating element
470: REG_ECTYPE invalid character class
471: REG_EESCAPE \e applied to unescapable character
472: REG_ESUBREG invalid backreference number
473: REG_EBRACK brackets [ ] not balanced
474: REG_EPAREN parentheses ( ) not balanced
475: REG_EBRACE braces { } not balanced
476: REG_BADBR invalid repetition count(s) in { }
477: REG_ERANGE invalid character range in [ ]
478: REG_ESPACE ran out of memory
479: REG_BADRPT ?, *, or + operand invalid
480: REG_EMPTY empty (sub)expression
481: REG_ASSERT ``can't happen''\(emyou found a bug
482: REG_INVARG invalid argument, e.g. negative-length string
483: .fi
484: .SH HISTORY
1.3 ! cgd 485: Originally written by Henry Spencer.
! 486: Altered for inclusion in the 4.4BSD distribution.
1.1 jtc 487: .SH BUGS
488: This is an alpha release with known defects.
489: Please report problems.
490: .PP
491: There is one known functionality bug.
492: The implementation of internationalization is incomplete:
493: the locale is always assumed to be the default one of 1003.2,
494: and only the collating elements etc. of that locale are available.
495: .PP
496: The back-reference code is subtle and doubts linger about its correctness
497: in complex cases.
498: .PP
499: .I Regexec
500: performance is poor.
501: This will improve with later releases.
502: .I Nmatch
503: exceeding 0 is expensive;
504: .I nmatch
505: exceeding 1 is worse.
506: .I Regexec
507: is largely insensitive to RE complexity \fIexcept\fR that back
508: references are massively expensive.
509: RE length does matter; in particular, there is a strong speed bonus
510: for keeping RE length under about 30 characters,
511: with most special characters counting roughly double.
512: .PP
513: .I Regcomp
514: implements bounded repetitions by macro expansion,
515: which is costly in time and space if counts are large
516: or bounded repetitions are nested.
517: An RE like, say,
518: `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
519: will (eventually) run almost any existing machine out of swap space.
520: .PP
521: There are suspected problems with response to obscure error conditions.
522: Notably,
523: certain kinds of internal overflow,
524: produced only by truly enormous REs or by multiply nested bounded repetitions,
525: are probably not handled well.
526: .PP
527: Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
528: a special character only in the presence of a previous unmatched `('.
529: This can't be fixed until the spec is fixed.
530: .PP
531: The standard's definition of back references is vague.
532: For example, does
533: `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
534: Until the standard is clarified,
535: behavior in such cases should not be relied on.
536: .PP
537: The implementation of word-boundary matching is a bit of a kludge,
538: and bugs may lurk in combinations of word-boundary matching and anchoring.
CVSweb <webmaster@jp.NetBSD.org>