Annotation of src/external/bsd/tre/dist/README.md, Revision 1.1.1.1
1.1 rin 1: Introduction
2: ============
3:
4: TRE is a lightweight, robust, and efficient POSIX compliant regexp
5: matching library with some exciting features such as approximate
6: (fuzzy) matching.
7:
8: The matching algorithm used in TRE uses linear worst-case time in
9: the length of the text being searched, and quadratic worst-case
10: time in the length of the used regular expression.
11:
12: In other words, the time complexity of the algorithm is O(M^2N), where
13: M is the length of the regular expression and N is the length of the
14: text. The used space is also quadratic on the length of the regex, but
15: does not depend on the searched string. This quadratic behaviour
16: occurs only on pathological cases which are probably very rare in
17: practice.
18:
19:
20: Hacking
21: =======
22:
23: Here's how to work with this code.
24:
25: Prerequisites
26: -------------
27:
28: You will need the following tools installed on your system:
29:
30: - autoconf
31: - automake
32: - gettext
33: - libtool
34: - zip (optional)
35:
36:
37: Building
38: --------
39:
40: First, prepare the tre. Change to the root of the source directory
41: and run
42: ```
43: ./utils/autogen.sh
44: ```
45: This will regenerate various things using the prerequisite tools so
46: that you end up with a buildable tree.
47:
48: After this, you can run the configure script and build TRE as usual:
49: ```
50: ./configure
51: make
52: make check
53: make install
54: ```
55:
56:
57: Building a source code package
58: ------------------------------
59:
60: In a prepared tree, this command creates a source code tarball:
61: ```
62: ./configure && make dist
63: ```
64:
65: Alternatively, you can run
66: ```
67: ./utils/build-sources.sh
68: ```
69: which builds the source code packages and puts them in the `dist`
70: subdirectory. This script needs a working `zip` command.
71:
72:
73: Features
74: ========
75:
76: TRE is not just yet another regexp matcher. TRE has some features
77: which are not there in most free POSIX compatible implementations.
78: Most of these features are not present in non-free implementations
79: either, for that matter.
80:
81: Approximate matching
82: --------------------
83:
84: Approximate pattern matching allows matches to be approximate, that
85: is, allows the matches to be close to the searched pattern under
86: some measure of closeness. TRE uses the edit-distance measure (also
87: known as the Levenshtein distance) where characters can be
88: inserted, deleted, or substituted in the searched text in order to
89: get an exact match.
90:
91: Each insertion, deletion, or substitution adds the distance, or cost,
92: of the match. TRE can report the matches which have a cost lower than
93: some given threshold value. TRE can also be used to search for matches
94: with the lowest cost.
95:
96: TRE includes a version of the agrep (approximate grep) command line
97: tool for approximate regexp matching in the style of grep. Unlike
98: other agrep implementations (like the one by Sun Wu and Udi Manber
99: from University of Arizona) TRE agrep allows full regexps of any
100: length, any number of errors, and non-uniform costs for insertion,
101: deletion and substitution.
102:
103: Strict standard conformance
104: ---------------------------
105:
106: POSIX defines the behaviour of regexp functions precisely. TRE
107: attempts to conform to these specifications as strictly as possible.
108: TRE always returns the correct matches for subpatterns, for example.
109: Very few other implementations do this correctly. In fact, the only
110: other implementations besides TRE that I am aware of (free or not)
111: that get it right are Rx by Tom Lord, Regex++ by John Maddock, and the
112: AT&T ast regex by Glenn Fowler and Doug McIlroy.
113:
114: The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
115: or Open Group Base Specifications Issue 6, commonly referred to as
116: "POSIX". It can be found online here. The relevant parts are the
117: base specifications on regular expressions (and the rationale) and
118: the description of the regcomp() API.
119:
120: For an excellent survey on POSIX regexp matchers, see the testregex
121: pages by Glenn Fowler of AT&T Labs Research.
122:
123: Predictable matching speed
124: --------------------------
125:
126: Because of the matching algorithm used in TRE, the maximum time
127: consumed by any regexec() call is always directly proportional to
128: the length of the searched string. There is one exception: if back
129: references are used, the matching may take time that grows
130: exponentially with the length of the string. This is because
131: matching back references is an NP complete problem, and almost
132: certainly requires exponential time to match in the worst case.
133:
134: Predictable and modest memory consumption
135: -----------------------------------------
136:
137: A regexec() call never allocates memory from the heap. TRE
138: allocates all the memory it needs during a regcomp() call, and some
139: temporary working space from the stack frame for the duration of
140: the regexec() call. The amount of temporary space needed is
141: constant during matching and does not depend on the searched
142: string. For regexps of reasonable size TRE needs less than 50K of
143: dynamically allocated memory during the regcomp() call, less than
144: 20K for the compiled pattern buffer, and less than two kilobytes of
145: temporary working space from the stack frame during a regexec()
146: call. There is no time/memory tradeoff. TRE is also small in code
147: size; statically linking with TRE increases the executable size
148: less than 30K (gcc-3.2, x86, GNU/Linux).
149:
150: Wide character and multibyte character set support
151: --------------------------------------------------
152:
153: TRE supports multibyte character sets. This makes it possible to
154: use regexps seamlessly with, for example, Japanese locales. TRE
155: also provides a wide character API.
156:
157: Binary pattern and data support
158: -------------------------------
159:
160: TRE provides APIs which allow binary zero characters both in
161: regexps and searched strings. The standard API cannot be easily
162: used to, for example, search for printable words from binary data
163: (although it is possible with some hacking). Searching for patterns
164: which contain binary zeroes embedded is not possible at all with
165: the standard API.
166:
167: Completely thread safe
168: ----------------------
169:
170: TRE is completely thread safe. All the exported functions are
171: re-entrant, and a single compiled regexp object can be used
172: simultaneously in multiple contexts; e.g. in main() and a signal
173: handler, or in many threads of a multithreaded application.
174:
175: Portable
176: --------
177:
178: TRE is portable across multiple platforms. Here's a table of
179: platforms and compilers that have been successfully used to compile
180: and run TRE:
181:
182: <table>
183: <tr><th>Platform(s)</th> <th>Compiler(s)</th></tr>
184: <tr><td>AIX 4.3.2 - 5.3.0</td> <td>GCC, C for AIX compiler version 5</td></tr>
185: <tr><td>Compaq Tru64 UNIX V5.1A/B</td> <td>Compaq C V6.4-014 - V6.5-011</td></tr>
186: <tr><td>Cygwin 1.3 - 1.5</td> <td>GCC</td></tr>
187: <tr><td>Digital UNIX V4.0</td> <td>DEC C V5.9-005</td></tr>
188: <tr><td>FreeBSD 4 and above</td> <td>GCC</td></tr>
189: <tr><td>GNU/Linux systems on x86, x86_64, ppc64, s390</td><td>GCC</td></tr>
190: <tr><td>HP-UX 10.20- 11.00</td> <td>GCC, HP C Compiler</td></tr>
191: <tr><td>IRIX 6.5</td> <td>GCC, MIPSpro Compilers 7.3.1.3m</td></tr>
192: <tr><td>Max OS X</td></tr>
193: <tr><td>NetBSD 1.5 and above</td> <td>GCC, egcs</td></tr>
194: <tr><td>OpenBSD 3.3 and above</td> <td>GCC</td></tr>
195: <tr><td>Solaris 2.7-10 sparc/x86</td> <td>GCC, Sun Workshop 6 compilers</td></tr>
196: <tr><td>Windows 98 - XP</td> <td>Microsoft Visual C++ 6.0</td></tr>
197: </table>
198:
199:
200: TRE 0.7.5 should compile without changes on all of the above
201: platforms. Tell me if you are using TRE on a platform that is not
202: listed above, and I'll add it to the list. Also let me know if TRE
203: does not work on a listed platform.
204:
205: Depending on the platform, you may need to install libutf8 to get
206: wide character and multibyte character set support.
207:
208: Free
209: ----
210:
211: TRE is released under a license which is essentially the same as
212: the "2 clause" BSD-style license used in NetBSD. See the file
213: LICENSE for details.
214:
215: Roadmap
216: -------
217:
218: There are currently two features, both related to collating
219: elements, missing from 100% POSIX compliance. These are:
220:
221: * Support for collating elements (e.g. [[.<X>.]], where <X> is a
222: collating element). It is not possible to support
223: multi-character collating elements portably, since POSIX does
224: not define a way to determine whether a character sequence is a
225: multi-character collating element or not.
226:
227: * Support for equivalence classes, for example [[=<X>=]], where
228: <X> is a collating element. An equivalence class matches any
229: character which has the same primary collation weight as
230: <X>. Again, POSIX provides no portable mechanism for
231: determining the primary collation weight of a collating
232: element.
233:
234: Note that other portable regexp implementations don't support
235: collating elements either. The single exception is Regex++, which
236: comes with its own database for collating elements for different
237: locales. Support for collating elements and equivalence classes has
238: not been widely requested and is not very high on the TODO list at
239: the moment.
240:
241: These are other features I'm planning to implement real soon now:
242:
243: * All the missing GNU extensions enabled in GNU regex, such as
244: [[:<:]] and [[:>:]]
245:
246: * A REG_SHORTEST regexec() flag for returning the shortest match
247: instead of the longest match.
248:
249: * Perl-compatible syntax
250: * `[:^class:]`
251: * Matches anything but the characters in class. Note that
252: * [^[:class:]] works already, this would be just a
253: * convenience shorthand.
254: *
255: * `\A`
256: * Match only at beginning of string
257: *
258: * `\Z`
259: * Match only at end of string, or before newline at the end
260: *
261: * `\z`
262: * Match only at end of string
263: *
264: * `\l`
265: * Lowercase next char (think vi)
266: *
267: * `\u`
268: * Uppercase next char (think vi)
269: *
270: * `\L`
271: * Lowercase till \E (think vi)
272: *
273: * `\U`
274: * Uppercase till \E (think vi)
275: *
276: * `(?=pattern)`
277: * Zero-width positive look-ahead assertions.
278: *
279: * `(?!pattern)`
280: * Zero-width negative look-ahead assertions.
281: *
282: * `(?<=pattern)`
283: * Zero-width positive look-behind assertions.
284: *
285: * `(?<!pattern)`
286: * Zero-width negative look-behind assertions.
287:
288: Documentation especially for the nonstandard features of TRE, such
289: as approximate matching, is a work in progress (with "progress"
290: loosely defined...)
CVSweb <webmaster@jp.NetBSD.org>