src/external/bsd/tre/dist/README.md - annotate

Return to README.md CVS log
Up to [cvs.NetBSD.org] / src / external / bsd / tre / dist
Annotation of src/external/bsd/tre/dist/README.md, Revision 1.1.1.1

1.1       rin         1: Introduction
                      2: ============
                      3:
                      4: TRE is a lightweight, robust, and efficient POSIX compliant regexp
                      5: matching library with some exciting features such as approximate
                      6: (fuzzy) matching.
                      7:
                      8: The matching algorithm used in TRE uses linear worst-case time in
                      9: the length of the text being searched, and quadratic worst-case
                     10: time in the length of the used regular expression.
                     11:
                     12: In other words, the time complexity of the algorithm is O(M^2N), where
                     13: M is the length of the regular expression and N is the length of the
                     14: text. The used space is also quadratic on the length of the regex, but
                     15: does not depend on the searched string. This quadratic behaviour
                     16: occurs only on pathological cases which are probably very rare in
                     17: practice.
                     18:
                     19:
                     20: Hacking
                     21: =======
                     22:
                     23: Here's how to work with this code.
                     24:
                     25: Prerequisites
                     26: -------------
                     27:
                     28: You will need the following tools installed on your system:
                     29:
                     30:   - autoconf
                     31:   - automake
                     32:   - gettext
                     33:   - libtool
                     34:   - zip (optional)
                     35:
                     36:
                     37: Building
                     38: --------
                     39:
                     40: First, prepare the tre.  Change to the root of the source directory
                     41: and run
                     42: ```
                     43: ./utils/autogen.sh
                     44: ```
                     45: This will regenerate various things using the prerequisite tools so
                     46: that you end up with a buildable tree.
                     47:
                     48: After this, you can run the configure script and build TRE as usual:
                     49: ```
                     50: ./configure
                     51: make
                     52: make check
                     53: make install
                     54: ```
                     55:
                     56:
                     57: Building a source code package
                     58: ------------------------------
                     59:
                     60: In a prepared tree, this command creates a source code tarball:
                     61: ```
                     62: ./configure && make dist
                     63: ```
                     64:
                     65: Alternatively, you can run
                     66: ```
                     67: ./utils/build-sources.sh
                     68: ```
                     69: which builds the source code packages and puts them in the `dist`
                     70: subdirectory.  This script needs a working `zip` command.
                     71:
                     72:
                     73: Features
                     74: ========
                     75:
                     76: TRE is not just yet another regexp matcher. TRE has some features
                     77: which are not there in most free POSIX compatible implementations.
                     78: Most of these features are not present in non-free implementations
                     79: either, for that matter.
                     80:
                     81: Approximate matching
                     82: --------------------
                     83:
                     84: Approximate pattern matching allows matches to be approximate, that
                     85: is, allows the matches to be close to the searched pattern under
                     86: some measure of closeness. TRE uses the edit-distance measure (also
                     87: known as the Levenshtein distance) where characters can be
                     88: inserted, deleted, or substituted in the searched text in order to
                     89: get an exact match.
                     90:
                     91: Each insertion, deletion, or substitution adds the distance, or cost,
                     92: of the match. TRE can report the matches which have a cost lower than
                     93: some given threshold value. TRE can also be used to search for matches
                     94: with the lowest cost.
                     95:
                     96: TRE includes a version of the agrep (approximate grep) command line
                     97: tool for approximate regexp matching in the style of grep. Unlike
                     98: other agrep implementations (like the one by Sun Wu and Udi Manber
                     99: from University of Arizona) TRE agrep allows full regexps of any
                    100: length, any number of errors, and non-uniform costs for insertion,
                    101: deletion and substitution.
                    102:
                    103: Strict standard conformance
                    104: ---------------------------
                    105:
                    106: POSIX defines the behaviour of regexp functions precisely. TRE
                    107: attempts to conform to these specifications as strictly as possible.
                    108: TRE always returns the correct matches for subpatterns, for example.
                    109: Very few other implementations do this correctly.  In fact, the only
                    110: other implementations besides TRE that I am aware of (free or not)
                    111: that get it right are Rx by Tom Lord, Regex++ by John Maddock, and the
                    112: AT&T ast regex by Glenn Fowler and Doug McIlroy.
                    113:
                    114: The standard TRE tries to conform to is the IEEE Std 1003.1-2001,
                    115: or Open Group Base Specifications Issue 6, commonly referred to as
                    116: "POSIX".  It can be found online here. The relevant parts are the
                    117: base specifications on regular expressions (and the rationale) and
                    118: the description of the regcomp() API.
                    119:
                    120: For an excellent survey on POSIX regexp matchers, see the testregex
                    121: pages by Glenn Fowler of AT&T Labs Research.
                    122:
                    123: Predictable matching speed
                    124: --------------------------
                    125:
                    126: Because of the matching algorithm used in TRE, the maximum time
                    127: consumed by any regexec() call is always directly proportional to
                    128: the length of the searched string. There is one exception: if back
                    129: references are used, the matching may take time that grows
                    130: exponentially with the length of the string. This is because
                    131: matching back references is an NP complete problem, and almost
                    132: certainly requires exponential time to match in the worst case.
                    133:
                    134: Predictable and modest memory consumption
                    135: -----------------------------------------
                    136:
                    137: A regexec() call never allocates memory from the heap. TRE
                    138: allocates all the memory it needs during a regcomp() call, and some
                    139: temporary working space from the stack frame for the duration of
                    140: the regexec() call. The amount of temporary space needed is
                    141: constant during matching and does not depend on the searched
                    142: string. For regexps of reasonable size TRE needs less than 50K of
                    143: dynamically allocated memory during the regcomp() call, less than
                    144: 20K for the compiled pattern buffer, and less than two kilobytes of
                    145: temporary working space from the stack frame during a regexec()
                    146: call. There is no time/memory tradeoff. TRE is also small in code
                    147: size; statically linking with TRE increases the executable size
                    148: less than 30K (gcc-3.2, x86, GNU/Linux).
                    149:
                    150: Wide character and multibyte character set support
                    151: --------------------------------------------------
                    152:
                    153: TRE supports multibyte character sets. This makes it possible to
                    154: use regexps seamlessly with, for example, Japanese locales. TRE
                    155: also provides a wide character API.
                    156:
                    157: Binary pattern and data support
                    158: -------------------------------
                    159:
                    160: TRE provides APIs which allow binary zero characters both in
                    161: regexps and searched strings. The standard API cannot be easily
                    162: used to, for example, search for printable words from binary data
                    163: (although it is possible with some hacking). Searching for patterns
                    164: which contain binary zeroes embedded is not possible at all with
                    165: the standard API.
                    166:
                    167: Completely thread safe
                    168: ----------------------
                    169:
                    170: TRE is completely thread safe. All the exported functions are
                    171: re-entrant, and a single compiled regexp object can be used
                    172: simultaneously in multiple contexts; e.g. in main() and a signal
                    173: handler, or in many threads of a multithreaded application.
                    174:
                    175: Portable
                    176: --------
                    177:
                    178: TRE is portable across multiple platforms. Here's a table of
                    179: platforms and compilers that have been successfully used to compile
                    180: and run TRE:
                    181:
                    182: <table>
                    183:    <tr><th>Platform(s)</th>                    <th>Compiler(s)</th></tr>
                    184:    <tr><td>AIX 4.3.2 - 5.3.0</td>              <td>GCC, C for AIX compiler version 5</td></tr>
                    185:    <tr><td>Compaq Tru64 UNIX V5.1A/B</td>      <td>Compaq C V6.4-014 - V6.5-011</td></tr>
                    186:    <tr><td>Cygwin 1.3 - 1.5</td>               <td>GCC</td></tr>
                    187:    <tr><td>Digital UNIX V4.0</td>              <td>DEC C V5.9-005</td></tr>
                    188:    <tr><td>FreeBSD 4 and above</td>            <td>GCC</td></tr>
                    189:    <tr><td>GNU/Linux systems on x86, x86_64, ppc64, s390</td><td>GCC</td></tr>
                    190:    <tr><td>HP-UX 10.20- 11.00</td>             <td>GCC, HP C Compiler</td></tr>
                    191:    <tr><td>IRIX 6.5</td>                       <td>GCC, MIPSpro Compilers 7.3.1.3m</td></tr>
                    192:    <tr><td>Max OS X</td></tr>
                    193:    <tr><td>NetBSD 1.5 and above</td>           <td>GCC, egcs</td></tr>
                    194:    <tr><td>OpenBSD 3.3 and above</td>          <td>GCC</td></tr>
                    195:    <tr><td>Solaris 2.7-10 sparc/x86</td>       <td>GCC, Sun Workshop 6 compilers</td></tr>
                    196:    <tr><td>Windows 98 - XP</td>                <td>Microsoft Visual C++ 6.0</td></tr>
                    197: </table>
                    198:
                    199:
                    200: TRE 0.7.5 should compile without changes on all of the above
                    201: platforms.  Tell me if you are using TRE on a platform that is not
                    202: listed above, and I'll add it to the list. Also let me know if TRE
                    203: does not work on a listed platform.
                    204:
                    205: Depending on the platform, you may need to install libutf8 to get
                    206: wide character and multibyte character set support.
                    207:
                    208: Free
                    209: ----
                    210:
                    211: TRE is released under a license which is essentially the same as
                    212: the "2 clause" BSD-style license used in NetBSD.  See the file
                    213: LICENSE for details.
                    214:
                    215: Roadmap
                    216: -------
                    217:
                    218: There are currently two features, both related to collating
                    219: elements, missing from 100% POSIX compliance. These are:
                    220:
                    221: * Support for collating elements (e.g. [[.<X>.]], where <X> is a
                    222:   collating element). It is not possible to support
                    223:   multi-character collating elements portably, since POSIX does
                    224:   not define a way to determine whether a character sequence is a
                    225:   multi-character collating element or not.
                    226:
                    227: * Support for equivalence classes, for example [[=<X>=]], where
                    228:   <X> is a collating element. An equivalence class matches any
                    229:   character which has the same primary collation weight as
                    230:   <X>. Again, POSIX provides no portable mechanism for
                    231:   determining the primary collation weight of a collating
                    232:   element.
                    233:
                    234: Note that other portable regexp implementations don't support
                    235: collating elements either. The single exception is Regex++, which
                    236: comes with its own database for collating elements for different
                    237: locales. Support for collating elements and equivalence classes has
                    238: not been widely requested and is not very high on the TODO list at
                    239: the moment.
                    240:
                    241: These are other features I'm planning to implement real soon now:
                    242:
                    243: * All the missing GNU extensions enabled in GNU regex, such as
                    244:   [[:<:]] and [[:>:]]
                    245:
                    246: * A REG_SHORTEST regexec() flag for returning the shortest match
                    247:   instead of the longest match.
                    248:
                    249: * Perl-compatible syntax
                    250:   * `[:^class:]`
                    251:   *  Matches anything but the characters in class. Note that
                    252:   *  [^[:class:]] works already, this would be just a
                    253:   *  convenience shorthand.
                    254:   *
                    255:   * `\A`
                    256:   * Match only at beginning of string
                    257:   *
                    258:   * `\Z`
                    259:   * Match only at end of string, or before newline at the end
                    260:   *
                    261:   * `\z`
                    262:   * Match only at end of string
                    263:   *
                    264:   * `\l`
                    265:   * Lowercase next char (think vi)
                    266:   *
                    267:   * `\u`
                    268:   * Uppercase next char (think vi)
                    269:   *
                    270:   * `\L`
                    271:   * Lowercase till \E (think vi)
                    272:   *
                    273:   * `\U`
                    274:   * Uppercase till \E (think vi)
                    275:   *
                    276:   * `(?=pattern)`
                    277:   * Zero-width positive look-ahead assertions.
                    278:   *
                    279:   * `(?!pattern)`
                    280:   * Zero-width negative look-ahead assertions.
                    281:   *
                    282:   * `(?<=pattern)`
                    283:   * Zero-width positive look-behind assertions.
                    284:   *
                    285:   * `(?<!pattern)`
                    286:   * Zero-width negative look-behind assertions.
                    287:
                    288: Documentation especially for the nonstandard features of TRE, such
                    289: as approximate matching, is a work in progress (with "progress"
                    290: loosely defined...)
CVSweb <webmaster@jp.NetBSD.org>