The NetBSD Project

CVS log for src/usr.sbin/makemandb/custom_apropos_tokenizer.c

[BACK] Up to [cvs.NetBSD.org] / src / usr.sbin / makemandb

Request diff between arbitrary revisions


Default branch: MAIN


Revision 1.6 / (download) - annotate - [select for diffs], Mon Aug 7 20:35:21 2023 UTC (6 months, 3 weeks ago) by tnn
Branch: MAIN
CVS Tags: HEAD
Changes since 1.5: +2 -2 lines
Diff to previous 1.5 (colored)

makemakedb: don't return uninitialized token length if stemming fails

Revision 1.5 / (download) - annotate - [select for diffs], Thu Aug 3 07:49:23 2023 UTC (7 months ago) by rin
Branch: MAIN
Changes since 1.4: +27 -27 lines
Diff to previous 1.4 (colored)

makemandb: trailing whitespace

Revision 1.4 / (download) - annotate - [select for diffs], Sun Dec 5 08:03:32 2021 UTC (2 years, 2 months ago) by wiz
Branch: MAIN
CVS Tags: netbsd-10-base, netbsd-10-0-RC5, netbsd-10-0-RC4, netbsd-10-0-RC3, netbsd-10-0-RC2, netbsd-10-0-RC1, netbsd-10
Changes since 1.3: +2 -2 lines
Diff to previous 1.3 (colored)

preceds -> precedes

Revision 1.3 / (download) - annotate - [select for diffs], Sun Dec 5 07:13:49 2021 UTC (2 years, 2 months ago) by msaitoh
Branch: MAIN
Changes since 1.2: +2 -2 lines
Diff to previous 1.2 (colored)

s/preceed/preced/ in comment.

Revision 1.2 / (download) - annotate - [select for diffs], Tue Oct 31 10:14:27 2017 UTC (6 years, 4 months ago) by abhinav
Branch: MAIN
CVS Tags: phil-wifi-base, phil-wifi-20200421, phil-wifi-20200411, phil-wifi-20200406, phil-wifi-20191119, phil-wifi-20190609, phil-wifi, pgoyette-compat-merge-20190127, pgoyette-compat-base, pgoyette-compat-20190127, pgoyette-compat-20190118, pgoyette-compat-1226, pgoyette-compat-1126, pgoyette-compat-1020, pgoyette-compat-0930, pgoyette-compat-0906, pgoyette-compat-0728, pgoyette-compat-0625, pgoyette-compat-0521, pgoyette-compat-0502, pgoyette-compat-0422, pgoyette-compat-0415, pgoyette-compat-0407, pgoyette-compat-0330, pgoyette-compat-0322, pgoyette-compat-0315, pgoyette-compat, netbsd-9-base, netbsd-9-3-RELEASE, netbsd-9-2-RELEASE, netbsd-9-1-RELEASE, netbsd-9-0-RELEASE, netbsd-9-0-RC2, netbsd-9-0-RC1, netbsd-9, is-mlppp-base, is-mlppp, cjep_sun2x-base1, cjep_sun2x-base, cjep_sun2x, cjep_staticlib_x-base1, cjep_staticlib_x-base, cjep_staticlib_x
Changes since 1.1: +5 -2 lines
Diff to previous 1.1 (colored)

Casting variable of type int *  to size_t *, may cause
alignment issues on some platforms (e.g. Sparc64)
So, Use a temporary variable to avoid the cast.

Thanks to Martin@ for noticing the issue and also suggesting the issue.
Fixes PR bin/52678

Revision 1.1 / (download) - annotate - [select for diffs], Sun Jun 18 16:24:10 2017 UTC (6 years, 8 months ago) by abhinav
Branch: MAIN
CVS Tags: perseant-stdc-iso10646-base, perseant-stdc-iso10646

Add a custom tokenizer which does not stem certain keywords.

Which keywords should not be stemmed is specified in the nostem.txt file.
(Right now I have taken all the man page names, split them if they had
underscores, removed common English words and converted everything to
lowercase.)

The tokenizer itself is based on the Porter stemming tokenizer shipped with
Sqlite. The code in custom_apropos_tokenizer.c is copy of that code with
some modifications to prevent stemming keywords specified in nostem.txt.

Additionally, it now uses underscore `_' also as a token delimiter. Therefore,
now it's possible to do query for `lwp' and all `_lwp_*' man page names
will be matched. Or the query can be `unconst' and `__UNCONST' will be matched.
This was not possible earlier, because underscore was not a delimiter and therefore
the index would have __UNCONST as a key rather than UNCONST.

The tokenizer needs fts3_tokenizer.h file, which is not shipped with the
amalgamation build of Sqlite, therefore it needs to be added here (unless
we decide there is a better place for it).

To enforce using the new tokenizer, a schema version bump is needed

Since the tokenization is done both at the indexing time (via makemandb) and
also while query time (via apropos or whatis), it will be needed to bump
the schema version everytime nostem.txt is modified. Otherwise the
index will consist of old tokens and desired changes will not be seen with
apropos.

This should also fix the issue reported in PR bin/46255. Similar suggestion was
also made on tech-userlevel@ recently:
<http://mail-index.netbsd.org/tech-userlevel/2017/06/08/msg010620.html>

Thanks to christos@ for multiple rounds of reviews of the tokenizer code.

This form allows you to request diff's between any two revisions of a file. You may select a symbolic revision name using the selection box or you may type in a numeric name using the type-in text box.




CVSweb <webmaster@jp.NetBSD.org>