The NetBSD Project

CVS log for src/usr.sbin/makemandb/custom_apropos_tokenizer.c

[BACK] Up to [cvs.NetBSD.org] / src / usr.sbin / makemandb

Request diff between arbitrary revisions


Keyword substitution: kv
Default branch: MAIN


Revision 1.6: download - view: text, markup, annotated - select for diffs
Mon Aug 7 20:35:21 2023 UTC (15 months, 4 weeks ago) by tnn
Branches: MAIN
CVS tags: perseant-exfatfs-base-20240630, perseant-exfatfs-base, perseant-exfatfs, HEAD
Diff to: previous 1.5: preferred, colored
Changes since revision 1.5: +2 -2 lines
makemakedb: don't return uninitialized token length if stemming fails

Revision 1.5: download - view: text, markup, annotated - select for diffs
Thu Aug 3 07:49:23 2023 UTC (16 months ago) by rin
Branches: MAIN
Diff to: previous 1.4: preferred, colored
Changes since revision 1.4: +27 -27 lines
makemandb: trailing whitespace

Revision 1.4: download - view: text, markup, annotated - select for diffs
Sun Dec 5 08:03:32 2021 UTC (3 years ago) by wiz
Branches: MAIN
CVS tags: netbsd-10-base, netbsd-10-0-RELEASE, netbsd-10-0-RC6, netbsd-10-0-RC5, netbsd-10-0-RC4, netbsd-10-0-RC3, netbsd-10-0-RC2, netbsd-10-0-RC1, netbsd-10
Diff to: previous 1.3: preferred, colored
Changes since revision 1.3: +2 -2 lines
preceds -> precedes

Revision 1.3: download - view: text, markup, annotated - select for diffs
Sun Dec 5 07:13:49 2021 UTC (3 years ago) by msaitoh
Branches: MAIN
Diff to: previous 1.2: preferred, colored
Changes since revision 1.2: +2 -2 lines
s/preceed/preced/ in comment.

Revision 1.2: download - view: text, markup, annotated - select for diffs
Tue Oct 31 10:14:27 2017 UTC (7 years, 1 month ago) by abhinav
Branches: MAIN
CVS tags: phil-wifi-base, phil-wifi-20200421, phil-wifi-20200411, phil-wifi-20200406, phil-wifi-20191119, phil-wifi-20190609, phil-wifi, pgoyette-compat-merge-20190127, pgoyette-compat-base, pgoyette-compat-20190127, pgoyette-compat-20190118, pgoyette-compat-1226, pgoyette-compat-1126, pgoyette-compat-1020, pgoyette-compat-0930, pgoyette-compat-0906, pgoyette-compat-0728, pgoyette-compat-0625, pgoyette-compat-0521, pgoyette-compat-0502, pgoyette-compat-0422, pgoyette-compat-0415, pgoyette-compat-0407, pgoyette-compat-0330, pgoyette-compat-0322, pgoyette-compat-0315, pgoyette-compat, netbsd-9-base, netbsd-9-4-RELEASE, netbsd-9-3-RELEASE, netbsd-9-2-RELEASE, netbsd-9-1-RELEASE, netbsd-9-0-RELEASE, netbsd-9-0-RC2, netbsd-9-0-RC1, netbsd-9, is-mlppp-base, is-mlppp, cjep_sun2x-base1, cjep_sun2x-base, cjep_sun2x, cjep_staticlib_x-base1, cjep_staticlib_x-base, cjep_staticlib_x
Diff to: previous 1.1: preferred, colored
Changes since revision 1.1: +5 -2 lines
Casting variable of type int *  to size_t *, may cause
alignment issues on some platforms (e.g. Sparc64)
So, Use a temporary variable to avoid the cast.

Thanks to Martin@ for noticing the issue and also suggesting the issue.
Fixes PR bin/52678

Revision 1.1: download - view: text, markup, annotated - select for diffs
Sun Jun 18 16:24:10 2017 UTC (7 years, 5 months ago) by abhinav
Branches: MAIN
CVS tags: perseant-stdc-iso10646-base, perseant-stdc-iso10646
Add a custom tokenizer which does not stem certain keywords.

Which keywords should not be stemmed is specified in the nostem.txt file.
(Right now I have taken all the man page names, split them if they had
underscores, removed common English words and converted everything to
lowercase.)

The tokenizer itself is based on the Porter stemming tokenizer shipped with
Sqlite. The code in custom_apropos_tokenizer.c is copy of that code with
some modifications to prevent stemming keywords specified in nostem.txt.

Additionally, it now uses underscore `_' also as a token delimiter. Therefore,
now it's possible to do query for `lwp' and all `_lwp_*' man page names
will be matched. Or the query can be `unconst' and `__UNCONST' will be matched.
This was not possible earlier, because underscore was not a delimiter and therefore
the index would have __UNCONST as a key rather than UNCONST.

The tokenizer needs fts3_tokenizer.h file, which is not shipped with the
amalgamation build of Sqlite, therefore it needs to be added here (unless
we decide there is a better place for it).

To enforce using the new tokenizer, a schema version bump is needed

Since the tokenization is done both at the indexing time (via makemandb) and
also while query time (via apropos or whatis), it will be needed to bump
the schema version everytime nostem.txt is modified. Otherwise the
index will consist of old tokens and desired changes will not be seen with
apropos.

This should also fix the issue reported in PR bin/46255. Similar suggestion was
also made on tech-userlevel@ recently:
<http://mail-index.netbsd.org/tech-userlevel/2017/06/08/msg010620.html>

Thanks to christos@ for multiple rounds of reviews of the tokenizer code.

Diff request

This form allows you to request diffs between any two revisions of a file. You may select a symbolic revision name using the selection box or you may type in a numeric name using the type-in text box.

Log view options

CVSweb <webmaster@jp.NetBSD.org>