dbacl: Is It Working?
Laird A. Breyer
Introduction
dbacl is a UNIX/POSIX command line
toolset which can be used in scripts to classify a single email among one
or more previously learned categories.
This document is intended to help you check that your installation of
dbacl is working as it should. When trying out new software, there are
many things to learn and things that can go wrong, and because
dbacl is a statistical
tool, it can sometimes be surprising and unpredictable. When you get unexpected
results, is it because dbacl doesn't work as claimed, an undiscovered bug,
problems to do with learning, problems in your scripts? Knowing what to
look for is half the battle.
In the text below, some commands are written after a '%'. The '%'
represents your shell prompt, and you are expected to enter the commands
following it. If a line doesn't start with a '%', then it represents
what you are likely to see as command output.
The basics
Let's first see if dbacl is properly installed on your system. At the shell prompt,
type:
% dbacl -V
dbacl version 1.11
Copyright (c) 2002,2003,2004,2005 L.A. Breyer. All rights reserved.
dbacl comes with ABSOLUTELY NO WARRANTY, and is licensed
to you under the terms of the GNU General Public License.
.
.
If you get something that looks like this, then you know that dbacl is installed
and ready to use. If the shell can't find the command dbacl, then you
most likely need to download and install the program, as explained in the next section.
If you get an error, you need to install the program first (read the next section).
If you've built the program from sources yourself and your shell still can't find dbacl,
then you need to indicate the full path to the program each time.
Installation time
Since dbacl is an open
source (GPL) program, you are likely to first encounter it in one
of two forms: a source code form as a compressed tar file named
something like dbacl-1.11.tar.gz, or a preinstalled binary
form that comes with your GNU/Linux or other operating system.
In the source code form you are expected to first build the program in the usual way:
% tar xfz dbacl-1.11.tar.gz
% cd dbacl-1.11
% ./configure && make
If something goes wrong during the build, you will see an error message and the
build will not finish. Troubleshooting the build is beyond the scope of
this document, so let's assume it finished without errors.
How do we know that the freshly built programs operate correctly? We run
some standard tests:
% make check
.
.
===================
All 55 tests passed
===================
.
.
This command produces a lot of output on your terminal, but the important
line is the one indicating whether all tests have passed. Normally, when
dbacl is packaged, all the test are guaranteed to pass, but small differences
between computer systems can cause failures. By running "make check" you
will know if things are working on your system. If you see a failure,
it is most likely that some error message during the build steps will point to
some difficulties which you can fix.
On some systems, it is possible that a handful of tests will fail
because the configure script cannot find a suitable Unicode
environment. In that case, the error output contains "nbsp;" tokens
which could not be converted to spaces. This is harmless.
Finally, you must install the freshly built programs in the correct location
on your system. The simplest way is to type as root:
% make install-strip
If your version of dbacl was preinstalled on your operating system, then
you do not need to build the programs from source, and you cannot "make check"
to see if the tests passed. However, you can normally trust the distributor
of your operating system and assume that all the build tests were successful.
Sanity checks
There are two simple ways to check that dbacl works correctly. The first way
is to read the tutorials and type yourself the commands given there.
The tutorials and where to find them are listed in the dbacl man page,
which you can read by typing
% man dbacl
You should expect to see nearly identical output as described in the tutorials.
The second way to check that dbacl works is to run a small classification
test with your own email collections. This is a better test because it gives
you confidence that the system will work for you. Below is just a quick
explanation, the full details can be read by typing
% man mailcross
You will need two mbox files containing collected emails of two different
types. The types could be spam and notspam, or anything else,
but make sure that the two mbox files do not have messages in common and
represent different topics. There should also be roughly the same number
of messages in each. Now type the following (you can replace spam and
notspam by any names you like):
% mailcross prepare 5
% mailcross add spam /path/to/spam.mbox
% mailcross add notspam /path/to/notspam.mbox
% mailcross learn
% mailcross run
% mailcross summarize
These commands could take a while depending on how big your mbox files are.
The summary is a table which shows the number of errors in the experiment,
which consists in learning some random subsets of the spam.mbox and
notspam.mbox
files and trying to predict the remaining messages. If all went well, you
should expect the number of errors to be less than 10%, but the exact percentage
depends on how many messages you have, how easy to separate they are, and
the default switches used by the test system.
Don't forget to clean up the mailcross temporary directory by typing
% mailcross clean
Particular symptoms
If you've tried the things above and you still suspect things aren't working
correctly, here is a small list of symptoms and their likely causes.
- dbacl seems to forget what it learns
-
You keep feeding data to dbacl for learning, but it seems to forget
everything very soon. This symptom occurs because dbacl truly does
forget data. Unlike certain other spam filters, dbacl doesn't support
incremental learning. You must teach it everything in one go. You can
simulate incremental learning by keeping and growing a collection of training
examples, and periodically teaching dbacl the full collection.
|