Basic operation: Learning
To learn your spam category, go to the directory containing your
$HOME/mail/notspam folder (we assume mbox type here) and at the command prompt (written %), type:
% dbacl -T email -l spam $HOME/mail/spam
This reads all the messages in $HOME/mail/notspam one at a time, ignoring certain mail
headers and attachments, and also removes HTML markup. The result is a binary
file called spam, which can be used by dbacl for classifications.
This file is a snapshot, it cannot be modified by learning extra mail messages.
(unlike other spam filters, which sometimes let you learn incrementally,
see below for discussion).
If you get warning messages about the hash size being too small,
you need to increase the memory reserved for email tokens. Type:
% dbacl -T email -H 20 -l spam $HOME/mail/spam
which reserves space for up to 2^20 (one million) unique words. The
default dbacl(1) settings are chosen to
limit dramatically the memory requirements (about 32 thousand tokens). Once the limit is reached,
no new tokens are added and your category models will be strongly skewed towards the first few emails read. Heed the warnings.
If your spam folder contains too many messages, you can tell dbacl to pick the emails to be learned randomly. This is done by using the decimation switch -x. For example,
% dbacl -T email -H 20 -x 1 -l spam $HOME/mail/spam
will learn only about 1/2 or the emails found in the spam folder. Similarly, -x 2 would learn about 1/4, -x 3 would learn about 1/8 of available emails.
dbacl(1) has several options designed to control the way email messages are
parsed. The man page lists them all, but of particular interest are the -T
and -e options.
If your email isn't kept in mbox format, dbacl(1) can open one or more directories
and read all the files in it.
For example, if your messages are stored in the directory $HOME/mh/, one file per email,
you can type
% dbacl -T email -l spam $HOME/mh
At present, dbacl(1) won't read the subdirectories, or look at the file names
to decide whether to read some messages and not others.
Another (but not necessarily faster) solution is to temporarily
convert your mail into an mbox format file and use that for learning:
% find $HOME/mh -type f | while read f; \
do formail <$f; done | dbacl -T email -l spam
It is not enough to learn $HOME/mail/notspam emails, you must also learn the
$HOME/mail/notspam emails. dbacl(1) can only choose among the categories which have been previously learned. It
cannot say that an email is unlike spam (that's an open ended statement), only that an
email is like spam or like notspam (these are both concrete statements). To learn
the notspam category, type:
% dbacl -T email -l notspam $HOME/mail/notspam
Make sure to use the same switches for both spam and notspam categories.
Once you've fully read the man page later, you can start to mix and match switches.
Every time dbacl(1) learns a category, it writes
a binary file containing the statistical model into the hidden directory $HOME/.dbacl,
so for example after learning the category spam, you will have a small file named $HOME/.dbacl/spam which contains everything dbacl(1) learned. The file is recreated from scratch each time you relearn spam, and is loaded each time you classify an email.
|