dbacl project homepage


Summary
Forums
CVS
Download

Laird Breyer

Download

previous next

Basic operation: Learning

To learn your spam category, go to the directory containing your $HOME/mail/notspam folder (we assume mbox type here) and at the command prompt (written %), type:

% dbacl -T email -l spam $HOME/mail/spam

This reads all the messages in $HOME/mail/notspam one at a time, ignoring certain mail headers and attachments, and also removes HTML markup. The result is a binary file called spam, which can be used by dbacl for classifications. This file is a snapshot, it cannot be modified by learning extra mail messages. (unlike other spam filters, which sometimes let you learn incrementally, see below for discussion).

If you get warning messages about the hash size being too small, you need to increase the memory reserved for email tokens. Type:

% dbacl -T email -H 20 -l spam $HOME/mail/spam

which reserves space for up to 2^20 (one million) unique words. The default dbacl(1) settings are chosen to limit dramatically the memory requirements (about 32 thousand tokens). Once the limit is reached, no new tokens are added and your category models will be strongly skewed towards the first few emails read. Heed the warnings.

If your spam folder contains too many messages, you can tell dbacl to pick the emails to be learned randomly. This is done by using the decimation switch -x. For example,

% dbacl -T email -H 20 -x 1 -l spam $HOME/mail/spam

will learn only about 1/2 or the emails found in the spam folder. Similarly, -x 2 would learn about 1/4, -x 3 would learn about 1/8 of available emails.

dbacl(1) has several options designed to control the way email messages are parsed. The man page lists them all, but of particular interest are the -T and -e options.

If your email isn't kept in mbox format, dbacl(1) can open one or more directories and read all the files in it. For example, if your messages are stored in the directory $HOME/mh/, one file per email, you can type

% dbacl -T email -l spam $HOME/mh

At present, dbacl(1) won't read the subdirectories, or look at the file names to decide whether to read some messages and not others. Another (but not necessarily faster) solution is to temporarily convert your mail into an mbox format file and use that for learning:

% find $HOME/mh -type f | while read f; \
    do formail <$f; done | dbacl -T email -l spam

It is not enough to learn $HOME/mail/notspam emails, you must also learn the $HOME/mail/notspam emails. dbacl(1) can only choose among the categories which have been previously learned. It cannot say that an email is unlike spam (that's an open ended statement), only that an email is like spam or like notspam (these are both concrete statements). To learn the notspam category, type:

% dbacl -T email -l notspam $HOME/mail/notspam

Make sure to use the same switches for both spam and notspam categories. Once you've fully read the man page later, you can start to mix and match switches.

Every time dbacl(1) learns a category, it writes a binary file containing the statistical model into the hidden directory $HOME/.dbacl, so for example after learning the category spam, you will have a small file named $HOME/.dbacl/spam which contains everything dbacl(1) learned. The file is recreated from scratch each time you relearn spam, and is loaded each time you classify an email.

previous next