Laird Breyer
contents introduction tutorial spam fun man related
previous next

Cross Validation

This section explains quality control and accuracy testing, but is not needed for daily use.

If you have time to kill, you might be inspired by the instructions above to write your own learning and filtering shell scripts. For example, you might have a script $HOME/ containing

dbacl -T email -H 19 -w 1 -l $@

With this script, you could learn your spam and notspam emails by typing

% ./ spam $HOME/mail/spam
% ./ notspam $HOME/mail/notspam

A second script $HOME/ might contain

dbacl -T email -vna $@ | bayesol -c $HOME/my.risk -v

With this script, you could classify an email by typing

% cat email.rfc | $HOME/ -c spam -c notspam

What we've done with these scripts saves typing, but isn't necessarily very useful. However, we can use these scripts together with another utility, mailcross(1).

What mailcross(1) does is run an email n-fold cross-validation, thereby giving you an idea of the effects of various switches. It accomplishes this by splitting each of your spam and notspam folders (which normally are fully used up for learning categories) into n equal sized subsets. One of these subsets is chosen for testing each category, and the remaining subsets are used for learning, since you already know if they are spam or notspam. This gives you an idea how good the filter is, and how it improves (or not!) when you change switches.

Statistically, cross-validation is on very shaky ground, because it violates the fundamental principle that you must not reuse data for two different purposes, but that doesn't stop most of the world from using it.

Suppose you want to cross-validate your filtering scripts above. The following instructions only work with mbox files. First you should type:

% mailcross prepare 10
% mailcross add spam $HOME/mail/spam
% mailcross add notspam $HOME/mail/notspam

This creates several copies of all your spam and notspam emails for later testing. Your original $HOME/mail/{,not}spam files are not modified or even needed after these steps.

Next, you must indicate the learning and filtering scripts to use. Type


Finally, it is time to run the cross validation. Type

% mailcross learn
% mailcross run
% mailcross summarize

The results you will see eventually are based on learning about 9/10th of the emails in $HOME/mail/spam and $HOME/mail/spam respectively for each category and testing with the other 1/10th.

Beware that the commands above may take a long time and have all the excitement of watching paint dry. When you get bored with cross validation, don't forget to type

% mailcross clean
previous next
contents introduction tutorial spam fun man related