Cross Validation
This section explains quality control and accuracy testing, but
is not needed for daily use.
If you have time to kill, you might be inspired by the instructions above to write your own learning and filtering shell scripts. For example, you might
have a script $HOME/mylearner.sh containing
#!/bin/sh
dbacl -T email -H 19 -w 1 -l $@
With this script, you could learn your spam and notspam emails by typing
% ./mylearner.sh spam $HOME/mail/spam
% ./mylearner.sh notspam $HOME/mail/notspam
A second script $HOME/myfilter.sh might contain
#!/bin/sh
dbacl -T email -vna $@ | bayesol -c $HOME/my.risk -v
With this script, you could classify an email by typing
% cat email.rfc | $HOME/myfilter.sh -c spam -c notspam
What we've done with these scripts saves typing, but isn't necessarily
very useful. However, we can use these scripts together with another
utility, mailcross(1).
What mailcross(1) does is run an email n-fold
cross-validation, thereby giving you an idea of the effects of
various switches. It accomplishes this by splitting each of your spam and notspam folders (which normally are fully used up for learning categories) into
n equal sized subsets. One of these subsets is chosen for testing each category, and the remaining subsets are used for learning, since you already know if they are spam or notspam. This gives you an idea how good the filter is, and how it improves (or not!) when you change switches.
Statistically, cross-validation is on very shaky ground, because it violates the fundamental principle that you must not reuse data for two different purposes, but that doesn't stop most of the world from using it.
Suppose you want to cross-validate your filtering scripts above. The following
instructions only work with mbox files. First you should type:
% mailcross prepare 10
% mailcross add spam $HOME/mail/spam
% mailcross add notspam $HOME/mail/notspam
This creates several copies of all your spam and notspam emails
for later testing. Your original $HOME/mail/{,not}spam files are not
modified or even needed after these steps.
Next, you must indicate the learning and filtering scripts to use. Type
% export MAILCROSS_LEARNER="$HOME/mylearner.sh"
% export MAILCROSS_FILTER="$HOME/myfilter.sh"
Finally, it is time to run the cross validation. Type
% mailcross learn
% mailcross run
% mailcross summarize
The results you will see eventually are based on learning about 9/10th of the emails in $HOME/mail/spam and $HOME/mail/spam respectively for each category and testing with the other 1/10th.
Beware that the commands above may take a long time and have all the excitement of watching paint dry. When you get bored with cross validation, don't forget to type
% mailcross clean
|