SourceForge.net Logo
Summary
Forums
CVS
Download

Laird Breyer
Download
contents introduction tutorial spam fun man related
previous next
MAILCROSS.TESTSUITE

MAILCROSS.TESTSUITE

NAME
SYNOPSIS
DESCRIPTION
EXIT STATUS
COMMANDS
USAGE
SCRIPT INTERFACE
ENVIRONMENT
WARNING
SOURCE
AUTHOR
SEE ALSO

NAME

mailcross.testsuite − cross validate several mail classifiers at once.

SYNOPSIS

mailcross.testsuite command [ command_arguments ]

DESCRIPTION

mailcross.testsuite can be used to cross validate several binary email classifiers simultaneously on a user’s own email corpus.

One of the easiest ways to compare the performance of trainable email classifiers is through cross validation. This consists in computing the average success/failure rates when the classifier learns most emails in the corpus, while keeping a selection of unlearned messages for testing. mailcross(1) is a tool designed to compute such rates for a single email classifier such as dbacl(1) at a time.

mailcross.testsuite takes care of cross validating several classifiers simultaneously, by running mailcross(1) for each. To accomplish this task, each classifier is invoked through a wrapper script, which takes care of differences in the invocation of the various classifiers. mailcross.testsuite comes with a selection of predefined wrapper scripts for several popular email classifiers.

EXIT STATUS

mailcross.testsuite returns 1 on success, 0 if a problem occurred.

COMMANDS

list

Shows a list of available filters/wrapper scripts which can be selected.

select FILTER [FILTER]...

Prepares the filter(s) named FILTER to be used for cross validation. The filter name is the name of a wrapper script located in the directory @PKGDATADIR@/testsuite. Each filter has a rigid interface documented below, and the act of selecting it copies it to the mailcross.d/filters directory. Only filters located there are used in the cross validation.

deselect FILTER [FILTER]...

Removes the named filter(s) from the directory mailcross.d/filters so that they are not used in the cross validation.

run

Invokes mailcross(1) for every filter located in mailcross.d/filters.

status

Describes the scheduled cross validations.

summarize

Shows the cross validation results for all filters. Only makes sense after the run command.

USAGE

mailcross.testsuite is closely tied to mailcross(1), which must be invoked directly for some of the steps.

Before you can cross validate any of your email classifiers, you will need two sets of email messages in mbox format. One set can be full of junk mail (assumed henceforth to reside in the file $HOME/mail/junk.mbox), another can be full of ordinary mail (assumed to reside in the file $HOME/mail/good.mbox). It is normally important to keep the two types of email messages entirely separate, as mixing message types can impact classifier performance.

The first step in the cross validation is to create the necessary infrastructure. To do so, make sure you have plenty of disk space (about 10 times the combined size of both mbox files), and type the following:

% mailcross prepare 10
% mailcross add spam $HOME/mail/junk.mbox
% mailcross add notspam $HOME/mail/good.mbox

This will create a directory named mailcross.d which contains copies of your mail messages ready for use. The original mbox files are no longer needed or referenced. Next, you must choose which email classifiers to test. Every such classifier is called through a wrapper script by mailcross.testsuite, and you can view a list of available wrappers by typing:

% mailcross.testsuite list

Note that the wrapper scripts are NOT the actual email classifiers, which must be installed separately by your system administrator or otherwise. Once this is done, you can select one or more wrappers for the cross validation by typing, for example:

% mailcross.testsuite select dbacl ifile

If some of the selected classifiers cannot be found on the system, they are not selected. Note also that some wrappers can have hard-coded category names, e.g. if the classifier only supports binary classification. Heed the warning messages.

It remains only to run the cross validation. Beware, this can take a long time (several hours depending on the classifier).

% mailcross.testsuite run
% mailcross.testsuite summarize

Once you are all done cross validating, you can delete the working files, log files etc. by typing

% mailcross clean

The progress of the cross validation is written silently in various log files which are located in the mailcross.d/log directory. Check these in case of problems.

SCRIPT INTERFACE

mailcross.testsuite takes care of learning and classifying your prepared email corpora for each selected classifier. Since classifiers have widely varying interfaces, this is only possible by wrapping those interfaces individually into a standard form which can be used by mailcross.testsuite.

Each wrapper script is a command line tool which accepts a single command followed by zero or more optional arguments, in the standard form:

wrapper command [argument]...

Each wrapper script also makes use of STDIN and STDOUT in a well defined way. If no behaviour is described, then no output or input should be used. The possible commands are described below:

filter

In this case, a single email is expected on STDIN, and a list of category filenames is expected in $2, $3, etc. The script writes the category name corresponding to the input email on STDOUT. No trailing newline is required or expected.

learn

In this case, a standard mbox stream is expected on STDIN, while a suitable category file name is expected in $2. No output is written to STDOUT.

clean

In this case, a directory is expected in $2, which is examined for old database information. If any old databases are found, they are purged or reset. No output is written to STDOUT.

describe

IN this case, a single line of text is written to STDOUT, describing the filter’s functionality. The line should be kept short to prevent line wrapping on a terminal.

bootstrap

In this case, a directory is expected in $2. The wrapper script first checks for the existence of its associated classifier, and other prerequisites. If the check is successful, then the wrapper is cloned into the supplied directory. A courtesy notification should be given on STDOUT to express success or failure. It is also permissible to give longer descriptions caveats.

ENVIRONMENT

The variables MAILCROSS_FILTER and MAILCROSS_LEARNER are set repeatedly during cross validation, and have unpredictable values at the end.

WARNING

Cross-validation is a widely used, but ad-hoc statistical procedure, completely unrelated to Bayesian theory, and subject to controversy. Use this at your own risk.

SOURCE

The source code for the latest version of this program is available at the following locations:

http://www.lbreyer.com/gpl.html
http://dbacl.sourceforge.net

AUTHOR

Laird A. Breyer <laird@lbreyer.com>

SEE ALSO

bayesol(1) dbacl(1), mailinspect(1), mailcross(1), regex(7)


previous next
contents introduction tutorial spam fun man related