Laird Breyer
contents introduction tutorial spam fun man related
previous next

An Example: Preparations

Before you can run mailcross(1) testsuite, you need a set of sample emails for each category. Here, we shall test on two categories, named spam and notspam.

Take a moment to sift through your local mail folders for sample emails.

The instructions below assume you have two Unix mbox format files named $HOME/sample_spam.mbox and $HOME/sample_notspam.mbox, containing junk email and ordinary email respectively. These will not be modified in any way during the test.

Fill these folders with as many messages as you can. While this will lengthen the time it takes for the cross validation to complete, it also gives more accurate results. You should expect the tests to run overnight anyway.

If your emails aren't in mbox format, you must convert them. For example, if $HOME/myspam is a directory containing your emails, one file per email, you can type:

% find $HOME/myspam -type f | while read f; \
do formail <$f; done > $HOME/sample_spam.mbox

Alternatively, if you don't have many emails for testing, you can download samples from a public corpus. For example, SpamAssassin maintains suitable sets of messages at Be kind to their server!

The SpamAssassin corpus doesn't come in mbox format. Here's what you must do to obtain usable files: Download a compressed message archive. For example, you can download the file 20021010_hard_ham.tar.bz2, which contains a selection of nonjunk messages. Type

% tar xfj 20021010_hard_ham.tar.bz2

which will extract the files into a directory named hard_ham. If you inspect the directory by typing

% ls hard_ham
you will see many files named something like 0053.ccd1056dc3ff533d76a97044bac52087. These are all individual messages. Watch out for files named out of the ordinary. Some archives contain a file named cmds which is NOT a mail message. Delete all such files before proceeding. Next, type:

% find hard_ham -type f | while read f; \
      do formail <$f; done > $HOME/sample_notspam.mbox

You can repeat this command for as many archives as needed, but remember to change the destination mbox name, as it will get overwritten otherwise.

previous next
contents introduction tutorial spam fun man related