An Example: Preparations
Before you can run mailcross(1) testsuite,
you need a set
of sample emails for each category. Here, we shall test on two
categories, named spam and notspam.
Take a moment to sift through your local mail folders for sample emails.
The instructions below assume you have two Unix mbox format files
named $HOME/sample_spam.mbox and $HOME/sample_notspam.mbox, containing
junk email and ordinary email respectively. These will not be
modified in any way during the test.
Fill these folders with
as many messages as you can. While this will lengthen the time it
takes for the cross validation to complete, it also gives more accurate results.
You should expect the tests to run overnight anyway.
If your emails aren't in mbox format, you must convert them. For example,
if $HOME/myspam is a directory containing your emails, one file per email, you
% find $HOME/myspam -type f | while read f; \
do formail <$f; done > $HOME/sample_spam.mbox
Alternatively, if you don't have many emails for testing, you can download
samples from a public corpus. For example, SpamAssassin maintains
suitable sets of messages at
http://spamassassin.org/publiccorpus/. Be kind to their server!
The SpamAssassin corpus doesn't come in mbox format. Here's what you must do
to obtain usable files: Download a compressed message archive. For example,
you can download the file 20021010_hard_ham.tar.bz2, which contains
a selection of nonjunk messages. Type
% tar xfj 20021010_hard_ham.tar.bz2
which will extract the files into a directory named hard_ham. If you
inspect the directory by typing
% ls hard_ham
you will see many files named something like
These are all individual messages. Watch out for files named out of the ordinary.
Some archives contain a file named cmds which is NOT a mail message.
Delete all such files before proceeding. Next, type:
% find hard_ham -type f | while read f; \
do formail <$f; done > $HOME/sample_notspam.mbox
You can repeat this command for as many archives as needed, but remember to
change the destination mbox name, as it will get overwritten otherwise.