Laird Breyer
contents introduction tutorial spam fun man related
previous next

Comparing dbacl with other classifiers

The key to comparing dbacl(1) with other classifiers is the mailcross(1) testsuite command.

Simply put, this command allows you to compare the error rates of several classifiers on a common set of training documents. The rates you obtain are of course only estimates, and likely to vary somewhat depending on the actual sample emails you use. Thus it is possible for one classifier to perform better than another with one set of documents, while performing worse with a different set.

Unfortunately, there does not exist a truly representative set of email documents for everyone on the planet. Moreover, one person's email characteristics vary slowly over time. Consequently, it makes little sense to compare the performance of different classifiers on different sets of documents. Instead, the task of choosing the best classifier for yourself can only be done reliably by referencing your own email, and by comparing classifiers on exactly the same emails.

The mailcross(1) testsuite must be given a set of categories, with sample emails from each category in mbox format. After selecting all the classifiers to be compared, it remains only to leave the script running over night. The summary is usually inspected the next morning.

The method used to estimate classification errors is a standard cross validation. The training emails are split into a number of roughly equal sized subsets, all of which, except for one, are used for learning. The remaining subset, which wasn't learned, is predicted. Finally, the percentage of errors is calculated for each category by averaging results over all possible choices of the prediction subset.

Note that this is neither the only way to estimate prediction errors, nor even accepted as a good way by all academics. However, it's independent of the classifier, widely used around the world, and easy to program.

You can cross validate as many categories as you like, provided the classifiers all support multiple categories. For example, you could compare dbacl(1) and ifile on many categories.

However, most email junk filters can only cope with two categories, representing junk mail and regular mail. When comparing the performance of these classifiers, such as bogofilter for example, the mailcross(1) testsuite is hard coded to function with two categories named spam and notspam. You must use these category names, or the results will not make sense.

previous next
contents introduction tutorial spam fun man related