Comparing dbacl with other classifiers
The key to comparing dbacl(1) with other
classifiers is the mailcross(1) testsuite command.
Simply put, this command allows you to compare the error rates of several
classifiers on a common set of training documents. The rates you obtain
are of course only estimates, and likely to vary somewhat depending on the
actual sample emails you use. Thus it is possible for one classifier to
perform better than another with one set of documents, while performing worse
with a different set.
Unfortunately, there does not exist a truly representative set of email documents
for everyone on the planet. Moreover, one person's email characteristics vary slowly over time. Consequently, it makes little sense to compare the
performance of different classifiers on different sets of documents. Instead,
the task of choosing the best classifier for yourself can only be done reliably by referencing your own email, and by comparing classifiers on exactly the same emails.
The mailcross(1) testsuite must be given a set
of categories, with sample emails from each category in mbox format.
After selecting all the classifiers to be compared, it remains only to leave
the script running over night. The summary is usually inspected the next morning.
The method used to estimate classification errors is a standard cross validation.
The training emails are split into a number of roughly equal sized subsets, all of which, except for one, are used for learning. The remaining subset, which wasn't learned, is predicted. Finally, the percentage of errors is calculated for each category by averaging results over all possible choices of the prediction subset.
Note that this is neither the only way to estimate prediction errors, nor
even accepted as a good way by all academics. However, it's independent of
the classifier, widely used around the world, and easy to program.
You can cross validate as many categories as you like, provided the classifiers all support multiple categories. For example, you could compare dbacl(1) and ifile on many categories.
However, most email junk filters can only cope with two categories, representing junk mail and regular mail. When comparing the performance of these classifiers, such as bogofilter for example, the mailcross(1) testsuite is hard coded to function with two categories named spam and notspam. You must use these category names, or the results will not make sense.
|