SourceForge.net Logo
Summary
Forums
CVS
Download

Laird Breyer
Download
contents introduction tutorial spam fun man related
previous next

Before you begin

A statistical email classifier cannot work unless you show it some (as many as you can) examples of emails for every category of interest. This requires some work, because you must separate your emails into dedicated folders. dbacl works best with mbox style folders, which are the standard UNIX folder type that most mailreaders can import and export.

We'll assume that you want to define two categories called notspam and spam respectively. If you want dbacl(1) to recognize these categories, please take a moment to create two mbox folders named (for example) $HOME/mail/spam and $HOME/mail/notspam. You must make sure that the $HOME/mail/notspam folder doesn't contain any unwanted messages, and similarly $HOME/mail/spam must not contain any wanted messages. If you mix messages in the two folders, dbacl(1) will be somewhat confused, and its classification accuracy will drop.

If you've used other Bayesian spam filters, you will find that dbacl(1) is slightly different. While other filters can sometimes learn incrementally, one message at a time, dbacl always learns from scratch all the messages you give it, and only those. dbacl is optimized for learning a large number of messages quickly in one go, and to classify messages as fast as possible afterwards. The author has several good reasons for this choice, which are beyond the scope of this tutorial.

As time goes by, if you use dbacl(1) for classification, you will probably set up your filing system so that messages identified as spam go automatically into the $HOME/mail/notspam folder, and messages identified as notspam go into the $HOME/mail/notspam folder. dbacl(1) is far from perfect, and can make mistakes. This will result in messages going to the wrong folder. When dbacl(1) relearns, it will become slightly confused and over time its ability to distinguish spam and notspam will be diminished.

As with all email classifiers which learn your email, you should inspect your folders regularly, and if you find messages in the wrong folder, you must move them to the correct folder before relearning. If you keep your mail folders clean for learning, dbacl(1) will eventually make very few mistakes, and you will have plenty of time to inspect the folders once in a while. Or so the theory goes...

Since dbacl(1) must relearn categories from scratch each time, you will probably want to set up a cron(1) job to relearn your mail folders every day at midnight. This tutorial explains how to do this below. If you like, you don't need to relearn periodically at all. The author relearns his categories once every few months, without noticeable loss. This works as long as your mail folders contain enough representative emails for training.

Last but not least, if after reading this tutorial you have trouble to get classifications working, please read is_it_working.html.

previous next
contents introduction tutorial spam fun man related