SourceForge.net Logo
Summary
Forums
CVS
Download

Laird Breyer
Download
contents introduction tutorial spam fun man related
previous next

Advanced operation: Costs

This section can be skipped. It is here for completeness, but probably won't be very useful to you, especially if you are a new user.

The classification performed by dbacl(1) as described above is known as a MAP estimate. The optimal category is chosen only by looking at the email contents. What is missing is your input as to the costs of misclassifications.

This section is by no means necessary for using dbacl(1) for most classification tasks. It is useful for tweaking dbacl's algorithms only. If you want to improve dbacl's accuracy, first try to learn bigger collections of email.

To understand the idea, imagine that an email being wrongly marked spam is likely to be sitting in the $HOME/mail/spam folder until you check through it, while an email wrongly marked notspam will prominently appear among your regular correspondence. For most people, the former case can mean a missed timely communication, while the latter case is merely an annoyance.

No classification system is perfect. Learned emails can only imperfectly predict never before seen emails. Statistical models vary in quality. If you try to lower one kind of error, you automatically increase the other kind.

The dbacl system allows you to specify how much you hate each type of misclassification, and does its best to accomodate this extra information. To input your settings, you will need a risk specification like this:

categories {
    spam, notspam
}
prior {
    1, 1
}
loss_matrix {
"" spam      [            0,  1^complexity ]
"" notspam   [ 2^complexity,  0            ]
}

This risk specification states that your cost for misclassifying spam emails into notspam is 1 for every word of the email (merely an annoyance). Your cost for misclassifying regular emails into spam is 2 for every word of the email (a more serious problem). The costs for classifying your email correctly are zero in each case. Note that the cost numbers are arbitrary, only their relative sizes matter. See the tutorial if you want to understand these statements.

Now save your risk specification above into a file named my.risk, and type

% cat email.rfc | dbacl -T email -c spam -c notspam \
  -vna | bayesol -c my.risk -v
notspam

The output category may or may not differ from the category selected via dbacl(1) alone, but over many emails, the resulting classifications will be more cautious about marking an email as spam.

Since dbacl(1) can output the score for each category (using the -n switch), you are also free to do your own processing and decision calculation, without using bayesol(1). For example, you could use:

% cat email.rfc | dbacl -T email -n -c spam -c notspam | \
  awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) { print $1; } \
       else { print $3; } }'

where p1 is the a priori probability that an email is spam, u12 is the cost of misclassifying spam as notspam, and u21 is the cost of seeing spam among your regular email.

When you take your misclassification costs into account, it is better to use the logarithmic scores (given by the -n) instead of the true probabilities (given by the -N switch).

The scores represent the amount of evidence away from each model, so the smaller the score the better. For each category, dbacl outputs both the score and the complexity of the email (ie the number of tokens/words actually looked at). For example

% cat email.rfc | dbacl -T email -c spam -c notspam -vn
spam 5.52 * 42 notspam 5.46 * 42
would indicate that there are 5.46 bits of evidence away from notspam, but 5.52 bits of evidence away from spam. This evidence is computed based on 42 tokens (it's a small email :-), and one would conclude that the notspam category is a slightly better fit.
previous next
contents introduction tutorial spam fun man related