Advanced operation: Costs
This section can be skipped. It is here for completeness, but probably
won't be very useful to you, especially if you are a new user.
The classification performed by dbacl(1) as described above is known as a MAP
estimate. The optimal category is chosen only by looking at the email contents. What is missing is your input
as to the costs of misclassifications.
This section is by no means necessary for using dbacl(1) for most
classification tasks. It is useful for tweaking dbacl's algorithms only.
If you want to improve dbacl's accuracy, first try to learn bigger collections
of email.
To understand the idea, imagine that an email being wrongly marked spam is likely to be
sitting in the $HOME/mail/spam folder until you check through it, while an email wrongly marked notspam will prominently appear among your regular correspondence. For most people, the former case can mean a missed timely communication, while the latter case is merely an annoyance.
No classification system is perfect. Learned emails can only imperfectly predict never before seen emails. Statistical models vary in quality. If you try to lower one kind of error, you automatically increase the other kind.
The dbacl system allows you to specify how much you hate each type of misclassification, and does its best to accomodate this extra information. To input your settings, you will need a risk specification like this:
categories {
spam, notspam
}
prior {
1, 1
}
loss_matrix {
"" spam [ 0, 1^complexity ]
"" notspam [ 2^complexity, 0 ]
}
This risk specification states that your cost for misclassifying spam emails into notspam is 1 for every word of the email (merely an annoyance). Your cost for misclassifying regular emails into spam is 2 for every word of the email (a more serious problem). The costs for classifying your email correctly are zero in each case. Note that the cost numbers are arbitrary, only their relative sizes matter. See the tutorial if you want to understand these statements.
Now save your risk specification above into a file named my.risk, and type
% cat email.rfc | dbacl -T email -c spam -c notspam \
-vna | bayesol -c my.risk -v
notspam
The output category may or may not differ from the category selected via dbacl(1) alone, but over
many emails, the resulting classifications will be more cautious about marking an email as spam.
Since dbacl(1) can output the score for each category (using the -n switch),
you are also free to do your own processing and decision calculation, without using bayesol(1).
For example, you could use:
% cat email.rfc | dbacl -T email -n -c spam -c notspam | \
awk '{ if($2 * p1 * u12 > $4 * (1 - p1) * u21) { print $1; } \
else { print $3; } }'
where p1 is the a priori probability that an email is spam, u12 is the cost of misclassifying
spam as notspam, and u21 is the cost of seeing spam among your regular email.
When you take your misclassification costs into account, it is better to use
the logarithmic scores (given by the -n) instead of the true probabilities (given by the -N switch).
The scores represent the amount of evidence away from each model, so the smaller the score the better. For each category, dbacl outputs both the score and the complexity of the email (ie the number of tokens/words actually looked at). For example
% cat email.rfc | dbacl -T email -c spam -c notspam -vn
spam 5.52 * 42 notspam 5.46 * 42
would indicate that there are 5.46 bits of evidence away from notspam,
but 5.52 bits of evidence away from spam. This evidence is computed
based on 42 tokens (it's a small email :-), and one would conclude that the notspam category is a slightly better fit.
|