SourceForge.net Logo
Summary
Forums
CVS
Download

Laird Breyer
Download
contents introduction tutorial spam fun man related
previous next

Appendix B: Extreme probabilities

Why is the result of a dbacl probability calculation always so accurate?

% dbacl -c one -c two -c three sample4.txt -N
one 0.00% two  100.00% three  0.00%

The reason for this has to do with the type of model which dbacl uses. Let's look at some scores:

% dbacl -c one -c two -c three sample4.txt -n
one 13549.34 two 8220.22 three 13476.84 
% dbacl -c one -c two -c three sample4.txt -nv
one 26.11 * 519.0 two 15.84 * 519.0 three 25.97 * 519.0

The first set of numbers are minus the logarithm (base 2) of each category's probability of producing the full document sample4.txt. This represents the evidence away from each category, and is measured in bits. one and three are fairly even, but two has by far the lowest score and hence highest probability (in other words, the model for two is the least bad at predicting sample4.txt, so if there are only three possible choices, it's the best). To understand these numbers, it's best to split each of them up into a product of cross entropy and complexity, as is done in the second line.

Remember that dbacl calculates probabilities about resemblance by weighing the evidence for all the features found in the input document. There are 519 features in sample4.txt, and each feature contributes on average 26.11 bits of evidence against category one, 15.84 bits against category two and 25.97 bits against category three. Let's look at what happens if we only look at the first 25 lines of sample4.txt:

% head -25 sample4.txt | dbacl -c one -c two -c three -nv
one 20.15 * 324.0 two 15.18 * 324.0 three 20.14 * 324.0

There are fewer features in the first 25 lines of sample4.txt than in the full text file, but the picture is substantially unchanged.

% head -25 sample4.txt | dbacl -c one -c two -c three -N
one  0.00% two 100.00% three  0.00%

dbacl is still very sure, because it has looked at many features (324) and found small differences which add up to quite different scores. However, you can see that each feature now contributes less information (20.15, 15.18, 20.14) bits compared to the earlier (26.11, 15.84, 25.97).

Since category two is obviously the best (closest to zero) choice among the three models, let's drop it for a moment and consider the other two categories. We also reduce dramatically the number of features (words) we shall look at. The first line of sample4.txt has 15 words:

% head -1 sample4.txt | dbacl -c one -c three -N
one 25.65% three 74.35%
Finally, we are getting probabilities we can understand! Unfortunately, this is somewhat misleading. Each of the 15 words gave a score and these scores were added for each category. Since both categories here are about equally bad at predicting words in sample4.txt, the difference in the final scores for category one and three amounts to less than 3 bits of information, which is why the probabilities are mixed:
% head -1 sample4.txt | dbacl -c one -c three -nv
one 16.61 * 15.0 three 16.51 * 15.0

So the interpretation of the probabilities is clear. dbacl weighs the evidence from each feature it finds, and reports the best fit among the choices it is offered. Because it sees so many features separately (hundreds usually), it believes its verdict is very sure. Wouldn't you be after hundreds of checks? Of course, whether these features are independent, and are the right features to look at for best classification is another matter entirely, and it's entirely up to you to decide. dbacl can't do much about its inbuilt assumptions.

Last but not least, the probabilities above are not the same as the confidence percentages printed by the -U switch. The -U switch was developed to overcome the limitations above, by looking at dbacl's calculations from a higher level, but this is a topic for another time.

previous next
contents introduction tutorial spam fun man related