Why is the result of a dbacl probability calculation always so accurate?
% dbacl c one c two c three sample4.txt N
one 0.00% two 100.00% three 0.00%
The reason for this has to do with the type of model which dbacl uses. Let's
look at some scores:
% dbacl c one c two c three sample4.txt n
one 13549.34 two 8220.22 three 13476.84
% dbacl c one c two c three sample4.txt nv
one 26.11 * 519.0 two 15.84 * 519.0 three 25.97 * 519.0
The first set of numbers are minus the logarithm (base 2) of each category's
probability of producing the full document sample4.txt. This represents the
evidence away from each category, and is measured in bits.
one and three are fairly even, but two has by far
the lowest score and hence highest probability (in other words, the model
for two is the least bad at predicting sample4.txt, so if there are only
three possible choices, it's the best).
To understand these numbers, it's best to split each of them up into
a product of cross entropy and complexity, as is done in the second line.
Remember that dbacl calculates probabilities about resemblance
by weighing the evidence for all the features found in the input document.
There are 519 features in sample4.txt, and each feature contributes on average 26.11 bits of evidence against category one, 15.84 bits against category two and 25.97 bits against category three. Let's look at what happens if we only look at the first 25 lines of sample4.txt:
% head 25 sample4.txt  dbacl c one c two c three nv
one 20.15 * 324.0 two 15.18 * 324.0 three 20.14 * 324.0
There are fewer features in the first 25 lines of sample4.txt than in the full
text file, but the picture is substantially unchanged.
% head 25 sample4.txt  dbacl c one c two c three N
one 0.00% two 100.00% three 0.00%
dbacl is still very sure, because it has looked at many features (324) and found
small differences which add up to quite different scores. However, you can see that
each feature now contributes less information (20.15, 15.18, 20.14) bits compared
to the earlier (26.11, 15.84, 25.97).
Since category two is obviously the best (closest to zero) choice among the
three models, let's drop it for a moment and consider the other two categories.
We also reduce dramatically the number of features (words) we shall look at. The first line
of sample4.txt has 15 words:
% head 1 sample4.txt  dbacl c one c three N
one 25.65% three 74.35%
Finally, we are getting probabilities we can understand! Unfortunately, this
is somewhat misleading. Each of the 15 words gave a score
and these scores were added for each category. Since both categories here are about equally
bad at predicting words in sample4.txt, the difference in the final scores for category
one and three amounts to less than 3 bits of information, which is why the
probabilities are mixed:
% head 1 sample4.txt  dbacl c one c three nv
one 16.61 * 15.0 three 16.51 * 15.0
So the interpretation of the probabilities is clear. dbacl weighs the
evidence from each feature it finds, and reports the best fit among the choices
it is offered. Because it sees so many features separately (hundreds usually), it believes
its verdict is very sure. Wouldn't you be after hundreds of checks?
Of course, whether these
features are independent, and are the right features to look at for best classification is another
matter entirely, and it's entirely up to you to decide. dbacl can't do much
about its inbuilt assumptions.
Last but not least, the probabilities above are not the same as the
confidence percentages printed by the U switch. The U switch was developed
to overcome the limitations above, by looking at dbacl's calculations
from a higher level, but this is a topic for another time.
