Evaluating the models
Now that you have a grasp of the variety of language models which dbacl can generate, the important question is what set of features should you use?
There is no easy answer to this problem.
Intuitively, a larger
set of features seems always preferable,
since it takes more information into account.
However, there is a tradeoff.
Comparing more features requires extra memory,
but much more importantly, too many features can overfit the data.
This results
in a model which is so good at predicting the learned documents,
that virtually no other documents are considered even remotely similar.
It is beyond the scope of this tutorial to describe the variety of statistical
methods which can help decide what features are meaningful. However, to get a
rough idea of the quality of the model, we can look at the cross entropy
reported by dbacl.
The cross entropy is measured in bits and has the following meaning:
If we use our probabilistic model to construct an optimal compression algorithm,
then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature.
This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself, and moreover
what we mean by compression is the act of compressing the features, not the full
document, which also contains punctuation and white space which is ignored.
To compute the cross entropy of category one, type
% dbacl c one sample1.txt vn
one 7.42 * 678.0
The cross entropy is the first value (7.42) returned. The second value essentially
measures how many features describe the document.
Now suppose we try other models trained on the same document:
% dbacl c slick sample1.txt vn
slick 4.68 * 677.5
% dbacl c smooth sample1.txt vn
smooth 6.03 * 640.5
The first thing to nota is that the complexity terms are not the same. The
slick category is based on word pairs (also called bigrams), of
which tere are 677 in this document. But there are 678 words, and the
fractional value indicates that the last word only counts for half a
feature. The smooth category also depends on word pairs, but
unlike slick, pairs cannot be counted if they straddle a
newline (this is a limitation of lineoriented regular expressions).
So in smooth, there are several missing word pairs, and various
single words which count as a fractional pair, giving a grand total of
640.5.
The second thing to note is that both bigram models fit sample1.txt better. This is
easy to see for slick, since the complexity (essentially the number of features)
is nearly the same as for one, so the comparison reduces to seeing
which cross entropy is lowest.
Let's ask dbacl which category fits better:
% dbacl c one c slick sample1.txt v
slick
You can do the same thing to compare one and smooth.
Let's ask dbacl which category fits better overall:
% dbacl c one c slick c smooth sample1.txt v
slick
We already know that slick is better than one, but why is
slick better than smooth? While slick looks at more
features than smooth (677.5 versus 640.5), it needs just 4.68 bits
of information per feature to represent the sample1.txt document,
while smooth needs 6.03 bits on average. So slick wins based
on economies of scale.
WARNING: it is not always appropriate to classify documents whose models
look at different feature set like we did above. The underlying statistical
basis for these comparisons is the likelihood, but it is easy to
compare "apples and oranges" incorrectly. It is safest if you learn and
classify documents by using exactly the same command line switches
for every category.
