Laird Breyer
contents introduction tutorial spam fun man related
previous next

Evaluating the models

Now that you have a grasp of the variety of language models which dbacl can generate, the important question is what set of features should you use?

There is no easy answer to this problem. Intuitively, a larger set of features seems always preferable, since it takes more information into account. However, there is a tradeoff. Comparing more features requires extra memory, but much more importantly, too many features can overfit the data. This results in a model which is so good at predicting the learned documents, that virtually no other documents are considered even remotely similar.

It is beyond the scope of this tutorial to describe the variety of statistical methods which can help decide what features are meaningful. However, to get a rough idea of the quality of the model, we can look at the cross entropy reported by dbacl.

The cross entropy is measured in bits and has the following meaning: If we use our probabilistic model to construct an optimal compression algorithm, then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature. This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself, and moreover what we mean by compression is the act of compressing the features, not the full document, which also contains punctuation and white space which is ignored.

To compute the cross entropy of category one, type

% dbacl -c one sample1.txt -vn
one  7.42 * 678.0

The cross entropy is the first value (7.42) returned. The second value essentially measures how many features describe the document. Now suppose we try other models trained on the same document:

% dbacl -c slick sample1.txt -vn
slick  4.68 * 677.5
% dbacl -c smooth sample1.txt -vn
smooth  6.03 * 640.5

The first thing to nota is that the complexity terms are not the same. The slick category is based on word pairs (also called bigrams), of which tere are 677 in this document. But there are 678 words, and the fractional value indicates that the last word only counts for half a feature. The smooth category also depends on word pairs, but unlike slick, pairs cannot be counted if they straddle a newline (this is a limitation of line-oriented regular expressions). So in smooth, there are several missing word pairs, and various single words which count as a fractional pair, giving a grand total of 640.5.

The second thing to note is that both bigram models fit sample1.txt better. This is easy to see for slick, since the complexity (essentially the number of features) is nearly the same as for one, so the comparison reduces to seeing which cross entropy is lowest. Let's ask dbacl which category fits better:

% dbacl -c one -c slick sample1.txt -v

You can do the same thing to compare one and smooth. Let's ask dbacl which category fits better overall:

% dbacl -c one -c slick -c smooth sample1.txt -v

We already know that slick is better than one, but why is slick better than smooth? While slick looks at more features than smooth (677.5 versus 640.5), it needs just 4.68 bits of information per feature to represent the sample1.txt document, while smooth needs 6.03 bits on average. So slick wins based on economies of scale.

WARNING: it is not always appropriate to classify documents whose models look at different feature set like we did above. The underlying statistical basis for these comparisons is the likelihood, but it is easy to compare "apples and oranges" incorrectly. It is safest if you learn and classify documents by using exactly the same command line switches for every category.

previous next
contents introduction tutorial spam fun man related