Beyond Naive Bayes
The classification methodology used by the dbacl project is fully Bayesian,
which is reflected in the choice of learning models and the choice of decision
algorithms.
When learning a text corpus, dbacl builds a Bayesian linguistic model by maximizing the entropy (uncertainty) of selected features. dbacl can cope with features
defined by arbitrary regular expressions, which gives it unified support for unigram ("naive bayes"), n-gram and various random field models. Automatic feature smoothing is achieved through the combination of maximum entropy and the
digramic reference measure.
When classifying new text with the utility bayesol(1), the
optimal Bayesian decision is given
based on prior probabilities, the calculated conditional distribution,
and supplied misclassification costs.
When calculating conditional distributions, dbacl(1) can either report probablities or universal information measures such as the cross entropy, which allows direct comparison with other algorithms, and can also be used in minimum description length calculations.
|