dbacl project homepage


Summary
Forums
CVS
Download

Laird Breyer

Download

previous next

Basic operation: Classifying

Suppose you have a file named email.rfc which contains the body of a single email (in standard RFC822 format). You can classify the email into either spam or notspam by typing:

% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam

All you get is the name of the best category, the email itself is consumed. A variation of particular interest is to replace -v by -U, which gives a percentage representing how sure dbacl is of printing the correct category.

% cat email.rfc | dbacl -T email -c spam -c notspam -U
notspam # 67%

A result of 100% means dbacl is very sure of the printed category, while a result of 0% means at least one other category is equally likely. If you would like to see separate scores for each category, type:

% cat email.rfc | dbacl -T email -c spam -c notspam -n
spam 232.07 notspam 229.44

The winning category always has the smallest score (closest to zero). In fact, the numbers returned with the -n switch can be interpreted as unnormalized distances towards each category from the input document, in a mathematical space of high dimensions. If you prefer a return code, dbacl(1) returns a positive integer (1, 2, 3, ...) identifying the category by its position on the command line. So if you type:

% cat email.rfc | dbacl -T email -c spam -c notspam

then you get no output, but the return code is 2. If you use the bash(1) shell, the return code for the last run command is always in the variable $?.

previous next