Basic operation: Classifying
Suppose you have a file named email.rfc which contains the body of a single email
(in standard RFC822 format).
You can classify the email into either spam or notspam by typing:
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam
All you get is the name of the best category, the email itself is consumed.
A variation of particular interest is to replace -v by -U, which gives
a percentage representing how sure dbacl is of printing the correct category.
% cat email.rfc | dbacl -T email -c spam -c notspam -U
notspam # 67%
A result of 100% means dbacl is very sure of the printed category, while
a result of 0% means at least one other category is equally likely.
If you would like to see separate scores for each category, type:
% cat email.rfc | dbacl -T email -c spam -c notspam -n
spam 232.07 notspam 229.44
The winning category always has the smallest score (closest to zero). In fact, the numbers returned with
the -n switch can be interpreted as unnormalized distances towards each category from the input document, in a mathematical space of high dimensions.
If you prefer a return code,
dbacl(1) returns a positive integer (1, 2, 3, ...)
identifying the category by its position on the command line. So if you type:
% cat email.rfc | dbacl -T email -c spam -c notspam
then you get no output, but the return code is 2. If you use the bash(1) shell, the return code
for the last run command is always in the variable $?.
|