SourceForge.net Logo
Summary
Forums
CVS
Download

Laird Breyer
Download
contents introduction tutorial spam fun man related
previous next

Using bayesol

bayesol is a companion program for dbacl which makes the decision calculations easier. The bad news is that you still have to write down a prior and loss matrix yourself. Eventually, someone, somewhere may write a graphical interface. The good news is that for most classification tasks, you don't need to bother with bayesol at all, and can skip this section. Really.

bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost of misclassifications. For the toy example discussed earlier, the file toy.risk looks like this:

categories {
    one, two, three
}
prior {
    2, 1, 1
}
loss_matrix {
"" one   [ 0, 1, 2 ]
"" two   [ 3, 0, 5 ]
"" three [ 1, 1, 0 ]
}

Let's see if our hand calculation was correct:

% dbacl -c one -c two -c three sample4.txt -vna | bayesol -c toy.risk -v
two

Good! However, as discussed above, the misclassification costs need improvement. This is completely up to you, but here are some possible suggestions to get you started.

To devise effective loss matrices, it pays to think about the way that dbacl computes the probabilities. Appendix B gives some details, but we don't need to go that far. Recall that the language models are based on features (which are usually kinds of words). Every feature counts towards the final probabilities, and a big document will have more features, hence more opportunities to steer the probabilities one way or another. So a feature is like an information bearing unit of text.

When we read a text document which doesn't accord with our expectations, we grow progressively more annoyed as we read further into the text. This is like an annoyance interest rate which compounds on information units within the text. For dbacl, the number of information bearing units is reported as the complexity of the text. This suggests that the cost of reading a misclassified document could have the form (1 + interest)^complexity. Here's an example loss matrix which uses this idea

loss_matrix { 
"" one   [ 0,               (1.1)^complexity,  (1.1)^complexity ]
"" two   [(1.1)^complexity, 0,                 (1.7)^complexity ] 
"" three [(1.5)^complexity, (1.01)^complexity, 0 ]
} 

Remember, these aren't monetary interest rates, they are value judgements. You can see this loss matrix in action by typing

% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example1.risk -v
three

Now if we increase the cost of misclassifying two as three from 1.7 to 2.0, the optimal category becomes

% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example2.risk -v
two

bayesol can also handle infinite costs. Just write "inf" where you need it. This is particularly useful with regular expressions. If you look at each row of loss_matrix above, you see an empty string "" before each category. This indicates that this row is to be used by default in the actual loss matrix. But sometimes, the losses can depend on seeing a particular string in the document we want to classify.

Suppose you normally like to use the loss matrix above, but in case the document contains the word "Polly", then the cost of misclassification is infinite. Here is an updated loss_matrix:

loss_matrix { 
""          one   [ 0,               (1.1)^complexity,  (1.1)^complexity ]
"Polly"     two   [ inf,             0,                 inf ]
""          two   [(1.1)^complexity, 0,                 (2.0)^complexity ] 
""          three [(1.5)^complexity, (1.01)^complexity, 0 ]
}

bayesol looks in its input for the regular expression "Polly", and if it is found, then for misclassifications away from two, it uses the row with the infinite values, otherwise it uses the default row, which starts with "". If you have several rows with regular expressions for each category, bayesol always uses the first one from the top which matches within the input. You must always have at least a default row for every category.

The regular expression facility can also be used to perform more complicated document dependent loss calculations. Suppose you like to count the number of lines of the input document which start with the character '>', as a proportion of the total number of lines in the document. The following perl script transcribes its input and appends the calculated proportion.

#!/usr/bin/perl 
# this is file prop.pl

$special = $normal = 0; 
while(<SDTIN>) {
    $special++ if /^ >/; 
    $normal++; 
    print; 
} 
$prop = $special/$normal; 
print "proportion: $prop\n"; 

If we used this script, then we could take the output of dbacl, append the proportion of lines containing '>', and pass the result as input to bayesol. For example, the following line is included in the example3.risk specification

"^proportion: ([0-9.]+)" one [ 0, (1+$1)^complexity, (1.2)^complexity ]

and through this, bayesol reads, if present, the line containing the proportion we calculated and takes this into account when it constructs the loss matrix. You can try this like so:

% dbacl -T email -c one -c two -c three sample6.txt -nav \
  | perl prop.pl | bayesol -c example3.risk -v

Note that in the loss_matrix specification above, $1 refers to the numerical value of the quantity inside the parentheses. Also, it is useful to remember that when using the -a switch, dbacl outputs all the original lines from unknown.txt with an extra space in front of them. If another instance of dbacl needs to read this output again (e.g. in a pipeline), then the latter should be invoked with the -A switch.

previous next
contents introduction tutorial spam fun man related