Using bayesol
bayesol is a companion program for dbacl which makes the decision calculations
easier. The bad news is that you still have to write down a prior and loss matrix
yourself. Eventually, someone, somewhere may write a graphical interface.
The good news is that for most classification tasks, you don't need to
bother with bayesol at all, and can skip this section. Really.
bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost of
misclassifications. For the toy example discussed earlier, the file toy.risk looks like this:
categories {
one, two, three
}
prior {
2, 1, 1
}
loss_matrix {
"" one [ 0, 1, 2 ]
"" two [ 3, 0, 5 ]
"" three [ 1, 1, 0 ]
}
Let's see if our hand calculation was correct:
% dbacl -c one -c two -c three sample4.txt -vna | bayesol -c toy.risk -v
two
Good! However, as discussed above, the misclassification costs need
improvement. This is completely up to you, but here are some possible
suggestions to get you started.
To devise effective loss matrices, it pays to think about the way that dbacl
computes the probabilities. Appendix B gives some
details, but we don't need to go that far. Recall that the language models are
based on features (which are usually kinds of words).
Every feature counts towards the final probabilities, and a big document
will have more features, hence more opportunities to steer the
probabilities one way or another. So a feature is like an information
bearing unit of text.
When we read a text document which doesn't accord with our expectations, we
grow progressively more annoyed as we read further into the text. This is like
an annoyance interest rate which compounds on information units within the text.
For dbacl, the number of information bearing units is reported as the complexity
of the text.
This suggests that the cost of reading a misclassified document could have the
form (1 + interest)^complexity. Here's an example loss matrix which uses this idea
loss_matrix {
"" one [ 0, (1.1)^complexity, (1.1)^complexity ]
"" two [(1.1)^complexity, 0, (1.7)^complexity ]
"" three [(1.5)^complexity, (1.01)^complexity, 0 ]
}
Remember, these aren't monetary interest rates, they are value judgements.
You can see this loss matrix in action by typing
% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example1.risk -v
three
Now if we increase the cost of misclassifying two as three from
1.7 to 2.0, the optimal category becomes
% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example2.risk -v
two
bayesol can also handle infinite costs. Just write "inf" where you need it.
This is particularly useful with regular expressions. If you look at each
row of loss_matrix above, you see an empty string "" before each category.
This indicates that this row is to be used by default in the actual loss matrix.
But sometimes, the losses can depend on seeing a particular string in the document we want to classify.
Suppose you normally like to use the loss matrix above, but in case the
document contains the word "Polly", then the cost of misclassification
is infinite. Here is an updated loss_matrix:
loss_matrix {
"" one [ 0, (1.1)^complexity, (1.1)^complexity ]
"Polly" two [ inf, 0, inf ]
"" two [(1.1)^complexity, 0, (2.0)^complexity ]
"" three [(1.5)^complexity, (1.01)^complexity, 0 ]
}
bayesol looks in its input for the regular expression "Polly", and if it
is found, then for misclassifications away from two,
it uses the row with the infinite values, otherwise it uses the default
row, which starts with "". If you have several rows with regular expressions
for each category, bayesol always uses the first one from the top which
matches within the input. You must always have at least a default row for
every category.
The regular expression facility can also be used to perform more complicated
document dependent loss calculations. Suppose you like to count the number
of lines of the input document which start with the character '>', as a
proportion of the total number of lines in the document. The following perl script transcribes its input and appends the calculated proportion.
#!/usr/bin/perl
# this is file prop.pl
$special = $normal = 0;
while(<SDTIN>) {
$special++ if /^ >/;
$normal++;
print;
}
$prop = $special/$normal;
print "proportion: $prop\n";
If we used this script, then we could take the output of dbacl, append the
proportion of lines containing '>', and pass the result as input to bayesol.
For example, the following line is included in the example3.risk
specification
"^proportion: ([0-9.]+)" one [ 0, (1+$1)^complexity, (1.2)^complexity ]
and through this, bayesol reads, if present,
the line containing the proportion we
calculated and
takes this into account when it constructs the loss matrix.
You can try this like so:
% dbacl -T email -c one -c two -c three sample6.txt -nav \
| perl prop.pl | bayesol -c example3.risk -v
Note that in the loss_matrix specification
above, $1 refers to the numerical value of the
quantity inside the parentheses. Also, it is useful to remember that
when using the -a switch, dbacl outputs all the original lines
from unknown.txt with an extra space in front of them. If another
instance of dbacl needs to read this output again (e.g. in a pipeline),
then the latter should be invoked with the -A switch.
|