Advanced operation: Parsing
dbacl(1) sets some default switches which should be acceptable
for email classification, but probably not optimal. If you like to experiment,
then this section should give you enough material to stay occupied, but
reading it is not strictly necessary.
When dbacl(1) inspects an email message, it only looks at certain words/tokens.
In all examples so far,
the tokens picked up were purely alphabetic words. No numbers are picked up, or special characters such as $, @, % and punctuation.
The success of text classification schemes depends not only on the statistical models used, but also strongly on the type of tokens considered. dbacl(1) allows you to try out different tokenization schemes. What works best depends on your email.
By default, dbacl(1) picks up only purely alphabetic words as tokens (this uses the least amount of memory). To pick up alphanumeric tokens, use the -e switch as follows:
% dbacl -e alnum -T email -l spam $HOME/mail/notspam
% dbacl -e alnum -T email -l notspam $HOME/mail/notspam
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam
You can also pick up printable words (use -e graph) or purely ASCII (use -e ascii) tokens.
Note that you do not need to indicate the -e switch when classifying, but you should make sure
that all the categories use the same -e switch when learning.
dbacl(1) can also look at single words, consecutive
pairs of words, triples, quadruples. For example, a trigram
model based on alphanumeric tokens can be learned as follows:
% dbacl -e alnum -w 3 -T email -l spam $HOME/mail/notspam
One thing to watch out for is that n-gram models require much more memory to learn in general. You will likely need to use the -H switch to reserve enough space.
If you prefer, you can specify the tokens to look at through a regular expression. The following
example picks up single words which contain purely alphabetic characters followed by zero or more numeric characters. It can be considered an intermediate tokenization scheme between -e alpha and -e alnum:
% dbacl -T email \
-g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \
-l spam $HOME/mail/notspam
% dbacl -T email \
-g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \
-l notspam $HOME/mail/notspam
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam
Note that there is no need to repeat the -g switch when classifying.
|