dbacl project homepage


Summary
Forums
CVS
Download

Laird Breyer

Download

previous next

Advanced operation: Parsing

dbacl(1) sets some default switches which should be acceptable for email classification, but probably not optimal. If you like to experiment, then this section should give you enough material to stay occupied, but reading it is not strictly necessary.

When dbacl(1) inspects an email message, it only looks at certain words/tokens. In all examples so far, the tokens picked up were purely alphabetic words. No numbers are picked up, or special characters such as $, @, % and punctuation.

The success of text classification schemes depends not only on the statistical models used, but also strongly on the type of tokens considered. dbacl(1) allows you to try out different tokenization schemes. What works best depends on your email.

By default, dbacl(1) picks up only purely alphabetic words as tokens (this uses the least amount of memory). To pick up alphanumeric tokens, use the -e switch as follows:

% dbacl -e alnum -T email -l spam $HOME/mail/notspam
% dbacl -e alnum -T email -l notspam $HOME/mail/notspam
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam

You can also pick up printable words (use -e graph) or purely ASCII (use -e ascii) tokens. Note that you do not need to indicate the -e switch when classifying, but you should make sure that all the categories use the same -e switch when learning.

dbacl(1) can also look at single words, consecutive pairs of words, triples, quadruples. For example, a trigram model based on alphanumeric tokens can be learned as follows:

% dbacl -e alnum -w 3 -T email -l spam $HOME/mail/notspam

One thing to watch out for is that n-gram models require much more memory to learn in general. You will likely need to use the -H switch to reserve enough space.

If you prefer, you can specify the tokens to look at through a regular expression. The following example picks up single words which contain purely alphabetic characters followed by zero or more numeric characters. It can be considered an intermediate tokenization scheme between -e alpha and -e alnum:

% dbacl -T email \
  -g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \
  -l spam $HOME/mail/notspam
% dbacl -T email \
  -g '(^|[^[:alpha:]])([[:alpha:]]+[[:digit:]]*)||2' \
  -l notspam $HOME/mail/notspam
% cat email.rfc | dbacl -T email -c spam -c notspam -v
notspam

Note that there is no need to repeat the -g switch when classifying.

previous next