|
|
|
previous next
|
DBACL
DBACL
NAME
SYNOPSIS
OVERVIEW
DESCRIPTION
SECONDARY SWITCHES
EXIT STATUS
OPTIONS
USAGE
PERFORMANCE
MULTIPLE PROCESSES AND DATA CORRUPTION
MEMORY USE
ENVIRONMENT
SIGNALS
NOTES
BUGS
SOURCE
AUTHOR
SEE ALSO
NAME
|
dbacl − a digramic Bayesian classifier for text
recognition.
|
SYNOPSIS
|
dbacl [-01dvnirmwMNDXW] [-T type ] -l
category [-h size] [-H gsize] [-x
decim] [-q quality] [-w max_order] [-e
deftok] [-o online] [-L measure] [-g
regex]... [FILE]...
|
|
dbacl [-vnimNRX] [-h size] [-T type]
-c category [-c category]... [-f
keep]... [FILE]...
|
OVERVIEW
|
dbacl is a Bayesian text and email classifier.
When using the -l switch, it learns a body of text
and produce a file named category which summarizes
the text. When using the -c switch, it compares an
input text stream with any number of category files,
and outputs the name of the closest match, or optionally
various numerical scores explained below.
Whereas this manual page is intended as a reference,
there are several tutorials and documents you can read to
get specialized information. Specific documentation about
the design of dbacl and the statistical models that
it uses can be found in dbacl.ps. For a basic overview of
text classification using dbacl, see tutorial.html. A
companion tutorial geared towards email filtering is
email.html. If you have trouble getting dbacl to classify
reliably, read is_it_working.html. The USAGE section of this
manual page also has some examples.
|
|
@PKGDATADIR@/doc/dbacl.ps |
|
@PKGDATADIR@/doc/tutorial.html |
|
@PKGDATADIR@/doc/email.html |
|
@PKGDATADIR@/doc/is_it_working.html |
|
dbacl uses a maximum entropy (minimum divergence)
language model constructed with respect to a digramic
reference measure (unknown tokens are predicted from
digrams, i.e. pairs of letters). Practically, this means
that a category is constructed from tokens in the
training set, while previously unseen tokens can be
predicted automatically from their letters. A token here is
either a word (fragment) or a combination of words
(fragments), selected according to various switches.
Learning roughly works by tweaking token probabilities until
the training data is least surprising.
|
DESCRIPTION
|
When using the -l command form, dbacl
learns a category when given one or more FILE names, which
should contain readable ASCII text. If no
FILE is given, dbacl learns from STDIN. If FILE is a
directory, it is opened and all its files are read, but not
its subdirectories. The result is saved in the binary file
named category, and completely replaces any previous
contents. As a convenience, if the environment variable
DBACL_PATH contains a directory, then that is prepended to
the file path, unless category starts with a
’/’ or a ’.’.
The input text for learning is assumed to be unstructured
plain text by default. This is not suitable for learning
email, because email contains various transport encodings
and formatting instructions which can reduce classification
effectiveness. You must use the -T switch in that
case so that dbacl knows it should perform decoding
and filtering of MIME and HTML as appropriate. Apropriate
switch values are "-T email" for RFC2822 email
input, "-T html" for HTML input, "-T
xml" for generic XML style input and "-T
text" is the default plain text format. There are other
values of the -T switch that also allow fine tuning
of the decoding capabilities.
When using the -c command form, dbacl
attempts to classify the text found in FILE, or STDIN if no
FILE is given. Each possible category must be given
separately, and should be the file name of a previously
learned text corpus. As a convenience, if the variable
DBACL_PATH contains a directory, it is prepended to each
file path which doesn’t start with a ’/’
or a ’.’. The visible output of the
classification depends on the combination of extra switches
used. If no switch is used, then no output is shown on
STDOUT. However, dbacl always produces an exit code
which can be tested.
To see an output for a classification, you must use at
least one of the
-v,-U,-n,-D,-d switches.
Sometimes, they can be used in combination to produce a
natural variation of their individual outputs. Sometimes,
dbacl also produces warnings on STDERR if
applicable.
The -v switch outputs the name of the best
category among all the choices given.
The -U switch outputs the name of the best
category followed by a confidence percentage. Normally, this
is the switch that you want to use. A percentage of 100%
means that dbacl is sure of its choice, while a
percentage of 0% means that some other category is equally
likely.
The -n switch prints each category name followed
by the negative logarithm of its probability. The smallest
number gives the best category. A more convenient form is to
use both -n and -v which prints each category
name followed by the cross entropy and the number of tokens
analyzed. The cross entropy measures (in bits) the average
compression rate which is achievable, under the given
category model, per token of input text. If you use all
three of -n,-v,-X then an extra value
is output for each category, representing a kind of p-value
for each category score. This indicates how typical the
score is compared to the training documents, but only works
if the -X switch was used during learning, and only
for some types of models. These p-values are uniformly
distributed and independent (if the categories are
independent), so can be combined using Fisher’s chi
squared test to obtain composite p-values for groupings of
categories.
The -v and -X switches together print each
category name followed by a detailed decomposition of the
category score, factored into ( divergence rate + shannon
entropy rate )* token count @ p-value.
The -v and -U switches print each category
name followed by a decomposition of the category score into
( divergence rate + shannon entropy rate # score variance )*
token count.
The -D switch prints out the input text as
modified internally by dbacl prior to tokenization.
For example, if a MIME encoded email document is classified,
then this prints the decoded text that will be actually
tokenized and classified. This switch is mainly useful for
debugging.
The -d switch dumps tokens and scores while they
are being read. It is useful for debugging, or if you want
to create graphical representations of the classification. A
detailed explanation of the output is beyond the scope of
this manual page, but is straightforward if you’ve
read dbacl.ps. Possible variations include -d
together with -n or -N.
Classification can be done with one or several categories
in principle. When two or more categories are used, the
Bayesian posterior probability is used, given the input
text, with a uniform prior distribution on categories. For
other choices of prior, see the companion utility
bayesol(1). When a single category is used,
classification can be done by comparing the score with a
treshold. In practice however, much better results are
obtained with several categories.
Learning and classifying cannot be mixed on the same
command invocation, however there are no locking issues and
separate dbacl processes can operate simultaneously
with obvious results, because file operations are designed
to be atomic.
Finally, note that dbacl does not manage your
document corpora or your computed categories, and in
particular it does not allow you to extend an existing
category file with new documents. This is unlike various
current spam filters, which can learn new emails
incrementally. This limitation of dbacl is partially
due to the nonlinear procedure used in the learning
algorithm, and partially a desire for increased
flexibility.
You can simulate the effect of incremental learning by
saving your training documents into archives and adding to
these archives over time, relearning from scratch
periodically. Learning is actually faster if these archives
are compressed and decompressed on the fly when needed. By
keeping control of your archives, you can never lose the
information in your categories, and you can easily
experiment with different switches or tokenizations or sets
of training documents if you like.
|
SECONDARY SWITCHES
|
By default, dbacl classifies the input text as a
whole. However, when using the -f option,
dbacl can be used to filter each input line
separately, printing only those lines which match one or
more models identified by keep (use the category name
or number to refer to a category). This is useful if you
want to filter out some lines, but note that if the lines
are short, then the error rate can be high.
The -e,-w,-g,-j switches are
used for selecting an appropriate tokenization scheme. A
token is a word or word fragment or combination of words or
fragments. The shape of tokens is important because it forms
the basis of the language models used by dbacl. The
-e switch selects a predefined tokenization scheme,
which is speedy but limited. The -w switch specifies
composite tokens derived from the -e switch. For
example, "-e alnum -w 2" means that tokens should
be alphanumeric word fragments combined into overlapping
pairs (bigrams). When the -j switch is used, all
tokens are converted to lowercase, which reduces the number
of possible tokens and therefore memory consumption.
If the -g switch is used, you can completely
specify what the tokens should look like using a regular
expression. Several -g switches can be used to
construct complex tokenization schemes, and parentheses
within each expression can be used to select fragments and
combine them into n-grams. The cost of such flexibility is
reduced classification and learning speed. When
experimenting with tokenization schemes, try using the
-d or -D switches while learning or
classifying, as they will print the tokens explicitly so you
can see what text fragments are picked up or missed out. For
regular exression syntax, see regex(7).
The -h and -H switches regulate how much
memory dbacl may use for learning. Text
classification can use a lot of memory, and by default
dbacl limits itself even at the expense of learning
accuracy. In many cases if a limit is reached, a warning
message will be printed on STDERR with some advice.
When relearning the same category several times, a
significant speedup can be obtained by using the -1
switch, as this allows the previously learned probabilities
to be read from the category and reused.
Note that classification accuracy depends foremost on the
amount and quality of the training samples, and then only on
amount of tweaking.
|
EXIT STATUS
|
When using the -l command form, dbacl
returns zero on success. When using the -c form,
dbacl returns a positive integer (1,2,3...)
corresponding to the category with the highest
posterior probability. In case of a tie, the first most
probable category is chosen. If an error occurs,
dbacl returns zero.
|
OPTIONS
|
-0
|
|
When learning, prevents weight preloading. Normally,
dbacl checks if the category file already exists, and
if so, tries to use the existing weights as a starting
point. This can dramatically speed up learning. If the
-0 (zero) switch is set, then dbacl behaves as
if no category file already exists. This is mainly useful
for testing. This switch is now enabled by default, to
protect against weight drift which can reduce accuracy over
many learning iterations. Use -1 to force
preloading.
|
|
-1
|
|
Force weight preloading if the category file already
exists. See discussion of the -0 switch.
|
|
-a
|
|
Append scores. Every input line is written to STDOUT and
the dbacl scores are appended. This is useful for
postprocessing with bayesol(1). For ease of
processing, every original input line is indented by a
single space (to distinguish them from the appended scores),
and the line with the scores (if -n is used) is
prefixed with the string "scores ". If a second
copy of dbacl needs to read this output later, it
should be invoked with the -A switch.
|
|
-d
|
|
Dump the model parameters to STDOUT. In conjunction with
the -l option, this produces a human-readable summary
of the maximum entropy model. In conjunction with the
-c option, displays the contribution of each token to
the final score. Suppresses all other normal output.
|
|
-e
|
|
Select character class for default (not regex-based)
tokenization. By default, tokens are alphabetic strings
only. This corresponds to the case when deftok is
"alpha". Possible values for deftok are
"alpha", "alnum", "graph",
"cef" and "adp". The last two are custom
tokenizers intended for email messages. See also
isalpha(3).
|
|
-f
|
|
Filter each line of input separately, passing to STDOUT
only lines which match the category identified as
keep. This option should be used repeatedly for each
category which must be kept. keep can be
either the category file name, or a positive integer
representing the required category in the same order
it appears on the command line.
|
|
Output lines are flushed as soon as they are written. If
the input file is a pipe or character device, then an
attempt is made to use line buffering mode, otherwise the
more efficient block buffering is used.
|
|
-g
|
|
Learn only features described by the extended regular
expression regex. This overrides the default feature
selection method (see -w option) and learns, for each
line of input, only tokens constructed from the
concatenation of strings which match the tagged
subexpressions within the supplied regex. All
substrings which match regex within a suffix of each
input line are treated as features, even if they overlap on
the input line.
|
|
As an optional convenience, regex can include the
suffix ||xyz which indicates which parenthesized
subexpressions should be tagged. In this case, xyz
should consist exclusively of digits 1 to 9, numbering
exactly those subexpressions which should be tagged.
Alternatively, if no parentheses exist within regex,
then it is assumed that the whole expression must be
captured.
|
|
-h
|
|
Set the size of the hash table to 2^size
elements. When using the -l option, this refers to
the total number of features allowed in the maximum entropy
model being learned. When using the -c option
toghether with the -M switch and multinomial type
categories, this refers to the maximum number of features
taken into account during classification. Without the
-M switch, this option has no effect.
|
|
-i
|
|
Fully internationalized mode. Forces the use of wide
characters internally, which is necessary in some locales.
This incurs a noticeable performance penalty.
|
|
-j
|
|
Make features case sensitive. Normally, all features are
converted to lower case during processing, which reduces
storage requirements and improves statistical estimates for
small datasets. With this option, the original
capitalization is used for each feature. This can improve
classification accuracy.
|
|
-m
|
|
Aggressively maps categories into memory and locks them
into RAM to prevent swapping, if possible. This is useful
when speed is paramount and memory is plentiful, for example
when testing the classifier on large datasets.
|
|
Locking may require relaxing user limits with
ulimit(1). Ask your system administrator. Beware when
using the -m switch together with the -o
switch, as only one dbacl process must learn or classify at
a time to prevent file corruption. If no learning takes
place, then the -m switch for classifying is always
safe to use. See also the discussion for the -o
switch.
|
|
-n
|
|
Print scores for each category. Each score is the
product of two numbers, the cross entropy and the complexity
of the input text under each model. Multiplied together,
they represent the log probability that the input resembles
the model. To see these numbers separately, use also the
-v option. In conjunction with the -f option,
stops filtering but prints each input line prepended with a
list of scores for that line.
|
|
-q
|
|
Select quality of learning, where quality
can be 1,2,3,4. Higher values take longer to learn, and
should be slightly more accurate. The default quality
is 1 if the category file doesn’t exist or weights
cannot be preloaded, and 2 otherwise.
|
|
-o
|
|
When learning, reads/writes partial token counts so they
can be reused. Normally, category files are learned from
exactly the input data given, and don’t contain
extraneous information. When this option is in effect, some
extra information is saved in the file online, after
all input was read. This information can be reread the next
time that learning occurs, to continue where the previous
dataset left off. If online doesn’t exist, it
is created. If online exists, it is read before
learning, and updated afterwards. The file is approximately
3 times bigger (at least) than the learned
category.
|
|
In dbacl, file updates are atomic, but if using
the -o switch, two or more processes should not learn
simultaneously, as only one process will write a lasting
category and memory dump. The -m switch can also
speed up online learning, but beware of possible corruption.
Only one process should read or write a file. This option is
intended primarily for controlled test runs.
|
|
-r
|
|
Learn the digramic reference model only. Skips the
learning of extra features in the text corpus.
|
|
-v
|
|
Verbose mode. When learning, print out details of the
computation, when classifying, print out the name of the
most probable category. In conjunction with the
-n option, prints the scores as an explicit product
of the cross entropy and the complexity.
|
|
-w
|
|
Select default features to be n-grams up to
max_order. This is incompatible with the -g
option, which always takes precedence. If no -w or
-g options are given, dbacl assumes -w
1. Note that n-grams for n greater than 1 do not straddle
line breaks by default. The -S switch enables line
straddling.
|
|
-x
|
|
Set decimation probability to 1 - 2^(-decim). To
reduce memory requirements when learning, some inputs are
randomly skipped, and only a few are added to the model.
Exact behaviour depends on the applicable -T option
(default is -T "text"). When the type is
not "email" (eg "text"), then individual
input features are added with probability 2^(-decim).
When the type is "email", then full input messages
are added with probability 2^(-decim). Within each
such message, all features are used.
|
|
-A
|
|
Expect indented input and scores. With this switch,
dbacl expects input lines to be indented by a single
space character (which is then skipped). Lines starting with
any other character are ignored. This is the counterpart to
the -a switch above. When used together with the
-a switch, dbacl outputs the skipped lines as
they are, and reinserts the space at the front of each
processed input line.
|
|
-D
|
|
Print debug output. Do not use normally, but can be very
useful for displaying the list features picked up while
learning.
|
|
-H
|
|
Allow hash table to grow up to a maximum of
2^gsize elements during learning. Initial size is
given by -h option.
|
|
-L
|
|
Select the digramic reference measure for character
transitions. The measure can be one of
"uniform", "dirichlet" or
"maxent". Default is "uniform".
|
|
-M
|
|
Force multinomial calculations. When learning, forces
the model features to be treated multinomially. When
classifying, corrects entropy scores to reflect multinomial
probabilities (only applicable to multinomial type models,
if present). Scores will always be lower, because the
ordering of features is lost.
|
|
-N
|
|
Print posterior probabilities for each category.
This assumes the supplied categories form an exhaustive list
of possibilities. In conjunction with the -f option,
stops filtering but prints each input line prepended with a
summary of the posterior distribution for that line.
|
|
-R
|
|
Include an extra category for purely random text. The
category is called "random". Only makes sense when
using the -c option.
|
|
-S
|
|
Enable line straddling. This is useful together with the
-w option to allow n-grams for n > 1 to ignore
line breaks, so a complex token can continue past the end of
the line. This is not recommended for email.
|
|
-T
|
|
Specify nonstandard text format. By default,
dbacl assumes that the input text is a purely
ASCII text file. This corresponds to the case
when type is "text".
|
|
There are several types and subtypes which can be used to
clean the input text of extraneous tokens before actual
learning or classifying takes place. Each (sub)type you wish
to use must be indicated with a separate -T option on
the command line, and automatically implies the
corresponding type.
The "text" type is for unstructured plain text.
No cleanup is performed. This is the default if no types are
given on the command line.
The "email" type is for mbox format input files
or single RFC822 emails. Headers are recognized and most are
skipped. To include extra RFC822 standard headers (except
for trace headers), use the "email:headers"
subtype. To include trace headers, use the
"email:theaders" subtype. To name all headers in
the email, use the "email:xheaders" subtype. To
skip all headers, except the subject, use
"email:noheaders". To scan binary attachments for
strings, use the "email:atts" subtype.
When the "email" type is in effect, HTML markup
is automatically removed from text attachments except
text/plain attachments. To also remove HTML markup from
plain text attachments, use "email:noplain". To
prevent HTML markup removal in all text attachments, use
"email:plain".
The "html" type is for removing HTML markup
(between <html> and </html> tags) and
surrounding text. Note that if the "email" type is
enabled, then "html" is automatically enabled for
compatible message attachments only.
The "xml" type is like "html", but
doesn’t honour <html> and </html>, and
doesn’t interpret tags (so this should be more
properly called "angle markup" removal, and has
nothing to do with actual XML semantics).
When "html" is enabled, most markup attributes
are lost (for values of ’most’ close to
’all’). The "html:links" subtype
forces link urls to be parsed and learned, which would
otherwise be ignored. The "html:alt" subtype
forces parsing of alternative text in ALT attributes and
various other tags. The "html:scripts" subtype
forces parsing of scripts, "html:styles" forces
parsing of styles, "html:forms" forces parsing of
form values, while "html:comments" forces parsing
of HTML comments.
|
|
-U
|
|
Print (U)nambiguity. When used in conjunction with the
-v switch, prints scores followed by their empirical
standard deviations. When used alone, prints the best
category, followed by an estimated probability that this
category choice is unambiguous. More precisely, the
probability measures lack of overlap of CLT confidence
intervals for each category score (If there is overlap, then
there is ambiguity).
|
|
This estimated probability can be used as an
"unsure" flag, e.g. if the estimated probability
is lower than 50%. Formally, a score of 0% means another
category is equally likely to apply to the input, and a
score of 100% means no other category is likely to apply to
the input. Note that this type of confidence is unrelated to
the -X switch. Also, the probability estimate is
usually low if the document is short, or if the message
contains many tokens that have never been seen before (only
applies to uniform digramic measure).
|
|
-V
|
|
Print the program version number and exit.
|
|
-W
|
|
Like -w, but prevents features from straddling newlines.
See the description of -w.
|
|
-X
|
|
Print the confidence in the score calculated for each
category, when used together with the -n or
-N switch. Prepares the model for confidence scores,
when used with the -l switch. The confidence is an
estimate of the typicality of the score, assuming the null
hypothesis that the given category is correct. When used
with the -v switch alone, factorizes the score as the
empirical divergence plus the shannon entropy, multiplied by
complexity, in that order. The -X switch is not
supported in all possible models, and displays a percentage
of "0.0" if it can’t be calculated. Note
that for unknown documents, it is quite common to have
confidences close to zero.
|
USAGE
|
To create two category files in the current directory
from two ASCII text files named
Mark_Twain.txt and William_Shakespeare.txt respectively,
type:
% dbacl -l twain Mark_Twain.txt
% dbacl -l shake William_Shakespeare.txt
Now you can classify input text, for example:
% echo "howdy" | dbacl -v -c twain -c shake
twain
% echo "to be or not to be" | dbacl -v -c twain -c
shake
shake
Note that the -v option is necessary, otherwise
dbacl does not print anything. The return value is 1
in the first case, 2 in the second. If you want to print a
simple confidence value together with the best category,
replace -v with -U.
Suppose a file document.txt contains English text lines
interspersed with noise lines. To filter out the noise lines
from the English lines, assuming you have an existing
category shake say, type:
% dbacl -c shake -f shake -R document.txt >
document.txt_eng
% dbacl -c shake -f random -R document.txt >
document.txt_rnd
Note that the quality of the results will vary depending
on how well the categories shake and random represent each
input line. It is sometimes useful to see the posterior
probabilities for each line without filtering:
% dbacl -c shake -f shake -RN document.txt >
document.txt_probs
You can now postprocess the posterior probabilities for
each line of text with another script, to replicate an
arbitrary Bayesian decision rule of your choice.
In the special case of exactly two categories, the
optimal Bayesian decision procedure can be implemented for
documents as follows: let p1 be the prior probability
that the input text is classified as category1.
Consequently, the prior probability of classifying as
category2 is 1 - p1. Let u12 be the
cost of misclassifying a category1 input text as
belonging to category2 and vice versa for u21.
We assume there is no cost for classifying correctly. Then
the following command implements the optimal Bayesian
decision:
|
|
% dbacl -n -c category1 -c category2 | awk
’{ if($2 * p1 * u12 > $4 * (1 -
p1) * u21) { print $1; } else { print $3; }
}’ |
|
dbacl can also be used in conjunction with
procmail(1) to implement a simple Bayesian email
classification system. Assume that incoming mail should be
automatically delivered to one of three mail folders located
in $MAILDIR and named work, personal, and
spam. Initially, these must be created and filled
with appropriate sample emails. A crontab(1) file can
be used to learn the three categories once a day, e.g.
CATS=$HOME/.dbacl
5 0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
10 0 * * * dbacl -T email -l $CATS/personal
$MAILDIR/personal
15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam
To automatically deliver each incoming email into the
appropriate folder, the following procmailrc(5)
recipe fragment could be used:
CATS=$HOME/.dbacl
# run the spam classifier
:0 c
YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c
$CATS/spam
# send to the appropriate mailbox
:0:
* ? test -n "$YAY"
$MAILDIR/$YAY
:0:
$DEFAULT
Sometimes, dbacl will send the email to the wrong
mailbox. In that case, the misclassified message should be
removed from its wrong destination and placed in the correct
mailbox. The error will be corrected the next time your
messages are learned. If it is left in the wrong category,
dbacl will learn the wrong corpus statistics.
The default text features (tokens) read by dbacl
are purely alphabetic strings, which minimizes memory
requirements but can be unrealistic in some cases. To
construct models based on alphanumeric tokens, use the
-e switch. The example below also uses the optional
-D switch, which prints a list of actual tokens found
in the document:
% dbacl -e alnum -D -l twain Mark_Twain.txt | less
It is also possible to override the default feature
selection method used to learn the category model by means
of regular expressions. For example, the following
duplicates the default feature selection method in the C
locale, while being much slower:
|
|
% dbacl -l twain -g ’^([[:alpha:]]+)’ -g
’[^[:alpha:]]([[:alpha:]]+)’
Mark_Twain.txt |
|
The category twain which is obtained depends only on
single alphabetic words in the text file Mark_Twain.txt (and
computed digram statistics for prediction). For a second
example, the following command builds a smoothed Markovian
(word bigram) model which depends on pairs of consecutive
words within each line (but pairs cannot straddle a line
break):
|
|
% dbacl -l twain2 -g
’(^|[^[:alpha:]])([[:alpha:]]+)||2’ -g
’(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23’
Mark_Twain.txt |
|
More general, line based, n-gram models of all orders (up
to 7) can be built in a similar way. To construct paragraph
based models, you should reformat the input corpora with
awk(1) or sed(1) to obtain one paragraph per
line. Line size is limited by available memory, but note
that regex performance will degrade quickly for long
lines.
|
PERFORMANCE
|
The underlying assumption of statistical learning is that
a relatively small number of training documents can
represent a much larger set of input documents. Thus in the
long run, learning can grind to a halt without serious
impact on classification accuracy. While not true in
reality, this assumption is surprisingly accurate for
problems such as email filtering. In practice, this means
that a well chosen corpus on the order of ten thousand
documents is sufficient for highly accurate results for
years. Continual learning after such a critical mass results
in diminishing returns. Of course, when real world input
document patterns change dramatically, the predictive power
of the models can be lost.
dbacl is heavily optimized for the case of
frequent classifications but infrequent batch learning. This
is the long run optimum described above. Under ideal
conditions, dbacl can classify a hundred emails per
second on low end hardware (500Mhz Pentium III). Learning
speed is not very much slower, but takes effectively much
longer for large document collections for various reasons.
When using the -m switch, data structures are
aggressively mapped into memory if possible, reducing
overheads for both I/O and memory allocations.
dbacl throws away its input as soon as possible,
and has no limits on the input document size. Both
classification and learning speed are directly proportional
to the number of tokens in the input, but learning also
needs a nonlinear optimization step which takes time
proportional to the number of unique tokens discovered. At
time of writing, dbacl is one of the fastest open
source mail filters given its optimal usage scenario, but
uses more memory for learning than other filters.
|
MULTIPLE PROCESSES AND DATA CORRUPTION
|
When saving category files, dbacl first writes out
a temporary file in the same location, and renames it
afterwards. If a problem or crash occurs during learning,
the old category file is therefore left untouched. This
ensures that categories can never be corrupted, no matter
how many processes try to simultaneously learn or classify,
and means that valid categories are available for
classification at any time.
When using the -m switch, file contents are memory
mapped for speedy reading and writing. This, together with
the -o switch, is intended mainly for testing
purposes, when tens of thousands of messages must be learned
and scored in a laboratory to measure dbacl’s
accuracy. Because no file locking is attempted for
performance reasons, corruptions are possible, unless you
make sure that only one dbacl process reads or writes
any file at any given time. This is the only case (-m and -o
together) when corruption is possible.
|
MEMORY USE
|
When classifying a document, dbacl loads all
indicated categories into RAM, so the total memory needed is
approximately the sum of the category file sizes plus a
fixed overhead. The input document is consumed while being
read, so its size doesn’t matter, but very long lines
can take up space. When using the -m switch, the
categories are read using mmap(2) as available.
When learning, dbacl keeps a large structure in
memory which contains many objects which won’t be
saved into the output category. The size of this structure
is proportional to the number of unique tokens read, but not
the size of the input documents, since they are discarded
while being read. As a rough guide, this structure is 4x-5x
the size of the final category file that is produced.
To prevent unchecked memory growth, dbacl
allocates by default a fixed smallish amount of memory for
tokens. When this space is used up, further tokens are
discarded which has the effect of skewing the learned
category making it less usable as more tokens are dropped. A
warning is printed on STDERR in such a case.
The -h switch lets you fix the initial size of the
token space in powers of 2, ie "-h 17" means 2^17
= 131072 possible tokens. If you type "dbacl -V",
you can see the number of bytes needed for each token when
either learning or classifying. Multiply this number by the
maximum number of possible tokens to estimate the memory
needed for learning. The -H switch lets dbacl
grow its tables automatically if and when needed, up to a
maximum specified. So if you type "-H 21", then
the initial size will be doubled repeatedly if necessary, up
to approximately two million unique tokens.
When learning with the -X switch, a handful of
input documents are also kept in RAM throughout.
|
ENVIRONMENT
|
When this variable is set, its value is prepended to
every category filename which doesn’t start
with a ’/’ or a ’.’.
|
SIGNALS
|
INT
|
|
If this signal is caught, dbacl simply exits
without doing any cleanup or other operations. This signal
can often be sent by pressing Ctrl-C on the keyboard. See
stty(1).
|
|
If one of these signals is caught, dbacl stops
reading input and continues its operation as if no more
input was available. This is a way of quitting gracefully,
but note that in learning mode, a category file will be
written based on the incomplete input. The QUIT signal can
often be sent by pressing Ctrl- on the keyboard. See
stty(1).
|
|
USR1
|
|
If this signal is caught, dbacl reloads the
current categories at the earliest feasible opportunity.
This is not normally useful at all, but might be in special
cases, such as if the -f switch is invoked together
with input from a long running pipe.
|
NOTES
|
dbacl generated category files are in binary
format, and may or may not be portable to systems using a
different byte order architecture (this depends on how
dbacl was compiled). The -V switch prints out
whether categories are portable, or else you can just
experiment.
dbacl does not recognize functionally equivalent
regular expressions, and in this case duplicate features
will be counted several times.
With every learned category, the command line options
that were used are saved. When classifying, make sure that
every relevant category was learned with the same set of
options (regexes are allowed to differ), otherwise behaviour
is undefined. There is no need to repeat all the switches
when classifying.
If you get many digitization warnings, then you are
trying to learn too much data at once, or your model is too
complex. dbacl is compiled to save memory by
digitizing final weights, but you can disable digitization
by editing dbacl.h and recompiling.
dbacl offers several built-in tokenizers (see
-e switch) with more to come in future versions, as
the author invents them. While the default tokenizer may
evolve, no tokenizer should ever be removed, so that you can
always simulate previous dbacl behaviour subject to
bug fixes and architectural changes.
The confidence estimates obtained through the -X
switch are underestimates, ie are more conservative than
they should be.
|
BUGS
|
"Ya know, some day scientists are gonna invent
something that will outsmart a rabbit." (Robot Rabbit,
1953)
|
SOURCE
|
The source code for the latest version of this program is
available at the following locations:
http://www.lbreyer.com/gpl.html
http://dbacl.sourceforge.net
|
AUTHOR
|
Laird A. Breyer <laird@lbreyer.com>
|
SEE ALSO
|
awk(1), bayesol(1), crontab(1),
less(1), mailcross(1), mailfoot(1),
mailinspect(1), mailtoe(1),
procmailex(5), regex(7), stty(1),
sed(1)
|
|
|
|
previous next
|
|
|