When experimenting with complicated models, dbacl will quickly fill up its hash
tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.
For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.
|dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16
|dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20
|dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22
As can be seen from this table, including bigrams and trigrams has a noticeable
memory and performance effect during learning. Luckily, classification speed
is only affected by the number of features found in the unknown document.
|dbacl -c twain1 Twain-Collected_Works.txt
|dbacl -c twain2 Twain-Collected_Works.txt
||unigrams and bigrams
|dbacl -c twain3 Twain-Collected_Works.txt
||unigrams, bigrams and trigrams
The heavy memory requirements during learning of complicated models can be
reduced at the expense of the model itself. dbacl has a feature decimation switch
which slows down the hash table filling rate by simply ignoring many of the
features found in the input.