Laird Breyer
contents introduction tutorial spam fun man related
previous next

Appendix A: memory requirements

When experimenting with complicated models, dbacl will quickly fill up its hash tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.

For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.

command Unique features Category size Learning time
dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16 49,251 512K 0m9.240s
dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20 909,400 6.1M 1m1.100s
dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22 3,151,718 24M 3m42.240s

As can be seen from this table, including bigrams and trigrams has a noticeable memory and performance effect during learning. Luckily, classification speed is only affected by the number of features found in the unknown document.

command features Classification time
dbacl -c twain1 Twain-Collected_Works.txt unigrams 0m4.860s
dbacl -c twain2 Twain-Collected_Works.txt unigrams and bigrams 0m8.930s
dbacl -c twain3 Twain-Collected_Works.txt unigrams, bigrams and trigrams 0m12.750s

The heavy memory requirements during learning of complicated models can be reduced at the expense of the model itself. dbacl has a feature decimation switch which slows down the hash table filling rate by simply ignoring many of the features found in the input.

previous next
contents introduction tutorial spam fun man related