dbacl project homepage


Summary
Forums
CVS
Download

Laird Breyer

Download

previous next

Appendix A: memory requirements

When experimenting with complicated models, dbacl will quickly fill up its hash tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.

For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.

command Unique features Category size Learning time

dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16 49,251 512K 0m9.240s

dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20 909,400 6.1M 1m1.100s

dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22 3,151,718 24M 3m42.240s

As can be seen from this table, including bigrams and trigrams has a noticeable memory and performance effect during learning. Luckily, classification speed is only affected by the number of features found in the unknown document.

command features Classification time

dbacl -c twain1 Twain-Collected_Works.txt unigrams 0m4.860s

dbacl -c twain2 Twain-Collected_Works.txt unigrams and bigrams 0m8.930s

dbacl -c twain3 Twain-Collected_Works.txt unigrams, bigrams and trigrams 0m12.750s

The heavy memory requirements during learning of complicated models can be reduced at the expense of the model itself. dbacl has a feature decimation switch which slows down the hash table filling rate by simply ignoring many of the features found in the input.

previous next