When experimenting with complicated models, dbacl will quickly fill up its hash
tables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.
For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.
command |
Unique features |
Category size |
Learning time |
dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16 |
49,251 |
512K |
0m9.240s |
dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20 |
909,400 |
6.1M |
1m1.100s |
dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22 |
3,151,718 |
24M |
3m42.240s |
As can be seen from this table, including bigrams and trigrams has a noticeable
memory and performance effect during learning. Luckily, classification speed
is only affected by the number of features found in the unknown document.
command |
features |
Classification time |
dbacl -c twain1 Twain-Collected_Works.txt |
unigrams |
0m4.860s |
dbacl -c twain2 Twain-Collected_Works.txt |
unigrams and bigrams |
0m8.930s |
dbacl -c twain3 Twain-Collected_Works.txt |
unigrams, bigrams and trigrams |
0m12.750s |
The heavy memory requirements during learning of complicated models can be
reduced at the expense of the model itself. dbacl has a feature decimation switch
which slows down the hash table filling rate by simply ignoring many of the
features found in the input.
|
|