Laird Breyer
contents introduction tutorial spam fun man related
previous next

Language models

dbacl works by scanning the text it learns for features, which can be nearly anything you like. For example, unless you tell it otherwise, the standard features are all alphabetic single words in the document. dbacl builds a statistical model, ie a probability distribution, based only on those features, so anything that is not a feature will be ignored both during learning, and during classification.

This dependence on features is a double edged sword, because it helps dbacl focus on the things that matter (single alphabetic words by default), but if something else matters more then it is ignored if you don't tell dbacl it's a feature. This is the hard part, and it's up to you.

When telling dbacl what kind of features to look out for, you must use the language of regular expressions. For example, if you think the only interesting features for category one are words which contain the letter 'q', then you would type

% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \
  -g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt

The rule is that dbacl always takes as a feature whatever it finds within round brackets. Reading this can be painful if you don't know regular expressions, however.

In English, the first expression after the -g option above reads: take as a feature any string which looks like: "start of the line" (written ^) followed by "zero or more characters within the range a-z or A-Z" (written [a-zA-Z]*), followed by "the character q" (written q), followed by "zero or more characters within the range a-z or A-Z" (written [a-zA-Z]*). The second expression is nearly identical: "a single character which is not in the range a-zA-Z" (written [^a-zA-Z]), followed by "zero or more characters within the range a-z or A-Z" (can you guess?), followed by "the character q", followed by "zero or more characters within the range a-z or A-Z". The single quote marks are used to keep the whole expression together.

A regular expression is a simultaneous superposition of many text strings. Just like a word, you read and write it one character at a time.

Symbol What it means
. any character except newline
* zero or more copies of preceding character or parenthesized expression
+ one or more copies of preceding character or parenthesized expression
? zero or one copies of preceding character or parenthesized expression
^ beginning of line
$ end of line
a|b a or b
[abc] one character equal to a, b or c
[^abc] one character not equal to a, b or c
\*, \?, or \. the actual character *, ? or .

To get a feel for the kinds of features taken into account by dbacl in the example above, you can use the -D option. Retype the above in the slightly changed form

% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \
 -g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | grep match
match d191e93e []acquired[](1) 1
match 8c56f142 []inquire[](1) 1
match 7a2ccda2 []inquiry[](1) 1
match 38f595f3 []consequently[](1) 1
match a52053f2 []questions[](1) 1
match 78a0e302 []question[](1) 1

This command lists the first few matches, one per line, which exist in the sample1.txt document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model. However, when you are trying out regular expressions, you can compare this output with the contents of the document to see if your expression misses out on words or reads too many.

Sometimes, it's convenient to use parentheses which you want to throw away. dbacl understands the special notation ||xyz which you can place at the end of a regular expression, where x, y, z should be digits corresponding to the parentheses you want to keep. Here is an example for mixed Japanese and English documents, which matches alphabetic words and single ideograms:

% LANG=ja_JP dbacl -D -l konichiwa japanese.txt -i \
 -g '(^|[^a-zA-Z0-9])([a-zA-Z0-9]+|[[:alpha:]])||2'

Note that you need a multilingual terminal and Japanese fonts to view this, and your computer must have a Japanese locale available.

In the table below, you will find a list of some simple regular expressions to get you started:

If you want to match... Then you need this expression... Examples
alphabetic words (^|[^[:alpha:]]) ([[:alpha:]]+) ||2 hello, kitty
words in capitals (^|[^[A-Z]]) ([A-Z]+) ||2 MAKE, MONEY, FAST
strings of characters separated by spaces (^|[ ]) ([^ ]+) ||2 w$%&tf9(, amazing!, :-)
time of day (^|[^0-9]) ([0-9?[0-9]:[0-9][0-9](am|pm)) ||2 9:17am, 12:30pm
words which end in a number (^|[^a-zA-Z0-9]) ([a-zA-Z]+[0-9]+) [^a-zA-Z] ||2 borg17234, A1
alphanumeric word pairs (^|[^[:alnum:]]) ([[:alnum:]]+) [^[:alnum:]]+ ([[:alnum:]]+) ||23 good morning, how are

The last entry in the table above shows how to take word pairs as features. Such models are called bigram models, as opposed to the unigram models whose features are only single words, and they are used to capture extra information.

For example, in a unigram model the pair of words "well done" and "done well" have the same probability. A bigram model can learn that "well done" is more common in food related documents (provided this combination of words was actually found within the learning corpus).

However, there is a big statistical problem: because there exist many more meaningful bigrams than unigrams, you'll need a much bigger corpus to obtain meaningful statistics. One way around this is a technique called smoothing, which predicts unseen bigrams from already seen unigrams. To obtain such a combined unigram/bigram alphabetic word model, type

% dbacl -l smooth -g '(^|[^a-zA-Z])([a-zA-Z]+)||2' \
 -g '(^|[^a-zA-Z])([a-zA-Z]+)[^a-zA-Z]+([a-zA-Z]+)||23' sample1.txt

If all you want are alphabetic bigrams, trigrams, etc, there is a special switch -w you can use. The command

% dbacl -l slick -w 2 sample1.txt

produces a model slick which is nearly identical to smooth (the difference is that a regular expression cannot straddle newlines, but -w ngrams can). Let's look at the first few features: type

% dbacl -l slick -w 2 sample1.txt -D | grep match | head -10
match 818ad280 []tom[](1) 1
match 5d20c0e2 []tom[]no[](1) 2
match 3db5da99 []no[](1) 1
match 4a18ad66 []no[]answer[](1) 2
match eea4a1c4 []answer[](1) 1
match 95392743 []answer[]tom[](1) 2
match 61cc1403 []answer[]what[](1) 2
match 8c953ec2 []what[](1) 1
match 4291d86e []what[]s[](1) 2
match b09aa375 []s[](1) 1
You can see both pairs and single words, all lower case because dbacl converts everything to lower case unless you tell it otherwise. This saves a little on memory. But what did the original document look like?
% head -10 sample1.txt 

No answer.


No answer.

"What's gone with that boy,  I wonder? You TOM!"

Now you see how the pairs are formed. But wait, the pair of words ("TOM!", No) occurs twice in the text, but only once in the list of matches? Did we miss one? No, look again at the line
match 5d20c0e2 []tom[]no[](1) 2
and you will see that the last value is '2', since we've seen it twice. dbacl uses the frequencies of features to build its model.

Obviously, all this typing is getting tedious, and you will eventually want to automate the learning stage in a shell script. Use regular expressions sparingly, as they can quickly degrade the performance (speed and memory) of dbacl. See Appendix A for ways around this.

previous next
contents introduction tutorial spam fun man related