Laird Breyer
contents introduction tutorial spam fun man related
previous next

Figuring out what to learn

We know that dbacl can tell us if a piece of text is typical for a model, and in turn a model is learned by reading many examples together. So if a pattern occurs very often in the games being learned, then such a typical pattern will be recommended by dbacl. And if the pattern is rare then dbacl will recommend it rarely.

But we want dbacl to win. So we want it to recommend the kind of things winners often do. So when dbacl plays White, it must learn a model from games where White wins, and if dbacl plays Black, then its model must be from games where Black wins.

At least that's a good first assumption. Sometimes, strong players lose a game against weaker players, and if dbacl learns this type of game, then it will pick up bad habits from the weaker player. But we'll assume that most games the better player wins. Also, if we learn to play by studying games from terrible players, then we'll pick up bad habits no matter what. But this is for later, or we'll never get anywhere.

Unfortunately, we now have work to do. We must split our thousands of sample games into White-Win (1-0) and Black-Win (0-1). And what to do about draws (1/2-1/2)? We can put them in both categories or just ignore them.

The files I downloaded are zipped MS-DOS files called *.PGN whose lines end in "\r\n" instead of ending in "\n", which is the Unix convention.

% cd zipfiles
% for f in *.zip; do unzip $f; done
% mkdir ../gamefiles
% mv *.PGN ../gamefiles
% cd ..

After inspecting a few *.PGN files, it's clear that a typical game takes several lines to write out fully, but the lines in between are either empty or contain all sorts of comments and useless information which must be scrubbed. We can do this by recombining the lines of a game into a single long line, and since all games start with a "1.", any lines left that don't start this way can be thrown away.

% cd gamefiles
% cat *.PGN  | sed -e 's/\r//g' \
  | sed -e :a -e '$!N;s/\n\([a-hKQNBRO0-9]\)/ \1/;ta' -e 'P;D' \
  | sed -e 's/^ *//' \
  | grep '^1\.' \
  > allgames.txt

What we now have is a big file allgames.txt which contains very long lines where each line is a single game. In the PGN format, the end result is marked at the end of the game, so it is easy for us to sort the games by throwing away the lines which either contain 1-0 (White wins) or 0-1 (Black wins). We also remove the move numbers which we don't need anymore.

% cat allgames.txt | grep -v '0-1' | sed 's/[0-9]*\.[ ]*//g' > WhiteWinDraw.txt
% cat allgames.txt | grep -v '1-0' | sed 's/[0-9]*\.[ ]*//g' > BlackWinDraw.txt
% cat allgames.txt | grep '1-0' | sed 's/[0-9]*\.[ ]*//g' > WhiteWin.txt
% cat allgames.txt | grep '0-1' | sed 's/[0-9]*\.[ ]*//g' > BlackWin.txt
Let's see how many games we've got:
% wc -l *.txt
   46245 allgames.txt
   26809 BlackWinDraw.txt
   14913 BlackWin.txt
   31332 WhiteWinDraw.txt
   19436 WhiteWin.txt
  138735 total
All right, around 15-20 thousand winning games of each type. That should give dbacl something to read!

Remember, each game is on its own line. I'm going to leave the final scores at the end of each line where they are, as they are harmless (they can't occur in the middle of a game in progress). Let's learn the models:

% cd ..
% ./dbacl/src/dbacl -T text -l ./WhiteWinDraw -e alnum -L uniform \
  -j -w 2 -H 20 ./gamefiles/WhiteWinDraw.txt
% ./dbacl/src/dbacl -T text -l ./BlackWinDraw -e alnum -L uniform \
  -j -w 2 -H 20 ./gamefiles/BlackWinDraw.txt

The most important option here is "-w 2", which tells dbacl that it must pick up single words as well as word pairs. We'll see later if that's a good idea. If all went well, then you should have two files in your chess directory.

% ls -lh *Win*
-rw-r-----  1 laird laird 3.2M 2005-06-24 17:16 BlackWinDraw
-rw-r-----  1 laird laird 3.2M 2005-06-24 17:15 WhiteWinDraw
previous next
contents introduction tutorial spam fun man related