Setting up the game(s)
The first thing we have to do is obtain a (preferably large) collection
of chess games that we can learn.
Not being an expert, I started off by
browsing the web for likely keywords. It became soon apparent that a large
collection of free games is available electronically in something called the
PGN format. So I ended up downloading all the files available from
Chessopolis and placing them into a subdirectory (you can download any other
collections of PGN files you like, the instructions below should be
detailed enough so that you can adapt them easily).
% mkdir zipfiles
% cd zipfiles
100-pg.zip can92-pg.zip gmcom3pg.zip krp-pg.zip swede2pg.zip
4queenpg.zip cp8687pg.zip irish-pg.zip lonepine.zip ten-pg.zip
acbul-pg.zip cp8891pg.zip italy-pg.zip maxg-pgn.zip trap-pg.zip
bril-pg.zip croat-pg.zip kbp-pg.zip minis-pg.zip wchamppg.zip
brit60pg.zip denm-pg.zip knp-pg.zip pca9395.zip
brit70pg.zip gmcom1pg.zip kp-kp-pg.zip sams-pg.zip
cal-pg.zip gmcom2pg.zip kqp-pg.zip storm-pg.zip
Now that we have a collection, let's look at a typical game, say from
% zcat sams-pg.zip | head -15
[Site "Active Chess Championship, Kuala Lumpur (Malays"]
[White "Anand Viswanathan (IND)"]
[Black "Sloan Sam"]
1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Nxe4 5. Bxf7+ Ke7 6. d3 Nf6 7.
Bb3 d5 8. Nc3 Bg4 9. f3 Bf5 10. f4 Bg4 11. Qd2 h6 12. fxe5 Nxe5 13. Qe3
Kd6 14. d4 Nd3+ 15. Qxd3 Qe7+ 16. Be3 Re8 17. Nf7+ Qxf7 18. O-O c6 19.
Bf4+ Kd7 20. Be5 Be7 21. Rae1 Rhf8 22. Nxd5 cxd5 23. Ba4+ Kd8 24. Qc3
Bb4 25. Qxb4 Re6 26. c4 Rb6 27. Qa5 Bc8 28. c5 1-0
The trouble with data collections is that they are never exactly in the format
we want. The chess game is obviously the bit at the bottom, while the
text in square brackets looks quite useless to teach our filter.
Looking at the game itself, the numbers obviously count the moves,
while the actual symbols that follow just seem like noise. But look
more closely, and each move is actually followed by two expressions,
one for each player.
In chess, the White player always starts first,
and if you know that a chess board's columns are marked by letters,
and the rows are marked by numbers, then e4 is a square on the board.
The capital letters such as B, N, Q, K probably stand for Bishop,
kNight, Queen and King. Of course, if you get stuck, you might just
want to read the PGN format specification instead of guessing.
So now we know that each player's moves are separated by spaces, and
that the numbers ending in a dot are just there to help people read the
moves, and can be ignored just like the text in brackets with the names
of the players etc. The real game information could be simply written like this:
e4 e5 Nf3 Nc6 Bc4 Nf6 Ng5 Nxe4 Bxf7+ Ke7 d3 Nf6
Bb3 d5 Nc3 Bg4 f3 Bf5 f4 Bg4 Qd2 h6 fxe5 Nxe5 Qe3
Kd6 d4 Nd3+ Qxd3 Qe7+ Be3 Re8 Nf7+ Qxf7 O-O c6
Bf4+ Kd7 Be5 Be7 Rae1 Rhf8 Nxd5 cxd5 Ba4+ Kd8 Qc3
Bb4 Qxb4 Re6 c4 Rb6 Qa5 Bc8 c5 1-0
We'll come back to this later, but first let's talk about text classification.