Laird Breyer
contents introduction tutorial spam fun man related
previous next

Setting up the game(s)

The first thing we have to do is obtain a (preferably large) collection of chess games that we can learn.

Not being an expert, I started off by browsing the web for likely keywords. It became soon apparent that a large collection of free games is available electronically in something called the PGN format. So I ended up downloading all the files available from Chessopolis and placing them into a subdirectory (you can download any other collections of PGN files you like, the instructions below should be detailed enough so that you can adapt them easily).

% mkdir zipfiles
% cd zipfiles
% ls

Now that we have a collection, let's look at a typical game, say from

% zcat | head -15
[Event "?"]
[Site "Active Chess Championship, Kuala Lumpur (Malays"]
[Date "1989.??.??"]
[Round "?"]
[White "Anand Viswanathan (IND)"]
[Black "Sloan Sam"]
[Result "1-0"]
[ECO "C57"]

1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Nxe4 5. Bxf7+ Ke7 6. d3 Nf6 7. 
Bb3 d5 8. Nc3 Bg4 9. f3 Bf5 10. f4 Bg4 11. Qd2 h6 12. fxe5 Nxe5 13. Qe3 
Kd6 14. d4 Nd3+ 15. Qxd3 Qe7+ 16. Be3 Re8 17. Nf7+ Qxf7 18. O-O c6 19. 
Bf4+ Kd7 20. Be5 Be7 21. Rae1 Rhf8 22. Nxd5 cxd5 23. Ba4+ Kd8 24. Qc3 
Bb4 25. Qxb4 Re6 26. c4 Rb6 27. Qa5 Bc8 28. c5 1-0

The trouble with data collections is that they are never exactly in the format we want. The chess game is obviously the bit at the bottom, while the text in square brackets looks quite useless to teach our filter.

Looking at the game itself, the numbers obviously count the moves, while the actual symbols that follow just seem like noise. But look more closely, and each move is actually followed by two expressions, one for each player.

In chess, the White player always starts first, and if you know that a chess board's columns are marked by letters, and the rows are marked by numbers, then e4 is a square on the board. The capital letters such as B, N, Q, K probably stand for Bishop, kNight, Queen and King. Of course, if you get stuck, you might just want to read the PGN format specification instead of guessing.

So now we know that each player's moves are separated by spaces, and that the numbers ending in a dot are just there to help people read the moves, and can be ignored just like the text in brackets with the names of the players etc. The real game information could be simply written like this:

e4 e5 Nf3 Nc6 Bc4 Nf6 Ng5 Nxe4 Bxf7+ Ke7 d3 Nf6 
Bb3 d5 Nc3 Bg4 f3 Bf5 f4 Bg4 Qd2 h6 fxe5 Nxe5 Qe3 
Kd6 d4 Nd3+ Qxd3 Qe7+ Be3 Re8 Nf7+ Qxf7 O-O c6 
Bf4+ Kd7 Be5 Be7 Rae1 Rhf8 Nxd5 cxd5 Ba4+ Kd8 Qc3 
Bb4 Qxb4 Re6 c4 Rb6 Qa5 Bc8 c5 1-0

We'll come back to this later, but first let's talk about text classification.

previous next
contents introduction tutorial spam fun man related