SourceForge.net Logo
Summary
Forums
CVS
Download

Laird Breyer
Download
contents introduction tutorial spam fun man related
previous next

Setting up the game(s)

The first thing we have to do is obtain a (preferably large) collection of chess games that we can learn.

Not being an expert, I started off by browsing the web for likely keywords. It became soon apparent that a large collection of free games is available electronically in something called the PGN format. So I ended up downloading all the files available from Chessopolis and placing them into a subdirectory (you can download any other collections of PGN files you like, the instructions below should be detailed enough so that you can adapt them easily).

% mkdir zipfiles
% cd zipfiles
% ls
100-pg.zip    can92-pg.zip  gmcom3pg.zip  krp-pg.zip    swede2pg.zip
4queenpg.zip  cp8687pg.zip  irish-pg.zip  lonepine.zip  ten-pg.zip
acbul-pg.zip  cp8891pg.zip  italy-pg.zip  maxg-pgn.zip  trap-pg.zip
bril-pg.zip   croat-pg.zip  kbp-pg.zip    minis-pg.zip  wchamppg.zip
brit60pg.zip  denm-pg.zip   knp-pg.zip    pca9395.zip
brit70pg.zip  gmcom1pg.zip  kp-kp-pg.zip  sams-pg.zip
cal-pg.zip    gmcom2pg.zip  kqp-pg.zip    storm-pg.zip

Now that we have a collection, let's look at a typical game, say from sams-pg.zip:

% zcat sams-pg.zip | head -15
[Event "?"]
[Site "Active Chess Championship, Kuala Lumpur (Malays"]
[Date "1989.??.??"]
[Round "?"]
[White "Anand Viswanathan (IND)"]
[Black "Sloan Sam"]
[Result "1-0"]
[ECO "C57"]

1. e4 e5 2. Nf3 Nc6 3. Bc4 Nf6 4. Ng5 Nxe4 5. Bxf7+ Ke7 6. d3 Nf6 7. 
Bb3 d5 8. Nc3 Bg4 9. f3 Bf5 10. f4 Bg4 11. Qd2 h6 12. fxe5 Nxe5 13. Qe3 
Kd6 14. d4 Nd3+ 15. Qxd3 Qe7+ 16. Be3 Re8 17. Nf7+ Qxf7 18. O-O c6 19. 
Bf4+ Kd7 20. Be5 Be7 21. Rae1 Rhf8 22. Nxd5 cxd5 23. Ba4+ Kd8 24. Qc3 
Bb4 25. Qxb4 Re6 26. c4 Rb6 27. Qa5 Bc8 28. c5 1-0

The trouble with data collections is that they are never exactly in the format we want. The chess game is obviously the bit at the bottom, while the text in square brackets looks quite useless to teach our filter.

Looking at the game itself, the numbers obviously count the moves, while the actual symbols that follow just seem like noise. But look more closely, and each move is actually followed by two expressions, one for each player.

In chess, the White player always starts first, and if you know that a chess board's columns are marked by letters, and the rows are marked by numbers, then e4 is a square on the board. The capital letters such as B, N, Q, K probably stand for Bishop, kNight, Queen and King. Of course, if you get stuck, you might just want to read the PGN format specification instead of guessing.

So now we know that each player's moves are separated by spaces, and that the numbers ending in a dot are just there to help people read the moves, and can be ignored just like the text in brackets with the names of the players etc. The real game information could be simply written like this:

e4 e5 Nf3 Nc6 Bc4 Nf6 Ng5 Nxe4 Bxf7+ Ke7 d3 Nf6 
Bb3 d5 Nc3 Bg4 f3 Bf5 f4 Bg4 Qd2 h6 fxe5 Nxe5 Qe3 
Kd6 d4 Nd3+ Qxd3 Qe7+ Be3 Re8 Nf7+ Qxf7 O-O c6 
Bf4+ Kd7 Be5 Be7 Rae1 Rhf8 Nxd5 cxd5 Ba4+ Kd8 Qc3 
Bb4 Qxb4 Re6 c4 Rb6 Qa5 Bc8 c5 1-0

We'll come back to this later, but first let's talk about text classification.

previous next
contents introduction tutorial spam fun man related