Header format for polyglot books.

Rationale

The Polyglot opening book format is the most widely used non-proprietary opening book format. It is understood by many chess engines as well as by xboard/winboard, icsdrone and cutechess.

Currently Polyglot book files do not contain any metadata. This is problematic for many reasons

In the current document we propose an extensible header format for Polyglot opening books. To facilitate implementation we include a utility for handling such headers.

Since it is anticipated that the adoption of this header format will take some time, we have opted for a header consisting entirely of printable ASCII characters (with some qualification, see below). So the header can be inspected by a text editor or using a simple dump program such as od on Linux. In particular the following command

od -w16 -a <file> | less

will display the header.

A note on backward compatibility

The new header is backward compatible in the sense that it is invisible for applications that use Polyglot books through lookup of individual keys corresponding to chess positions (e.g. chess engines).

Applications that handle Polyglot books in a more global way (such as the Polyglot book merging utility) which are unaware of the header could possibly unintentionally mess it up. But the resulting file will still behave correctly from the point of view of key lookup.

A broken header may be easily deleted and recreated with the pgheader utility which is further described below.

Polyglot as of version 1.4.70b is aware of the new header and will treat it correctly.

Polyglot books

A polyglot book consists of a series of 16 byte records. The first 8 bytes of each record form the key. Normally the key is the zobrist hash key of a chess position. There may be more than one record with the same key. The records are ordered according to their keys (lowest key first). Records with a null key 0x0000000000000000 will be called null records below. If they exist they must necessarily be at the beginning of the file.

The header data will be embedded in null records. Chess positions that correspond to null keys are actually known (they were constructed by Peter Österlund) but the probability that such a position would occur in an actual chess game is totally negligeable, and moreover the probability of a collision with another key in the book is much larger. Of course a header aware application may simply regard a null key as invalid.

Header data

In a polyglot book the header data is defined as the concatenation of the non-key data in the null records. The bytes in the header data are in the same order as in the file.

If there are no null records then the book is assumed to contain no header.

The logical header

The logical header is the part up to and including the first null character in the header data.

If there is no null character in the header data then the book is assumed to contain no header.

The logical header is a UTF-8 encoded unicode character string (without byte order marker, see below). Note that a character string consisting of 7 bit ascii characters is a valid UTF-8 string.

As the logical header may be arbitrarily long this may present problems for applications that use fixed length buffers.

An application may refuse to parse a header which it considers too long. However it should always be able to process a logical header of at most 2048 characters (including the null character).

Fields

The logical header is considered to be a sequence of fields separated by linefeed characters 0x0A. The linefeed characters are not part of the fields. The carriage return character 0x0D is not a field separator, even if it occurs together with a linefeed in the typical newline sequence "\r\n". Note that Windows has the habit of changing "\n" into "\r\n" behind your back, so please test on that platform.

A certain number of fields (depending on the format version) are predefined. The predefined fields should not contain leading or trailing spaces. Numbers are written in decimal form without leading zeros.

The definition of the first three fields is independent of the format version.

In the current version of the logical header (1.0) one has: If n is the number of variants in the book then the fields 5 to 5+n-1 are the names of the variants supported by the book.

Variant names should be printable ascii characters and contain no spaces or upper case letters. For known variants the standard variant names from the Chess Engine Communication Protocol should be used.

Having zero variants is legal but the meaning of this is undefined in v1.0 of the format.

The non-predefined fields are free format. They should be regarded as comments and would typically include license information, author data, source files etc...

A note on extensibility

It recommended that newer versions of the format do not change the definition of predefined fields. Instead it is recommended to add new fields. The design of the format allows one to do this is in a way which is invisible to applications supporting only an earlier version of the format.

Currently the logical header is structured like a shallow tree. It is recommended to keep this tree-like format for further versions of the format according to the following Backus-Naur form

<header>        := <magic>\n<version>\n<root-field>[\n<field>]*
<magic>         := @PG@
<version>       := <number>.<number>
<root-field>    := <count>[\n<multi-field>]*
<multi-field>   := [<root-field> | <field>]
<field>         := <string> 
<count>         := <number>
where <string> is assumed to contain no linefeed characters and <count> is the total number of fields contained in the corresponding subtree (but not including the <count> field itself).

A note on UTF-8

Although this specification allows the comment section in the header to contain UTF-8 encoded multi-byte characters (recognizable by the fact that the highest bit in the corresponding bytes is set) it is currently probably best for a widely distributed book to use only the backward compatible 7bit ascii subset of UTF-8 (i.e. only single byte characters). Indeed not all current GUI's may be prepared to display multi-byte characters although this is likely to change in the future.

For clarification it should also be pointed out that the header should not contain a byte order marker (BOM) since it breaks compatibility with 7bit ascii. But this issue is actually moot since a valid BOM would be at the beginning of the header where normally the magic string is. So a header containing a BOM would be invalid.

A BOM is not necessary since by design UTF-8 has no endianness ambiguity and moreover the official specification for UTF-8 specifies a BOM as optional.

Example

The following logical header (written as C string) represents a book supporting normal and suicide chess.

"@PG@\n1.0\n3\n2\nnormal\nsuicide\n(normally comments here)"

In version 1.1 of the format it might perhaps be

"@PG@\n1.1\n4\n2\nnormal\nsuicide\n[somenewfield]\n(normally comments here)"

Sample code

Here is sample code that adds, displays and deletes headers in Polyglot books. This code may be freely used in guis, adaptors and other programs (including closed source ones).

$ ./pgheader -h
pgheader <options> [<file>];
Update a header, adding a default one if necessary
<file>            input file
Options:
-h                print this help 
-l                print the known variant list
-s                print the header
-S                print the header data
-d                delete the header
-v  <variants>    comma separated list of supported variants
-f                force inclusion of unknown variants
-c  <comment>     free format string, may contain newlines encoded as
                  two character strings "\n"
The following command adds a comment to the very widely used polyglot book "performance.bin" by Marc Lacrosse.
$ ./pgheader performance.bin  -c "performance.bin by Marc Lacrosse."
We verify that the header has indeed been added.
$ ./pgheader -s performance.bin
Variants supported:
normal
Comment:
performance.bin by Marc Lacrosse.
Here is the actual header data as shown by "./pgheader -S performance.bin".
    @    P    G    @   \n    1    .    0
   \n    2   \n    1   \n    n    o    r
    m    a    l   \n    p    e    r    f
    o    r    m    a    n    c    e    .
    b    i    n         b    y         M
    a    r    c         L    a    c    r
    o    s    s    e    .   \0   \0   \0
The actual api is contained in the source files pgheader.h and pgheader.c. It provides the following functions
int pgheader_known_variant(const char *variant);
int pgheader_detect(const char *infile);
int pgheader_create(char **header, const char *variants, const char *comment);
int pgheader_create_raw(char **raw_header, const char *header, unsigned int *size);
int pgheader_parse(const char *header, char **variants, char **comment);
int pgheader_read(char **header, const char *infile);
int pgheader_read_raw(char **raw_header, const char *infile, unsigned int *size);
int pgheader_write(const char *header, const char *infile, const char *outfile);
int pgheader_delete(const char *infile, const char *outfile);
const char * pgheader_strerror(int pgerror);
For instructions about using these functions see the comments in pgheader.h.

Magics

The following magic may be used to recognize the new header on Linux
#------------------------------------------------------------------------------
# polyglot:  file(1) magic polyglot chess opening book files
#
# From Michel Van den Bergh 

0       string          \x00\x00\x00\x00\x00\x00\x00\x00@PG@\x0a           Polyglot chess opening book
>13     string          1.0\x00\x00\x00\x00\x00\x00\x00\x00\x0a                (version 1.0)
!:mime  application/x-polyglot
It should be appended to /usr/share/file/magic and the latter file should then be recompiled with file -C .