I am currently working on implementing Giraffe chess algorithm. Following this paper, I designed a neural network similar to the one proposed by the author which I trained using TD-Leaf(lambda). The procedure that I followed is described in the extract of the paper shown below.
The corresponding code that I wrote (using PyTorch) is the following (you can find it here).
def self_play(batch, net, device, n_moves): '''Self play on n_moves of a given game''' boards = [chess.Board(b) for b in batch['board']] boards = list(map(push_random_move, boards)) giraffe_move = partial(find_best_move, max_depth=0, evaluator=partial(giraffe_evaluation, net=net, device=device)) scores =  for _ in range(n_moves): moves_, scores_ = zip(*map(giraffe_move, boards)) scores.append(torch.stack(scores_)) boards = [push_move(board, move) for (board, move) in zip(boards, moves_)] #scores = list(zip(*scores)) scores = torch.stack(scores) return torch.t(torch.squeeze(scores)) def td_loss(scores, td_lamdba): L, N = scores.size() err = torch.zeros((L, N)) for t in range(N-1): discount = 1 err_t = torch.zeros(L) for j in range(t, N-1): dj = scores[:, j+1] - scores[:, j] discount *= td_lamdba err_t += discount * dj err[:, t] = err_t # we include a minus sign because torch computes a gradient descent # by default, but we want to impose a custom update rule for the weights loss = torch.mean(torch.sum(-scores * err.detach(), dim=1)) return loss def self_learn(batch, net, device, n_moves, optimizer): scores = self_play(batch, net, device, n_moves) loss = td_loss(scores, 0.7) optimizer.zero_grad() loss.backward() optimizer.step() return loss.item()
The neural net is designed to receive an encoded chess board as input and outputs a score between -1 (Black wins) and 1 (White wins). The problem is that, when I train the net using TD-Leaf, it learns to evaluate the board to 1 at every time step of the game, even when the game just started (which should be evaluated at 0 ideally). However, when I think back about it, this result is not surprising since most of the moves that happened during the self learning process are not end game moves (so no true score is available). This implies that even if the neural net always outputs 1, it will be wrong very rarely.
I would like to know what do you think about this since I am running short of ideas.
NB: Note that before using TD-Leaf, I pretrained my neural net on Stockfish engine to not start on a random weights initialization.