I would like to use reinforcement learning to make the engine improve by playing against itself. I have been reading about the topic but I am still quite confused.

Be warned: Reinforcement learning is a large complex subject. Although it might take you on a detour from game-playing bots, you may want to study RL basics. A good place to start is Sutton & Barto *Reinforcement Learning: An Introduction*

What other reward is there in a game a part the win-or-lose output (1 or 0)?

Depending on you game, that is usually it. Actually for a win/draw/lose game like chess then the reward from each action is 0 except for winning (+1) or losing (-1) at the end. In a zero-sum game then this aligns nicely to minimax, alphabeta pruning etc.

Reinforcement Learning is intended to address environments with delayed rewards. Adding "helper" rewards for interim non-goals is usually counter-productive.

If I use other rewards, like the output from the evaluation function at each turn, how can I implement it?

Typically you don't. What applying self-playing RL will do is learn a *return* (sometimes called *utility*) function which predicts the expectation of your total +1/0/-1 reward by the end of the game. You would use this in place of your current heuristic for minimax search. Or, potentially you would adjust your current heuristic function to output in the same range and use RL to optimise its weights to make the best approximation to the true optimal play return function (which is probably too complex to calculate exactly).

How do I modify the evaluation function to give better rewards iteration after iteration?

That is what the different RL approaches all attempt to do, there are a variety of different solvers. There is no short way to explain it. You could start with a simple method such as Q-Learning. Q-Learning learns estimates of Q(s,a) (called the action value) which is the expected return when in state s and taking action a, and thereafter following an optimal policy. It makes an arbitrary guess to start with and refines it closer to the true value with each step made in the learning environment. Simple tabular Q-learners do this refinement simply by storing a big table of all states and actions with the best estimate so far of the true value, and averaging in each new estimate as it is experienced.

It is also possible to combine an RL method for heuristics with look-ahead minimax search - that is what the original AlphaGo did, and what AlphaGo Zero does during training. This is a powerful approach because minimax search will work to double-check the RL-generated heuristics. Although for simple enough games, RL can learn perfect heuristics and you would only need local search (what the next move should be).

Unless your game is very simple (all possible states would fit in memory), you will need some kind of function approximator inside the RL algorithm. Neural networks are a standard choice. Having *something* for that part is unavoidable - although another good choice is to define a bunch of proxy features (that you might use to construct a heuristic by hand) and use a linear approximator - just a weighted sum of all the features. This can work well enough, and has been used for example in checkers (draughts) players trained using RL.

In fact, provided your own heuristic function is not too unusual, you can probably treat it just like a linear approximator and use RL to learn the best weights for it.

"Reinforcement Learning is intended to address environments with delayed rewards. Adding "helper" rewards for interim non-goals is usually counter-productive." I would like to note that there's a paper which attempts to solve the problem of sparse rewards by introducing intermediary goals "Hindsight Experience Replay".

– nbro – 2018-11-18T13:15:55.8731@nbro: There are

lotsof attempts to solve sparse rewards, it's a big open-ended question in RL, one way to increase the challenge of a problem is to make rewards more sparse. Eligibility traces are another attempt, Hierarchical RL is another promising area . . . I don't think I want to add these techniques to the answer here though, as it is more about feasibility for OP's problem and an intro to the subject – Neil Slater – 2018-11-18T13:26:13.227