Which Reinforcement Learning algorithms are efficient for episodic problems?



I have some episodic datasets extracted from a turn-based RTS game in which the current actions leading to the next state doesn’t determine the final solution/outcome of the episode.

The learning is expected to terminate at a final state/termination condition (when it wins or losses) for each episode and then move on to the next number of episodes in the dataset.

I have being looking into Q learning, Monte Carlo and SARSA but I am confused about which one is best applicable.

If any of the mentioned algorithm is implemented, can a reward of zero be given in preliminary states before termination state of each episodes at which it will be rewarded with a positive/negative (win/loss) value?


Posted 2018-01-13T03:48:19.280

Reputation: 33



When applying techniques like SARSA(which are on-policy), one needs to have control over a simulator. If one is able to access only the episodic dataset, then the only choice is to opt for Q-learning or Off-policy Monte-Carlo(or off-policy methods in general).

Can a reward of zero be given in preliminary states before termination state of each episodes at which it will be rewarded with a positive/negative (win/loss) value?

With regards to the above question, the answer is yes. The task would be a sparse reward task with a reward occurring only at the last transition. The issue that one faces in a sparse reward task is that of slow convergence(or even lack of convergence).

Some guidelines for tackling sparse reward tasks are as follows :

  1. Monte-Carlo, and n-step Q-learning are preferred over Q-learning/SARSA.

    Consider the 10-step chain MDP, where the only reward is +1 when the transition from s10 to END occurs. enter image description here

    Let us consider the first episode in training to be start->s1->s2->s3->....->s10->end. The Q-learning update would result in no sane updates to states s1, s2, s3, s9 because the Q-value of the next state is random-value. The only state with a sane update is s10

    However, if we use n-step Q-learning or a Monte-Carlo based update. The Q-values of all the states are updated in a logical manner since the reward from the end of the episode propagates to all states in the episode.

    n-step Q-learning would be ideal since by adjusting the value of n, one can trade-off the benefits of Monte-Carlo methods(described above) and Q-learning(low variance).

  2. The use of pseudo/auxiliary rewards.

    This is not something necessarily recommended since the addition of new reward-structures can cause unintended behaviour. On the flip-side, it could lead to faster convergence.

    A simple example is as follows: Consider a game of chess, where the only reward is at the end of the game. Since the game of chess has very long episodes, one can introduce the following reward-structure instead :

    • +100 for winning the game
    • +1 for capturing any piece on the board

    Hence, the pseudo reward may provide some direction for learning. Note the different scales in the two rewards(100:1). This is necessary because the primary goal of the task should not shift to capturing pieces, but to win the match.


Posted 2018-01-13T03:48:19.280

Reputation: 26