what is shown when average reward per episode in training is unstable?
If there is big difference between average reward per episode and final reward by test section, what we can say?
For instance in paper of Atari by deep mind, what is determine by it in figure 2. left? What is difference between figure 2 left and right?
Figure 2: The two plots on the left show average reward per episode on Breakout and Seaquest respectively during training. The statistics were computed by running an -greedy policy with = 0.05 for 10000 steps. The two plots on the right show the average maximum predicted action-value of a held out set of states on Breakout and Seaquest respectively. One epoch corresponds to 50000 minibatch weight updates or roughly 30 minutes of training time.