I tried this on openai gym environment - LunarLander-v2. I wrote two algorithms with just one difference:
- Made it learn on each step.
- Made it learn at the end of each episode.
There is a significant difference between the performance of both. First one started touching around 200 reward after 3000 episodes whereas second algorithm is getting only -50 reward after 8000 episodes. Btw 3000 episodes of first algo took the same run time as the 8000 of second.
I tried the same experiment on CartPole too the results were same initially but after some time first algo's rewards dropped and kept on fluctuating but second algo surpassed it and was more stable. I am not sure if this is supposed to happen. Also, I am not doing any normalization on either the observation state or the rewards. Should it be done and if yes then how? a