I can't understand what is the problem in applying value-iteration in reinforcement learning setting (where we don't the reward and transition probabilities). In one of the lectures, the guy said it has to do with not being able to take max with samples.
Further on this, why does q-learning solve this? In both we take max over actions only. What is the big break-through with q-learning?
Lecture Link: https://www.youtube.com/watch?v=ifma8G7LegE&feature=youtu.be&t=3431 (The guy says we don't know how to do maxes with samples, what does that mean?)