How q-learning solves the issue with value iteration in model-free settings



  1. I can't understand what is the problem in applying value-iteration in reinforcement learning setting (where we don't the reward and transition probabilities). In one of the lectures, the guy said it has to do with not being able to take max with samples.

  2. Further on this, why does q-learning solve this? In both we take max over actions only. What is the big break-through with q-learning?

Lecture Link: (The guy says we don't know how to do maxes with samples, what does that mean?)

Abhishek Bhatia

Posted 2016-10-30T23:56:59.217

Reputation: 354



For normal value iteration, you need to have the model, i.e. the transition probability, denoted by $P(s' \mid s,a)$. With Q-learning, you use the current reward and the already stored Q value:

Q value update

The relation between the value function $V(s)$ and the Q function $Q(s, a)$ is that the $V(s)$ function is simply the value of the action $a$, such that $Q(s, a)$ is the highest, that is, $V(s) = \max_a Q(s, a)$.


Posted 2016-10-30T23:56:59.217

Reputation: 335