## Is the Q value updated at every episode?

3

I trying to understand the Bellman equation for updating the Q table values. The concept of initially updating the value is clear to me. What is unclear is the subsequent updates to the value. Is the value replaced with each episode? It doesn't seem like this would learn from the past. Maybe average the value from the previous episode with the existing value?

Not specifically from the book. I'm using the equation

$$V(s) = \max_a(R(s, a) + \gamma V(s')),$$

where $$\gamma$$ is the learning rate. $$0.9$$ will encourage exploration, and $$0.99$$ will encourage exploitation. I'm working with a simple $$3 \times 4$$ matrix from YouTube

3

I think you are a bit confused about what is the update function and the target.

The equation you have there, and what is done in the video is the estimation of the true value of a certain state. In Temporal-Difference algorithms this is called the TD-Target.

The reason for your confusion might be that in the video he starts from the end state and goes backwards using that formula to get the final value of each state. But that is not how you update the values, that is where you want to get to at the end of iterating through the states.

The update formula may have several forms depending on the algorithm. For TD(0), which is a simple 1-step look ahead off-policy where what is being evaluated is the state (as in your case), the update function is:

$$V(s) = (1 - \alpha) * V(s) + \alpha * (R(s,a) + \gamma V(s')),$$ where alpha is the learning rate. What alpha will do is balance how much of your current estimate you want to change. You keep $$1 - \alpha$$ of the original value and add $$\alpha$$ times the td-target, which uses the reward for the current state plus the discounted estimate of the value of the next state. Normal values for alpha can be 0.1 to 0.3, for example.

The estimate will slowly converge into the real value of the state which is given by your equation: $$V(s) = \max_a(R(s, a) + \gamma V(s')).$$

Also, the $$\gamma$$ is actually the discount associated with future states, as it is said in the video you referenced. It basically says how much importance you give to future states rewards. If $$\gamma = 0$$, then you only care about the reward in your current state to evaluate it (this is not what is used). On the other extreme if $$\gamma = 1$$ you will give as much value for a reward received in a state 5 steps ahead as you will to the current state. If you use some intermediate value you will give some importance to future rewards, but not as much as for the present one. The decay on the reward received on a state $$n steps$$ in the future is given by $$\gamma^n$$.

Another thing that I would correct is that the exploration - exploitation balance is not in any way related to $$\gamma$$. It is normally balanced by some policy, for example $$\epsilon - greedy$$. This one for example says that a certain % of the actions you take are random, which in turn makes you explore less valued states.