## Reinforcement Learning (Fitted Q): Qn on Concept & Implementation

5

I hope to get some clarifications on Fitted Q-Learning ('FQL').

My Research So Far

I've read Sutton's book (specifically, chp 6 to 10), Ernst et al and this paper.

I know that Q*(s,a) expresses the expected value of first taking action a from state s and then following optimal policy forever.

I tried my best to understand function approximation in large state spaces and TD(n).

My Questions

(1) Concept - Can someone explain the intuition behind how iteratively extending N from 1 until stopping condition achieve optimality (Section 3.5 of Ernst et al)? I have difficulty wrapping my mind around how this ties in with the basic definition of Q*(s,a) that I stated above.

(2) Implementation - Ernst et al gives the pseudo-code for the tabular form. But if I try to implement the function approximation form, is this correct:

Repeat until stopping conditions are reached:
- N ← N + 1

- Build the training set TS based on the the function ˆQN − 1 and on the full set of four-tuples F

- Train algo on the TS

- Use the trained model to predict on the TS itself

- Create TS for the next N by updating the labels - new reward plus ( gamma * predicted values )


I am just starting to learn RL as part of my course and thus, there are many gaps in my understanding. Hope to get some kind guidance. Thanks in advance!

2): I think that's right. When you build the training set, use the actions suggested by the results for the $$Q_{n-1}$$ network. That's an approximation of the reward for starting in each state and running for n-1 steps with an optimal policy. Then you're learning an approximation of $Q_n$ from that, which looks right.