5

I hope to get some clarifications on Fitted Q-Learning ('FQL').

**My Research So Far**

I've read Sutton's book (specifically, chp 6 to 10), Ernst et al and this paper.

I know that `Q*(s,a)`

expresses the expected value of first taking action `a`

from state `s`

and then following optimal policy forever.

I tried my best to understand function approximation in large state spaces and TD(n).

**My Questions**

(1) Concept - Can someone explain the intuition behind how iteratively extending N from 1 until stopping condition achieve optimality (Section 3.5 of Ernst et al)? I have difficulty wrapping my mind around how this ties in with the basic definition of `Q*(s,a)`

that I stated above.

(2) Implementation - Ernst et al gives the pseudo-code for the tabular form. But if I try to implement the *function approximation* form, is this correct:

```
Repeat until stopping conditions are reached:
- N ← N + 1
- Build the training set TS based on the the function ˆQN − 1 and on the full set of four-tuples F
- Train algo on the TS
- Use the trained model to predict on the TS itself
- Create TS for the next N by updating the labels - new reward plus ( gamma * predicted values )
```

I am just starting to learn RL as part of my course and thus, there are many gaps in my understanding. Hope to get some kind guidance. Thanks in advance!