I hope to get some clarifications on Fitted Q-Learning ('FQL').
My Research So Far
I know that
Q*(s,a) expresses the expected value of first taking action
a from state
s and then following optimal policy forever.
I tried my best to understand function approximation in large state spaces and TD(n).
(1) Concept - Can someone explain the intuition behind how iteratively extending N from 1 until stopping condition achieve optimality (Section 3.5 of Ernst et al)? I have difficulty wrapping my mind around how this ties in with the basic definition of
Q*(s,a) that I stated above.
(2) Implementation - Ernst et al gives the pseudo-code for the tabular form. But if I try to implement the function approximation form, is this correct:
Repeat until stopping conditions are reached: - N ← N + 1 - Build the training set TS based on the the function ˆQN − 1 and on the full set of four-tuples F - Train algo on the TS - Use the trained model to predict on the TS itself - Create TS for the next N by updating the labels - new reward plus ( gamma * predicted values )
I am just starting to learn RL as part of my course and thus, there are many gaps in my understanding. Hope to get some kind guidance. Thanks in advance!