6

I'm trying to implement Q-learning (state-based representation and no neural / deep stuff) but I'm having a hard time getting it to learn anything.

I believe my issue is with the exploration function and/or learning rate. Thing is, I see different explanations in the sources I am following so I'm not sure what's the right way anymore.

What I understand so far is that Q-learning is TD with q-val iteration.

So a time-limited q-val iteration step is:

```
Q[k+1](s,a) = ∑(s'): t(s,a,s') * [r(s,a,s') + γ * max(a'):Q[k](s',a')]
```

Where:

```
Q = q-table: state,action -> real
t = MDP transition model
r = MDP reward func
γ = discount factor.
```

But since this is a model-free, sample-based setting, the above update step becomes:

```
Q(s,a) = Q(s,a) + α * (sample - Q(s,a))
```

Where:

```
sample = r + γ * max(a'):Q(s',a')
r = reward, also coming from percept after taking action a in step s.
s' = next state coming from percept after taking action a in step s.
```

Now for example, assume the following MDP:

```
0 1 2 3 4
0 [10t][ s ][ ? ][ ? ][ 1t]
Discount: 0.1 |
Stochasticity: 0 |
t = terminal (only EXIT action is possible)
s = start
```

With all of the above, my algo (in pseudo code) is:

```
input: mdp, episodes, percept
Q: s,a -> real is initialized to 0 for all a,s
α = .3
for all episodes:
s = mdp.start
while s not none:
a = argmax(a): Q(s,a)
s', r = percept(s,a)
sample = r + γ * max(a'):Q(s',a')
Q(s,a) = Q(s,a) + α * [sample - Q(s,a)]
s = s'
```

As stated above, the algorithm will not learn. Because it will get greedy fast.

It will start at 0,1 and choose the best action so far. All q-vals are 0 so it will choose based on arbitrary order on how the qvals are stored in Q. Asume 'W' (go west) is chosen. It will go to 0,0 with a reward of 0 and a q-val update of 0 (since we don't yet know that 0,0, EXIT yields 10)

In the next step it will take the only possible action EXIT from 0,0 and get 10.

At this point the q-table will be:

```
0,1,W: 0
0,0,Exit: 3 (reward of 10 averaged by learning rate of .3)
```

And the episode is over because 0,0 was terminal. On the next episode, it will start at 0,1 again and take W again because of the arbitrary order. But now 0,1,W will be updated to 0.09. Then 0,0,Exit will be taken again (and 0,0,Exit updated to 5.1). Then the second episode will be over.

At this point the q-table is:

```
0,1,W: 0.09
0,0,Exit: 5.1
```

And the sequence 0,1,W->0,0,Exit will be taken ad infinitum.

So this takes me to learning rates and the exploration functions.

The book 'Artificial Intelligence: A Modern Approach' (3ed, by Russell) first mentions (pages 839-842) the exploration function as something to put in the val update (because it is discussing a model-based, value iteration approach instead).

So extrapolating from the val update discussion in the book, I'd assume the q-val update becomes:

```
Q(s,a) = ∑(s'): t(s,a,s') * [r(s,a,s') + γ * max(a'):E(s',a')]
```

Where E would be an exploration function which according to the book could be something like:

```
E(s,a) = <bigValue> if visitCount(s,a) < <minVisits> else Q(s,a)
```

The idea being to artificially pump up the vals of actions which have not been tried yet and so now they'll be tried out at least `minVisits`

times.

But then, in page 844 the book shows pseudo code for Q-learning and instead does not use this E in the q-val update but rather in the argmax of the action selection. I guess makes sense? Since exploration amounts to choosing an action...

The other source I have is the UC Berkeley CS188 lecture videos/notes. In those (Reinforcement Learning 2: 2016) they show the exploration function in the q-val update step. This is consistent with what I extrapolated from the book's discussion on value iteration methods but not with what the book shows for Q-Learning (remember the book uses the exploration function in the argmax instead).

I tried placing exploration functions in the update step, the action selection step and in both at the same time.. and still the thing eventually gets greedy and stuck.

So not sure where and how this should be implemented.

The other issue is the learning rate. The explanation usually goes "you need to decrease it over time." Ok.. but is there some heuristic? Right now, based off the book I am doing:

`learn(s,a) = 0.3 / visitCount(s,a)`

. But no idea if it is too much or too little or just right.

Finally, assuming I had the exploration and learn right, how would I know how many episodes to train for?

I'm thinking I'd have to keep 2 versions of the Q-table and check at which point the q-vals do not change much from previous iterations (similar to value iteration for solving known MDPs).

1Right, my exploration function was meant as 'upgrade' from a strictly e-greedy strategy (to mitigate thrashing by the time the optimal policy is learned). But I don't get why then it won't work even if I only use it in the action selection (behavior policy). Also the idea of plugging it in the update step I think is to propagate the optimism about exploring not only to the state in question but to states leading to that state. As for separating behavior and learning, shouldn't the behavior be driven from what has been learned so far? – SaldaVonSchwartz – 2018-02-16T10:12:35.827

1@SaldaVonSchwartz: "the idea of plugging it in the update step I think is to propagate the optimism about exploring" - that sort of works with the function as you have described it, because it cuts out after certain number of actions. Eventually the Q-table will forget the large optimistic starting values due to the fixed learning rate. But really this is the same as behaving randomly to start with (e.g. for 100 steps) then behaving closer to optimally later. You still need some exploration after the exploration function is exhausted, thus need a non-greedy action selection. – Neil Slater – 2018-02-16T10:47:28.507