How to implement exploration function and learning rate in Q Learning


I'm trying to implement Q-learning (state-based representation and no neural / deep stuff) but I'm having a hard time getting it to learn anything.

I believe my issue is with the exploration function and/or learning rate. Thing is, I see different explanations in the sources I am following so I'm not sure what's the right way anymore.

What I understand so far is that Q-learning is TD with q-val iteration.

So a time-limited q-val iteration step is:

Q[k+1](s,a) = ∑(s'): t(s,a,s') * [r(s,a,s') + γ * max(a'):Q[k](s',a')]


Q = q-table: state,action -> real
t = MDP transition model
r = MDP reward func
γ = discount factor.

But since this is a model-free, sample-based setting, the above update step becomes:

Q(s,a) = Q(s,a) + α * (sample - Q(s,a))


sample = r + γ * max(a'):Q(s',a')
r  = reward, also coming from percept after taking action a in step s.
s' = next state coming from percept after taking action a in step s. 

Now for example, assume the following MDP:

    0    1    2    3    4  
0 [10t][ s ][ ? ][ ? ][ 1t]

Discount: 0.1 | 
Stochasticity: 0 | 
t = terminal (only EXIT action is possible)
s = start

With all of the above, my algo (in pseudo code) is:

input: mdp, episodes, percept
Q: s,a -> real is initialized to 0 for all a,s
α = .3

for all episodes:
    s = mdp.start

    while s not none:
        a  = argmax(a): Q(s,a) 
        s', r = percept(s,a)
        sample = r + γ * max(a'):Q(s',a')
        Q(s,a) = Q(s,a) + α * [sample - Q(s,a)]
        s = s'

As stated above, the algorithm will not learn. Because it will get greedy fast.

It will start at 0,1 and choose the best action so far. All q-vals are 0 so it will choose based on arbitrary order on how the qvals are stored in Q. Asume 'W' (go west) is chosen. It will go to 0,0 with a reward of 0 and a q-val update of 0 (since we don't yet know that 0,0, EXIT yields 10)

In the next step it will take the only possible action EXIT from 0,0 and get 10.

At this point the q-table will be:

0,1,W:      0
0,0,Exit:   3 (reward of 10 averaged by learning rate of .3)

And the episode is over because 0,0 was terminal. On the next episode, it will start at 0,1 again and take W again because of the arbitrary order. But now 0,1,W will be updated to 0.09. Then 0,0,Exit will be taken again (and 0,0,Exit updated to 5.1). Then the second episode will be over.

At this point the q-table is:

0,1,W:      0.09
0,0,Exit:   5.1

And the sequence 0,1,W->0,0,Exit will be taken ad infinitum.

So this takes me to learning rates and the exploration functions.

The book 'Artificial Intelligence: A Modern Approach' (3ed, by Russell) first mentions (pages 839-842) the exploration function as something to put in the val update (because it is discussing a model-based, value iteration approach instead).

So extrapolating from the val update discussion in the book, I'd assume the q-val update becomes:

Q(s,a) = ∑(s'): t(s,a,s') * [r(s,a,s') + γ * max(a'):E(s',a')]

Where E would be an exploration function which according to the book could be something like:

E(s,a) = <bigValue> if visitCount(s,a) < <minVisits> else Q(s,a)

The idea being to artificially pump up the vals of actions which have not been tried yet and so now they'll be tried out at least minVisits times.

But then, in page 844 the book shows pseudo code for Q-learning and instead does not use this E in the q-val update but rather in the argmax of the action selection. I guess makes sense? Since exploration amounts to choosing an action...

The other source I have is the UC Berkeley CS188 lecture videos/notes. In those (Reinforcement Learning 2: 2016) they show the exploration function in the q-val update step. This is consistent with what I extrapolated from the book's discussion on value iteration methods but not with what the book shows for Q-Learning (remember the book uses the exploration function in the argmax instead).

I tried placing exploration functions in the update step, the action selection step and in both at the same time.. and still the thing eventually gets greedy and stuck.

So not sure where and how this should be implemented.

The other issue is the learning rate. The explanation usually goes "you need to decrease it over time." Ok.. but is there some heuristic? Right now, based off the book I am doing:

learn(s,a) = 0.3 / visitCount(s,a). But no idea if it is too much or too little or just right.

Finally, assuming I had the exploration and learn right, how would I know how many episodes to train for?

I'm thinking I'd have to keep 2 versions of the Q-table and check at which point the q-vals do not change much from previous iterations (similar to value iteration for solving known MDPs).


Posted 2018-02-16T05:12:24.747

Reputation: 203



Your main problem is that you need to separate out what is driving the behaviour policy from the Q-table.

Q Learning is an off-policy algorithm. The Q-table that it eventually learns is for an optimal policy (also called the target policy). In order to be able to learn that policy, the agent needs to explore. The usual way to do this is to make the agent follow a different policy (called the behaviour policy). For efficient learning, you generally want the behaviour policy to be similar to the target policy. So it is common to also drive the behaviour policy from the Q-table, but not absolutely necessary.

You do not need an exploration function, but it is one good way to drive exploration.

The simplest behaviour policy, and one that will work in your case, is to behave completely randomly - ignore the Q-table and select actions at random. With your simple toy problem, that should work reasonably well.

A more common approach is to behave ε-greedily. For some probability ε (e.g. 0.1), behave randomly. Otherwise take the argmax over a of Q(s,a).

The exploration function approach is similar to what you have so far, you are just missing the separation of behaviour policy from the Q-table updates. The Q-table update ignores the exploration function, and maximises over estimated next values:

Q(s,a) = Q(s,a) + α * (r + γ * max(a'):Q(s',a') - Q(s,a))

The behaviour policy for picking the actual next action to take can be decided by using the exploration function:

a' = argmax(a'):E(s',a')

Note that in stochastic environments (where the chosen action may lead randomly to multiple different states), that you may be able to get away without any separate behaviour policy, and always act greedily with respect to the Q-table. However, that is not a generic solution - such a learning agent would do badly in deterministic environments.

Neil Slater

Posted 2018-02-16T05:12:24.747

Reputation: 14 632

1Right, my exploration function was meant as 'upgrade' from a strictly e-greedy strategy (to mitigate thrashing by the time the optimal policy is learned). But I don't get why then it won't work even if I only use it in the action selection (behavior policy). Also the idea of plugging it in the update step I think is to propagate the optimism about exploring not only to the state in question but to states leading to that state. As for separating behavior and learning, shouldn't the behavior be driven from what has been learned so far? – SaldaVonSchwartz – 2018-02-16T10:12:35.827

1@SaldaVonSchwartz: "the idea of plugging it in the update step I think is to propagate the optimism about exploring" - that sort of works with the function as you have described it, because it cuts out after certain number of actions. Eventually the Q-table will forget the large optimistic starting values due to the fixed learning rate. But really this is the same as behaving randomly to start with (e.g. for 100 steps) then behaving closer to optimally later. You still need some exploration after the exploration function is exhausted, thus need a non-greedy action selection. – Neil Slater – 2018-02-16T10:47:28.507


I think that very good exemple in Python is

kris cincar

Posted 2018-02-16T05:12:24.747

Reputation: 11