Concrete Example for Q Learning



I am not sure if I understood the q learning algorithms correctly. Therefore I would give a concrete example and ask if someone can tell me how to update the q value correctly.

First I initialized a Neural Network with random weights. It shall henceforth evaluate the Q Value for all possible actions(4) given a State S.

Then the following happens. The agent is playing and is exploring. For 3 steps the Q Values evaluated were: (0,-1,-5,0), (0,-1,0,0), (0,-.6,0,0)

The reward given was: 0,0,1 The action took were: (1.,1.,1.) In the random walk example (same reward given), it was: (1.,2.,3.)

So what are the new Q - Values, assuming a discount factor of 0.99 and the learning rate 0.1?

The States for Simplicity are only one number: 1,1.3,2.4 Where 2.4 is the state who ends the game...

The same example holds for exploiting. Is the algorithm the same here?

Here you see my last implementation:

    public void rlearn(ArrayList<Tuple> tupels, double learningrate, double discountfactor) {

    //newQ = sum of all rewards you have got through
    for(int i = tupels.size()-1; i > 0; i--) {
        MLData in = new BasicMLData(45);
        MLData out = new BasicMLData(5);

        //Add State as in
        int index = 0;
        for(double w : tupels.get(i).statefirst.elements) {
            in.add(index++, w);

        //Now start updating Q - Values 
        double qnew = 0;
        if(i <= tupels.size()-2){
            qnew = tupels.get(i).rewardafter + discountfactor*qMax(tupels.get(i+1));
        } else {
            qnew = tupels.get(i).rewardafter;

        tupels.get(i).qactions.elements[tupels.get(i).actionTaken] = qnew;
        //Add Q Values as out
        index = 0;
        for(double w : tupels.get(i).qactions.elements) {
            out.add(index++, w);
         bigset.add(in, out);

Edit: This is the qMax - function:

    private double qMax(Tuple tuple) {
    double max = Double.MIN_VALUE;
    for(double w : tuple.qactions.elements) {
        if(w > max) {
            max = w;
    return max;


Posted 2019-01-04T14:11:38.477

Reputation: 301

Your example needs to give the action taken on each step that generated those sampled rewards. You should include at least one step where the non-maximising action was taken. For a full explanation, you should give the example data in the form of state_label, predicted_rewards, action_taken, actual_reward, next_state_label, end_flag - these don't all need to be in vector/numeric form, although it would help if the Q values and rewards are (as you have already done), plus the action id needs to be numeric in order to find what the predicted Q value was – Neil Slater – 2019-01-04T16:17:36.867

Your right, I edited the Acgtions taken, What are endflag and nextstatelabel? – TVSuchty – 2019-01-04T16:30:39.450

state_label and next_state_label identify the states in the trajectory - it is implied in your question but not stated that your neural network estimates $(q(s, a_0), q(s, a_1), q(s, a_2), q(s, a_3) )$, and we need to know $s$. It is $q(s,a)$ that you revise, using $\text{max}{a'} q(s',a')$ to improve the estimate, so you need to identify $s$ (state_label), $s'$ (next_state_label) and $a$. The end_flag is boolean - whether the transition ends an episode - that is critical information on how you learn Q values, because $\text{max}{a'} q(s',a')$ is by definition $0$ in that case – Neil Slater – 2019-01-04T16:54:00.320

I understood the end_flag. But I think I do not understand the state label. For what do you need the states again? Just to recalculate the error of the net. (see above) – TVSuchty – 2019-01-04T16:57:46.717

The ids of the states are needed to explain how the formula works. They are used both for calculating the new target value, and for showing which Q value is being updated. References to different states are used in different parts of the same step, so it is important to make it clear which state is being used and why. – Neil Slater – 2019-01-04T17:43:23.353

If your examples are sequential from the same trajectory (looks like they are), then you will end up with repeats, so first time step might have state_label $s_a$ and next_state_label $s_b$ then the second time step might have state_label $s_b$ and next_state_label $s_c$ etc - or you could make up a state vector for each one (because that's what you'd have for input to the NN). I am asking because I want you to add this information in the way that you understand it, so that the answer can explain things to you in your own terms. – Neil Slater – 2019-01-04T17:47:51.163

I have added the stated. Can you now explain it to me? – TVSuchty – 2019-01-04T21:38:29.630

Let us continue this discussion in chat.

– TVSuchty – 2019-01-04T21:41:27.837



Most Deep Q-learning implementations I have read are based on Deep Q-Networks (DQN). In DQN, the q-value network maps an input state to a vector of q-values, one for each action:

$$ Q(s, \mathbf{w}) \to \mathbf{v} $$

where $s$ is the input state from the environment, $\mathbf{w}$ are the parameters of the neural network, and $\mathbf{v}$ is a vector of q-values, where $v_i$ is the estimated q-value of the ith action. In the Sutton and Barto book, the q-value function is written as $Q(s, a, \mathbf{w})$, which corresponds to the network output for action $a$.

Unlike tabular Q-learning, Deep Q-learning updates the parameters of the the neural network according to the gradients of the loss function with respect to the parameters. DQN uses the loss function

$$ L(\mathbf{w}) = [(r + \gamma max_{a'} Q(s', a', \mathbf{w^-})) - Q(s, a, \mathbf{w})]^2 $$

where $\gamma$ is the discount rate, $a$ is the selected action (either greedily or randomly for an $epsilon$-greedy behavior policy), $s'$ is the next state, $a'$ is the argmax action for the next state, and $\mathbf{w^-}$ is an older version of the network weights $\mathbf{w}$ that is used to help stabilize training.

In deep Q-learning, training directly updates parameters, not q-values. Parameters are updated by taking a small step in the direction of the gradient of the loss function

$$ \mathbf{w} \gets \mathbf{w} + \alpha [(r + \gamma max_{a'} Q(s', a', \mathbf{w^-})) - Q(s, a, \mathbf{w})] \nabla_w Q(s, a, \mathbf{w}) $$

where $\alpha$ is the learning rate.

In frameworks like tensorflow or pytorch the derivative is calculated automatically by giving the loss function and model parameters directly to an optimizer class which uses some variation of mini-batch gradient descent. In eagerly executed tensorflow updating the parameters for a mini-batch might look something like

batch = buffer.sample(batch_size)
observations, actions, rewards, next_obervations = batch

with tf.GradientTape() as tape:
    qvalues = model(observations, training=True)
    next_qvalues = target_model(next_obervations)
    # r + max_{a'} Q(s', a') for the batch
    target_qvalues = rewards + gamma * tf.reduce_max(next_qvalues, axis=-1)
    # Q(s, a) for the batch
    selected_qvalues = tf.reduce_sum(tf.one_hot(actions, depth=qvalues.shape[-1]) * qvalues, axis=-1)
    loss = tf.reduce_mean((target_qvalues - selected_qvalues)**2)

grads = tape.gradient(loss, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))

Though I am not familiar with the Encog neural network framework you are using, based on the example file from your Github repo and Chapter 5 of the Encog User Manual and the Encog neural network examples on Github it looks like weights are updated as follows:

  1. A training set is constructed from pairs of input and target output.
  2. A Propagation instance, train, is constructed with a network and training set. Different subclasses of Propagation use different loss functions to update the network parameters.
  3. The method train.iterate() is called to run the network on the inputs, calculate the loss between the network outputs and target outputs, and update the weights according to the loss.

For DQN, a training set is constructed from a random sample from the experience replay buffer to help stabilize training. A training set could also be the trajectory of an episode, which is what the tupels argument in the example code of the question appears to be.

The input would be the statefirst member of each element of tupels. Since the network produces a vector of q-values, the target output must also be a vector of q-values.

The target output element for the selected action is $r + \gamma max_{a'} Q(s', a', \mathbf{w^-})$, In the example code of the question, this is

double qnew = 0;
if(i <= tupels.size()-2){
    qnew = tupels.get(i).rewardafter + discountfactor*qMax(tupels.get(i+1));
} else {
    qnew = tupels.get(i).rewardafter;
tupels.get(i).qactions.elements[tupels.get(i).actionTaken] = qnew

The target output elements for actions that were not selected should be $Q(s, b, \mathbf{w})$, where $b$ is one of the non-selected actions. This should have the effect of ignoring the q-values of non-selected actions by making the network output equal to the target output.

So what are the new Q - Values, assuming a discount factor of 0.99 and the learning rate 0.1?

Assuming you mean target outputs by the new Q - Values, and given the trajectory of actions, (1, 1, 1), and q-value vectors from the question, the concrete target outputs are (0, 0 + 0.99 * 0, -5, 0), (0, 0 + 0.99 * 0, 0, 0), and (0, 1 + 0, 0, 0).


Posted 2019-01-04T14:11:38.477

Reputation: 119

$$w←w+α[(r+γmaxa′Q(s′,a′,w−))−Q(s,a,w)]∇wQ(s,a,w)$$ Does this mean I have to hold always an older version of the network? I do not understand this. I understand what the loss function does. It is the normal loss function for most BP-algorithms. But how to calculate:

$$maxa′Q(s′,a′,w−))$$ – TVSuchty – 2019-01-05T09:55:11.890

Yes, the DQN method uses an older version of the model which gets replaced every N time steps by the current model. According to the paper this improves training. – todddeluca – 2019-01-06T07:06:32.883

You could calculate $max_{a'} Q(s', a', w^-)$ by running the state s' through an older copy of the model (i.e. the target network) and choosing the max of the outputs. – todddeluca – 2019-01-06T07:17:39.430

Unfortunately, this still does not work. Is there anything else to consider. I must have done an obvious mistake. – TVSuchty – 2019-01-06T11:56:45.320


With considerable luck, the code may be made to work well without properly understanding the frameworks, theories, and algorithms, but there are more reliable approaches to obtain an effective and maintainable piece of software leveraging a technology not yet well understood. This is a common development approach.

  • Review the framework documentation and example code to determine what design patterns and algorithms are used by the framework and list them.
  • Identify which designs and algorithms are most likely to achieve the project objective.
  • Use a search of academic papers to find the origins of those designs and algorithms and trace the references of those back to their beginnings.
  • Study the tree of papers and the nomenclature used in the mathematics until it is reasonably well understood. (Without this step, any programs written will likely be a hack.)
  • Design the piece of software and the one or more algorithms within it needed to achieve the project objective.
  • Find the one or more example programs that already work and run them to prove the environment is set up correctly and to begin with working code.
  • Incrementally modify and unit test and functionally test each modification until a working version of the desired design is achieved.

Ensuring no bugs enter the code with each modification to working code is usually less time consuming than writing large pieces of code that don't work and try to fix them. It is also important to have a code repository for frequent commits after a piece of code passes its test suite. Agile methodology applies as much or more to AI than conventional development.

These steps may seem incredibly inefficient, and it is tempting to believe that playing with lines of code will succeed, however it rarely does in practice. The first attempt, written without understanding, may have significant shortcomings in both design and implementation and will take longer in the vast majority of instances.

The reliability of an approach is often more important than the efficiency of its most lucky instances, a few standard deviations from the mean case. The above approach also fosters a greater depth of reusable knowledge. Maintainability and extensibility depend on both good design and the depth of knowledge in the engineer.

For these reasons, the above steps describe the wiser approach for many of the questions here, including this one.

Douglas Daseeco

Posted 2019-01-04T14:11:38.477

Reputation: 7 174