Most Deep Q-learning implementations I have read are based on Deep Q-Networks (DQN). In DQN, the q-value network maps an input state to a vector of q-values, one for each action:

$$
Q(s, \mathbf{w}) \to \mathbf{v}
$$

where $s$ is the input state from the environment, $\mathbf{w}$ are the parameters of the neural network, and $\mathbf{v}$ is a vector of q-values, where $v_i$ is the estimated q-value of the ith action. In the Sutton and Barto book, the q-value function is written as $Q(s, a, \mathbf{w})$, which corresponds to the network output for action $a$.

Unlike tabular Q-learning, Deep Q-learning updates the parameters of the the neural network according to the gradients of the loss function with respect to the parameters. DQN uses the loss function

$$
L(\mathbf{w}) = [(r + \gamma max_{a'} Q(s', a', \mathbf{w^-})) - Q(s, a, \mathbf{w})]^2
$$

where $\gamma$ is the discount rate, $a$ is the selected action (either greedily or randomly for an $epsilon$-greedy behavior policy), $s'$ is the next state, $a'$ is the argmax action for the next state, and $\mathbf{w^-}$ is an older version of the network weights $\mathbf{w}$ that is used to help stabilize training.

In deep Q-learning, training directly updates parameters, not q-values. Parameters are updated by taking a small step in the direction of the gradient of the loss function

$$
\mathbf{w} \gets \mathbf{w} + \alpha [(r + \gamma max_{a'} Q(s', a', \mathbf{w^-})) - Q(s, a, \mathbf{w})] \nabla_w Q(s, a, \mathbf{w})
$$

where $\alpha$ is the learning rate.

In frameworks like tensorflow or pytorch the derivative is calculated automatically by giving the loss function and model parameters directly to an optimizer class which uses some variation of mini-batch gradient descent. In eagerly executed tensorflow updating the parameters for a mini-batch might look something like

```
batch = buffer.sample(batch_size)
observations, actions, rewards, next_obervations = batch
with tf.GradientTape() as tape:
qvalues = model(observations, training=True)
next_qvalues = target_model(next_obervations)
# r + max_{a'} Q(s', a') for the batch
target_qvalues = rewards + gamma * tf.reduce_max(next_qvalues, axis=-1)
# Q(s, a) for the batch
selected_qvalues = tf.reduce_sum(tf.one_hot(actions, depth=qvalues.shape[-1]) * qvalues, axis=-1)
loss = tf.reduce_mean((target_qvalues - selected_qvalues)**2)
grads = tape.gradient(loss, model.variables)
optimizer.apply_gradients(zip(grads, model.variables))
```

Though I am not familiar with the Encog neural network framework you are using, based on the example `Brain.java`

file from your Github repo and Chapter 5 of the Encog User Manual and the Encog neural network examples on Github it looks like weights are updated as follows:

- A training set is constructed from pairs of
**input** and **target output**.
- A
`Propagation`

instance, `train`

, is constructed with a network and training set. Different subclasses of `Propagation`

use different loss functions to update the network parameters.
- The method
`train.iterate()`

is called to run the network on the inputs, calculate the loss between the network outputs and target outputs, and update the weights according to the loss.

For DQN, a training set is constructed from a random sample from the experience replay buffer to help stabilize training. A training set could also be the trajectory of an episode, which is what the `tupels`

argument in the example code of the question appears to be.

The **input** would be the `statefirst`

member of each element of `tupels`

. Since the network produces a vector of q-values, the **target output** must also be a vector of q-values.

The target output element for the selected action is $r + \gamma max_{a'} Q(s', a', \mathbf{w^-})$, In the example code of the question, this is

```
double qnew = 0;
if(i <= tupels.size()-2){
qnew = tupels.get(i).rewardafter + discountfactor*qMax(tupels.get(i+1));
} else {
qnew = tupels.get(i).rewardafter;
}
tupels.get(i).qactions.elements[tupels.get(i).actionTaken] = qnew
```

The target output elements for actions that were not selected should be $Q(s, b, \mathbf{w})$, where $b$ is one of the non-selected actions. This should have the effect of ignoring the q-values of non-selected actions by making the network output equal to the target output.

*So what are the new Q - Values, assuming a discount factor of 0.99 and the learning rate 0.1?*

Assuming you mean **target outputs** by the new Q - Values, and given the trajectory of actions, `(1, 1, 1)`

, and q-value vectors from the question, the concrete target outputs are `(0, 0 + 0.99 * 0, -5, 0)`

, `(0, 0 + 0.99 * 0, 0, 0)`

, and `(0, 1 + 0, 0, 0)`

.

1

See https://ai.stackexchange.com/questions/7298/reinforcement-learning-in-asteroid-game/8028#8028

– DrMcCleod – 2019-01-04T15:32:52.677Your example needs to give the action taken on each step that generated those sampled rewards. You should include at least one step where the non-maximising action was taken. For a full explanation, you should give the example data in the form of

`state_label`

,`predicted_rewards`

,`action_taken`

,`actual_reward`

,`next_state_label`

,`end_flag`

- these don't all need to be in vector/numeric form, although it would help if the Q values and rewards are (as you have already done), plus the action id needs to be numeric in order to find what the predicted Q value was – Neil Slater – 2019-01-04T16:17:36.867Your right, I edited the Acgtions taken, What are endflag and nextstatelabel? – TVSuchty – 2019-01-04T16:30:39.450

`state_label`

and`next_state_label`

identify the states in the trajectory - it is implied in your question but not stated that your neural network estimates $(q(s, a_0), q(s, a_1), q(s, a_2), q(s, a_3) )$, and we need to know $s$. It is $q(s,a)$ that you revise, using $\text{max}{a'} q(s',a')$ to improve the estimate, so you need to identify $s$ ({a'} q(s',a')$ is by definition $0$ in that case – Neil Slater – 2019-01-04T16:54:00.320`state_label`

), $s'$ (`next_state_label`

) and $a$. The`end_flag`

is boolean - whether the transition ends an episode - that is critical information on how you learn Q values, because $\text{max}I understood the end_flag. But I think I do not understand the state label. For what do you need the states again? Just to recalculate the error of the net. (see above) – TVSuchty – 2019-01-04T16:57:46.717

The ids of the states are needed to explain how the formula works. They are used both for calculating the new target value, and for showing

whichQ value is being updated. References to different states are used in different parts of the same step, so it is important to make it clear which state is being used and why. – Neil Slater – 2019-01-04T17:43:23.353If your examples are sequential from the same trajectory (looks like they are), then you will end up with repeats, so first time step might have

`state_label`

$s_a$ and`next_state_label`

$s_b$ then the second time step might have`state_label`

$s_b$ and`next_state_label`

$s_c$ etc - or you could make up a state vector for each one (because that's what you'd have for input to the NN). I am asking because I want you to add this information in the way thatyouunderstand it, so that the answer can explain things to you in your own terms. – Neil Slater – 2019-01-04T17:47:51.163I have added the stated. Can you now explain it to me? – TVSuchty – 2019-01-04T21:38:29.630

Let us continue this discussion in chat.

– TVSuchty – 2019-01-04T21:41:27.837