Reason for issues with correlation in the dataset in DQN

3

From the paper Human level Control through DeepRL, the correlation in the data causes instability in the network and may causes the network to diverge. I wanted to understand what does this instability and divergence mean ? And why correlated data causes this instability.

Karthic Rao

Posted 2018-11-02T08:06:45.560

Reputation: 133

Answers

1

I wanted to understand what does this instability and divergence mean ?

These are with reference to learning curves for the neural network. If a neural network is stable and converges, it means that the value of the error or cost function reduces consistently over time and reaches a stable point of minimum error.

In practice this is often a noisy process and not smooth. There are degrees of noise that are expected and acceptable when solving real problems. However, an instable learning curve will oscillate wildly, and a divergent learning curve will get consistently worse error function during training.

A typical stable, converging learning curve, cost function versus training data consumed, might look like this:

Imaginary stable converging learning curve

Whilst an instable, diverging learning curve might look like this:

Imaginary instable diverging learning curve

These are not using the same vertical scale - the lowest point of the unstable curve will typically be higher than most or even all of the stable curve.

And why correlated data causes this instability.

This is because for gradient descent to work, it needs gradient samples that it uses on every weight update step to be unbiased estimates of a true gradient. In RL you have either an online learning process or non-stationary targets (and usually both), so you must use stochastic or mini-batch gradient descent, working with a few samples at a time. You need those samples to be independent, not related to each other other than by random chance, otherwise the gradient value will be biased and gradient descent will consistently make updates in the wrong overall direction.

A good way to illustrate the difference is to use a really simple example, using gradient descent updates to estimate a mean value (this is roughly equivalent to training a neural network with one neuron, with weight fixed to zero and learning a bias value to represent the mean of the target- no actual input is required).

Say we have an array of values from 0 to 200 inclusive (example code is in Python):

import numpy as np
train_y = np.arange(0, 201)

If this array is kept sorted, then sequential values are highly correlated. If you plot consecutive pairs against each other, you will get a straight line.

We can estimate the mean value by setting a "bias" value to some arbitrary number, and running an update rule based on MSE (between current bias and observed value):

mean_estimate = 0.0
alpha = 0.1
for y in train_y:
    mean_estimate += alpha * (y - mean_estimate)
print(mean_estimate)

This prints (roughly) $191$ as the estimate for the mean, almost double.

However, if we shuffle the array first, it removes the correlation. If you plot consecutive pairs against each other, you will get a scatter graph with no apparent pattern. Adding the one line np.random.shuffle(x) to do this changes the results radically:

mean_estimate = 0.0
alpha = 0.1
np.random.shuffle(x)
for y in train_y:
    mean_estimate += alpha * (y - mean_estimate)
print(mean_estimate)

We get much better estimates (typically between $90$ and $110$), closer to the true value of $100$, and not biased to be higher or lower (run it enough times and you would find the expected result of this algorithm is very close to the true value).

This is because in the first version of the code, the gradient of the error was not sampled fairly - it kept pointing "up" even for high estimates due to the correlation. In the shuffled version, gradients are likely to be in either direction depending only on the current estimate, and will appear roughly in the ratios necessary to find the correct value.

As an exercise you could extend this simple example with mini-batches and repeated "epochs" to show that the effect persists with those changes, and the shuffling is the most important change here for better estimates.

Neil Slater

Posted 2018-11-02T08:06:45.560

Reputation: 14 632

Thank you for the detailed answer. "The bias in gradient", "fair sampling", "Shoots up in one direction" were the key take aways. Again, appreciate your time. – Karthic Rao – 2018-11-03T11:19:43.413