Agent always takes a same action in DQN - Reinforcement Learning


I have trained an RL agent using DQN algorithm. After 20000 episodes my rewards are converged. Now when I test this agent, the agent is always taking the same action , irrespective of state. I find this very weird. Can someone help me with this. Is there a reason, anyone can think of why is the agent behaving this way?

Reward plot

enter image description here

When I test the agent

state = env.reset()
print('State: ', state)

state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)

q_values_plotting = []
for i in range(0,action_size):


Every time it gives the same q_values plot, even though state initialized is different every time.Below is the q_Value plot.

enter image description here



test_rewards = []
for episode in range(1000):
    terminal_state = False
    state = env.reset()
    episode_reward = 0
    while terminal_state == False:
        print('State: ', state)
        state_encod = np.reshape(state, [1, state_size])
        q_values = model.predict(state_encod)
        action_key = np.argmax(q_values)
        action = index_to_action_mapping[action_key]
        print('Action: ', action)
        next_state, reward, terminal_state = env.step(state, action)
        print('Next_state: ', next_state)
        print('Reward: ', reward)
        print('Terminal_state: ', terminal_state, '\n')
        episode_reward += reward
        state = deepcopy(next_state)
    print('Episode Reward' + str(episode_reward))


enter image description here



Posted 2019-10-04T15:02:21.897

Reputation: 455

Is taking the same action in every state in any way close to optimal behaviour? Or is it worse than behaving randomly? How are you measuring "my rewards are converged" and what else are you measuring? Have you plotted episode return vs number of episodes (smoothed)? For concreteness, it may be useful to share details of the environment, your state representation, the actions and rewards. This would help in case you have made a mistake in problem analysis. Although more likely you have an implementation detail wrong, as there are lots of places in DQN agents that can go wrong in implementation. – Neil Slater – 2019-10-04T18:30:20.353

Hi, is there a way I can share my ipython notebook or code? – cvg – 2019-10-04T18:43:16.203

I am plotting total rewards in an episode vs the episodes . It converges after 10000 episodes. Please suggest if any other criterion has to be checked, before assuming your agent is trained enough. – cvg – 2019-10-04T18:48:15.663

Yes you can put a link to the notebook into the question. However, please don't expect volunteers here to work on and debug the project based on the question as is. Add the link, and also summarise the important details in the question - use [edit] – Neil Slater – 2019-10-04T18:48:28.653

One related question then - when you test the agent does it get the same amount of reward as you are plotting during training? – Neil Slater – 2019-10-04T18:49:42.793

when i test the trained agent, rewards are varying each time i run a episode – cvg – 2019-10-05T13:09:43.387

That's not what I meant, are the rewards the agent receives it receives during testing consistent with the values it receives during training? In other words, your training routine appears to be converging on a stable expected reward total per episode, so you think training is complete. Then you test the agent and note that it is always taking the same action. If you plotted the results from those test episodes, same as you plotted it during training, would teh graph show a similar level? – Neil Slater – 2019-10-05T13:14:09.507

Hi, I have plotted the test results(edited & added in the question) . The test rewards are similar to training rewards. What does this mean? Why is the agent always taking the same action ? – cvg – 2019-10-05T14:56:37.257

Not really possible to say. I don't see any obvious errors in your plotting code. You may need to explain about the control problem itself. Am I correct in thinking from your Q values plot that you have 500 possible actions? And that is it picking an action with id around 250 as the maximising action in each state? – Neil Slater – 2019-10-05T15:24:44.280

yes, your understanding is correct – cvg – 2019-10-05T16:03:46.350

Any mistakes you can think of while training the agent, which is leading to this behaviour? – cvg – 2019-10-05T16:10:51.893

add randomness ($\epsilon$-greedy strategy etc.) and make sure each episode replay buffer has new data (also be sure to wipe out bad old replays), also could you check that predict outputs different values each time? – quester – 2019-10-15T20:26:53.497

check if your training data isn't skewed or 90% are 0 or something similiar – quester – 2019-10-15T20:42:09.153



This may seem obvious, but have you tried using a Boltzmann distribution for action selection instead of argmax? This is known to encourage exploration and can be done by setting the action policy to

$$p(a|s) = \frac{\exp(\beta Q(a,s)}{\sum_{a'} \exp(\beta Q(a',s))},$$

where $\beta$ is the temperature parameter and governs the exploration-exploitation trade-off. This is also known as the softmax distribution.

Put into code, this would be something like this:

beta = 1.0
p_a_s = np.exp(beta * q_values)/np.sum(np.exp(beta * q_values))
action_key = np.random.choice(a=num_act, p=p_as)

This can lead to numerical instabilities because of the exponential, but that can be handled e.g. by first subtracting the highest q value:

q_values = q_values - np.max(q_vaues)


Posted 2019-10-04T15:02:21.897

Reputation: 1 752

astute observation – hh32 – 2020-04-09T09:09:41.647


  • The action taken by agent can be the most optimal action.
  • If the same state is input, you might be getting the same reward. Might be state not getting updated properly. Since next_state is given by agent, check the deepcopy function.
  • The model might not be updating it's parameters or it's q-values. Check how the model updates it's parameters and q-values.


Posted 2019-10-04T15:02:21.897

Reputation: 154