Different results every time I train a reinforcement learning agent


I am training an RL agent for a control problem using PPO algorithm. I am using stable-baselines library for it.

The objective of an agent is to maintain a temperature of 24 deg in a zone and it takes actions every 15 mins.The length of episode is 9 hrs. I have trained the model for 1 million steps and the rewards have converged. I assume that the agent is trained enough. I have done some experiments and have few questions regarding the training

  1. I test an agent by letting it take actions from a fixed initial state, and monitor the actions taken by actions and states for an episode. When I test the agent multiple times, actions taken and states resulted are different every time. Why is this happening when the agent is trained enough?

  2. I train an agent for 1 million steps. I train another agent for 1 million steps on the same environment with same step of hyperparameters and every thing else same. Both these agents converge. Now when I test these agents actions taken by these agents are not identical/similar. Why is this so?

Can someone help me with these.?

Thank you


Posted 2019-11-06T11:03:14.093

Reputation: 455



  1. A part of the agent consists of taking random actions. So there is a % chance that the agent will take a random action instead of an action based on the training. This is called "exploration". This page describes this as "The amount of randomness in action selection depends on both initial conditions and the training procedure. Over the course of training, the policy typically becomes progressively less random, as the update rule encourages it to exploit rewards that it has already found. "

  2. This is normal. The agent's network is initialized with random weights, and part of the actions it takes during the training are also random (see above). So different training runs will produce different results. If you want to circumvent this issue, you could use a fixed seed for the random-number generator.


Posted 2019-11-06T11:03:14.093

Reputation: 41

I understand that ppo learns a stochastic policy, hence there is some randomness associated in the actions taken. But once the agent is trained enough, shouldn't it for most of the times take a same action? – cvg – 2019-11-06T12:35:02.737

Thanks for clarifying the second point – cvg – 2019-11-06T12:42:20.463

The "randomness" of the actions taken depends on the entropy value. During training this value should reach a peak and then slowly decline, making the actions less random as the training progresses. Entropy is regulated by the entropy coefficient (ent_coef in baselines). If entropy declines too slowly during training, you should try decreasing the entropy coefficient. – user1939088 – 2019-11-06T13:46:17.187

To understand why your agent acts the way it does, you can use Tensorboard to monitor the value of entropy during training. If you don’t want to have any randomness at all after training, you can tweak the agent code to remove entropy from the equation. – user1939088 – 2019-11-06T13:47:28.963

hey, for the second point, I understand that different seeds produce different results. So if I want to deploy an agent to production and I have two models trained which produce different results because of the different seeds, which model should I use for deploying ? – cvg – 2019-11-07T09:55:07.723

Also I have integrated, tensorboard to my training to monitor, I wish to monitor the entropy during training, is entropy_loss, the graph which I should be monitoring? – cvg – 2019-11-07T10:05:53.113