I am training an RL agent for a control problem using PPO algorithm. I am using stable-baselines library for it.
The objective of an agent is to maintain a temperature of 24 deg in a zone and it takes actions every 15 mins.The length of episode is 9 hrs. I have trained the model for 1 million steps and the rewards have converged. I assume that the agent is trained enough. I have done some experiments and have few questions regarding the training
I test an agent by letting it take actions from a fixed initial state, and monitor the actions taken by actions and states for an episode. When I test the agent multiple times, actions taken and states resulted are different every time. Why is this happening when the agent is trained enough?
I train an agent for 1 million steps. I train another agent for 1 million steps on the same environment with same step of hyperparameters and every thing else same. Both these agents converge. Now when I test these agents actions taken by these agents are not identical/similar. Why is this so?
Can someone help me with these.?