Hi I am trying to develop an rl agent using PPO algorithm. My agent takes an action(CFM) to maintain a state variable called RAT in between 24 to 24.5. I am using PPO algorithm of stable-baselines library to train my agent.I have trained the agent for 2M steps.
Hyper-parameters in the code:
def __init__(self, *args, **kwargs): super(CustomPolicy, self).__init__(*args, **kwargs, net_arch=[dict(pi=[64, 64], vf=[64, 64])], feature_extraction="mlp") model = PPO2(CustomPolicy,env,gamma=0.8, n_steps=132, ent_coef=0.01, learning_rate=1e-3, vf_coef=0.5, max_grad_norm=0.5, lam=0.95, nminibatches=4, noptepochs=4, cliprange=0.2, cliprange_vf=None, verbose=0, tensorboard_log="./20_01_2020_logs/", _init_setup_model=True, policy_kwargs=None, full_tensorboard_log=False)
Once I train the agent I am testing the actions taken by agent in an episode.
For the time steps in between 40 and 60 ,RAT value is higher than 24.5. From the domain knowledge, if agent takes an action of CFM around 250, it can maintain the RAT in between 24 to 24.5. But agent is not taking such actions instead taking actions similar to previous steps.
Can someone help me with how to tackle this problem? Is there any particular hyper parameter I should try tuning?