I am training a reinforcement learning agent using DQN. My state space has 6 variables and the agent can one action which is discretized into 500 actions
My reward structure looks like
thermal_coefficient = -0.1 zone_temperature = output if zone_temperature < self.temp_sp_min: temp_penalty = self.temp_sp_min - zone_temperature elif zone_temperature > self.temp_sp_max: temp_penalty = zone_temperature - self.temp_sp_max else : temp_penalty = 0 reward = thermal_coefficient * temp_penalty
temp_sp_min is 23.7 and
temp_sp_max is 24.5. When i train the agent based on epsilon greedy action selection strategy, after around 10000 episodes my rewards are converging. When I test the trained agent now, the actions taken by the agent doesn't make sense, meaning when
zone_temperature is less than
temp_sp_min it is taking an action, which further reduces zone_temperature.
I don't understand where am I going wrong. Can someone help me with this?