Reward is converging but actions taken by trained agent are illogical in reinforcement learning


I am training a reinforcement learning agent using DQN. My state space has 6 variables and the agent can one action which is discretized into 500 actions

My reward structure looks like

thermal_coefficient = -0.1

        zone_temperature = output[6]

        if zone_temperature < self.temp_sp_min:
            temp_penalty = self.temp_sp_min - zone_temperature
        elif zone_temperature > self.temp_sp_max:
            temp_penalty = zone_temperature - self.temp_sp_max
        else :
            temp_penalty = 0

        reward = thermal_coefficient * temp_penalty

my temp_sp_min is 23.7 and temp_sp_max is 24.5. When i train the agent based on epsilon greedy action selection strategy, after around 10000 episodes my rewards are converging. When I test the trained agent now, the actions taken by the agent doesn't make sense, meaning when zone_temperature is less than temp_sp_min it is taking an action, which further reduces zone_temperature.

I don't understand where am I going wrong. Can someone help me with this?



Posted 2019-10-03T11:47:18.593

Reputation: 455



Without seeing the rest of your code, this is a bit tricky to answer, but you have to make sure that reward = -temp_penalty, i.e. the negtive penalty, otherwise you would learn exactly the behaviour you described. Here I'm assuming you excluded all other potential sources of error.

Further, I think it might be helpful to issue a reward if the agent stays within the limits you defined, i.e. in the else-clause set temp_penalty = -1. or something like that. I personally found this type of tweaking to be very helpful for DQN.


Posted 2019-10-03T11:47:18.593

Reputation: 1 409

Hi, updated the question with the reward function that I have been using. – cvg – 2019-10-03T14:38:01.957

I am giving a negative penalty only – cvg – 2019-10-03T14:43:16.627