Can exogenous variables be state features in reinforcement learning?


I have a question about state representation of Q-learning or DQN algorithm. I'm still a beginner of RL, so I'm not sure that is it suitable to take exogenous variables as state features.

For example, in my current project, deciding to charge/discharge an electric vehicle actions according to the real-time fluctuating electricity prices, I'm wondering if the past n-step prices or hours can be considered as state features.

Because both the prices and the hour are just given information in every time step rather than being dependent to the charging/discharging actions, I'm suspicious about whether they can are theoretically qualified to be state features or not.

If they are not qualified, could someone give me a reference or something that I can read?

JH Lee

Posted 2019-08-25T07:12:44.720

Reputation: 125



Including exogenous variables in your state representation certainly can be useful, as long as you expect them to be relevant information for determining the action to pick. So, state features are not only useful if you expect your agent (through application of actions) to have (partial) influence on those state variables; you just want the state variables themselves to be informative for your next action to take / prediction of expected future rewards.

However, if you only have exogenous variables, i.e. if you expect your agent to have no influence whatsoever on what states you'll end up in next... then the full problem definition typically used in RL (Markov decision processes) may be unnecessarily complex, and you may prefer to look into the Multi-Armed Bandits (MAB) problem formulation. If you're already familiar with RL / MDPs, you may think of MAB problems as (sequences of) single-step episodes, where you always just look at the current state and don't care at all about future states (because you expect to have 0 influence on them).

In theory, the RL / MDP framework is more general and is also applicable to those MAB problems, but RL algorithms that support this framework may perform worse than MAB algorithms in practice, because they (informally speaking) still put in effort trying to "learn" how their actions affect future states (a waste of effort when you expect there to be no such influence from the agent).

Dennis Soemers

Posted 2019-08-25T07:12:44.720

Reputation: 7 644