Why is it not advisable to have a 100 percent exploration rate?


During the learning phase, why don't we have a 100% exploration rate, to allow our agent to fully explore our environment and update the Q values, then during testing we bring in exploitation? Does that make more sense than decaying the exploration rate?


Posted 2020-06-26T16:58:12.707

Reputation: 199

Question was closed 2020-06-29T19:47:43.617



No - imagine if you were playing an Atari game and took completely random actions. Your games would not last very long and you would never get to experience all of the state space because the game would end too soon. This is why you need to combine exploration and exploitation to fully explore the state space.

David Ireland

Posted 2020-06-26T16:58:12.707

Reputation: 1 942

But you get my point, even if the game ends quickly, suppose you have a lot of episodes, eventually all space states would be explored , then when we want to test it , we bring in exploitation – Chukwudi – 2020-06-26T17:05:30.603

And also after full training, why is it not advisable to turn off exploration, why do we still leave room for small exploration, my guess is in the case of a dynamic environment where new states might appear – Chukwudi – 2020-06-26T17:07:02.860

@Chukwudi no, if you took random actions in an Atari game you would never win the game in a feasible amount of episodes. Maybe theoretically in the limit you would, but remember that we have limited compute power, and thus you need to exploit the good actions so that we can keep learning from the good actions by backing up the Q function for these state-action values. However, exploration must be maintained to some degree to ensure that we aren't missing any better actions. – David Ireland – 2020-06-26T18:13:41.563

What if we’re sure we’ve covered every state and the Q values are very accurate, assuming our space state is small, the what would be the use of still exploring – Chukwudi – 2020-06-26T18:20:13.850

In terms of making Q-learning scale well, relatively, in environments with a large state space, there must be exploration and exploitation since the aim of Q-learning is to find an optimum policy to traverse an optimum path from start to goal in an MDP. So exploration helps to check if there is a better action to take given a state and set of actions and then leading to a better policy and exploitation allows the agent to find it in a feasible amount of time. Note some environments can be very huge e.g. (256 ^ 128) states. – rert588 – 2020-06-26T18:27:58.470

@Chukwudi: If you are sure that you are done, then you are done and there is no need to continue learning process. But that is not what you asked in the question – Neil Slater – 2020-06-26T18:28:59.860

Decaying exploration as training proceeds is done because as the policy gets better there is not much need to explore since the agent might have already found the actions that give the highest reward, given a state. – rert588 – 2020-06-26T18:30:01.547

Decaying the exploration rate is not the only exploration-exploitation strategy, there are others like mellow-max. – rert588 – 2020-06-26T18:33:27.930

Thank you both, you’ve cleared my issue – Chukwudi – 2020-06-26T19:38:50.273

The policy you’re improving is your absolute greedy policy right – Chukwudi – 2020-06-26T19:39:22.230

epsilon-greedy policy – rert588 – 2020-06-26T20:07:08.223

No epsilon greedy is the behaviour policy, not the update policy – Chukwudi – 2020-06-26T20:53:16.273


While theoretically you can do something like this if you're very confident you'll cover most of the state space in exploration, this is still a suboptimal strategy. Even in the case of multi-armed bandits, this strategy can be much less sample efficient than $\epsilon$-greedy, and exploration is much easier in this case.

So, even if your strategy miraculously works on a decently sized MDP, it'll be worse than combining exploration and exploitation.


Posted 2020-06-26T16:58:12.707

Reputation: 606

Why do you say it’ll be worse, the whole point of me using just exploration is to cover the whole states, and improve my policy – Chukwudi – 2020-06-28T01:02:40.713

Because if you're doing 100% exploration, you'll be revisiting states very often as well. Furthermore of your state space is large, covering the whole state space is intractable. In theory, as long as you explore forever (like in e-greedy for instance), you can still cover the whole state space, but you'll be exploiting way more and incurring less regret. – harwiltz – 2020-06-28T14:47:59.197

But I need a concrete answer as to why we need to combine exploration and exploitation, because to me, exploration 100% would cover the entire space state After like 1000 episodes, them we can start exploiting – Chukwudi – 2020-06-29T02:04:05.767

Unless your state space is very small, you won't cover 100% of the state space with full exploration. Especially if the environment is difficult to navigate, without any exploitation your agent will probably just keep failing immediately (think CartPole for instance). For a concrete answer, like I said, look into the Multi-Armed Bandit literature (even Sutton and Barto). You can prove that your strategy does not achieve optimal regret. – harwiltz – 2020-06-29T14:21:42.747

Someone used a robot balancing on a rope, if I use 100% exploration, then the random actions will make the episodes shorter, and I will never be able to explore the environment in those number of episodes – Chukwudi – 2020-06-29T19:47:23.617

Right, that's a good example – harwiltz – 2020-06-29T21:46:47.987