Why Monte Carlo epsilon-soft approach cannot compute $\max Q(s,a)$?

3

I am new to Reinforcement learning and am currently reading up on the estimation of Q $$\pi(s, a)$$ values using MC epsilon-soft approach and chanced upon this algorithm. The link to the algorithm is found from this website.

https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-learning-openai-gym/

def monte_carlo_e_soft(env, episodes=100, policy=None, epsilon=0.01):

if not policy:
policy = create_random_policy(env)
# Create an empty dictionary to store state action values
Q = create_state_action_dictionary(env, policy)

# Empty dictionary for storing rewards for each state-action pair
returns = {} # 3.

for _ in range(episodes): # Looping through episodes
G = 0 # Store cumulative reward in G (initialized at 0)
episode = run_game(env=env, policy=policy, display=False) # Store state, action and value respectively

# for loop through reversed indices of episode array.
# The logic behind it being reversed is that the eventual reward would be at the end.
# So we have to go back from the last timestep to the first one propagating result from the future.

# episodes = [[s1,a1,r1], [s2,a2,r2], ... [Sn, an, Rn]]
for i in reversed(range(0, len(episode))):
s_t, a_t, r_t = episode[i]
state_action = (s_t, a_t)
G += r_t # Increment total reward by reward on current timestep

# if state - action pair not found in the preceeding episodes,
# then this is the only time the state appears in this episode.

if not state_action in [(x[0], x[1]) for x in episode[0:i]]: #
# if returns dict contains a state action pair from prev episodes,
# append the curr reward to this dict
if returns.get(state_action):
returns[state_action].append(G)
else:
# create new dictionary entry with reward
returns[state_action] = [G]

# returns is a dictionary that maps (s,a) : [G1,G2, ...]
# Once reward is found for this state in current episode,
# average the reward.
Q[s_t][a_t] = sum(returns[state_action]) / len(returns[state_action]) # Average reward across episodes

# Finding the action with maximum value.

Q_list = list(map(lambda x: x[1], Q[s_t].items()))
indices = [i for i, x in enumerate(Q_list) if x == max(Q_list)]
max_Q = random.choice(indices)

A_star = max_Q # 14.

# Update action probability for s_t in policy
for a in policy[s_t].items():
if a[0] == A_star:
policy[s_t][a[0]] = 1 - epsilon + (epsilon / abs(sum(policy[s_t].values())))
else:
policy[s_t][a[0]] = (epsilon / abs(sum(policy[s_t].values())))

return policy


This algorithm computes the $$Q(s, a)$$ for all state action value pairs that the policy follows. If $$\pi$$ is a random policy, and after running through this algorithm, and for each state take the $$\max Q(s,a)$$ for all possible actions, why would that not be equal to $$Q_{\pi^*}(s, a)$$ (optimal Q function)?

From this website, they claim to have been able to find the optimal policy when running through this algorithm.

I have read up a bit on Q-learning and the update equation is different from MC epsilon-soft. However, I can't seem to understand clearly how these 2 approaches are different.

2

If $$\pi$$ is a random policy, and after running through this algorithm, and for each state take the $$\max Q(s,a)$$ for all possible actions, why would that not be equal to $$Q_{\pi^*}(s, a)$$ (optimal Q function)?

Assuming that the estimates for $$Q_{\pi}(s,a)$$ have converged to close to correct values from many samples, then a policy based on $$\pi'(s) = \text{argmax}_a Q_{\pi}(s,a)$$ is not guaranteed to be an optimal policy unless the policy $$\pi$$ being measured is already the optimal policy.

This is because the action value $$Q_{\pi}(s,a)$$ gives the expected future reward from taking action $$a$$ in state $$s$$, and from that point on following the policy $$\pi$$. The function does not, by itself, adapt to the idea that you might change other action choices as well. It is a measure of immediate differences between action choices at any given time step. Therefore if there are any long-term dependencies where your action choice at $$t$$ would be different if only you could guarantee a certain choice at $$t+1$$ or later, this cannot be resolved by simply taking the maximum $$Q_{\pi}(s,a)$$ when $$\pi$$ was a simple random policy.

However, if you do decide to change the policy such that you always follow actions $$\pi'(s) = \text{argmax}_a Q_{\pi}(s,a)$$ for all states, then you can say this: For each state, $$V_{\pi'}(s) \ge V_{\pi}(s)$$. I.e. $$\pi'(s)$$ is no worse than, and may be a strict improvement over $$\pi(s)$$. Better than that, $$\pi'(s)$$ will be a strict improvment over $$\pi(s)$$ if $$Q_{\pi}(s,a)$$ is accurate and $$\pi(s)$$ is not already the optimal policy $$\pi^*(s)$$. This is the basis for the Policy Improvement Theorem which shows that if you repeat the process of measuring $$Q_{\pi^k}$$ and then creating a new policy $$\pi^{k+1}(s) = \text{argmax}_a Q_{\pi^k}(s,a)$$ that you will eventually find the optimal policy. You only have to repeat your idea many times to eventually find $$\pi^*$$

The Dynamic Programming technique Policy Iteration does this exactly. All other value-based Reinforcement Learning methods are variations of this idea and rely at least in part on this proof.

Hey Neil, thanks for answering ! I think I have a clearer understanding of why it is not Q opt ! What do you mean in the second paragraph when you say “stuck measuring immediate differences between ... “ from that point onwards up till the end of the paragraph ? – calveeen – 2020-01-16T13:42:39.803

Thank you, the explanation helped a lot as well ! – calveeen – 2020-01-16T14:21:35.937

I have one more question regarding the last paragraph where you talk about policy iteration. Policy iteration does not necessarily output Q opt in the case where not all actions are explored when following a policy pi, am i right ? If we take the case of a grid problem, where we can move "up" , " down", "left" and "right", then if our current policy specifies move only "right" for cell number (1,1), we would only get the Q value for action move "right", at the end of the whole monte carlo simulation and there would be no way to update the policy for cell number (1,1) ? – calveeen – 2020-01-16T15:21:26.627

@calveen: Policy Iteration from Dynamic Programming explores all state.action pairs. That means it doesn't scale well. However, it is theoretically solid. The example you are asking about is a Monte Carlo updates - these approach optimal policy in a more stochastic way over time. To do that you need to add some randomness - typically epsilon greedy policies, so that sometimes the agent will take a different action than the polciy derived by taking the maximisiing action, and learn about the other directions – Neil Slater – 2020-01-16T16:01:13.150

@calveen: In my second paragraph I am trying to cover that accurate Q(s,a) gives only a one step look ahead. It does not measure longer-term effects. That's what I meant by "stuck" - just that the values do not change instantly for $Q_{\pi}(s_2,)$ solely because you changed your policy choice when looking at $Q_{\pi}(s_1,)$. You need to make that policy change for real (or in simulation) and then re-estimate your Q values after using it for a while – Neil Slater – 2020-01-16T16:04:52.540