3

I am new to Reinforcement learning and am currently reading up on the estimation of Q $\pi(s, a)$ values using MC epsilon-soft approach and chanced upon this algorithm. The link to the algorithm is found from this website.

```
def monte_carlo_e_soft(env, episodes=100, policy=None, epsilon=0.01):
if not policy:
policy = create_random_policy(env)
# Create an empty dictionary to store state action values
Q = create_state_action_dictionary(env, policy)
# Empty dictionary for storing rewards for each state-action pair
returns = {} # 3.
for _ in range(episodes): # Looping through episodes
G = 0 # Store cumulative reward in G (initialized at 0)
episode = run_game(env=env, policy=policy, display=False) # Store state, action and value respectively
# for loop through reversed indices of episode array.
# The logic behind it being reversed is that the eventual reward would be at the end.
# So we have to go back from the last timestep to the first one propagating result from the future.
# episodes = [[s1,a1,r1], [s2,a2,r2], ... [Sn, an, Rn]]
for i in reversed(range(0, len(episode))):
s_t, a_t, r_t = episode[i]
state_action = (s_t, a_t)
G += r_t # Increment total reward by reward on current timestep
# if state - action pair not found in the preceeding episodes,
# then this is the only time the state appears in this episode.
if not state_action in [(x[0], x[1]) for x in episode[0:i]]: #
# if returns dict contains a state action pair from prev episodes,
# append the curr reward to this dict
if returns.get(state_action):
returns[state_action].append(G)
else:
# create new dictionary entry with reward
returns[state_action] = [G]
# returns is a dictionary that maps (s,a) : [G1,G2, ...]
# Once reward is found for this state in current episode,
# average the reward.
Q[s_t][a_t] = sum(returns[state_action]) / len(returns[state_action]) # Average reward across episodes
# Finding the action with maximum value.
Q_list = list(map(lambda x: x[1], Q[s_t].items()))
indices = [i for i, x in enumerate(Q_list) if x == max(Q_list)]
max_Q = random.choice(indices)
A_star = max_Q # 14.
# Update action probability for s_t in policy
for a in policy[s_t].items():
if a[0] == A_star:
policy[s_t][a[0]] = 1 - epsilon + (epsilon / abs(sum(policy[s_t].values())))
else:
policy[s_t][a[0]] = (epsilon / abs(sum(policy[s_t].values())))
return policy
```

This algorithm computes the $Q(s, a)$ for all state action value pairs that the policy follows. If $\pi$ is a random policy, and after running through this algorithm, and for each state take the $\max Q(s,a)$ for all possible actions, why would that not be equal to $Q_{\pi^*}(s, a)$ (optimal Q function)?

From this website, they claim to have been able to find the optimal policy when running through this algorithm.

I have read up a bit on Q-learning and the update equation is different from MC epsilon-soft. However, I can't seem to understand clearly how these 2 approaches are different.

Hey Neil, thanks for answering ! I think I have a clearer understanding of why it is not Q opt ! What do you mean in the second paragraph when you say “stuck measuring immediate differences between ... “ from that point onwards up till the end of the paragraph ? – calveeen – 2020-01-16T13:42:39.803

Thank you, the explanation helped a lot as well ! – calveeen – 2020-01-16T14:21:35.937

I have one more question regarding the last paragraph where you talk about policy iteration. Policy iteration does not necessarily output Q opt in the case where not all actions are explored when following a policy pi, am i right ? If we take the case of a grid problem, where we can move "up" , " down", "left" and "right", then if our current policy specifies move only "right" for cell number (1,1), we would only get the Q value for action move "right", at the end of the whole monte carlo simulation and there would be no way to update the policy for cell number (1,1) ? – calveeen – 2020-01-16T15:21:26.627

@calveen: Policy Iteration from Dynamic Programming explores all state.action pairs. That means it doesn't scale well. However, it is theoretically solid. The example you are asking about is a Monte Carlo updates - these approach optimal policy in a more stochastic way over time. To do that you need to add some randomness - typically epsilon greedy policies, so that sometimes the agent will take a different action than the polciy derived by taking the maximisiing action, and learn about the other directions – Neil Slater – 2020-01-16T16:01:13.150

@calveen: In my second paragraph I am trying to cover that accurate Q(s,a) gives only a one step look ahead. It does not measure longer-term effects. That's what I meant by "stuck" - just that the values do not change instantly for $Q_{\pi}(s_2,

)$ solely because you changed your policy choice when looking at $Q_{\pi}(s_1,)$. You need to make that policy change for real (or in simulation) and then re-estimate your Q values after using it for a while – Neil Slater – 2020-01-16T16:04:52.540