Why Monte Carlo epsilon-soft approach cannot compute $\max Q(s,a)$?

3

I am new to Reinforcement learning and am currently reading up on the estimation of Q $\pi(s, a)$ values using MC epsilon-soft approach and chanced upon this algorithm. The link to the algorithm is found from this website.

https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-learning-openai-gym/

def monte_carlo_e_soft(env, episodes=100, policy=None, epsilon=0.01):

if not policy:
    policy = create_random_policy(env)
# Create an empty dictionary to store state action values
Q = create_state_action_dictionary(env, policy)

# Empty dictionary for storing rewards for each state-action pair
returns = {} # 3.

for _ in range(episodes): # Looping through episodes
    G = 0 # Store cumulative reward in G (initialized at 0)
    episode = run_game(env=env, policy=policy, display=False) # Store state, action and value respectively

    # for loop through reversed indices of episode array.
    # The logic behind it being reversed is that the eventual reward would be at the end.
    # So we have to go back from the last timestep to the first one propagating result from the future.

    # episodes = [[s1,a1,r1], [s2,a2,r2], ... [Sn, an, Rn]]
    for i in reversed(range(0, len(episode))):
        s_t, a_t, r_t = episode[i]
        state_action = (s_t, a_t)
        G += r_t # Increment total reward by reward on current timestep

        # if state - action pair not found in the preceeding episodes,
        # then this is the only time the state appears in this episode.

        if not state_action in [(x[0], x[1]) for x in episode[0:i]]: #
            # if returns dict contains a state action pair from prev episodes,
            # append the curr reward to this dict
            if returns.get(state_action):
                returns[state_action].append(G)
            else:
                # create new dictionary entry with reward
                returns[state_action] = [G]

            # returns is a dictionary that maps (s,a) : [G1,G2, ...]
            # Once reward is found for this state in current episode,
            # average the reward.
            Q[s_t][a_t] = sum(returns[state_action]) / len(returns[state_action]) # Average reward across episodes

            # Finding the action with maximum value.



            Q_list = list(map(lambda x: x[1], Q[s_t].items()))
            indices = [i for i, x in enumerate(Q_list) if x == max(Q_list)]
            max_Q = random.choice(indices)

            A_star = max_Q # 14.

            # Update action probability for s_t in policy
            for a in policy[s_t].items():
                if a[0] == A_star:
                    policy[s_t][a[0]] = 1 - epsilon + (epsilon / abs(sum(policy[s_t].values())))
                else:
                    policy[s_t][a[0]] = (epsilon / abs(sum(policy[s_t].values())))

return policy

This algorithm computes the $Q(s, a)$ for all state action value pairs that the policy follows. If $\pi$ is a random policy, and after running through this algorithm, and for each state take the $\max Q(s,a)$ for all possible actions, why would that not be equal to $Q_{\pi^*}(s, a)$ (optimal Q function)?

From this website, they claim to have been able to find the optimal policy when running through this algorithm.

I have read up a bit on Q-learning and the update equation is different from MC epsilon-soft. However, I can't seem to understand clearly how these 2 approaches are different.

calveeen

Posted 2020-01-16T08:24:37.940

Reputation: 909

Answers

2

If $\pi$ is a random policy, and after running through this algorithm, and for each state take the $\max Q(s,a)$ for all possible actions, why would that not be equal to $Q_{\pi^*}(s, a)$ (optimal Q function)?

Assuming that the estimates for $Q_{\pi}(s,a)$ have converged to close to correct values from many samples, then a policy based on $\pi'(s) = \text{argmax}_a Q_{\pi}(s,a)$ is not guaranteed to be an optimal policy unless the policy $\pi$ being measured is already the optimal policy.

This is because the action value $Q_{\pi}(s,a)$ gives the expected future reward from taking action $a$ in state $s$, and from that point on following the policy $\pi$. The function does not, by itself, adapt to the idea that you might change other action choices as well. It is a measure of immediate differences between action choices at any given time step. Therefore if there are any long-term dependencies where your action choice at $t$ would be different if only you could guarantee a certain choice at $t+1$ or later, this cannot be resolved by simply taking the maximum $Q_{\pi}(s,a)$ when $\pi$ was a simple random policy.

However, if you do decide to change the policy such that you always follow actions $\pi'(s) = \text{argmax}_a Q_{\pi}(s,a)$ for all states, then you can say this: For each state, $V_{\pi'}(s) \ge V_{\pi}(s)$. I.e. $\pi'(s)$ is no worse than, and may be a strict improvement over $\pi(s)$. Better than that, $\pi'(s)$ will be a strict improvment over $\pi(s)$ if $Q_{\pi}(s,a)$ is accurate and $\pi(s)$ is not already the optimal policy $\pi^*(s)$. This is the basis for the Policy Improvement Theorem which shows that if you repeat the process of measuring $Q_{\pi^k}$ and then creating a new policy $\pi^{k+1}(s) = \text{argmax}_a Q_{\pi^k}(s,a)$ that you will eventually find the optimal policy. You only have to repeat your idea many times to eventually find $\pi^*$

The Dynamic Programming technique Policy Iteration does this exactly. All other value-based Reinforcement Learning methods are variations of this idea and rely at least in part on this proof.

Neil Slater

Posted 2020-01-16T08:24:37.940

Reputation: 14 632

Hey Neil, thanks for answering ! I think I have a clearer understanding of why it is not Q opt ! What do you mean in the second paragraph when you say “stuck measuring immediate differences between ... “ from that point onwards up till the end of the paragraph ? – calveeen – 2020-01-16T13:42:39.803

Thank you, the explanation helped a lot as well ! – calveeen – 2020-01-16T14:21:35.937

I have one more question regarding the last paragraph where you talk about policy iteration. Policy iteration does not necessarily output Q opt in the case where not all actions are explored when following a policy pi, am i right ? If we take the case of a grid problem, where we can move "up" , " down", "left" and "right", then if our current policy specifies move only "right" for cell number (1,1), we would only get the Q value for action move "right", at the end of the whole monte carlo simulation and there would be no way to update the policy for cell number (1,1) ? – calveeen – 2020-01-16T15:21:26.627

@calveen: Policy Iteration from Dynamic Programming explores all state.action pairs. That means it doesn't scale well. However, it is theoretically solid. The example you are asking about is a Monte Carlo updates - these approach optimal policy in a more stochastic way over time. To do that you need to add some randomness - typically epsilon greedy policies, so that sometimes the agent will take a different action than the polciy derived by taking the maximisiing action, and learn about the other directions – Neil Slater – 2020-01-16T16:01:13.150

@calveen: In my second paragraph I am trying to cover that accurate Q(s,a) gives only a one step look ahead. It does not measure longer-term effects. That's what I meant by "stuck" - just that the values do not change instantly for $Q_{\pi}(s_2,)$ solely because you changed your policy choice when looking at $Q_{\pi}(s_1,)$. You need to make that policy change for real (or in simulation) and then re-estimate your Q values after using it for a while – Neil Slater – 2020-01-16T16:04:52.540