## What is the relation between Q-learning and policy gradients methods?

25

15

As far as I understand, Q-learning and policy gradients (PG) are the two major approaches used to solve RL problems. While Q-learning aims to predict the reward of a certain action taken in a certain state, policy gradients directly predict the action itself.

However, both approaches appear identical to me, i.e. predicting the maximum reward for an action (Q-learning) is equivalent to predicting the probability of taking the action directly (PG). Is the difference in the way the loss is back-propagated?

25

However, both approaches appear identical to me i.e. predicting the maximum reward for an action (Q-learning) is equivalent to predicting the probability of taking the action directly (PG).

Both methods are theoretically driven by the Markov Decision Process construct, and as a result use similar notation and concepts. In addition, in simple solvable environments you should expect both methods to result in the same - or at least equivalent - optimal policies.

However, they are actually different internally. The most fundamental differences between the approaches is in how they approach action selection, both whilst learning, and as the output (the learned policy). In Q-learning, the goal is to learn a single deterministic action from a discrete set of actions by finding the maximum value. With policy gradients, and other direct policy searches, the goal is to learn a map from state to action, which can be stochastic, and works in continuous action spaces.

As a result, policy gradient methods can solve problems that value-based methods cannot:

• Large and continuous action space. However, with value-based methods, this can still be approximated with discretisation - and this is not a bad choice, since the mapping function in policy gradient has to be some kind of approximator in practice.

• Stochastic policies. A value-based method cannot solve an environment where the optimal policy is stochastic requiring specific probabilities, such as Scissor/Paper/Stone. That is because there are no trainable parameters in Q-learning that control probabilities of action, the problem formulation in TD learning assumes that a deterministic agent can be optimal.

However, value-based methods like Q-learning have some advantages too:

• Simplicity. You can implement Q functions as simple discrete tables, and this gives some guarantees of convergence. There are no tabular versions of policy gradient, because you need a mapping function $$p(a \mid s, \theta)$$ which also must have a smooth gradient with respect to $$\theta$$.

• Speed. TD learning methods that bootstrap are often much faster to learn a policy than methods which must purely sample from the environment in order to evaluate progress.

There are other reasons why you might care to use one or other approach:

• You may want to know the predicted return whilst the process is running, to help other planning processes associated with the agent.

• The state representation of the problem lends itself more easily to either a value function or a policy function. A value function may turn out to have very simple relationship to the state and the policy function very complex and hard to learn, or vice-versa.

Some state-of-the-art RL solvers actually use both approaches together, such as Actor-Critic. This combines strengths of value and policy gradient methods.

What do you mean when you say that actor-critic combines the strength of both methods? To my understanding, the actor evaluates the best action to take based on state, and the critic evaluates that state's value, then feeds reward to the actor. Treating them as a single "Policy" unit still looks like policy gradient to me. Why is this actually like Q-learning? – Gulzar – 2019-01-25T20:04:50.920

1@Guizar: The critic learns using a value-based method (e.g. Q-learning). So, overall, actor-critic is a combination of a value method and a policy gradient method, and it benefits from the combination. One notable improvement over "vanilla" PG is that gradients can be assessed on each step, instead of at the end of each episode. If you are looking for a more detailed answer on this subject you should ask a question on the site. – Neil Slater – 2019-01-26T08:45:18.450

@Guizar: Actually scratch the (e.g. Q-learning) as I'm getting confused between advantage actor-critic (which adjusts the baseline to be based on action-values) and the critic which is usually a simpler state value. However, the rest my description is still the same, the critic is usually updated using value-based TD methods, of which Q learning is also an example. – Neil Slater – 2019-01-26T10:32:26.753

"A value function may turn out to have very simple relationship to the state and the policy function very complex and hard to learn." Is this really true? The policy is just the argmax of the value function. You can easily (locally) get the policy from the value function, but not viceversa. – user76284 – 2020-01-23T03:51:30.250

@user76284: Your statement "the policy is just the argmax of the value function" only applies to value-based approaches estimating an optimal policy e.g. Q-learning. Policy gradient methods do not derive the policy in this way - it is learned directly as a function of the state. So yes my statement is really true. – Neil Slater – 2020-01-23T09:53:12.270

You misunderstood. What I meant is that the optimal policy cannot be more complex than the value function, since the former can be easily extracted from the latter. – user76284 – 2020-01-23T18:18:34.327

The value function for a given problem exists regardless of whether you decided to learn it. My point is about the problem’s value function and optimal policy: that the latter cannot be “more complex” than the former. It’s a reduction.

– user76284 – 2020-01-23T18:21:25.620

Another way to phrase it is that an environment’s optimal policy will always be at least as simple as the environment’s value function. – user76284 – 2020-01-23T18:32:40.683

@user76284: I'm not convinced we are taling about quite the same thing. The reduction function of argmax doesn't necessarily compress simply. A value of function of $Q(s, a) = sin(s + a)$ is pretty simple to model. Stick an argmax over $a$ on that, and you will get a much more complex relationship. In reality you don't have the analytic forms, so complexity of the function in the sense I mean in the answer is about how learnable it is, not composition or kolgomorov complexity etc. Is there a way I could phrase that to make it clear? – Neil Slater – 2020-01-23T19:49:47.683

What is the action space in your example? – user76284 – 2020-01-23T20:57:29.460

If it’s finite, a simple softmax of the Q table suffices. If it’s infinite, the argmax for your example is just $\frac{\pi}{2} - s$, a linear function of $s$. – user76284 – 2020-01-23T21:08:58.053

@user76284 Perhaps that is not a good example for thsi discussion then. Consider this: You need to estimate given a state, either your optimal policy, or the value of acting in multiple different ways so that you can figure out the optimal policy. Your state is your position, bearing and speed, plus a target's position (bearing and distance from you). The total reward is negative the time it takes to get to the target. The actions are to change bearing or change speed. – Neil Slater – 2020-01-23T21:13:49.377

The optimal policy is simple - change bearing to face target, and accelerate towards it. Calculating the expected total reward for the different actions is harder. You would not choose to do it first and figure out the maximum. – Neil Slater – 2020-01-23T21:14:30.980

You say the optimal policy is simpler (i.e. not harder than) than the value function. Isn’t that what I’m saying too? Sorry if I misunderstood. – user76284 – 2020-01-23T21:19:23.037

@user76284: It is in that case. It is not always is also what I am saying. If you feel that the best or only way to calculate it would be to calculate the Q table and argmax over it, then you aren't likely to be using policy gradients any more. This discsussion has got too long though - if you think there is something wrong/unclear in this answer, then would you mind starting a new question about it? Then if anything crops up in that that I could reference or include in this answer to make it better, I would be happy to – Neil Slater – 2020-01-23T21:32:40.477

2

This Tutorial by OpenAI offers a great comparison of different RL methods.
I'll try to summarize the differences between Q-Learning and Policy Gradient methods:

1. Objective Function

1. In Q-Learning we learn a Q-function that satisfies the Bellman (Optimality) Equation. This is most often achieved by minimizing the Mean Squared Bellman Error (MSBE) as the loss function. The Q-function is then used to obtain a policy (e.g. by greedily selecting the action with maximum value).
2. Policy Gradient methods directly try to maximize the expected return by taking small steps in the direction of the policy gradient. The policy gradient is the derivative of the expected return w.r.t. the policy parameters.
2. On- vs. Off-Policy

1. The Policy Gradient is derived as an expectation over trajectories ($$s_1,a_1,r_1,s_2,a_2,...,r_n$$), which is estimated by a sample mean. To get an unbiased estimate of the gradient, the trajectories have to be sampled from the current policy. Thus, policy gradient methods are on-policy methods.
2. Q-Learning only makes sure to satisfy the Bellman-Equation. This equation has to hold true for all transitions. Therefore, Q-learning can also use experiences collected from previous policies and is off-policy.
3. Stability and Sample Efficiency

1. Directly optimizing the return and thus the actual performance on a given task, Policy Gradient methods tend to more stably converge to a good behavior. Indeed being on-policy, makes them very sample inefficient. Q-learning find a function that is guaranteed to satisfy the Bellman-Equation, but this does not guarantee to result in near-optimal behavior. Several tricks are used to improve convergence and in this case, Q-learning is more sample efficient.