However, both approaches appear identical to me i.e. predicting the maximum reward for an action (Q-learning) is equivalent to predicting the probability of taking the action directly (PG).

Both methods are theoretically driven by the Markov Decision Process construct, and as a result use similar notation and concepts. In addition, in simple solvable environments you should expect both methods to result in the same - or at least equivalent - optimal policies.

However, they are actually different internally. The most fundamental differences between the approaches is in how they approach action selection, both whilst learning, and as the output (the learned policy). In Q-learning, the goal is to learn a single deterministic action from a discrete set of actions by finding the maximum value. With policy gradients, and other direct policy searches, the goal is to learn a map from state to action, which can be stochastic, and works in continuous action spaces.

As a result, policy gradient methods can solve problems that value-based methods cannot:

Large and continuous action space. However, with value-based methods, this can still be approximated with discretisation - and this is not a bad choice, since the mapping function in policy gradient has to be some kind of approximator in practice.

Stochastic policies. A value-based method cannot solve an environment where the optimal policy is stochastic requiring specific probabilities, such as Scissor/Paper/Stone. That is because there are no trainable parameters in Q-learning that control probabilities of action, the problem formulation in TD learning assumes that a deterministic agent can be optimal.

However, value-based methods like Q-learning have some advantages too:

Simplicity. You can implement Q functions as simple discrete tables, and this gives some guarantees of convergence. There are no tabular versions of policy gradient, because you need a mapping function $p(a \mid s, \theta)$ which also must have a smooth gradient with respect to $\theta$.

Speed. TD learning methods that bootstrap are often much faster to learn a policy than methods which must purely sample from the environment in order to evaluate progress.

There are other reasons why you might care to use one or other approach:

You may want to know the predicted return whilst the process is running, to help other planning processes associated with the agent.

The state representation of the problem lends itself more easily to either a value function or a policy function. A value function may turn out to have very simple relationship to the state and the policy function very complex and hard to learn, or *vice-versa*.

Some state-of-the-art RL solvers actually use both approaches together, such as Actor-Critic. This combines strengths of value and policy gradient methods.

What do you mean when you say that actor-critic combines the strength of both methods? To my understanding, the actor evaluates the best action to take based on state, and the critic evaluates that state's value, then feeds reward to the actor. Treating them as a single "Policy" unit still looks like policy gradient to me. Why is this actually like Q-learning? – Gulzar – 2019-01-25T20:04:50.920

1@Guizar: The critic learns using a value-based method (e.g. Q-learning). So, overall, actor-critic is a combination of a value method and a policy gradient method, and it benefits from the combination. One notable improvement over "vanilla" PG is that gradients can be assessed on each step, instead of at the end of each episode. If you are looking for a more detailed answer on this subject you should ask a question on the site. – Neil Slater – 2019-01-26T08:45:18.450

@Guizar: Actually scratch the (e.g. Q-learning) as I'm getting confused between advantage actor-critic (which adjusts the baseline to be based on action-values) and the critic which is usually a simpler state value. However, the rest my description is still the same, the critic is usually updated using value-based TD methods, of which Q learning is also an example. – Neil Slater – 2019-01-26T10:32:26.753

"A value function may turn out to have very simple relationship to the state and the policy function very complex and hard to learn."Is this really true? The policy is just the argmax of the value function. You can easily (locally) get the policy from the value function, but not viceversa. – user76284 – 2020-01-23T03:51:30.250@user76284: Your statement "the policy is just the argmax of the value function" only applies to value-based approaches estimating an optimal policy e.g. Q-learning. Policy gradient methods do not derive the policy in this way - it is learned directly as a function of the state. So yes my statement is really true. – Neil Slater – 2020-01-23T09:53:12.270

You misunderstood. What I meant is that the optimal policy cannot be

more complexthan the value function, since the former can be easily extracted from the latter. – user76284 – 2020-01-23T18:18:34.327The value function for a given problem exists regardless of whether you decided to learn it. My point is about the

– user76284 – 2020-01-23T18:21:25.620problem’svalue function and optimal policy: that the latter cannot be “more complex” than the former. It’s a reduction.Another way to phrase it is that an environment’s optimal policy will always be

at least as simpleas the environment’s value function. – user76284 – 2020-01-23T18:32:40.683@user76284: I'm not convinced we are taling about quite the same thing. The reduction function of argmax doesn't necessarily compress simply. A value of function of $Q(s, a) = sin(s + a)$ is pretty simple to model. Stick an argmax over $a$ on that, and you will get a much more complex relationship. In reality you don't have the analytic forms, so complexity of the function

in the sense I mean in the answeris about how learnable it is, not composition or kolgomorov complexity etc. Is there a way I could phrase that to make it clear? – Neil Slater – 2020-01-23T19:49:47.683What is the action space in your example? – user76284 – 2020-01-23T20:57:29.460

If it’s finite, a simple softmax of the Q table suffices. If it’s infinite, the argmax for your example is just $\frac{\pi}{2} - s$, a linear function of $s$. – user76284 – 2020-01-23T21:08:58.053

@user76284 Perhaps that is not a good example for thsi discussion then. Consider this: You need to estimate given a state, either your optimal policy, or the

valueof acting in multiple different ways so that you can figure out the optimal policy. Your state is your position, bearing and speed, plus a target's position (bearing and distance from you). The total reward is negative the time it takes to get to the target. The actions are to change bearing or change speed. – Neil Slater – 2020-01-23T21:13:49.377The optimal policy is simple - change bearing to face target, and accelerate towards it. Calculating the expected total reward for the different actions is harder. You would not choose to do it first and figure out the maximum. – Neil Slater – 2020-01-23T21:14:30.980

You say the optimal policy is simpler (i.e. not harder than) than the value function. Isn’t that what I’m saying too? Sorry if I misunderstood. – user76284 – 2020-01-23T21:19:23.037

@user76284: It is in that case. It is not always is also what I am saying. If you feel that the best or only way to calculate it would be to calculate the Q table and argmax over it, then you aren't likely to be using policy gradients any more. This discsussion has got too long though - if you think there is something wrong/unclear in this answer, then would you mind starting a new question about it? Then if anything crops up in that that I could reference or include in this answer to make it better, I would be happy to – Neil Slater – 2020-01-23T21:32:40.477