## How to design two different neural nets for actor and critic RL?

3

In order to have an actor critic RL model there are two things to be satisfied .

1. Value approximation function should converge to a local minimum

$$\sum_s d^{\pi}(s) \sum_a \pi(s,a)[Q^{\pi}(s,a) - f_w(s,a)]\frac{\partial f_w(s,a)}{\partial w} = 0$$

1. The following condition should be satisfied with the parameterization

$$\frac{\partial f_w(s,a)}{\partial w} = \frac{\partial \pi(s,a)}{\partial \theta} \frac{1}{ \pi(s,a)}$$

So specifically how can we design a model to meet the second condition?

Update

here I want to highlight the value function approximation in actor-critic methods . We need to optimize the critic also as we did for the Q learning but following the on policy which is taking the TD error according to the actor. Here I will put the final equation of actor critic.

Here simply we can take the critic neural net's output as the state value function or the f above. So how to satisfy the condition 2 ?

May I ask why you are interested in on-policy critic form? Also how do you intend to train the critic's NN on-policy? – Constantinos – 2017-12-09T20:40:49.887

3

There are many papers out there that deal with neural networks and RL. This blog will give a very good insight on a Policy Gradient Network: Deep RL with PG

Now for your question. You really need to be familiar on how we train a neural network. A simple one for classification. If you check the derivations and how the weights are getting updated things will be very clear to you on how you can implement the above.

I will describe it to you as simple as possible so you get the link. A Neural Network in a very broad sense consists of nested functions. The function that contains all the others is the one at your output layer. In case of Stochastic Policy Gradients this is your Boltzmann function. So your output layer will take all the previous layer's outputs and will pass them through the Boltzmann function. In the case of NN the parametrization comes from all previous layers.

The blog link I sent you describes a very nice and simple example of a vanilla Policy Gradient with NN (REINFORCE algorithm). By using the code (plus your understanding on feedforward networks) you will see that the gradients are multiplied by the reward. It is a very good exercise!

For Actor-Critic, you need in general a network performing PG (stochastic or deterministic) -- you Actor -- and a network that will give you the reward signal (like the simple case in the blog. However, for various reasons, instead of the actual reward we use another network that estimates the reward by performing Q-learning as in Deep-Q learning (minimizing square error between estimated reward and true reward).

Hope this helps!

3The relationship between this answer and equation 2 from the OP's question is a bit tenuous - basically you are saying the explanation is available from a paper linked from the blog/tutorial. I'm not sure I understand your third paragraph well enough to get whether it summarises things from the paper accurately. I am still stuck on questioning the assumption that $w$ and $\theta$ can even be guaranteed to have the same dimension since $f_w$ is always a scalar value, but $\pi$ can have any number of dimensions . . . I guess it would be nice to see some equation and terms in it explained. – Neil Slater – 2017-12-08T18:25:14.287

well not really. It's not about the optimization of the neural net I asked. In actor critic methods we do maintain two neural networks . One is for the policy which is optimized with policy gradients and other one for the estimating the reward which is optimized with td error (value approximation can be simplified in to TD when we use a baseline reward function to reduce variance which is a value function ) , so in the second question there is a derivative need to be equivalent. How to fulfill that? – Shamane Siriwardhana – 2017-12-09T17:32:18.633

Neil, totally agree with you that the answer I gave is kind of ambiguous. However the question is not well defined in my opinion. The OP asks how to design AC with neural nets. The details in the question though reveal that actually he implies that AC with NNs should be designed by satisfying the two conditions. This is not true in case of NNs. In general and to the extent of my knowledge there are no theoretical guarantees that NNs will converge (training is very unstable). The two conditions in most cases are relaxed. – Constantinos – 2017-12-09T18:19:38.683

1The second condition is almost never true. There are approximations that can be made for certain classes of functions. Usually you end up having the $f_w=w^T \del{log\pi(\theta)}$. But with multiple nonlinear approximators such as NNs, theoretical performance guarantees are impossible. They used other methods such as target networks and slow updates to make the training stable. I am not sure Neil what do you mean about $\pi$ having infinite dimensions. The $f_w$ approximates the advantage function $A(s,a)$ and $\pi(s,a;\theta)$ is your policy. – Constantinos – 2017-12-09T18:38:07.127

1I mean that $\nabla_{w}f_w$ and $\nabla_{\theta} \text{log}(\pi)$ have trouble being equal unless they start with vector sizes match, which means both $w$ and $\theta$ must hold exactly same number of parameters. This is quite hard to arrange if $\pi$ has anything other than a single dimension, which it won't in many problems (only exception would be if there was a single real-valued action). BTW, the LaTex code you were looking for is \nabla – Neil Slater – 2017-12-11T19:48:14.693

1Yes, for the equality to hold we need equal size parameters. However as I stated above in the case of NNs these conditions do not hold. The OP makes the assumption that these conditions should hold for every parametrization which is not correct. And yes you do not have theoretical guarantees for AC with NNs -- at least to the extent of my knowledge. Thanks!I dont see how I can edit my comment... – Constantinos – 2017-12-12T22:20:15.110