25 What is the relation between Q-learning and policy gradients methods? 2018-04-28T03:11:16.087

11 How can policy gradients be applied in the case of multiple continuous actions? 2017-09-21T08:27:28.160

9 Do off-policy policy gradient methods exist? 2017-12-23T18:41:29.570

9 Why is baseline conditional on state at some timestep unbiased? 2018-09-09T20:31:07.373

8 Why does is make sense to normalize rewards per episode in reinforcement learning? 2019-01-24T13:56:08.333

6 Why do Bellman equations indirectly create a policy? 2017-12-18T13:27:20.397

5 Reinforcement Learning with more actions than states 2019-03-18T23:16:00.460

5 Is REINFORCE the same as 'vanilla policy gradient'? 2019-03-27T13:01:19.460

4 Why does the "reward to go" trick in policy gradient methods work? 2018-12-20T01:00:04.310

4 How is the policy gradient calculated in REINFORCE? 2019-04-21T19:23:33.580

4 Eligibility vector for softmax policy with policy gradients 2019-12-05T19:23:35.307

4 How to calculate the advantage in policy gradient functions? 2020-03-17T08:49:36.980

4 How does the Ornstein-Uhlenbeck process work, and how it is used in DDPG? 2020-08-21T20:00:04.873

3 Can I use deterministic policy gradient methods for stochastic policy learning? 2019-01-31T01:33:56.620

3 Meaning of Actor Output in Actor Critic Reinforcement Learning 2019-02-06T15:04:21.750

3 How large should the replay buffer be? 2019-04-04T14:40:34.553

3 Reinforcement Learning without state space 2019-06-12T14:28:23.273

3 Policy gradient methods for continuous action space 2019-08-07T16:23:08.983

3 Are these two TRPO objective functions equivalent? 2019-10-07T05:15:32.387

3 What could be the cause of the drop in the reward in A3C? 2019-10-28T07:47:59.513

3 Purpose of using actor-critic algorithms under deterministic MDP dynamics? 2019-11-12T14:25:44.363

3 Why is the stationary distribution independent of the initial state in the proof of the policy gradient theorem? 2019-12-03T10:50:41.380

3 Reinforcement Learning on quantum circuit 2019-12-18T09:43:09.777

3 In the policy gradient equation, is $\pi(a_{t} | s_{t}, \theta)$ a distribution or a function? 2020-02-21T16:23:15.443

3 What is the purpose of argmax in the PPO algorithm? 2020-03-13T08:42:01.437

3 Representation of state space, action space and reward system for Reinforcement Learning problem 2020-03-22T20:22:34.407

3 Appropriate algorithm for RL problem with sparse rewards, continuous actions and significant stochasticity 2020-04-23T09:39:41.123

3 What does it mean to parameterise a policy in policy gradient methods? 2020-05-06T11:04:09.407

3 On-policy preventing us from using the replay buffer with the PG? 2020-05-12T15:17:21.890

3 What happens when you select actions using softmax instead of epsilon greedy in DQN? 2020-06-23T16:47:51.683

3 Choosing a policy improvement algorithm for a continuing problem with continuous action and state-space 2020-07-22T08:00:14.133

3 Comparing the derivation of the Deterministic Policy Gradient Theorem with the standard Policy Gradient Theorem 2020-08-04T07:10:33.170

3 Why does REINFORCE work at all? 2020-08-15T12:30:38.393

3 Why does (not) the distribution of states depend on the policy parameters that induce it? 2020-08-27T10:36:32.770

2 How to include exploration in Gaussian policy 2019-01-22T12:02:32.973

2 What information should be cached in experience replay for actor-critic? 2019-01-31T23:35:08.680

2 Calculating gradient for log policy when variance is not constant 2019-02-13T11:27:58.547

2 How does the TRPO surrogate loss account for the error in the policy? 2019-05-02T15:31:08.017

2 How to enforce covariance-matrix output as part of the last layer of a Policy Network? 2019-07-14T08:47:20.760

2 What are the pros and cons of using standard deviation or entropy for exploration in PPO? 2019-09-08T10:51:15.487

2 How do policy gradients compute an infinite probability distribution from a neural network 2019-09-15T21:28:37.853

2 Solving a Multi-Armed, "Multi-Bandit" Problem 2019-10-29T01:23:30.953

2 What is the complexity of policy gradient algorithms compared to discrete action space algorithms? 2019-11-07T07:24:35.957

2 How is gradient being calculated in Andrej Karpathy's pong code? 2019-11-07T14:48:22.037

2 How to set the multiple continuous actions with constraints 2019-11-25T07:55:47.423

2 What is the effect of picking action deterministicly at inference with Policy Gradient Methods? 2019-11-28T23:19:57.650

2 What is the difference between Sutton's and Levine's REINFORCE algorithm? 2020-01-07T22:47:17.180

2 Is the negative of the policy loss function in a simple policy gradient algorithm an estimator of expected returns? 2020-02-24T00:12:19.963

2 How can I constraint the actions with dependent coordinates? 2020-02-26T20:13:22.920

2 Monte Carlo updates on policy gradient with no terminal state 2020-02-27T00:31:05.983

2 Policy Gradient Reward Oscillation in MATLAB 2020-03-17T14:58:40.440

2 Could we update the policy network with previous trajectories using supervised learning? 2020-04-12T10:08:58.267

2 How can I sample the output distribution multiple times when pruning the filters with reinforcement learning? 2020-04-26T01:16:07.933

2 How is the log-derivative trick of a trajectory derived? 2020-04-26T21:42:54.467

2 What is the gradient of the Q function with respect to the policy's parameters? 2020-04-30T00:00:24.280

2 PPO algorithm converges on only one action 2020-05-04T19:26:31.697

2 How long should the state-dependent baseline for policy gradient methods be trained at each iteration? 2020-05-08T11:15:34.553

2 Advantage computed the wrong way? 2020-05-14T21:47:16.960

2 Is this the correct gradient for log of softmax? 2020-05-17T21:25:43.497

2 How do I derive the gradient with respect to the parameters of the softmax policy? 2020-05-19T05:02:46.307

2 Learning policy where action involves discrete and continuous parameters 2020-05-22T05:20:47.083

2 Non-differentiable reward function to update a neural network 2020-06-09T19:42:50.047

2 Understanding the "unroling" step in the proof of the policy gradient theorem 2020-06-23T08:21:16.683

2 How should we interpret all the different metrics in reinforcement learning? 2020-07-07T16:25:28.417

2 How can I classify policy gradient methods in RL? 2020-07-11T08:25:32.980

2 What kind of policy evaluation and policy improvement AlphaGo, AlphaGo Zero and AlphaZero are using 2020-07-17T14:02:38.373

2 Is it common to have extreme policy's probabilities? 2020-07-20T21:11:43.370

2 Generation of 'new log probabilities' in continuous action space PPO 2020-08-26T20:02:03.287

2 What's an example of a simple policy but a complex value function? 2020-08-27T10:16:55.680

1 Impact of Varying Length Trajectories on Policy Gradient Optimization 2019-01-12T00:54:38.873

1 Neural network with logical hidden layer - how to train it? Is it policy gradient problem? Chaining NNs? 2019-01-29T14:23:01.237

1 Policy gradient loss for neural network training 2019-04-20T20:36:44.257

1 Understanding policy update in PPO2 2019-07-29T11:32:37.603

1 Why is image classification tasks are dominated by minimizing cost function instead of maximizing ones? 2019-10-14T01:28:18.200

1 How does the policy gradient's derivative work? 2019-11-08T02:31:39.923

1 Deciding std. deviation for policy network output? 2019-12-10T06:49:06.140

1 Should I consider mean or sampled value for action selection in ppo algorithm? 2019-12-10T18:16:49.123

1 Is the TD-residual defined for timesteps $t$ past the length of the episode? 2020-04-03T16:09:34.237

1 Is there a good and easy paper to code policy gradient algorithms (REINFORCE) from scratch? 2020-04-19T01:34:40.647

1 How can I design a DQN or policy gradient model to explore and collect all optimal solutions? 2020-04-24T12:23:53.930

1 Subtracting the entropy from our policy gradient will prevent our agent from being stuck in the local minimum? 2020-04-25T13:47:44.187

1 What if the rewards induced by an environment are related to the policy too? 2020-05-01T19:31:56.960

1 Is the reward following after time step $t+1$ collected based on current policy? 2020-05-10T14:27:15.683

1 How does the gradient increase the probabilities of the path with a positive reward in policy gradient? 2020-05-12T09:51:11.073

1 Why a single trajectory can be used to update the policy network $\theta$ in A3C? 2020-05-12T14:33:18.140

1 In vanilla policy gradient is the baseline lagging behind the policy? 2020-05-22T12:58:12.750

1 Policy Gradient on Tic-Tac-Toe not working 2020-05-22T17:17:16.697

1 How can I perform policy update in python? 2020-05-23T07:29:04.767

1 Using a model-based method to build an accurate day trading environment model 2020-05-27T00:52:09.220

1 Should I use exploration strategy in Policy Gradient algorithms? 2020-06-06T21:38:24.673

1 What is the proof that "reward-to-go" reduces variance of policy gradient? 2020-06-10T13:38:53.023

1 Is this figure a correct representation of off-policy actor-critic methods? 2020-06-21T14:27:22.673

1 How to optimize neural network parameters with REINFORCE 2020-07-13T15:24:21.633

1 In continuous action spaces, how is the standard deviation, associated with Gaussian distribution from which actions are sampled, represented? 2020-07-18T16:42:24.300

1 Why is the policy loss the mean of $-Q(s, \mu(s))$ in the DDPG algorithm? 2020-07-22T01:18:42.457

1 What is the difference between vanilla policy gradient and advantage actor-critic? 2020-07-27T04:40:57.510

1 DDPG doesn't converge for MountainCarContinuous-v0 gym environment 2020-08-09T15:50:26.347

1 Customized food for persons based on their profile using Reinforcement learning 2020-08-13T06:55:54.493

0 Can gradient descent training be used for nonsmooth loss functions? 2019-01-10T08:29:05.757