9 Is the mean-squared error always convex in the context of neural networks? 2017-08-22T14:26:51.633

9 What are the best known gradient-free training methods for deep learning? 2017-08-24T12:42:23.943

8 Can non-differentiable layer be used in a neural network, if it's not learned? 2018-08-24T17:34:05.080

6 How is the gradient calculated for the middle layer's weights? 2018-03-08T16:22:55.040

6 CNN backpropagation with stride>1 2018-03-22T10:48:09.347

6 How can a neural network learn when the derivative of the activation function is 0? 2019-01-24T04:22:43.037

6 How is local minima possible in gradient descent? 2019-04-23T19:12:53.727

6 What is the formula for the momentum and Adam optimisers? 2020-01-13T07:04:57.613

5 How to avoid falling into the "local minima" trap? 2016-08-05T10:39:31.520

5 For each epoch, can I use only on a subset of the full training dataset to train the neural network? 2018-07-27T17:48:59.153

5 How do I choose the optimal batch size? 2018-10-21T17:09:04.533

5 Can the mean squared error be negative? 2018-11-17T15:47:15.667

5 Are on-line backpropagation iterations perpendicular to the constraint? 2019-03-23T16:03:49.737

5 Can neuroevolution be combined with gradient descent? 2019-07-08T10:29:58.843

5 Is the gradient at a layer independent of the activations of the previous layers? 2019-10-11T05:27:08.410

4 Can a neural network learn to avoid wrong decisions using backpropagation? 2017-08-21T16:53:00.563

4 What are some concrete steps to deal with the vanishing gradient problem? 2018-01-25T18:31:10.407

4 Do good approximations produce good gradients? 2018-09-21T22:38:36.357

4 Which function $(\hat{y} - y)^2$ or $(y - \hat{y})^2$ should I use to compute the gradient? 2019-05-31T12:54:50.300

4 An intuitive explanation of Adagrad, its purpose and its formula 2019-08-16T16:51:46.013

4 Why is batch gradient descent performing worse than stochastic and minibatch gradient descent? 2019-09-26T16:21:35.740

4 Eligibility vector for softmax policy with policy gradients 2019-12-05T19:23:35.307

4 Is running more epochs really a direct cause of overfitting? 2019-12-28T15:39:18.030

4 Is there a reason to choose regular momentum over Nesterov momentum for neural networks? 2020-02-04T21:58:02.993

4 How is the Jacobian a generalisation of the gradient? 2020-05-13T02:18:59.550

3 How to calculate gradient of filter in convolution network 2018-04-13T05:50:47.720

3 How to perform gradient checking in a neural network with batch normalization? 2018-04-24T12:52:12.787

3 Why Feature Scaling for skewed contour? 2018-06-23T06:52:10.587

3 How do I calculate the gradient of the hinge loss function? 2018-10-06T11:49:18.957

3 Deep Q-Learning: why don't we use mini-batches during experience reply? 2018-11-05T09:10:34.903

3 Could error surface shape be useful to detect which local minima is better for generalization? 2019-03-01T20:46:51.720

3 Does Retina-net's focal loss accomplish its goal? 2019-08-03T14:53:51.103

3 What's the function that SGD takes to calculate the gradient? 2020-01-14T22:02:19.673

3 What is the purpose of argmax in the PPO algorithm? 2020-03-13T08:42:01.437

3 How to implement Mean square error loss function in mini batch GD 2020-04-09T17:51:33.530

3 What exactly is averaged when doing batch gradient descent? 2020-04-18T21:21:21.223

3 How does SGD escape local minima? 2020-05-31T09:56:07.290

2 Does Musk know what gradient descent is? 2017-04-21T05:38:50.523

2 Are gradients of weights in RNNs dependent on the gradient of every neuron in that layer? 2017-08-04T22:44:21.400

2 How is direction of weight change determined by Gradient Descent algorithm 2018-03-14T06:11:01.467

2 Why use semi-gradient instead of full gradient in RL problems, when using function approximation? 2018-04-24T23:11:25.637

2 How do I implement softmax forward propagation and backpropagation to replace sigmoid in a neural network? 2018-05-10T18:43:12.950

2 What does the notation $\nabla_\theta \mathcal{L}$ mean? 2018-07-06T21:26:06.797

2 How can we calculate the gradient of the Boltzmann policy over reward function? 2018-07-14T13:27:12.740

2 Should the weights of a neural network be updated after each example or at the end of the batch? 2018-10-19T19:53:17.767

2 Will LMS always be convex function? If yes, then why do we change it for neural networks? 2019-01-30T13:29:47.810

2 What is the gradient of the objective function in the Soft Actor-Critic paper? 2019-02-13T08:33:16.277

2 Is back propagation applied for each data point or for a batch of data points? 2019-04-05T08:34:13.667

2 How can we reach global optimum? 2019-04-09T05:10:55.027

2 What is the right formula for weight update rule in Logistic Regression using stochastic gradient descent 2019-05-07T17:23:47.053

2 Neural networks when gradient descent is not possible 2019-06-19T08:28:00.287

2 How to calculate multiobjective optimization cost for ordinary problems? 2019-10-02T01:35:17.870

2 How to reduce variance of the model loss during training? 2019-12-01T11:49:36.080

2 Why gradients are so small in deep learning? 2020-02-29T08:43:50.950

2 Different methods of calculating gradients of cost function(loss function) 2020-03-31T11:59:46.407

2 Why do we update all layers simultaneously while training a neural network? 2020-04-16T06:37:29.933

2 How long should the state-dependent baseline for policy gradient methods be trained at each iteration? 2020-05-08T11:15:34.553

2 If the normal equation works, why do we need gradient descent? 2020-07-08T14:15:23.210

1 Can a second network take as input the weights of a first network and help training the first network? 2017-07-07T22:42:50.787

1 How would 1D gradient descent look like? 2017-07-18T13:24:40.043

1 How to calculate Adaptive gradient? 2018-02-19T16:54:33.217

1 How should I update the weights of a neural network, given the gradient? 2018-05-03T21:06:29.713

1 What can be deduced about the "algorithm" of backpropagation/gradient descent? 2018-05-13T15:48:56.120

1 Mini-batch training and the gradient 2018-08-09T03:31:38.700

1 Neural network backpropagation gradient descent better than conjugate gradient descent? 2018-11-07T15:13:22.593

1 SARSA won't work for linear function approximator for MountainCar-v0 in OpenAI environment. What are the possible causes? 2018-12-18T17:02:51.690

1 Feed forward neural network using numpy for IRIS dataset 2018-12-22T17:34:42.387

1 neuralnetworksanddeeplearning.com chapter 5 problems 2018-12-24T17:12:58.153

1 Cost function increasing with SGD 2019-01-10T03:38:42.657

1 Neural network with logical hidden layer - how to train it? Is it policy gradient problem? Chaining NNs? 2019-01-29T14:23:01.237

1 How to obtain a formula for loss, when given an iterative update rule in gradient descent? 2019-02-12T12:32:36.097

1 Which local minima to choose according to the shape of the error surface? 2019-03-01T21:47:19.977

1 When is bias values updated in back propagation? 2019-04-08T16:29:47.503

1 How does NEAT find the most successful generation without gradients? 2019-06-04T19:53:37.460

1 Sensitivity of neural network to inputs 2019-06-24T11:13:58.117

1 In NN, as iterations of Gradient descent increases, the accuracy of Test/CV set decreases. how can i resolve this? 2019-06-28T09:03:02.233

1 Online Learning for Neural Networks 2019-08-19T11:08:49.433

1 How to plot Loss Landscape with more than 2 weights in the network 2019-10-18T03:28:00.377

1 Understanding the partial derivative with respect to the weight matrix and bias 2019-10-19T08:56:10.823

1 Can Grad CAM feature maps be used for Training? 2019-11-28T13:12:13.470

1 How can I train a neural network to find the hyper-parameters with which the data was generated? 2019-12-24T20:55:22.940

1 How many parameters are being optimised over in a simple CNN? 2019-12-26T19:18:10.280

1 Is Gradient Descent algorithm a part of Calculus of Variations? 2020-02-04T08:00:06.863

1 Oscillating around the saddle point in gradient descent? 2020-03-04T13:09:59.623

1 Understanding the derivation of the first-order model-agnostic meta-learning 2020-03-09T09:28:08.753

1 How to prove that gradient descent doesn't necessarily find the global optimum? 2020-03-16T02:36:18.900

1 Which activation functions can lead to the vanishing gradient problem? 2020-03-16T11:58:22.280

1 What is the difference between batch and mini-batch gradient decent? 2020-03-28T05:03:17.583

1 How are the weights retained for filters for a particular class in a CNN? 2020-03-31T04:17:51.197

1 What is relation between gradient descent and regularization in deep learning? 2020-04-01T06:03:10.190

1 What do these numbers represent in this picture of a surface? 2020-04-10T22:11:51.313

1 What is the gradient of a non-linear SVM with respect to the input? 2020-04-11T08:57:39.667

1 If the output of a model is a ridge function, what should the activation functions at all the nodes be? 2020-05-20T07:12:26.083

1 What is the equation of the learning rate decay in the Adam optimiser? 2020-06-17T07:07:17.663

1 Why the cost/loss starts to increase for some iterations during the training phase? 2020-07-06T23:30:18.850

1 Isn't it true that using max over a softmax will be much slower because there is not a smooth gradient? 2020-07-09T19:37:22.937

1 Implementing Gradient Descent Algorithm in Python, bit confused regarding equations 2020-08-11T15:31:42.437

0 What is the proof behind the gradient of a curve being proportional to the distance between the two co-ordinates in the x-axis? 2018-01-20T10:58:40.943

0 Behaviour of cost 2018-07-31T17:42:27.820