How can the performance of a neural network vary considerably without changing any parameters?

1

I am training a neural network with 1 sigmoid hidden layer and a linear output layer. The network simply approximates a cosine function. The weights are initiliazed according to Nguyen-Widrow initialization and the biases are initialized to 1. I am using MATLAB as a platform.

Running the network a number of times without changing any parameters, I am getting results (mean squared error) which range from 0.5 to 0.5*10^-6. I cannot understand how the results can even vary that much, I'd imagine there would at least be a narrower and more consistent window of errors.

What could be causing such a large variance?

edgaralienfoe

Posted 2015-03-12T11:08:43.067

Reputation: 113

Answers

2

In general, there is no guarantee that ANNs such as a multi-layer Perceptron network will converge to the global minimum squared error (MSE) solution. The final state of the network can be heavily dependent on how the network weights are initialized. Since most initialization schemes (including Nguyen-Widrow) use random numbers to generate the initial weights, it is quite possible that some initial states will result in convergence on local minima, whereas others will converge on the MSE solution.

bogatron

Posted 2015-03-12T11:08:43.067

Reputation: 826

I agree. I primarily tested with the weights initialized to 1 (I have my reasons) and it's how I first noticed the large variance. Then I initialized the weights using NW to check if the results would be better. There was some improvement but the large variance was still present. – edgaralienfoe – 2015-03-12T13:28:36.473

If you repeat training with the same set of initial weights, I would expect the same result (unless you have some sort of asynchronous processing going on). One other point: the difference in MSE between 0.5 and 0.5*10^-6 is only about 0.5, which isn't necessarily a large difference, depending on your training set size, number of outputs, and initial MSE. – bogatron – 2015-03-12T15:00:01.067

The variance was still being observed with weights and biases all initialized to 1, which is why I felt confused when this happened. The data set is also consistent and it contains 10,000 values. There is less variance now even with such initialization, however everynow and then it tends to happen and it seems very strange. Is it possible that the error surface being created is different everytime and hence the network sometimes gets stuck in local minimum by chance? – edgaralienfoe – 2015-03-12T15:44:52.797

What makes you say that 0.5 vs. 0.5*10-6 is a large variance? What is the MSE at the start of training? Also, I'm suspicious about all weights being initialized to 1. If all the weights in an MLP are initialized to the same value, then I would expect all final weights for a given layer to converge to a common value. – bogatron – 2015-03-12T15:53:26.390

I have one output since it's approximating a one dimensional cosine function, 10,000 values for training set size. The initial MSE (after the first epoch using Levenberg Marquardt BP) is 14.6268. The neural network has one hidden layer with 4 neurons and it is approximating cos(Pi). All the weights and biases are initialized to 1. It might be worth noting that the data division is happening at random (I'm calling dividerand in MATLAB) every time. Again, the dataset is consistent, and each point is evenly spaced out (that is the cosine function is being plotted with a fixed step) – edgaralienfoe – 2015-03-12T19:12:01.543

0

I asked a similar question:

https://stats.stackexchange.com/q/140406/70282

What I ended up doing is having a while loop repeatably create NNs of a small size looking for the best root mean square error.

Essentially: For hidden layer size of 1 to 10 do For trials of 1 to 10: Create NN and test for rmse with test data. Remember NN and best rmse for particular hidden layer size. Done for. Done for. Review results choosing the smallest hidden layer size which had a small rmse.

By separating your data into training and tedtinh data, you don't pick an over fit NN.

Chris

Posted 2015-03-12T11:08:43.067

Reputation: 221