Why do neural networks trained on identical datasets and with identical hyper-parameters have different performances?



I found that fully connected neural networks trained on identical data sets with identical hyper-parameters can have different performances or accuracies (7-8% of deviation). Is this an unusual phenomenon? What is the cause? What is an acceptable deviation?


Posted 2018-05-16T09:26:49.697

Reputation: 61

Question was closed 2020-08-18T23:11:14.357

Are you using dropout by any chance? – razvanc92 – 2020-01-17T12:52:50.050



This should simply be due to the fact that network weight initialization is random.

k.c. sayz 'k.c sayz'

Posted 2018-05-16T09:26:49.697

Reputation: 1 835


Supervised learning is a statistical process that draws from a discrete set of examples, and therefore variance is to be expected. (Recall that variance is normalized and standard-deviation is not.) In the case of stochastic gradient descent, which exhibits improvements in training speed and reliability under many conditions, pseudo random factors are introduced deliberately, also affecting variance in accuracy measurements.

However, the threshold of acceptable accuracy variance is not the primary metric of interest to system architects and corporate stakeholders. It is the maximum theoretical and empirical inaccuracy metrics that are of primary concern.

In the PAC (probably approximately correct) framework, the required accuracy threshold is defined by $1 - \epsilon$. Data requirements to guarantee a minimum reliability, $1 - \delta$ for particular types of problems can be determined for specific types of training objectives and for a given $\epsilon$.

One could theoretically extend the PAC framework to consider mean, variance, and dimensions of skew, but it would not be of much interest to architects and stakeholders who are accustomed to communicating assurances in the forms like, "We know that with the data we have for training we can achieve 99.1% accuracy 89.3% of the time."

Douglas Daseeco

Posted 2018-05-16T09:26:49.697

Reputation: 7 174


but at different times

you have a variance problem, I suggest to study basic machine learning and the bias-variance tradeoff

The intuition: the model has a large capacity with a little or no regularization, therefore the model has a large degree of freedom of fitting the data.

What is an acceptable standard deviation level

+/- 0.05

Is that an unusual phenomenon

some insights from: Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks

Fadi Bakoura

Posted 2018-05-16T09:26:49.697

Reputation: 348