Regularization is one of the important prerequisites for improving the reliability, speed, and accuracy of convergence, but it is not a solution to every problem. Irregularity in data is only one of many root causes for slow or otherwise inadequate learning results, and as the results in the question indicates, it can reduce reliability, speed, or accuracy in some cases.
For those who are new to this topic, these are a few other root causes.
- Complexity beyond the capacity of the computing machinery to model adequately
- Insufficient number of examples for training
- Poor distribution alignment between data sets (training, testing, validation, production)
- Saturation of back propagation values in floating point
- Outliers cause by error in example creation or labeling
- Local optimization minima in the loss function combined with insufficient stochastic injection in SGD
- Leaning to heavily on artificial network convergence and neglecting using other known algorithms as part of an overall system architecture
- Using activation functions or hyper-parameters that are not well tuned to the example set or the model
In the case of regularization, L1 and L2 have known properties that have been proven in theorem form to guarantee convergence in fewer examples or epochs, but they rely on very specific models toward which convergence is targetted. L1 and L2 are not always beneficial for polynomial models. Trying to make them so tends to lead to over-fitting.
The resulting trained networks lack the needed generalization to produce reliable behavior upon receiving data outside the training set. Irregularities might not always be caught during test and verification. Deployed systems can exhibit the signs of over-fitting over time because the network learned time-specific features.
Overfitting can be prevented using dropout regularization, another technique first proposed in Dropout: A Simple Way to Prevent Neural Networks from Overfitting by Srivastava, et al. 2014.
Another two things to look at ...
- Is the logistic function is the best choice and for which layers? It seems to be down-trending. More people are using ReLU and its softer varieties because of higher performance shown in a number of problem domains and models.
- Is the noise injection clean in that it is a good pseudo noise source and is it configured properly with whatever SGD methods and libraries being used.
I am training a Multilayer Neural Nets with 146 samples (97 for training set, 20 for validation set and 29 for testing set). I am using: automatic differentiation, SGD method, fixed learning rate + momentum term, logistic function, quadratic cost function, L1 and L2 regularization technique, adding some artificial noise 3%.
The question indicated the best results for the current training set and model.
The best result was obtained using a lambda of 0.001 without regularization.
That's the starting point for improving further. Since complete automated convergence is not yet something libraries and frameworks provide, the engineer normally has to converge on the combination of techniques and settings that best help the network's convergence.