Regularization is one of the important prerequisites for improving the reliability, speed, and accuracy of convergence, but it is not a solution to every problem. Irregularity in data is only one of many root causes for slow or otherwise inadequate learning results, and as the results in the question indicates, it can reduce reliability, speed, or accuracy in some cases.

For those who are new to this topic, these are a few other root causes.

- Complexity beyond the capacity of the computing machinery to model adequately
- Insufficient number of examples for training
- Poor distribution alignment between data sets (training, testing, validation, production)
- Saturation of back propagation values in floating point
- Outliers cause by error in example creation or labeling
- Local optimization minima in the loss function combined with insufficient stochastic injection in SGD
- Leaning to heavily on artificial network convergence and neglecting using other known algorithms as part of an overall system architecture
- Using activation functions or hyper-parameters that are not well tuned to the example set or the model

In the case of regularization, L1 and L2 have known properties that have been proven in theorem form to guarantee convergence in fewer examples or epochs, but they rely on very specific models toward which convergence is targetted. L1 and L2 are not always beneficial for polynomial models. Trying to make them so tends to lead to over-fitting.

The resulting trained networks lack the needed generalization to produce reliable behavior upon receiving data outside the training set. Irregularities might not always be caught during test and verification. Deployed systems can exhibit the signs of over-fitting over time because the network learned time-specific features.

Overfitting can be prevented using dropout regularization, another technique first proposed in *Dropout: A Simple Way to Prevent Neural Networks from Overfitting* by Srivastava, et al. 2014.

Another two things to look at ...

- Is the logistic function is the best choice and for which layers? It seems to be down-trending. More people are using ReLU and its softer varieties because of higher performance shown in a number of problem domains and models.
- Is the noise injection clean in that it is a good pseudo noise source and is it configured properly with whatever SGD methods and libraries being used.

I am training a Multilayer Neural Nets with 146 samples (97 for training set, 20 for validation set and 29 for testing set). I am using: automatic differentiation, SGD method, fixed learning rate + momentum term, logistic function, quadratic cost function, L1 and L2 regularization technique, adding some artificial noise 3%.

The question indicated the best results for the current training set and model.

The best result was obtained using a lambda of 0.001 without regularization.

That's the starting point for improving further. Since complete automated convergence is not yet something libraries and frameworks provide, the engineer normally has to converge on the combination of techniques and settings that best help the network's convergence.

Welcome to ai.se...It has been found out that regularization of the polynomial form doesn't work well for NN..Instead use dropouts...This makes intuitional sense since the complexity of NN is not much dependent on the weights. – DuttaA – 2018-10-26T12:52:02.080

Hello. Thanks for your answer . Do you know where I can find the information (paper) abou L1 and polynomial form? Thanks!! – LVoltz – 2018-10-26T14:11:56.567

By polynomial form I meant |weights|^n. – DuttaA – 2018-10-26T14:25:43.643

I understand! sorry for the misunderstand. But I´d like to know more about why L1 doen not fit well to NN. Do you have some references to recommend me? thanks! – LVoltz – 2018-10-26T14:56:27.053

Not ALL polynomials. Proofs describe conditions for which convergence is guaranteed or improved, but not all that has not yet been proven is inviable. There may be a large class of polynomials that perform better with L1 or L2 that hasn't been investigated or represented well on the web There could also be conditions on the input that can be proven to give advantage to L1 or L2. – FelicityC – 2018-10-26T15:48:21.437