I am writing a custom framework and in it I'm trying to train a simple network to predict the addition function.
- 1 hidden layer of 3 Neurons
- 1 output layer
- cost function used is Squared error, (Not MSE to avoid precision problems)
- Identity transfer function to make things simple at first
- no specal updaters, just the step size
- no learning rate decay
- no regularization
The training set:
- ~500 samples
[n1 + n2]
- Every element is between 0 and 1. e.g.:
[0.5][0.3] => [0.8]
The algorithm I'm using to optimize:
- samples 64 elements for an epoch
- for each sample: it evaluates the error
- then propagates the error back
- and then based on the error values calculates the gradients
- the gradients for each elements are added up into one vector, then normalized by dividing by the number of samples evaluated
- After the gradients are calculated a step size of 1e-2 is used to modify the weights.
- The training stops when the sum of the errors for the 500 data elements are below 1e-2
I don't have a test dataset yet, as first I'd like to overfit to a training set, to see if it could even do that. Withouot bias the training converges to an optimum in about ~4k epochs.
When I include the tuning of bias into the training, it seems to have a much worse performance, the network is not converging to the optimum, instead the biases and the weights oscillate next to one another..
Is this a normal effect of introducing a bias?