What is the best XOR neural network configuration out there in terms of low error?


I'm trying to understand what would be the best neural network for implementing a XOR gate. I'm considering a neural network to be good if it can produce all the expected outcomes with the lowest possible error.

It looks like my initial choice of random weights has a big impact on my end result after training. The accuracy (i.e. error) of my neural net is varying a lot depending on my initial choice of random weights.

I'm starting with a 2 x 2 x 1 neural net, with a bias in the input and hidden layers, using the sigmoid activation function, with a learning rate of 0.5. Below my initial setup, with weights chosen randomly:

enter image description here

The initial performance is bad, as one would expect:

Input | Output | Expected | Error
(0,0)   0.8845      0       39.117%
(1,1)   0.1134      0       0.643%
(1,0)   0.7057      1       4.3306%
(0,1)   0.1757      1       33.9735%

Then I proceed to train my network through backpropagation, feeding the XOR training set 100,000 times. After training is complete, my new weights are:

enter image description here

And the performance improved to:

Input | Output | Expected | Error
(0,0)   0.0103      0       0.0053%
(1,1)   0.0151      0       0.0114%
(1,0)   0.9838      1       0.0131%
(0,1)   0.9899      1       0.0051%

So my questions are:

  1. Has anyone figured out the best weights for a XOR neural network with that configuration (i.e. 2 x 2 x 1 with bias) ?

  2. Why my initial choice of random weights make a big difference to my end result? I was lucky on the example above but depending on my initial choice of random weights I get, after training, errors as big as 50%, which is very bad.

  3. Am I doing anything wrong or making any wrong assumptions?

So below is an example of weights I cannot train, for some unknown reason. I think I might be doing my backpropagation training incorrectly. I'm not using batches and I'm updating my weights on each data point solved from my training set.

Weights: ((-9.2782, -.4981, -9.4674, 4.4052, 2.8539, 3.395), (1.2108, -7.934, -2.7631))

enter image description here


Posted 2018-04-25T12:31:48.297

Reputation: 123

1Welcome to ai.se...Have you tried to vary the learning rate? – DuttaA – 2018-04-25T13:19:00.797

1During backpropagation training, I'm adjusting my weights for every training point. Perhaps I should be using batching or some other kind of average? – rdalmeida – 2018-04-25T16:15:23.310

1Try a learning rate of .1 – DuttaA – 2018-04-25T16:17:25.213

1I did. Same problem :( – rdalmeida – 2018-04-25T16:48:20.077



Well initializing weights has a big impact on the results. I'm not sure specifically for the XOR gate, but the error can have unoptimal local minima that the network can get "stuck" in during training. Using Stochastic gradient descent can help give some randomness that gets the error out of these pits. Also for the sigmoid function, weights should be initialized so that the input to the activation is close to the part with the highest derivative so that training is better.

dan dan

Posted 2018-04-25T12:31:48.297

Reputation: 171


It looks like you are right: https://datascience.stackexchange.com/a/21792

– rdalmeida – 2018-04-26T13:00:27.423


Its good to read some literature on neural networks like this it explains everything: http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

– dan dan – 2018-04-26T20:03:26.793


2 perceptrons without bias (+1 in the output layer, to get the result as 1 number).

enter image description here


Posted 2018-04-25T12:31:48.297

Reputation: 21

Welcome to AI! Love the graphic, but feel free to elaborate a little, and share your insights. – DukeZhou – 2018-04-25T18:23:08.257

1@DukeZhou This is a matlab representation of 2 layer NN. it stats with inputs (green box), the input is a vector 2x1, these inputs are passed to 2 artificial neurons through weights "w", summed up "+" and passed through activation function "bipolar sigmoid". The Layer 2 is an output layer, doing the same, but the output is just linear (in other words, there is no specific activation function because for linear function input=output). – new_stacker – 2018-04-26T03:56:16.313

1I recommend to learn the matlab toolkit, it is simple, provides all information (can print all the weights) and you can supervise the learning process. Or, just code it on your own, 2 layer NN is simple and you can check all the values on your own. DONT use biases for this simple case, that is just adding more hyper parameters. – new_stacker – 2018-04-26T03:58:11.430


I'd bet, you're doing something wrong, though I can't tell what it is. Try to change the learning rate dynamically, try to train in varying order, ....

On the seconds thought, it looks like you're using the standard sigmoid function. Then you're doing it basically wrong. The input can only be exactly 1 if the input is infinite - or very big so that the floating point arithmetic outputs 1 after rounding.

That's very wrong for two reasons:

  • You're forcing the network in a broken state having huge weights and tiny derivatives. That feels like imposing numerical instability on an otherwise sane algorithm. Just don't do it. Map your booleans better (see below).
  • You're doing what you don't need. Any value close enough to the wanted result (0 or 1) can be simply evaluated as correct. When you get 0.9 instead of 1, then you can simply stop saying "that's perfect". Remember, all you want is an boolean.

A better mapping would be false=0.1 and true=0.9. This doesn't lead to needing infinite weights and reduces related problems.

Even better may be using a symmetrical activation function (e.g., tanh) and a symmetrical mapping like false=-0.9 and true=0.9.

Also consider using ReLU.


Posted 2018-04-25T12:31:48.297

Reputation: 459