What fast loss convergence indicates on a CNN?


I'm training two CNNs (AlexNet e GoogLeNet) in two differents DL libraries (Caffe e Tensorflow). The networks was implemented by dev teams of each libraries (here and here)

I reduced the original Imagenet dataset to 1024 images of 1 category -- but setted 1000 categories to classify on the networks.

So I trained the CNNs, varying processing unit (CPU/GPU) and batches sizes, and I observed that the losses converges fastly to near zero (in mostly times before 1 epoch be completed), like in this graph (Alexnet on Tensorflow):

In portuguese, 'Épocas' is epochs and 'Perda' is loss.

The weight decays and initial learning rate are the same as used on models that I downloaded, I only changed the dataset and the batch sizes.

Why my networks are converging this way, and not like this way?


Posted 2017-12-05T07:34:53.323

Reputation: 159

What do you mean, you settled 1000 categories? – BlueMoon93 – 2017-12-05T10:31:06.900

I specified to network/library 1000 different output categories, but in practice I just give 1 to the network. I dont know exactly how Tensorflow/Caffe works, but I think that this value (1000) is used in softmax layer – 648trindade – 2017-12-05T17:47:26.513



Ok, so the fact that you always give the same category means the network only has to output the same value for any given input. This is exceedingly easy to learn (all the weights always shift in the same directions) and this make this graph.

Once you have more categories (try 10 for example), you'll see the graph will be closer to expected.


Posted 2017-12-05T07:34:53.323

Reputation: 803

Thats make a lot of sense. But why loss converges faster if my batch is too small? With sizes lower than 32, the networks converges with just 10% of an epoch (aprox. 100 images) – 648trindade – 2017-12-05T18:10:29.560

You have less variance when you have a smaller batch size, so the network ignores the noise even faster – BlueMoon93 – 2017-12-05T18:23:13.757