## What could an oscillating training loss curve represent?

3

3 I tried to create a simple model that receives an $$80 \times 130$$ pixel image. I only had 35 images and 10 test images. I trained this model for a binary classification task. The architecture of the model is described below.

conv2d_1 (Conv2D)            (None, 80, 130, 64)       640
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 78, 128, 64)       36928
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 39, 64, 64)        0
_________________________________________________________________
dropout_1 (Dropout)          (None, 39, 64, 64)        0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 39, 64, 128)       73856
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 37, 62, 128)       147584
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 18, 31, 128)       0
_________________________________________________________________
dropout_2 (Dropout)          (None, 18, 31, 128)       0
_________________________________________________________________
flatten_1 (Flatten)          (None, 71424)             0
_________________________________________________________________
dense_1 (Dense)              (None, 512)               36569600
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 513


What could the oscillating training loss curve represent above? Why is the validation loss constant?

1what loss function you using and this is ABSURDLY large for the amount of data you have. Your models gonna have real difficulty – mshlis – 2019-08-21T13:39:13.577

2

Try lowering the learning rate.

Such a loss curve can be indicative of a high learning rate. Due to a high learning rate the algorithm can take large steps in the direction of the gradient and miss the local minima. Then it will try to come back to the minima in the next step and overshoot it again.

You may also try switching to a momentum-based GD algorithm. Such a training loss curve can be indicative of a loss contour like in this example, for which momentum-based GD methods are helpful.

I noticed that you have a very small training set. You may have better luck with a larger training data (~1000 examples) or using a pre-trained Conv network as a starting point.

Lowering Learning rate did not help(0.1, 0.01, or 0.005 or even 0.001) – Krishnakumar – 2019-08-27T05:18:53.303

2

# Overview

As it has already been observed, your main problem, beside the training related issues like fixing the learning rate, is you have basically no chance to learn such a big model woth such a small dataset ... from scratch

So focusing on the real problem, here are some techniques you could use

• dataset augmentation
• transfer learning
• from a pretrained model
• from the encoder stage of an autoencoder (last resort option before getting into more advanced topics)

# Dataset Augmentation

Let's assume that

• $$I$$ is an input image

• $$l$$ its associated label

• $$f(\mathcal{I};\theta) \rightarrow \mathcal{I}$$ is a parametric transformation that affects appearance but not semantic, for example it is a rotation of $$\theta$$ angle

then you can augment your dataset by generating $$\{I_{\theta}, l\}$$ a set of transformed (e.g. rotated) images associated the same $$l$$ label

# Transfer Learning

The fundamental idea of transfer learning is to re-use a NN which has been trained to solve a task, to solve other tasks retraining only a selected subset of the weights

It means using a pre-trained convolutive backend, the part of the model with Conv2D and Pooling, and train dense layers with dropout only (but you should still probably think about reducing the dimensionality there)

• $$f_{C}(I; \theta_{X})$$ : Convolutive Processing on Input Image

• it is the part of the CNN composed of Conv2D and MaxPooling2D layers
• the $$\theta_{C}$$ is the convolutive learnable weights set
• $$b = f_{C}(I; \theta_{C})$$ : Bottleneck Feature Representation

• it is the result of Flatten layer
• $$f_{D}(b; \theta_{D})$$ : Dense Processing

• it is the part of the model composed of Dense layers
• the $$\theta_{D}$$ is the dense learnable weights set

The idea is to pick $$\theta_{C}$$ from a training performed on an another dataset, bigger than your current one, and keep it fixed while training in your task This means reducing the number of parameters to be trained, however beware the dense layers account for most of the weights, as you can also see from your mode summary, which means you should also focus on reducing that number, for example reducing the bottleneck feature tensor size

## Transfer Learning from Pre-Trained Model

For example, if your actual goal was to perform binary classification on some kind of MNIST-like data then you could use a convolutive backend from a CNN which has been pre-trained on the MNIST 0..9 classification task or you can train it yourself but is important is the $$\theta_{C}$$ weights will be learned from a MNIST dataset, which is much bigger than yours, even if the task is (slightly) different.

Furthermore, in case of MNIST like data, please consider if you really need your full 80 x 130 resolution hence your input tensor, considering I can deduct from your model summary it is grayscale (no color), needs to be $$(80,130,1)$$ or you could rescale to the 28 x 28 MNIST resolution so you work with a smaller $$(28,28,1)$$ tensor

My suggestion is to start from an architecture like this MNIST Keras Model as

• it has a bottleneck representation of 64 which could be enough for your task and
• also suggesting to remove the first dense layer so to significantly reduce $$\theta_{D}$$ the number of learnable paramters hence going for something like
  model = Sequential()
# output layer



then compile the model with binary_crossentropy loss and maybe start giving a try to adam optimizer

## Transfer Learning from Autoencoder

If your data is so special you can't find any big enough and similar enough dataset to use this strategy and you do not come up with any transformation you could use to perform dataset augmentation, without getting into advanced things, you could try to play one last card: use an Autoencoder to learn a compressed representation aimed at reconstructing the original image and perform transfer learning with the encoder only

For example, again under the assumption of working with a $$(28,28,1)$$ tensor, you could start with an architecture like the following one

def build_ae(input_img):
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
# (28,28,16)

# (4,4,8)

x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
# (4,4,8)

x = UpSampling2D((8, 8))(x)
# (16,16,8)

x = Conv2D(16, (3, 3), activation='relu')(x)
# Note: Convolving without padding='same' in order to get w-2 and h-2 dimensioality reduction so that following upsampling can lead to the desired 28x28 spatial resolution
# (14,14,8)

x = UpSampling2D((2, 2))(x)
# (28,28,8)

decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_img, decoded)
return autoencoder



In this case, the full model has 2633 weights but the encoding stage consists only of Conv2D+Relu+MaxPooling which means in total 3x3x1x16 weights for the convolutive step and 16 weights for the relu for a total of 160 weights only and the latent representation is a $$(4,4,8)$$ tensor which means a 128 dimensional flattened tensor and hence assuming, as before, to perform the binary classification with a dense sigmoid layer it would mean 128+1 weights to learn in the actual binary classification task

Of course it is possible to go for an even more compressed latent representation both on the spatial domain or channel domain with consequent reduced flattened vector dimensionality and ultimately even less weights to learn

Would you share more details about your problem, also your dataset, we could try to help more

1

Nicola Bernini's answer is quite comprehensive. Here are my insights.

First of all, think whether you really need neural networks to solve your problem. Think whether traditional computer vision operations like edge detection/ region-based methods help you to solve your problem (OpenCV can help you here). Think about your data again. In case you decide to use neural networks, some things to try out:-

1. Your data set size is too small. Try to recall that we are learning to approximate functions (Universal approximation theorem). Less data + More parameters have high chance of overfitting. Use transfer learning. (Try Resizing the image and perform random resized square crop and use as input to your neural network. This may or may not work since I don't know what exactly you are doing). Also try data augmentations that make sense (i.e. : Vertical flip of traffic sign images don't work)

2. Try reducing the learning rate/ using different learning rates for different parts of your network if you decide to use transfer learning.

3. Check whether your train and test dataset distributions are the same. i.e.: Don't train with 95 % of label 0 and test with a set that has 95% label 1. I do not know whether your dataset is highly class imbalanced or whether you are doing some kind of anomaly detection.