Is there any proven disadvantage of transfer learning for CNNs?



Suppose I know that I want to use a ResNet-101 architecture for my specific problem. There are ReseNet-101 models trained on ImageNet.

Is there any disadvantage of using those pre-trained models and just resetting the last (few) layers to match the specific number of classes or should that be the default option?

Please don't simply post your gut feeling... I have a gut feeling as well, but I want to know.

Martin Thoma

Posted 2018-04-26T09:16:51.497

Reputation: 15 590



Based on my experience, not just for ImageNet, if you have enough data it's better to train your network from scratch. There are numerous reasons that I can explain why.

  • First of all, I don't know whether you have had this experience or not but I've trained complicated CNNs neworks with over 25 million parameters. After reaching 95% accuracy, after convergence, I changed the learning rate to a bit bigger number to find another probable local minimum. I've not found any answer till today but whenever I did so, my accuracy decreased signifincantly all the time and it never imporoved even after more than thosands of epochs.

  • The other problem is that whenver you use transfer learning, your training data should have two options. First of all, the distribution of the training data which your pre-trianed model has used should be like the data that you are going to face during test time or at least don't vary too much. Second, the number of training data for transfer learning should be in a way that you don't overfit the model. If you have a few number of labeled training data, and your typical pre-trained model, like ResNet has millions of parameters, you have to be aware of overfitting by choosing appropriate metrics for evaluation and good testing data which represent the real distribution of the real population.

  • Next problem is that you can not remove layers with confidence to reduce the number of parameters. Basically the number of layers, as you yourself know, is a hyper-parameter that there is no consensus on how to be chosen. If you remove the convolutional layers from the first layers, again based on experience, you won't have good learning because of the nature of the architecture which finds low level features. Furthermore, if you remove first layers you will have problem for your dense layers, because the number of trainable parameters changes in this way. Densely connected layers and deep convolutional layers can be good points for reduction but it may take time to find how many layers and neurons to be diminished inorder not to overfit.

If you don't have enough data and there is already a pre-trained model, you can do something that can help you. I divide my answer to two parts for this situation:

The pre-trained model does not have common labels for most of its classes

If this is the case, you can forget all the fully connected layers and replace new ones. Basically what they do is just classifying the features that the network has found to reduce the error rate. About the convolutional layers you have to consider two major points:

  1. Pooling layers try to sum up the information in a local neighbourhood and make a higher representation of the inputs. Suppose that the inputs of a pooling layer have nose, eyes, eybrows and so on. Your pooling layers somehow attempt to check whether they exist in a neighbourhood or not. Consequently, convolutional layers after pooling layers usually keep information which may be irrelevant to your task. There is a downside for this interpretation. The information may be distributed among different activation maps as Jason Yosinski et al. have investigated this nature at Understanding neural networks through deep visualization where you can read One of the most interesting conclusions so far has been that representations on some layers seem to be surprisingly local. Instead of finding distributed representations on all layers, we see, for example, detectors for text, flowers, fruit, and faces on conv4 and conv5. These conclusions can be drawn either from the live visualization or the optimized images (or, best, by using both in concert) and suggest several directions for future research and These visualizations suggest that further study into the exact nature of learned representations—whether they are local to a single channel or distributed across several .... A partial solution can be keeping the first layers which find low level features that are usually shared among different data distributions and removing deeper convolutional layers which find higher abstractions.

  2. As stated, the main problem of convolutional layers is that the information they find may be distributed among different activation maps. Consequently, you can not be sure by removing a layer you can have better performance or not.

The pre-trained model has some common classes with your labels

If this is the case you can visualize the activations using the techniques described here. They have shown that although human face is not a label in ImageNet, some of internal activation maps get activated by face. This observation can be seen for other labels too. For instance, netwroks for deciding a scene contains a car or not are usually sensetive to roads and trees. The image below shows that which parts of ouputs get activated for which parts of images. This can help you when you don't have enough data and you have to use transfer learning.

enter image description here

Based on the answer here The standard classification setting is an input distribution $p(X)$ and a label distribution $p(Y|X)$. Domain adaptation: when $p(X)$ changes between training and test. Transfer learning: when $p(Y|X)$ changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stay the same, but the labels change. Consequently, domain adaption problems also can be considered for the mentioned solutions.


Posted 2018-04-26T09:16:51.497

Reputation: 12 077

"Based on my experience" on which datasets / architectures? – Martin Thoma – 2018-04-26T12:24:16.913

Actually for numerous architectures and datasets. Resnet50 on othello, Inception on Go. AlexNet for ImageNet. AlexNet for NotMNIST and others. The former had more than 50 million training data if I remember. Even my hand made architectures for eastern ocr. – Media – 2018-04-26T12:34:56.483

And on which data set and with which architecture did you make the experience that training on a pertained model is worse than training on random weights? – Martin Thoma – 2018-04-26T19:41:20.777

Almost for all architectures whenever we increased the learning rate, we observed the mentioned behavior. experience that training on a pertained model is worse than training on random weights? I really did not mean that. If your distribution changes, yes. That's the solution. And if you don't have enough data, again yes :) But about your question. For landmark detection, we had 1000 images and there were pretrained models, after using the pretrained model, our results were not bad but it couldn't generalize well. We made an architecture from scratch with less than 5 million parameters. – Media – 2018-04-26T21:10:42.450

I also experienced another problem, I used NotMnist data set on pretrained model on MNIST. The pre-trained model didn't learn so fast and the accuracy was not so much good, I don't remember well what was the exact number, maybe less than 91%. Our model had about 200 thousands parameters. I made a new architecture and employed ST for two consequent layers, number of parameters were less than 50 thousands I guess. We had over 98% accuracy. – Media – 2018-04-26T21:16:08.340

After all transfer learning is a good technique but you should care about overfitting and data distribution, although I've read your thesis and I know you are aware of them. – Media – 2018-04-26T21:17:15.583

Also consider this point it highly depends on the environment and the hardware requirements of your agent, transfer learning on architecures like ResNet has a high cost. If you have two class labels which are already in ImageNet, it is not wise to use all the convolutional layers. – Media – 2018-04-26T21:21:46.203

Thank you for your input. However, if you change multiple things beside the weight initialization, you can't compare the results. So I will not accept your answer, as it doesn't answer my question. – Martin Thoma – 2018-04-27T04:51:03.460

@MartinThoma changes in distribution and number of training data in hand should help you use transfer learning or not as I referred to. After that, I mentioned the solutions for cases you can employ transfer learning. Finally, if you want to understand, suppose that first initialization somehow centralizes the weights. They may go to each direction after learning. If you use transfer learning, they are already biased to a somehow a typical corner. If you train them on your data for transfer learning, if you have same distribution it will be fast but if you don't, it will not be so helpful. – Media – 2018-04-27T05:23:07.640

About accepting or not, it is your choice and no need for explanation :) we all are friends here and share our opinion. – Media – 2018-04-27T05:23:52.190

Well, in this case I'm more interested in facts than opionions ;-) – Martin Thoma – 2018-04-27T06:16:42.430

No one really knows why CNN works.... It's that it just does... because considering the parameters involved, it's beyond Humans Reach... – Aditya – 2018-05-14T13:57:01.027

1@Aditya let me not to share your opinion. Although the features are shared among different feature maps, we know that they find fine and coarse features. The cited paper has discussed that in detail. – Media – 2018-05-14T14:42:02.800

1Thanks @media the your answer is very informative... What I meant is given the parameters involved it's too tough for a human to comprehend the same... That's all – Aditya – 2018-05-14T14:57:38.333


I would not do that if your data is very different from the data in Imagenet. This is not typically the case, since Imagenet has lots of images representing many different things. However, let's say that your data comes from pictures taken by telescopes. In that case, even the most basic features (the very first layers) of a model trained on Imagenet will not be useful to your model, and in this case it might be more convenient to train the model from scratch using random initialization. One thing you can do, and that it might work better, is use the Imagenet model as your initial parameters and update all the parameters of the network.

However, in most cases your images look like a subset of Imagenet, being transfer learning a very powerful technique.

David Masip

Posted 2018-04-26T09:16:51.497

Reputation: 5 101

Is there any report on the difference in performance of a pretrained model which is fine-tuned on very different data compared to a model which is trained on random initial weights the same time on the same data? – Martin Thoma – 2018-04-26T10:58:39.053

Maybe this one, but it doesn't say anything about data:

– David Masip – 2018-04-26T11:10:59.877

No, he uses different models (see the model size at the end). – Martin Thoma – 2018-04-26T11:20:47.037


I found this Q. a year or so after it was asked, and I still don't think there are any studies that have comprehensively tested this.

When most people discuss transfer learning, what they think of is:

"Will imagenet weights help me get better accuracy (or at least speed up training) when using [insert popular google/facbook AI/microsoft architecture] on my data."

In reality, if I have medical images of arm bones, and I've previously trained a model on leg bones, that would probably be more applicable transfer learning than using imagenet weights... so stating that "de novo is better" or "transfer learning is better" is kind of meaningless, because you may just not have access to more relevant weights for transfer learning than imagenet.

The early layers always contain 'simple' features, and these are probably always somewhat transferable to any other visual problem; providing the scale of the objects of interest in the images are not drastically different from the original data. But to answer the question directly: no, there is no “proven” disadvantage to transferring weights from a previously trained network.

Tristan HB

Posted 2018-04-26T09:16:51.497

Reputation: 11