Are there any better visual models for transfer rather than ImageNet?

3

1

Similar to the recent pushes in Pretrained Language Models (BERT, GPT2, XLNet) I was wondering if such a thrust exists in Computer Vision?

From my understanding, it seems the community has converged and settled for ImageNet trained classifiers as the "Pretrained Visual Model". But relative to the data we have access too, shouldn't there exist something stronger? Also, classification as a sole task has its own constrictions on domain transfer (based on the assumption of how these loss manifolds are).

Are there any better visual models for transfer rather than ImageNet successes? If no, why? Is it because of the domains fluidity in shape, resolution, etc., in comparison to text?

mshlis

Posted 2019-06-25T18:05:47.027

Reputation: 1 845

Answers

3

Why is ImageNet so popular for transfer learning?

Models pre-trained on the ImageNet datasets have been the de-facto choice for many years now. Many popular reasons as to why people think that ImageNet is so effective for transfer learning are the following:

  • ImageNet is a truly large-scale dataset that contains over 1 million images, each of which has a decent resolution.
  • ImageNet has a wide and diverse set of classes (1000 in total) ranging from animals to humans, cars, trees, etc. Most Computer Vision tasks operate in similar domains, however there are some notable exceptions (e.g. medical images).
    For example, an object detection model for autonomous driving would benefit from ImageNet transfer learning, as the pre-trained model has seen images with similar content (e.g. roads, people, cars, street signs), even though it tries to solve a different task (i.e. object detection not classification).
  • The above two reasons allow models trained on ImageNet to identify and extract very generic features, especially in their initial layers, that can be effectively re-used.
  • ImageNet has a lot of similar classes. This is an interesting argument because it contradicts the second one. Due to the closeness of some classes (e.g. multiple breeds of cats), networks learn to extract more fine-grained features.

Another overlooked reason I find very important is that:

  • ImageNet has been the benchmark for performance for image classifiers for years now. When, for example, you are selecting a re-trained ResNet to use, you know that that model is guaranteed to operate at a high level of performance. Other datasets don't have such notable challenges as the ILVRC. That challenge is what make the VGG and ResNet popular in the first case, so it comes natural that people would want to use those weights.

In practice, due to the way in which CNNs identify and extract features from images, they can easily be "transferred" from task to task.

Is it actually better than other datasets?

This question was widely explored by Huh et al., who tried to identify the reasons that made the ImageNet dataset better than other ones for transfer learning.

In short they found out that most of the reasons that people thought made ImageNet so good (i.e. the ones I mentioned above) weren't necessarily correct. Furthermore, the amount and diversity of images and classes required to effectively train a CNN has been highly overestimated. So there is no particular reason people should choose this specific dataset.

Now, to answer your questions:

I was wondering if such a thrust exists in Computer Vision?

No, ImageNet is currently established as the de-facto choice, evident by the fact that all 10 keras.applications models offer weights only for ImageNet.

But relative to the data we have access too, shouldn't there exist something stronger?

This is an interesting question, as the consensus things that deep learning models keep getting better with more data. There is, however, evidence that indicates otherwise (i.e. that CNN models con't have as much capacity as we thought). You can read the aforementioned study for more details. In any case, this is still an open research question.

Even if models could get better, though, with more data, it is possible that it still wouldn't matter because ImageNet pre-trained models are strong enough.

classification as a sole task has its own constrictions on domain transfer

There have been numerous cases where models initialized from pre-trained ImageNet weights have done well in settings other than classification (e.g. regression, object detection). I'd argue that initialization from ImageNet is almost always better than random initialization.

Are there any better visual models for transfer rather than ImageNet successes? If no, why? Is it because of the domains fluidity in shape, resolution, etc., in comparison to text?

Partly, yes. I think that in comparison to text, images have some useful properties that are exploited through CNNs, which makes their knowledge more transferable. This claim, however, is based on intuition; I can't back this up somehow.

Djib2011

Posted 2019-06-25T18:05:47.027

Reputation: 2 624

Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-06-15T14:31:39.323