Why is deep learning hyped despite bad VC dimension?

67

64

The VC-dimension formula for neural networks ranges from $O(E)$ to $O(E^2)$, with $O(E^2V^2)$ in the worst case, where $E$ is the number of edges and $V$ is the number of nodes. The number of training samples needed to have a strong guarantee of generalization is linear with the VC-dimension.

This means that for a network with billions of edges, as in the case of successful deep learning models, the training dataset needs billions of training samples in the best case, to quadrillions in the worst case. The largest training sets currently have about a hundred billion samples. Since there is not enough training data, it is unlikely deep learning models are generalizing. Instead, they are overfitting the training data. This means the models will not perform well on data that is dissimilar to the training data, which is an undesirable property for machine learning.

Given the inability of deep learning to generalize, according to VC dimensional analysis, why are deep learning results so hyped? Merely having a high accuracy on some dataset does not mean much in itself. Is there something special about deep learning architectures that reduces the VC-dimension significantly?

If you do not think the VC-dimension analysis is relevant, please provide evidence/explanation that deep learning is generalizing and is not overfitting. I.e. does it have good recall AND precision, or just good recall? 100% recall is trivial to achieve, as is 100% precision. Getting both close to 100% is very difficult.

As a contrary example, here is evidence that deep learning is overfitting. An overfit model is easy to fool since it has incorporated deterministic/stochastic noise. See the following image for an example of overfitting.

Example of underfitting, fitting, and overfitting.

Also, see lower ranked answers to this question to understand the problems with an overfit model despite good accuracy on test data.

Some have responded that regularization solves the problem of a large VC dimension. See this question for further discussion.

yters

Posted 2017-05-13T12:43:43.370

Reputation: 369

Comments are not for extended discussion; this conversation has been moved to chat.

– D.W. – 2017-05-15T03:37:15.120

7I don't think questions why is something "hyped" are good. The answer is "because people". People take interest in things because of plethora of reasons, including marketing. – luk32 – 2017-05-16T13:49:26.787

Answers

66

"If the map and the terrain disagree, trust the terrain."

It's not really understood why deep learning works as well as it does, but certainly old concepts from learning theory such as VC dimensions appear not to be very helpful.

The matter is hotly debated, see e.g.:

Regarding the issue of adversarial examples, the problem was discovered in:

It is further developed in:

There is a lot of follow-on work.

Martin Berger

Posted 2017-05-13T12:43:43.370

Reputation: 6 705

Comments are not for extended discussion; this conversation has been moved to chat.

– D.W. – 2017-05-15T03:36:13.787

When you say "There is a lot of follow-on work" are you referring to the last 2014 paper? The first two papers you mention are fairly recent. Could you update with the papers you're referring to?. – VF1 – 2017-06-17T18:55:29.177

61

"Given the inability of Deep Learning to generalize, according to VC dimensional analysis [...]"

No, that's not what VC dimensional analysis says. VC dimensional analysis gives some sufficient conditions under which generalization is guaranteed. But the converse ain't necessarily so. Even if you fail to meet those conditions, the ML method still might generalize.

Put another way: deep learning works better than VC dimensional analysis would lead you to expect (better than VC analysis "predicts"). That's a shortcoming of VC dimensional analysis, not a shortcoming of deep learning. It doesn't imply that deep learning is flawed. Rather, it means that we don't know why deep learning works -- and VC analysis is unable to provide any useful insights.

High VC dimension does not imply that deep learning can be fooled. High VC dimension doesn't guarantee anything at all about whether it can be fooled in practical situations. VC dimension provides a unidirectional, worst-case bound: if you meet these conditions, then good things happen, but if you don't meet these conditions, we don't know what will happen (maybe good things will still happen anyway, if nature behaves better than the worst possible case; VC analysis doesn't promise that good things can't/won't happen).

It could be that the VC dimension of the model space is large (it includes very complex patterns as possible), but nature is explained by simple patterns, and the ML algorithm learns the simple pattern present in nature (e.g., because of regularization) -- in this case, VC dimension would be high but the model would generalize (for the particular pattern that is present in nature).

That said... there is growing evidence that deep learning can be fooled by adversarial examples. But be careful about your chain of reasoning. The conclusions you are drawing don't follow from the premises you started with.

D.W.

Posted 2017-05-13T12:43:43.370

Reputation: 83 008

6High VC dimension does imply its harder to generalize (in some sense, at least when dealing with arbitrary distributions). The $\Omega\left(\sqrt{\frac{d}{n}}\right)$ generalization error lower bound exactly means that for number of samples small compared to the VC dimension, there exists a distribution such that relative to it any algorithm will experience high generalization error (with high probability). – Ariel – 2017-05-14T06:49:23.190

4-1 for "High VC dimensional doesn't guarantee anything at all." This is not true: high VC-dimension implies sample complexity lower bounds for PAC learning. A good answer should address worst-case vs "real-life" distributions. – Sasho Nikolov – 2017-05-15T04:19:10.903

@SashoNikolov, good point -- thank you! Edited. – D.W. – 2017-05-15T16:56:55.840

23

Industry people have no regard for VC dimension, hooligans...

On a more serious note, although the PAC model is an elegant way to think about learning (in my opinion at least), and is complex enough to give rise to interesting concepts and questions (such as VC dimension and its connection to sample complexity), it has very little to do with real life situations.

Remember that in the PAC model you are required to handle arbitrary distributions, this means that your algorithm should handle adversarial distributions. When trying to learn some phenomena in the real world, no one is giving you "adversarial data" to mess up your results, so requiring a concept class to be PAC learnable might be way too strong. Sometimes you can bound the generalization error independently of the VC dimension, for a specific class of distributions. This is the case of margin bounds, who are formulated independently of the VC dimension. They can promise low generalization error if you can guarantee high empirical margin (which of course, cannot happen for all distributions, e.g. take two close points on the plane with opposite tags, and focus the distribution on them).

So, putting the PAC model and VC dimension aside, I think the hype comes from the fact they just seem to work, and succeed in tasks that were previously not possible (one of the latest achievements that comes to mind is AlphaGo). I know very little about neural nets, so I hope someone with more experience will pitch in, but to my knowledge there are no good guarantees yet (definitely not like in the PAC model). Perhaps under the right assumptions one could justify formally the success of neural nets (I assume there are works around the formal treatment of neural nets and "deep learning", so I'm hoping people with more knowledge on the subject could link some papers).

Ariel

Posted 2017-05-13T12:43:43.370

Reputation: 9 296

Comments are not for extended discussion; this conversation has been moved to chat.

– D.W. – 2017-05-15T03:37:45.297

15

Given the inability of Deep Learning to generalize,

I don't know where you take that from. Empirically, generalization is seen as the score (e.g. accuracy) on unseen data.

The answer why CNNs are used is simple: CNNs work much better than anything else. See ImageNet 2012 for example:

  • CNNs: 15.315% (that was an early example. CNNs are much better now. At about 4% top-5 error)
  • Best non-CNN: 26.172% Top-5-error (source - up to my knowledge techniques which do not use CNNs didn't get below 25% top-5 error)

Create a classifier which is better and people will shift to that.

UPDATE: I will award an answer to anyone providing published evidence that machine learning in general is easily fooled, like this evidence for Deep Learning.

This is not the case. You can create a classifier which is extremely simple on a simple dataset. It will not be possible to fool it (it doesn't even matter what "easy" means) , but it also is not interesting.

Martin Thoma

Posted 2017-05-13T12:43:43.370

Reputation: 1 214

3A low error does not imply generalization. It is a necessary, but not sufficient, condition. – yters – 2017-05-15T02:05:53.563

3@yters Please define generalization then. – Martin Thoma – 2017-05-16T06:39:36.990

Generalization means the model is a good fit for the entire population, not just the sample. If your sample happens to be a good representation, then great, but otherwise you need a model that can generalize well. Analogous to how humans can extrapolate properties that hold for a very large population from just a few instances. That is generalization. Memorization is like memorizing the answers for a test without understanding the reason why they are answers. – yters – 2017-05-16T13:57:14.013

5@yters, this comment makes me think you haven't read much about Machine Learning. Martin said accuracy on unseen data. You're talking about accuracy on training data. You're basically correct about what generalization is, but please realize that everyone else here understands that too. – Ken Williams – 2017-05-16T15:32:02.517

@KenWilliams, there is a difference between the unseen data in the dataset, and the unseen data in the population in general. Good accuracy on unseen data in the dataset may indicate generalization, but if the dataset is not a good representation of the population and the model overfits, then it will do badly on unseen data in the general population. It is the difference between building a model in a test environment, and how it performs in production. – yters – 2017-05-16T17:49:21.087

@KenWilliams to relate this to my question: it is clear that DL performs well on benchmark datasets. However, it is erroneous to say this means DL generalizes if the benchmark samples are not typical for the population. For example, say the DL model classifies birds well in the benchmark. But, the benchmark photos are all high quality. As long as the general population of photos are all high quality, then the model is fine. But, if the photos are amateur, then the model will not perform well. An SVM with a higher error may still generalize better in this case, due to lower VC dimension. – yters – 2017-05-16T17:55:38.973

@KenWilliams for an example of the problem with a DNN overfitting, see this paper: https://arxiv.org/pdf/1604.04004.pdf A little noise or blur causes the DNN performance to drop significantly.

– yters – 2017-05-16T17:56:25.153

1@yters I am pretty sure Ken (and many people on this site, including myself) knows this. If your test set, however, does not represent your dataset you can't make any statement about generalization. While it is worth to keep this in mind, I do not see how this helps you in any way for this question. You just have to assume / make sure that your test set does represent your data at production time. In fact, it is really easy to show that you can make any classifier arbitrary bad if the training samples do not represent the distribution. – Martin Thoma – 2017-05-16T19:26:50.483

@MartinThoma Perhaps I can put it this way. Assume we are trying to fit the function x = y. Our samples we use in development include noise from one distribution, but the production environment has noise from another distribution. If our model overfits, then it may do well on our dataset, but it will include the noise, and perform badly in production when trying to predict x = y with a different noise distribution. A model with low VCD that does not overfit will do better in production. – yters – 2017-05-16T19:31:33.853

1That's obvious. You can't expect a model to generalize well if it's trained on validated on the wrong data. You need better data, not a better model. – Emre – 2017-05-18T05:24:04.580

9

The one word answer is "regularization". The naive VC dimension formula does not really apply here because regularization requires that the weights not be general. Only a tiny (infinitesimal?) proportion of weight combinations have acceptable loss after regularization. The true dimension is many orders of magnitude less as a result, so generalization can occur with the training sets we have. The real life results bear out that overfitting isn't generally happening.

David Khoo

Posted 2017-05-13T12:43:43.370

Reputation: 91

2I've seen the repeated claim that real life results show deep learning generalizes. What exactly are the results that show generalization? All I've seen so far is that DL achieves low error rates on particular datasets, which does not in itself mean that DL generalizes. – yters – 2017-05-16T01:44:34.997

3it shows good results ("good" = better than other ML methods) on data that it was not trained on. im not sure how else you want to practically measure generalization. – lvilnis – 2017-05-16T13:53:14.017

3

We address the paper: Understanding Deep Learning Requires Rethinking Generalization. in

Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior Charles H. Martin and Michael W. Mahoney

See: https://arxiv.org/pdf/1710.09553.pdf

Basically, we argue that the VC bounds are too loose because the fundamental approach and how the statistical limit that is taken is unrealistic.

A better approach lies in Statistical Mechanics, which considers a class of data dependent functions, takes the Thermodynamic limit (not just the limit of large numbers)

Moreover, we also point out how the natural discontinuities in deep need lead to a phase transitions in the learning curve, which we believe is being observed in the Google paper (above)

With regard to the limits, see section 4.2 of our paper

"Clearly, if we fix the sample size m and let [the size of the function class] N → ∞ , [or vise versa, fix N, let m → ∞] the we should not expect a non-trivial result, since [N] is becoming larger but the sample size is fixed. Thus, [in Statistical Mechanics] one typically considers the case that m, N → ∞ such that α = m/N is a fixed constant."

That is, very rarely would we just add more data (m) to a deep net. We always increase the size of the net (N) too, because we know that we can capture more detailed features / information from the data. Instead we do in practice what we argue for in the paper--take the limit of large size, with the ratio m/N fixed (as opposed to say fixing m and let N increase).

These results are well known in the Statistical Mechanics of Learning. The analysis is more complicated, but the results lead to a much richer structure that explains many phenomena in deep learning.

Also, and in particular, it is known that many bounds from statistics become either trivial or do not apply to non-smooth probability distributions, or when the variables take on discrete values. With neural networks, non-trivial behavior arises because of discontinuities (in the activation functions), leading to phase transitions (which arise in the thermodynamic limit).

The paper we wrote tries to explain the salient ideas to a computer science audience.

Vapnik himself realized that his theory was not really applicable to neural networks...way back in 1994

"The extension of [the VC dimension] to multilayer networks faces [many] difficulties..the existing learning algorithms can not be viewed as minimizing the empirical risk over the entire set of functions implementable by the network...[because] it is likely...the search will be confined to a subset of [these] functions...The capacity of this set can be much lower than the capacity of the whole set...[and] may change with the number of observations.  This may require a theory that considers the notion of a non-constant capacity with an 'active' subset of functions"
Vapnik, Levin, and LeCun 1994

http://yann.lecun.com/exdb/publis/pdf/vapnik-levin-lecun-94.pdf

While not easy to treat with VC theory, this is not an issue for stat mech..and what they describe looks very much like Energy Landscape Theory of protein folding. (which will be the topic of a future paper)

Charles Martin

Posted 2017-05-13T12:43:43.370

Reputation: 31

This sounds interesting, but I'm not sure I follow your argument. Can you elaborate on the first sentence, i.e., on how the fundamental approach / statistical limit is unrealistic, in a self-contained way that doesn't require understanding statistical mechanics? What assumptions do VC bounds make, and why are they unrealistic? Perhaps you can edit your answer to include that information? – D.W. – 2017-11-26T18:40:46.653

I added a reference to the original work by Vapnik and LeCun (1994) that discusses the issue. – Charles Martin – 2017-11-27T19:47:34.143

And added some clarification. – Charles Martin – 2017-11-27T19:54:04.580