## Why does estimation error increase with $|H|$ and decrease with $m$ in PAC learning?

4

Why does estimation error increase with $$|H|$$ and decrease with $$m$$ in PAC learning?

I came across this statement in the section 5.2 of the book "understanding machine learning: from theory to algorithms". You just search "increases (logarithmically)" in your browser and then you can find the sentence.

I just can't understand the statement. And there is no proof in the book either. What I would like to do is prove that estimation error $$\epsilon_{est}$$ increase (logarithmically) with || and decrease with . Hope you can help me out. A rigorous proof can't be better!

2

Definitely, you can find the proof in different resources (for example, in these notes or in the paper that originally proposed PAC learnability, A Theory of the Learnable). However, the intuition behind your question is when the size of the hypothesis increases, if you do not change anything, you can't see more part of the space. Hence, the estimation error will increase. Moreover, when you increase the number of samples, you have more chance to see more part of the hypothesis space, hence, the estimation error decrease.

Also, you can see some lemma about the relation of the PAC learnability and other similar concepts in the Wikipedia article Probably approximately correct learning:

Under some regularity conditions these three conditions are equivalent:

1. The concept class $$C$$ is PAC learnable.
2. The VC dimension of $$C$$ is finite.
3. $$C$$ is a uniform Glivenko-Cantelli class.

1Thanks. But it seems that this paper doesn't provide the proof but the sample complexity. For a finite class H, $m_H(\epsilon,\delta)\le\lceil\frac{log(|H|/\delta)}{\epsilon}\rceil$. But how could we draw the conclusion that we discussed? – Ben – 2019-09-16T11:45:56.953

@ChenglinBen this part is the definition of the PAC learnability and a definition does not need any proof. – OmG – 2019-09-16T12:23:54.677

I know that. I mean that maybe our discussion about that estimation error increase (logarithmically) with || and decrease with in PAC learning can be inspired by this. I'm not intended to prove the definition of PAC learnability. – Ben – 2019-09-16T12:26:40.817

@ChenglinBen may be the last part of post (it is updated) could help. Indeed, the relation of the PAC learnability with other similar concepts such as VC dimension could help to know better the concept. The proof of the equivalency of these concepts could help to prove what you want. – OmG – 2019-09-16T12:28:38.237

1First, thank you very much. But I'm sorry that maybe I didn't state my question clearly. What I would like to do is prove that estimation error $\epsilon_{est}$ increase (logarithmically) with || and decrease with . Because the book I mentioned initially stated this conclusion toughly, I would like to figure out the rigorous proof. Hope you can help me out. – Ben – 2019-09-16T12:32:53.413

0

The book has actually proven the theorem rigorously in Chapter 2. I don't want to prove it here, but you can look it up. I will try to explain parts which are non obvious (and somewhat confusing according to the book's literature).

So for PAC learning (with or without the realizability assumption) the theory is that given a data-set of size:

$$m \geq [\frac{log(|H|/\delta)}{\epsilon}]$$ where $$|H|$$ is the size of the finite hypothesis class.

which when simplified is nothing but:

$$|H|e^{-\epsilon m} \leq \delta$$

where $$\delta$$ is the probability that your sample is not representative of the underlying distribution (according to the book, hence the term Probably in PAC Learning) and $$\epsilon$$ is the maximum probability that your learned hypothesis $$h$$ predicts new unseen samples wrong (basically accuracy of your hypothesis and hence the term Approximately Correct in PAC Learning).

This equation/bound comes from the last step of the proof which states:

$$D^m [ {S|_x : L_{(D,f)}(h_S)\gt \epsilon}] \leq |H_B|e^{-\epsilon m} \leq |H|e^{-\epsilon m}$$ where $$H_B$$ are all the bad hypothesis (over-fitting hypothesis)

Estimation error increase with linearly $$|H|$$ and decrease exponentially with $$m$$ in PAC learning

Now here comes the tricky part, following this equation the proof directly jumps to:

$$|H|e^{-\epsilon m} \leq \delta$$

The justification for this is given in previous part of the proof (I am not entirely sure if they meant this justification, but it seems the only one):

Since the realizability assumption implies that $$L_S (h_S ) = 0$$, it follows that the event $$L_{(D,f )} (h_S ) > \epsilon$$ can only happen if for some $$h ∈ H_B$$ we have $$L_S (h) = 0$$. In other words, this event will only happen if our sample is in the set of misleading samples.

Do not mistakenly we confuse misleading $$\rightarrow$$ non-representative otherwise we will not able to justify the aforementioned jump ($$\epsilon$$ and $$\delta$$ becomes dependent on each other)

The actual interpretation of $$\delta$$ is that it is our confidence parameter i.e we want to ensure: $$D^m [ {S|_x : L_{(D,f)}(h_S)\gt \epsilon}] \leq \delta$$ which means we are $$1-\delta$$ confident that our learned $$h_s$$ will have $$L_{(D,f)}(h_s) \leq \epsilon$$ (complementary expression).

NOTE: This idea is skipped in most resources I read, I found its explanation here.

Now, coming to the statement: $$m_H \leq [\frac{log(|H|/\delta)}{\epsilon}]$$ $$m_H$$ unlike $$m$$ is defined as:

If $$H$$ is PAC learnable, there are many functions $$m_H$$ that satisfy the requirements given in the definition of PAC learnability. Therefore, to be precise,we will define the sample complexity of learning $$H$$ to be the “minimal function,” in the sense that for any $$\epsilon, \delta$$ $$m_H (\epsilon, \delta)$$ is the minimal integer that satisfies the requirements of PAC learning with accuracy $$\epsilon$$ and confidence $$\delta$$.

And hence the equality sign is reversed, since many good samples will result in good hypothesis being generated in a smaller number of samples.

Side note: All conventions are from Understanding Machine Learning: From Theory to Algorithms.

All symbols have their usual meanings if not mentioned. – DuttaA – 2020-03-08T16:40:59.220

Nice attempt to answer this technical question! I will try to have a close look at it later and provide some feedback ;) – nbro – 2020-03-08T16:54:31.800

@nbro thanks. I am hoping for a 2nd opinion since I am unsure of that part where a jump in the proof occurs. – DuttaA – 2020-03-08T17:01:55.270

I think you should explain more in detail the formula that comes after "This equation/bound comes from the last step of the proof which states". For example, what is D, what is L, etc.? – nbro – 2020-03-12T21:16:24.247

@nbro i mean that would be just copying and pasting from the book...I could do that but it would be pointless i think...all conventions are same as used in the book by OP. Since the 2 standard books i know use different conventions so I didn't make it specific. On second thoughts I'll add some details. – DuttaA – 2020-03-12T21:31:26.040

@nbro I was trying to add more details, but I fell it won't make any sense to an unsuspecting reader. So I have kept the old answer. – DuttaA – 2020-03-13T13:22:18.413

I haven't yet upvoted your answer because I need time to check that all your statements are correct, but I really appreciate someone that attempts to give this type of answer. In the future, hopefully, I will have more time to check that this answer is correct. – nbro – 2020-03-13T13:54:07.807

@nbro Yes I know...but to tell the truth there isn't really much substance to my answer since all of it is provided in the book (maybe a little unclear at some places). Hopefully by next year I'll be giving more technical answers. Really want to see this stack grow since it contributed so much to my initial learning, – DuttaA – 2020-03-13T14:23:53.447