The book has actually proven the theorem rigorously in Chapter 2. I don't want to prove it here, but you can look it up. I will try to explain parts which are non obvious (and somewhat confusing according to the book's literature).

So for PAC learning (with or without the realizability assumption) the theory is that given a data-set of size:

$$m \geq [\frac{log(|H|/\delta)}{\epsilon}]$$
where $|H|$ is the size of the finite hypothesis class.

which when simplified is nothing but:

$$|H|e^{-\epsilon m} \leq \delta$$

where $\delta$ is the probability that your sample is not representative of the underlying distribution (according to the book, hence the term Probably in PAC Learning) and $\epsilon$ is the maximum probability that your learned hypothesis $h$ predicts new unseen samples wrong (basically accuracy of your hypothesis and hence the term Approximately Correct in PAC Learning).

This equation/bound comes from the last step of the proof which states:

$$D^m [ {S|_x : L_{(D,f)}(h_S)\gt \epsilon}] \leq |H_B|e^{-\epsilon m} \leq |H|e^{-\epsilon m}$$
where $H_B$ are all the bad hypothesis (over-fitting hypothesis)

which is your answer to the **question**:

Estimation error increase with linearly $|H|$ and decrease exponentially with $m$ in PAC learning

Now here comes the tricky part, following this equation the proof directly jumps to:

$$|H|e^{-\epsilon m} \leq \delta$$

The justification for this is given in previous part of the proof (I am not entirely sure if they meant this justification, but it seems the only one):

Since
the realizability assumption implies that $L_S (h_S ) = 0$, it follows that the event
$L_{(D,f )} (h_S ) > \epsilon$ can only happen if for some $h ∈ H_B$ we have $L_S (h) = 0$. In other words, this event will only happen if our sample is in the set of **misleading** samples.

Do not mistakenly we confuse **misleading** $\rightarrow$ **non-representative** otherwise we will not able to justify the aforementioned jump ($\epsilon$ and $\delta$ becomes dependent on each other)

The actual interpretation of $\delta$ is that it is our confidence parameter i.e we want to ensure:
$$D^m [ {S|_x : L_{(D,f)}(h_S)\gt \epsilon}] \leq \delta$$
which means we are $1-\delta$ confident that our learned $h_s$ will have $L_{(D,f)}(h_s) \leq \epsilon$ (complementary expression).

**NOTE:** This idea is skipped in most resources I read, I found its explanation here.

Now, coming to the statement:
$$m_H \leq [\frac{log(|H|/\delta)}{\epsilon}]$$ $m_H$ unlike $m$ is defined as:

If $H$ is PAC learnable, there are many functions $m_H$ that satisfy the
requirements given in the definition of PAC learnability. Therefore, to be precise,we will define the sample complexity of learning $H$ to be the “minimal function,” in the sense that for any $\epsilon, \delta$ $m_H (\epsilon, \delta)$ is the minimal integer that satisfies the requirements of PAC learning with accuracy $\epsilon$ and confidence $\delta$.

And hence the equality sign is reversed, since many good samples will result in good hypothesis being generated in a smaller number of samples.

Side note: All conventions are from Understanding Machine Learning: From Theory to Algorithms.

1Thanks. But it seems that this paper doesn't provide the proof but the sample complexity. For a finite class H, $m_H(\epsilon,\delta)\le\lceil\frac{log(|H|/\delta)}{\epsilon}\rceil$. But how could we draw the conclusion that we discussed? – Ben – 2019-09-16T11:45:56.953

@ChenglinBen this part is the definition of the PAC learnability and a definition does not need any proof. – OmG – 2019-09-16T12:23:54.677

I know that. I mean that maybe our discussion about that estimation error increase (logarithmically) with || and decrease with in PAC learning can be inspired by this. I'm not intended to prove the definition of PAC learnability. – Ben – 2019-09-16T12:26:40.817

@ChenglinBen may be the last part of post (it is updated) could help. Indeed, the relation of the PAC learnability with other similar concepts such as VC dimension could help to know better the concept. The proof of the equivalency of these concepts could help to prove what you want. – OmG – 2019-09-16T12:28:38.237

1First, thank you very much. But I'm sorry that maybe I didn't state my question clearly. What I would like to do is prove that estimation error $\epsilon_{est}$ increase (logarithmically) with || and decrease with . Because the book I mentioned initially stated this conclusion toughly, I would like to figure out the rigorous proof. Hope you can help me out. – Ben – 2019-09-16T12:32:53.413