← Back to home

Rademacher Complexity and the Beautiful Inversion — from Scratch

May 2026

This is Part 2 of a two-part series on the foundations of statistical learning theory. Part 1 built the Fundamental Theorem of Statistical Learning from scratch: it proved that a hypothesis class is learnable if and only if it has finite VC dimension, with matching upper and lower bounds on sample complexity. Every tool — Markov, Hoeffding, Sauer–Shelah, symmetrization, KL divergence, Pinsker, Le Cam — was constructed from definitions.

But the Fundamental Theorem gives a binary answer: learnable or not learnable. It doesn’t tell you how fast a specific class generalizes on your specific data. VC dimension is a worst-case, distribution-free number. Can we do better?

This post introduces Rademacher complexity — a probabilistic measure that asks how well the hypothesis class can correlate with pure noise. It gives data-dependent generalization bounds that adapt to the actual training set, not just worst-case combinatorics. We prove everything from scratch, building McDiarmid’s inequality, the Rademacher generalization bound, Massart’s lemma, and the connection between VC dimension and Rademacher complexity. The post ends with the Beautiful Inversion: the full proof chain, visualized as one complete cycle.

13. Rademacher Complexity

VC dimension is a combinatorial measure: it asks which labelings are achievable. Rademacher complexity is a probabilistic refinement: it asks how well a hypothesis class can correlate with random noise. This gives tighter, data-dependent generalization bounds.

Definition (Rademacher Random Variables) A Rademacher random variable \(\sigma\) takes values \(+1\) and \(-1\) each with probability \(1/2\). A sequence \(\sigma_1, \ldots, \sigma_m\) of i.i.d. Rademacher variables is a random labeling with no structure.
Definition (Empirical Rademacher Complexity) Given a function class \(\mathcal{F}\) (functions from some domain \(\mathcal{Z}\) to \(\mathbb{R}\)) and a fixed sample \(S = (z_1, \ldots, z_m)\), the empirical Rademacher complexity is $$ \hat{\mathcal{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\sup_{f \in \mathcal{F}} \frac{1}{m}\sum_{i=1}^m \sigma_i f(z_i)\right], $$ where the expectation is over i.i.d. Rademacher \(\sigma_1, \ldots, \sigma_m\).
Definition (Rademacher Complexity) The Rademacher complexity of \(\mathcal{F}\) for sample size \(m\) is $$ \mathcal{R}_m(\mathcal{F}) = \mathbb{E}_S[\hat{\mathcal{R}}_S(\mathcal{F})], $$ where \(S \sim \mathcal{D}^m\).
Intuition \(\hat{\mathcal{R}}_S(\mathcal{F})\) measures how well the best function in \(\mathcal{F}\) can "fit" a purely random labeling \(\sigma\). If \(\mathcal{F}\) is very expressive, some \(f \in \mathcal{F}\) will correlate well with noise, giving a high Rademacher complexity. If \(\mathcal{F}\) is simple, no \(f\) can fit noise, and the Rademacher complexity is small.

A class that can't fit noise can't overfit data — this is the key link to generalization.

For binary classification, we apply this to the loss class \(\ell \circ \mathcal{H} = \{(x,y) \mapsto \mathbf{1}[h(x) \neq y] : h \in \mathcal{H}\}\).

Example 1: A Single Hypothesis

Example If \(\mathcal{F} = \{f\}\) contains a single function, then: $$ \hat{\mathcal{R}}_S(\mathcal{F}) = \frac{1}{m}\left|\mathbb{E}_\sigma\!\left[\sum_{i=1}^m \sigma_i f(z_i)\right]\right| = 0 $$ since each \(\mathbb{E}[\sigma_i] = 0\). With one function, there is no ability to fit noise.

Example 2: Two Hypotheses

Example Let \(\mathcal{F} = \{f, -f\}\) where \(f(z_i) \in \{-1, +1\}\). Then: $$ \hat{\mathcal{R}}_S(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\frac{1}{m}\left|\sum_{i=1}^m \sigma_i f(z_i)\right|\right]. $$ Since \(\sigma_i f(z_i)\) are still Rademacher (product of two \(\pm 1\) r.v.'s with one Rademacher is Rademacher), this equals \(\mathbb{E}[|W|/m]\) where \(W = \sum \sigma_i\). For large \(m\), \(W \approx \mathcal{N}(0, m)\) and \(\mathbb{E}[|W|] \approx \sqrt{2m/\pi}\), giving \(\hat{\mathcal{R}}_S \approx \sqrt{2/(\pi m)}\). Vanishes as \(m \to \infty\).

Example 3: All Binary Functions

Example Let \(\mathcal{F} = \{-1, +1\}^m\) (all possible binary functions on \(m\) points). Then for every \(\sigma\), we can pick \(f(z_i) = \sigma_i\), giving \(\sum \sigma_i f(z_i) = m\). So: $$ \hat{\mathcal{R}}_S(\mathcal{F}) = \frac{m}{m} = 1. $$ Maximum Rademacher complexity — the class can perfectly fit any noise pattern. This matches infinite VC dimension: a class this expressive cannot generalize.

14. McDiarmid’s Inequality

To prove the Rademacher generalization bound, we need a concentration inequality for functions of independent random variables (not just sums). McDiarmid's inequality is a powerful generalization of Hoeffding's inequality.

Definition (Bounded Differences) A function \(g : \mathcal{Z}^m \to \mathbb{R}\) satisfies the bounded differences condition with constants \(c_1, \ldots, c_m\) if for all \(i\) and all \(z_1, \ldots, z_m, z'_i\): $$ |g(z_1, \ldots, z_i, \ldots, z_m) - g(z_1, \ldots, z'_i, \ldots, z_m)| \leq c_i. $$ Changing any single input changes the output by at most \(c_i\).
Theorem (McDiarmid’s Inequality) Let \(Z_1, \ldots, Z_m\) be independent random variables and let \(g\) satisfy the bounded differences condition with constants \(c_1, \ldots, c_m\). Then for any \(t > 0\): $$ \Pr\!\left[g(Z_1, \ldots, Z_m) - \mathbb{E}[g] \geq t\right] \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^m c_i^2}\right). $$
Proof

The proof mirrors Hoeffding's inequality but uses a martingale decomposition.

Step 1: Martingale differences. Define:

$$ V_i = \mathbb{E}[g \mid Z_1, \ldots, Z_i] - \mathbb{E}[g \mid Z_1, \ldots, Z_{i-1}]. $$

Then \(g(Z) - \mathbb{E}[g] = \sum_{i=1}^m V_i\) (telescoping), and \(\mathbb{E}[V_i \mid Z_1, \ldots, Z_{i-1}] = 0\) (martingale difference property).

Step 2: Bounding \(V_i\). For fixed \(Z_1, \ldots, Z_{i-1}\), define:

$$ g_i(z) = \mathbb{E}[g(Z) \mid Z_1, \ldots, Z_{i-1}, Z_i = z]. $$

Then \(V_i = g_i(Z_i) - \mathbb{E}[g_i(Z_i)]\). Since \(g\) has bounded differences with constant \(c_i\) in coordinate \(i\), the function \(g_i\) (which is an average of \(g\) over the remaining coordinates) also has range at most \(c_i\). So \(V_i\) is zero-mean and lies in an interval of width at most \(c_i\).

Step 3: Chernoff method. For any \(s > 0\):

$$ \Pr[g(Z) - \mathbb{E}[g] \geq t] = \Pr\!\left[e^{s \sum V_i} \geq e^{st}\right] \leq e^{-st}\, \mathbb{E}\!\left[e^{s\sum V_i}\right]. $$

Step 4: Iterative conditioning.

$$\mathbb{E}\!\left[e^{s\sum_{i=1}^m V_i}\right] = \mathbb{E}\!\left[e^{s\sum_{i=1}^{m-1} V_i} \cdot \mathbb{E}\!\left[e^{sV_m} \mid Z_1, \ldots, Z_{m-1}\right]\right].$$

By Hoeffding's lemma applied to \(V_m\) (which is zero-mean and bounded in width \(c_m\), conditioned on \(Z_1, \ldots, Z_{m-1}\)):

$$ \mathbb{E}[e^{sV_m} \mid Z_1, \ldots, Z_{m-1}] \leq e^{s^2 c_m^2 / 8}. $$

Iterating over \(i = m, m-1, \ldots, 1\):

$$ \mathbb{E}\!\left[e^{s\sum V_i}\right] \leq \prod_{i=1}^m e^{s^2 c_i^2/8} = e^{s^2 \sum c_i^2 / 8}. $$

Step 5: Optimize.

$$ \Pr[g(Z) - \mathbb{E}[g] \geq t] \leq e^{-st + s^2\sum c_i^2/8}. $$

Minimizing over \(s\) at \(s^* = 4t/\sum c_i^2\):

$$ \Pr[g(Z) - \mathbb{E}[g] \geq t] \leq \exp\!\left(\frac{-2t^2}{\sum c_i^2}\right). \quad \square $$
Example (Recovering Hoeffding) Let \(g(z_1, \ldots, z_m) = \frac{1}{m}\sum z_i\) where \(z_i \in [a, b]\). Changing one \(z_i\) changes \(g\) by at most \((b-a)/m\), so \(c_i = (b-a)/m\). McDiarmid gives: $$ \Pr[\bar{Z} - \mathbb{E}[\bar{Z}] \geq t] \leq \exp\!\left(\frac{-2t^2}{m \cdot (b-a)^2/m^2}\right) = \exp\!\left(\frac{-2m t^2}{(b-a)^2}\right), $$ which is exactly Hoeffding's inequality.
Example (Supremum of Empirical Errors) Let \(g(S) = \sup_{h \in \mathcal{H}} |L_S(h) - L_{\mathcal{D}}(h)|\). Changing one sample point \((x_i, y_i)\) changes each \(L_S(h)\) by at most \(1/m\), hence changes \(g\) by at most \(1/m\). So \(c_i = 1/m\) and McDiarmid gives: $$ \Pr\!\left[g(S) - \mathbb{E}[g(S)] \geq t\right] \leq e^{-2mt^2}. $$ This is crucial: the supremum of the generalization gap concentrates around its mean.

15. The Rademacher Generalization Bound

We now prove that Rademacher complexity controls generalization, with a complete proof from the ingredients we've built.

Theorem (Rademacher Generalization Bound) Let \(\mathcal{F}\) be a class of functions mapping \(\mathcal{Z}\) to \([0, 1]\). Then with probability at least \(1 - \delta\) over \(S \sim \mathcal{D}^m\), for all \(f \in \mathcal{F}\): $$ \mathbb{E}_{\mathcal{D}}[f] \leq \frac{1}{m}\sum_{i=1}^m f(z_i) + 2\mathcal{R}_m(\mathcal{F}) + \sqrt{\frac{\ln(1/\delta)}{2m}}. $$ Equivalently, for all \(f \in \mathcal{F}\): $$ \mathbb{E}_{\mathcal{D}}[f] - \frac{1}{m}\sum_{i=1}^m f(z_i) \leq 2\hat{\mathcal{R}}_S(\mathcal{F}) + 3\sqrt{\frac{\ln(2/\delta)}{2m}}. $$

For binary classification with \(f_{h}(x,y) = \mathbf{1}[h(x) \neq y]\), this says:

$$ L_{\mathcal{D}}(h) \leq L_S(h) + 2\mathcal{R}_m(\ell \circ \mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2m}} \quad \text{for all } h \in \mathcal{H}. $$
Proof

Let \(\Phi(S) = \sup_{f \in \mathcal{F}} (\mathbb{E}_{\mathcal{D}}[f] - \hat{\mathbb{E}}_S[f])\) where \(\hat{\mathbb{E}}_S[f] = \frac{1}{m}\sum f(z_i)\).

Step 1: Bound \(\mathbb{E}[\Phi(S)]\) by Rademacher complexity.

Let \(S' = (z'_1, \ldots, z'_m)\) be an independent ghost sample. Since \(\mathbb{E}_{S'}[\hat{\mathbb{E}}_{S'}[f]] = \mathbb{E}_{\mathcal{D}}[f]\):

$$ \mathbb{E}_S[\Phi(S)] = \mathbb{E}_S\!\left[\sup_{f \in \mathcal{F}} \left(\mathbb{E}_{\mathcal{D}}[f] - \hat{\mathbb{E}}_S[f]\right)\right] = \mathbb{E}_S\!\left[\sup_{f \in \mathcal{F}} \left(\mathbb{E}_{S'}[\hat{\mathbb{E}}_{S'}[f]] - \hat{\mathbb{E}}_S[f]\right)\right]. $$

Since \(\sup\) of an expectation \(\leq\) expectation of \(\sup\) (by Jensen, since \(\sup\) is convex):

$$ \leq \mathbb{E}_{S,S'}\!\left[\sup_{f \in \mathcal{F}} \left(\hat{\mathbb{E}}_{S'}[f] - \hat{\mathbb{E}}_S[f]\right)\right] = \mathbb{E}_{S,S'}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{m}\sum_{i=1}^m \left(f(z'_i) - f(z_i)\right)\right]. $$

Since \((z_i, z'_i)\) are i.i.d. pairs, swapping \(z_i \leftrightarrow z'_i\) doesn't change the distribution. So multiplying each difference by an independent Rademacher \(\sigma_i\) doesn't change the expectation:

$$ = \mathbb{E}_{S,S',\sigma}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{m}\sum_{i=1}^m \sigma_i\left(f(z'_i) - f(z_i)\right)\right]. $$

By the triangle inequality for \(\sup\):

$$ \leq \mathbb{E}_{S,\sigma}\!\left[\sup_{f} \frac{1}{m}\sum_{i=1}^m \sigma_i f(z_i)\right] + \mathbb{E}_{S',\sigma}\!\left[\sup_{f} \frac{1}{m}\sum_{i=1}^m \sigma_i f(z'_i)\right] = 2\mathcal{R}_m(\mathcal{F}). $$

Wait — why the split? We used \(\sup(A + B) \leq \sup A + \sup B\) and then the fact that \(-\sigma_i\) is also Rademacher (so the term with \(-f(z_i)\) becomes \(+\sigma_i f(z_i)\) after flipping signs). Each term is the same expectation by symmetry.

So: \(\mathbb{E}_S[\Phi(S)] \leq 2\mathcal{R}_m(\mathcal{F})\).

Step 2: Concentrate \(\Phi\) around its mean.

Changing one sample point \(z_i\) in \(S\) changes \(\hat{\mathbb{E}}_S[f]\) by at most \(1/m\) for each \(f\), so \(\Phi(S)\) changes by at most \(1/m\). By McDiarmid's inequality:

$$ \Pr\!\left[\Phi(S) - \mathbb{E}[\Phi(S)] \geq t\right] \leq e^{-2mt^2}. $$

Step 3: Combine. Setting \(e^{-2mt^2} = \delta\), so \(t = \sqrt{\ln(1/\delta)/(2m)}\):

$$ \Pr\!\left[\Phi(S) > 2\mathcal{R}_m(\mathcal{F}) + \sqrt{\frac{\ln(1/\delta)}{2m}}\right] \leq \delta. $$

This means: with probability \(\geq 1 - \delta\), for all \(f \in \mathcal{F}\):

$$ \mathbb{E}_{\mathcal{D}}[f] \leq \hat{\mathbb{E}}_S[f] + 2\mathcal{R}_m(\mathcal{F}) + \sqrt{\frac{\ln(1/\delta)}{2m}}. \quad \square $$
Data-Dependent Bounds Unlike the VC bound, the Rademacher bound using \(\hat{\mathcal{R}}_S(\mathcal{F})\) depends on the actual training data, not just worst-case combinatorics. On "easy" data where the hypothesis class has low effective complexity, the bound is tighter. This is why Rademacher complexity is preferred in modern learning theory.

16. Massart’s Lemma and Properties of Rademacher Complexity

We now develop tools for computing Rademacher complexity in specific cases.

Massart’s Lemma

Lemma (Massart) Let \(A \subseteq \mathbb{R}^m\) be a finite set with \(\max_{a \in A} \|a\| \leq B\). Then: $$ \mathbb{E}_\sigma\!\left[\max_{a \in A} \sum_{i=1}^m \sigma_i a_i\right] \leq B\sqrt{2\ln|A|}. $$
Proof

For any \(\lambda > 0\):

$$ \mathbb{E}\!\left[\max_{a \in A} \sum_i \sigma_i a_i\right] = \frac{1}{\lambda}\mathbb{E}\!\left[\ln \exp\!\left(\lambda \max_{a \in A} \sum_i \sigma_i a_i\right)\right]. $$

By Jensen's inequality (\(\ln\) is concave):

$$ \leq \frac{1}{\lambda}\ln \mathbb{E}\!\left[\exp\!\left(\lambda \max_{a \in A} \sum_i \sigma_i a_i\right)\right] = \frac{1}{\lambda}\ln \mathbb{E}\!\left[\max_{a \in A} \exp\!\left(\lambda \sum_i \sigma_i a_i\right)\right]. $$

Since \(\max \leq \sum\):

$$ \leq \frac{1}{\lambda}\ln \sum_{a \in A} \mathbb{E}\!\left[\exp\!\left(\lambda \sum_i \sigma_i a_i\right)\right]. $$

By independence of \(\sigma_i\):

$$ = \frac{1}{\lambda}\ln \sum_{a \in A} \prod_{i=1}^m \mathbb{E}[e^{\lambda \sigma_i a_i}]. $$

Now \(\mathbb{E}[e^{\lambda \sigma_i a_i}] = \cosh(\lambda a_i) \leq e^{\lambda^2 a_i^2/2}\) (this is Hoeffding's lemma for Rademacher \(\sigma_i\) bounded in \(\{-1, +1\}\)). So:

$$ \leq \frac{1}{\lambda}\ln \sum_{a \in A} e^{\lambda^2 \|a\|^2/2} \leq \frac{1}{\lambda}\ln\!\left(|A| \cdot e^{\lambda^2 B^2/2}\right) = \frac{\ln|A|}{\lambda} + \frac{\lambda B^2}{2}. $$

Minimizing over \(\lambda\): set \(\lambda = \sqrt{2\ln|A|}/B\), giving:

$$ \frac{B\sqrt{\ln|A|}}{\sqrt{2}} + \frac{\sqrt{2\ln|A|} \cdot B}{2 \cdot B} \cdot B = B\sqrt{2\ln|A|}. \quad \square $$

Rademacher Complexity of Finite Classes

Corollary If \(|\mathcal{H}| < \infty\) and \(\mathcal{F} = \ell \circ \mathcal{H}\) is the associated loss class (with values in \(\{0,1\}\)): $$ \mathcal{R}_m(\mathcal{F}) \leq \sqrt{\frac{2\ln|\mathcal{H}|}{m}}. $$
Proof

For a fixed sample \(S\), the set \(A = \{(f(z_1), \ldots, f(z_m)) : f \in \mathcal{F}\} \subseteq \{0,1\}^m\) has \(|A| \leq |\mathcal{H}|\) and each \(a \in A\) has \(\|a\| \leq \sqrt{m}\). By Massart:

$$ \hat{\mathcal{R}}_S(\mathcal{F}) = \frac{1}{m}\mathbb{E}_\sigma\!\left[\max_{a \in A} \sum \sigma_i a_i\right] \leq \frac{\sqrt{m} \cdot \sqrt{2\ln|\mathcal{H}|}}{m} = \sqrt{\frac{2\ln|\mathcal{H}|}{m}}. $$

Taking \(\mathbb{E}_S\) preserves the bound. \(\square\)

Plugging into the Rademacher generalization bound recovers (up to constants) the finite class bound from Part 1. So Rademacher complexity subsumes the earlier result.

Rademacher Complexity of Linear Classes

Example (Bounded Linear Functions) Let \(\mathcal{F} = \{x \mapsto \langle w, x \rangle : \|w\| \leq B\}\) with data satisfying \(\|x_i\| \leq R\). Then: $$ \hat{\mathcal{R}}_S(\mathcal{F}) = \frac{1}{m}\mathbb{E}_\sigma\!\left[\sup_{\|w\| \leq B} \left\langle w, \sum_{i=1}^m \sigma_i x_i \right\rangle\right] = \frac{B}{m}\mathbb{E}_\sigma\!\left[\left\|\sum_{i=1}^m \sigma_i x_i\right\|\right] $$ since the supremum of a linear function over a ball is achieved at \(w = B \cdot v/\|v\|\) where \(v = \sum \sigma_i x_i\). By Jensen: $$ \leq \frac{B}{m}\sqrt{\mathbb{E}\!\left[\left\|\sum \sigma_i x_i\right\|^2\right]} = \frac{B}{m}\sqrt{\sum_{i=1}^m \|x_i\|^2} \leq \frac{BR}{\sqrt{m}}, $$ using \(\mathbb{E}[\sigma_i \sigma_j] = \delta_{ij}\) to expand the squared norm.

This gives a generalization bound for linear models that depends on the norm constraints \(B, R\) and scales as \(1/\sqrt{m}\), independent of the input dimension.

17. The Connection: VC Dimension and Rademacher Complexity

The two measures are tightly linked. The growth function (controlled by VC dimension via Sauer–Shelah) bounds the Rademacher complexity:

Theorem Let \(\mathcal{H}\) be a class of binary functions with \(\operatorname{VCdim}(\mathcal{H}) = d\). Then: $$ \mathcal{R}_m(\mathcal{H}) \leq \sqrt{\frac{2d\ln(em/d)}{m}}. $$
Proof

For a fixed sample \(S\), the set of restrictions \(\mathcal{H}_S = \{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\} \subseteq \{0,1\}^m\) satisfies \(|\mathcal{H}_S| \leq \Pi_{\mathcal{H}}(m)\). Each vector has norm \(\leq \sqrt{m}\). By Massart's lemma:

$$ \hat{\mathcal{R}}_S(\mathcal{H}) = \frac{1}{m}\mathbb{E}_\sigma\!\left[\max_{a \in \mathcal{H}_S} \sum \sigma_i a_i\right] \leq \frac{\sqrt{m} \cdot \sqrt{2\ln \Pi_{\mathcal{H}}(m)}}{m} = \sqrt{\frac{2\ln \Pi_{\mathcal{H}}(m)}{m}}. $$

By Sauer–Shelah: \(\Pi_{\mathcal{H}}(m) \leq (em/d)^d\), so \(\ln \Pi_{\mathcal{H}}(m) \leq d\ln(em/d)\).

$$ \hat{\mathcal{R}}_S(\mathcal{H}) \leq \sqrt{\frac{2d\ln(em/d)}{m}}. $$

Since this holds for every \(S\), \(\mathcal{R}_m(\mathcal{H}) = \mathbb{E}_S[\hat{\mathcal{R}}_S(\mathcal{H})] \leq \sqrt{\frac{2d\ln(em/d)}{m}}\). \(\square\)

The converse direction also holds (up to constants): if \(\mathcal{R}_m(\mathcal{H}) \to 0\) as \(m \to \infty\), then \(\operatorname{VCdim}(\mathcal{H}) < \infty\). So Rademacher complexity and VC dimension characterize the same notion of learnability, but Rademacher complexity gives finer, data-dependent information.

18. The Beautiful Inversion

Let's step back and see the full chain of ideas:

VC Dimension combinatorics Sauer–Shelah: Π(m) ≤ (em/d)ᵈ Growth Function counting Massart: R ≤ √(2d·ln(em/d)/m) Rademacher Complexity probability Symmetrization + McDiarmid Generalization Bound statistics ERM + uniform convergence Learning prediction No Free Lunch (converse) Each arrow crosses a conceptual boundary. The full cycle is an if-and-only-if.
The proof chain: from a combinatorial property to a statistical guarantee. The dashed arrow shows the converse (No-Free-Lunch), closing the equivalence.

Each step crosses a conceptual boundary:

And the No-Free-Lunch theorem closes the loop: if VC dimension is infinite (the combinatorial property fails), then no algorithm can learn (the prediction goal fails). The chain is an if-and-only-if.

The Takeaway

The question "when does learning work?" has a precise, beautiful answer: when the hypothesis class can't represent everything.

VC dimension quantifies "how much" the class can represent. Finite VC dimension means polynomial growth, which means noise can't be fit, which means training error approximates true error, which means ERM works. Infinite VC dimension means exponential growth, which means noise can be fit, which means overfitting is unavoidable, which means no algorithm works.

Rademacher complexity refines this binary verdict into a continuous measure: it tells you not just whether you can learn, but how fast, and it adapts to the specific data you're working with.

Appendix: Proof Details and Worked Examples

A.1 Why \(\cosh(x) \leq e^{x^2/2}\)

This identity is used in Massart's lemma. We verify it term by term:

$$ \cosh(x) = \sum_{k=0}^{\infty} \frac{x^{2k}}{(2k)!}, \qquad e^{x^2/2} = \sum_{k=0}^{\infty} \frac{x^{2k}}{2^k \cdot k!}. $$

For each \(k\): \((2k)! = (2k)(2k-1)\cdots 1 \geq 2^k \cdot k!\) (pair up consecutive terms: \((2k)(2k-1) \geq 2k\), etc.). So \(\frac{1}{(2k)!} \leq \frac{1}{2^k k!}\), and each term of \(\cosh\) is \(\leq\) the corresponding term of \(e^{x^2/2}\). \(\checkmark\)

A.2 The Rademacher Complexity of Thresholds

\(\mathcal{H}\) = thresholds on \(\mathbb{R}\), \(\operatorname{VCdim} = 1\). For a sample of size \(m\), \(\Pi_{\mathcal{H}}(m) = m + 1\) (each threshold selects a suffix of the sorted points, plus the empty set). By our bound:

$$ \mathcal{R}_m(\mathcal{H}) \leq \sqrt{\frac{2\ln(m+1)}{m}}. $$

For \(m = 100\): \(\mathcal{R}_{100} \leq \sqrt{2\ln(101)/100} \approx \sqrt{0.0924} \approx 0.304\).

For \(m = 10{,}000\): \(\mathcal{R}_{10000} \leq \sqrt{2\ln(10001)/10000} \approx \sqrt{0.00184} \approx 0.043\).

Vanishes as \(O(\sqrt{\ln m / m})\) — thresholds are very learnable. \(\checkmark\)

A.3 Complete Worked Example: Generalization Bound for Linear Classifiers in \(\mathbb{R}^{10}\)

Setup: \(\mathcal{H}\) = linear classifiers in \(\mathbb{R}^{10}\), \(d = \operatorname{VCdim}(\mathcal{H}) = 11\). We want uniform convergence within \(\epsilon = 0.1\) with confidence \(\delta = 0.05\), using the VC bound:

$$ 4\left(\frac{2em}{d}\right)^d e^{-m\epsilon^2/8} \leq \delta. $$

Substituting: \(4(2em/11)^{11} \cdot e^{-m/800} \leq 0.05\).

Taking logarithms: \(\ln 4 + 11\ln(2em/11) - m/800 \leq \ln(0.05)\).

At \(m = 12{,}000\): LHS \(\approx 1.39 + 11\ln(5{,}922) - 15 = 1.39 + 11(8.69) - 15 = 1.39 + 95.56 - 15 = 81.95\) and RHS \(= -3.0\). Not enough.

At \(m = 100{,}000\): LHS \(\approx 1.39 + 11\ln(49{,}350) - 125 = 1.39 + 118.8 - 125 = -4.8\). This works since \(-4.8 < -3.0\). \(\checkmark\)

So the VC bound asks for about \(m \approx 100{,}000\) samples. This is conservative — the Rademacher bound (which can use norm constraints) typically gives much tighter sample requirements in practice.