Rademacher Complexity and the Beautiful Inversion — from Scratch
May 2026
This is Part 2 of a two-part series on the foundations of statistical learning theory. Part 1 built the Fundamental Theorem of Statistical Learning from scratch: it proved that a hypothesis class is learnable if and only if it has finite VC dimension, with matching upper and lower bounds on sample complexity. Every tool — Markov, Hoeffding, Sauer–Shelah, symmetrization, KL divergence, Pinsker, Le Cam — was constructed from definitions.
But the Fundamental Theorem gives a binary answer: learnable or not learnable. It doesn’t tell you how fast a specific class generalizes on your specific data. VC dimension is a worst-case, distribution-free number. Can we do better?
This post introduces Rademacher complexity — a probabilistic measure that asks how well the hypothesis class can correlate with pure noise. It gives data-dependent generalization bounds that adapt to the actual training set, not just worst-case combinatorics. We prove everything from scratch, building McDiarmid’s inequality, the Rademacher generalization bound, Massart’s lemma, and the connection between VC dimension and Rademacher complexity. The post ends with the Beautiful Inversion: the full proof chain, visualized as one complete cycle.
13. Rademacher Complexity
VC dimension is a combinatorial measure: it asks which labelings are achievable. Rademacher complexity is a probabilistic refinement: it asks how well a hypothesis class can correlate with random noise. This gives tighter, data-dependent generalization bounds.
A class that can't fit noise can't overfit data — this is the key link to generalization.
For binary classification, we apply this to the loss class \(\ell \circ \mathcal{H} = \{(x,y) \mapsto \mathbf{1}[h(x) \neq y] : h \in \mathcal{H}\}\).
Example 1: A Single Hypothesis
Example 2: Two Hypotheses
Example 3: All Binary Functions
14. McDiarmid’s Inequality
To prove the Rademacher generalization bound, we need a concentration inequality for functions of independent random variables (not just sums). McDiarmid's inequality is a powerful generalization of Hoeffding's inequality.
The proof mirrors Hoeffding's inequality but uses a martingale decomposition.
Step 1: Martingale differences. Define:
$$ V_i = \mathbb{E}[g \mid Z_1, \ldots, Z_i] - \mathbb{E}[g \mid Z_1, \ldots, Z_{i-1}]. $$Then \(g(Z) - \mathbb{E}[g] = \sum_{i=1}^m V_i\) (telescoping), and \(\mathbb{E}[V_i \mid Z_1, \ldots, Z_{i-1}] = 0\) (martingale difference property).
Step 2: Bounding \(V_i\). For fixed \(Z_1, \ldots, Z_{i-1}\), define:
$$ g_i(z) = \mathbb{E}[g(Z) \mid Z_1, \ldots, Z_{i-1}, Z_i = z]. $$Then \(V_i = g_i(Z_i) - \mathbb{E}[g_i(Z_i)]\). Since \(g\) has bounded differences with constant \(c_i\) in coordinate \(i\), the function \(g_i\) (which is an average of \(g\) over the remaining coordinates) also has range at most \(c_i\). So \(V_i\) is zero-mean and lies in an interval of width at most \(c_i\).
Step 3: Chernoff method. For any \(s > 0\):
$$ \Pr[g(Z) - \mathbb{E}[g] \geq t] = \Pr\!\left[e^{s \sum V_i} \geq e^{st}\right] \leq e^{-st}\, \mathbb{E}\!\left[e^{s\sum V_i}\right]. $$Step 4: Iterative conditioning.
$$\mathbb{E}\!\left[e^{s\sum_{i=1}^m V_i}\right] = \mathbb{E}\!\left[e^{s\sum_{i=1}^{m-1} V_i} \cdot \mathbb{E}\!\left[e^{sV_m} \mid Z_1, \ldots, Z_{m-1}\right]\right].$$By Hoeffding's lemma applied to \(V_m\) (which is zero-mean and bounded in width \(c_m\), conditioned on \(Z_1, \ldots, Z_{m-1}\)):
$$ \mathbb{E}[e^{sV_m} \mid Z_1, \ldots, Z_{m-1}] \leq e^{s^2 c_m^2 / 8}. $$Iterating over \(i = m, m-1, \ldots, 1\):
$$ \mathbb{E}\!\left[e^{s\sum V_i}\right] \leq \prod_{i=1}^m e^{s^2 c_i^2/8} = e^{s^2 \sum c_i^2 / 8}. $$Step 5: Optimize.
$$ \Pr[g(Z) - \mathbb{E}[g] \geq t] \leq e^{-st + s^2\sum c_i^2/8}. $$Minimizing over \(s\) at \(s^* = 4t/\sum c_i^2\):
$$ \Pr[g(Z) - \mathbb{E}[g] \geq t] \leq \exp\!\left(\frac{-2t^2}{\sum c_i^2}\right). \quad \square $$15. The Rademacher Generalization Bound
We now prove that Rademacher complexity controls generalization, with a complete proof from the ingredients we've built.
For binary classification with \(f_{h}(x,y) = \mathbf{1}[h(x) \neq y]\), this says:
Let \(\Phi(S) = \sup_{f \in \mathcal{F}} (\mathbb{E}_{\mathcal{D}}[f] - \hat{\mathbb{E}}_S[f])\) where \(\hat{\mathbb{E}}_S[f] = \frac{1}{m}\sum f(z_i)\).
Step 1: Bound \(\mathbb{E}[\Phi(S)]\) by Rademacher complexity.
Let \(S' = (z'_1, \ldots, z'_m)\) be an independent ghost sample. Since \(\mathbb{E}_{S'}[\hat{\mathbb{E}}_{S'}[f]] = \mathbb{E}_{\mathcal{D}}[f]\):
$$ \mathbb{E}_S[\Phi(S)] = \mathbb{E}_S\!\left[\sup_{f \in \mathcal{F}} \left(\mathbb{E}_{\mathcal{D}}[f] - \hat{\mathbb{E}}_S[f]\right)\right] = \mathbb{E}_S\!\left[\sup_{f \in \mathcal{F}} \left(\mathbb{E}_{S'}[\hat{\mathbb{E}}_{S'}[f]] - \hat{\mathbb{E}}_S[f]\right)\right]. $$Since \(\sup\) of an expectation \(\leq\) expectation of \(\sup\) (by Jensen, since \(\sup\) is convex):
$$ \leq \mathbb{E}_{S,S'}\!\left[\sup_{f \in \mathcal{F}} \left(\hat{\mathbb{E}}_{S'}[f] - \hat{\mathbb{E}}_S[f]\right)\right] = \mathbb{E}_{S,S'}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{m}\sum_{i=1}^m \left(f(z'_i) - f(z_i)\right)\right]. $$Since \((z_i, z'_i)\) are i.i.d. pairs, swapping \(z_i \leftrightarrow z'_i\) doesn't change the distribution. So multiplying each difference by an independent Rademacher \(\sigma_i\) doesn't change the expectation:
$$ = \mathbb{E}_{S,S',\sigma}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{m}\sum_{i=1}^m \sigma_i\left(f(z'_i) - f(z_i)\right)\right]. $$By the triangle inequality for \(\sup\):
$$ \leq \mathbb{E}_{S,\sigma}\!\left[\sup_{f} \frac{1}{m}\sum_{i=1}^m \sigma_i f(z_i)\right] + \mathbb{E}_{S',\sigma}\!\left[\sup_{f} \frac{1}{m}\sum_{i=1}^m \sigma_i f(z'_i)\right] = 2\mathcal{R}_m(\mathcal{F}). $$Wait — why the split? We used \(\sup(A + B) \leq \sup A + \sup B\) and then the fact that \(-\sigma_i\) is also Rademacher (so the term with \(-f(z_i)\) becomes \(+\sigma_i f(z_i)\) after flipping signs). Each term is the same expectation by symmetry.
So: \(\mathbb{E}_S[\Phi(S)] \leq 2\mathcal{R}_m(\mathcal{F})\).
Step 2: Concentrate \(\Phi\) around its mean.
Changing one sample point \(z_i\) in \(S\) changes \(\hat{\mathbb{E}}_S[f]\) by at most \(1/m\) for each \(f\), so \(\Phi(S)\) changes by at most \(1/m\). By McDiarmid's inequality:
$$ \Pr\!\left[\Phi(S) - \mathbb{E}[\Phi(S)] \geq t\right] \leq e^{-2mt^2}. $$Step 3: Combine. Setting \(e^{-2mt^2} = \delta\), so \(t = \sqrt{\ln(1/\delta)/(2m)}\):
$$ \Pr\!\left[\Phi(S) > 2\mathcal{R}_m(\mathcal{F}) + \sqrt{\frac{\ln(1/\delta)}{2m}}\right] \leq \delta. $$This means: with probability \(\geq 1 - \delta\), for all \(f \in \mathcal{F}\):
$$ \mathbb{E}_{\mathcal{D}}[f] \leq \hat{\mathbb{E}}_S[f] + 2\mathcal{R}_m(\mathcal{F}) + \sqrt{\frac{\ln(1/\delta)}{2m}}. \quad \square $$16. Massart’s Lemma and Properties of Rademacher Complexity
We now develop tools for computing Rademacher complexity in specific cases.
Massart’s Lemma
For any \(\lambda > 0\):
$$ \mathbb{E}\!\left[\max_{a \in A} \sum_i \sigma_i a_i\right] = \frac{1}{\lambda}\mathbb{E}\!\left[\ln \exp\!\left(\lambda \max_{a \in A} \sum_i \sigma_i a_i\right)\right]. $$By Jensen's inequality (\(\ln\) is concave):
$$ \leq \frac{1}{\lambda}\ln \mathbb{E}\!\left[\exp\!\left(\lambda \max_{a \in A} \sum_i \sigma_i a_i\right)\right] = \frac{1}{\lambda}\ln \mathbb{E}\!\left[\max_{a \in A} \exp\!\left(\lambda \sum_i \sigma_i a_i\right)\right]. $$Since \(\max \leq \sum\):
$$ \leq \frac{1}{\lambda}\ln \sum_{a \in A} \mathbb{E}\!\left[\exp\!\left(\lambda \sum_i \sigma_i a_i\right)\right]. $$By independence of \(\sigma_i\):
$$ = \frac{1}{\lambda}\ln \sum_{a \in A} \prod_{i=1}^m \mathbb{E}[e^{\lambda \sigma_i a_i}]. $$Now \(\mathbb{E}[e^{\lambda \sigma_i a_i}] = \cosh(\lambda a_i) \leq e^{\lambda^2 a_i^2/2}\) (this is Hoeffding's lemma for Rademacher \(\sigma_i\) bounded in \(\{-1, +1\}\)). So:
$$ \leq \frac{1}{\lambda}\ln \sum_{a \in A} e^{\lambda^2 \|a\|^2/2} \leq \frac{1}{\lambda}\ln\!\left(|A| \cdot e^{\lambda^2 B^2/2}\right) = \frac{\ln|A|}{\lambda} + \frac{\lambda B^2}{2}. $$Minimizing over \(\lambda\): set \(\lambda = \sqrt{2\ln|A|}/B\), giving:
$$ \frac{B\sqrt{\ln|A|}}{\sqrt{2}} + \frac{\sqrt{2\ln|A|} \cdot B}{2 \cdot B} \cdot B = B\sqrt{2\ln|A|}. \quad \square $$Rademacher Complexity of Finite Classes
For a fixed sample \(S\), the set \(A = \{(f(z_1), \ldots, f(z_m)) : f \in \mathcal{F}\} \subseteq \{0,1\}^m\) has \(|A| \leq |\mathcal{H}|\) and each \(a \in A\) has \(\|a\| \leq \sqrt{m}\). By Massart:
$$ \hat{\mathcal{R}}_S(\mathcal{F}) = \frac{1}{m}\mathbb{E}_\sigma\!\left[\max_{a \in A} \sum \sigma_i a_i\right] \leq \frac{\sqrt{m} \cdot \sqrt{2\ln|\mathcal{H}|}}{m} = \sqrt{\frac{2\ln|\mathcal{H}|}{m}}. $$Taking \(\mathbb{E}_S\) preserves the bound. \(\square\)
Plugging into the Rademacher generalization bound recovers (up to constants) the finite class bound from Part 1. So Rademacher complexity subsumes the earlier result.
Rademacher Complexity of Linear Classes
This gives a generalization bound for linear models that depends on the norm constraints \(B, R\) and scales as \(1/\sqrt{m}\), independent of the input dimension.
17. The Connection: VC Dimension and Rademacher Complexity
The two measures are tightly linked. The growth function (controlled by VC dimension via Sauer–Shelah) bounds the Rademacher complexity:
For a fixed sample \(S\), the set of restrictions \(\mathcal{H}_S = \{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\} \subseteq \{0,1\}^m\) satisfies \(|\mathcal{H}_S| \leq \Pi_{\mathcal{H}}(m)\). Each vector has norm \(\leq \sqrt{m}\). By Massart's lemma:
$$ \hat{\mathcal{R}}_S(\mathcal{H}) = \frac{1}{m}\mathbb{E}_\sigma\!\left[\max_{a \in \mathcal{H}_S} \sum \sigma_i a_i\right] \leq \frac{\sqrt{m} \cdot \sqrt{2\ln \Pi_{\mathcal{H}}(m)}}{m} = \sqrt{\frac{2\ln \Pi_{\mathcal{H}}(m)}{m}}. $$By Sauer–Shelah: \(\Pi_{\mathcal{H}}(m) \leq (em/d)^d\), so \(\ln \Pi_{\mathcal{H}}(m) \leq d\ln(em/d)\).
$$ \hat{\mathcal{R}}_S(\mathcal{H}) \leq \sqrt{\frac{2d\ln(em/d)}{m}}. $$Since this holds for every \(S\), \(\mathcal{R}_m(\mathcal{H}) = \mathbb{E}_S[\hat{\mathcal{R}}_S(\mathcal{H})] \leq \sqrt{\frac{2d\ln(em/d)}{m}}\). \(\square\)
The converse direction also holds (up to constants): if \(\mathcal{R}_m(\mathcal{H}) \to 0\) as \(m \to \infty\), then \(\operatorname{VCdim}(\mathcal{H}) < \infty\). So Rademacher complexity and VC dimension characterize the same notion of learnability, but Rademacher complexity gives finer, data-dependent information.
18. The Beautiful Inversion
Let's step back and see the full chain of ideas:
Each step crosses a conceptual boundary:
- Sauer–Shelah converts a combinatorial property (can't shatter large sets) into a counting bound (polynomial growth).
- Massart's lemma converts a counting bound into a probabilistic bound (Rademacher complexity).
- Symmetrization + McDiarmid convert a probabilistic bound into a statistical guarantee (uniform convergence).
- ERM converts a statistical guarantee into a prediction algorithm.
And the No-Free-Lunch theorem closes the loop: if VC dimension is infinite (the combinatorial property fails), then no algorithm can learn (the prediction goal fails). The chain is an if-and-only-if.
The question "when does learning work?" has a precise, beautiful answer: when the hypothesis class can't represent everything.
VC dimension quantifies "how much" the class can represent. Finite VC dimension means polynomial growth, which means noise can't be fit, which means training error approximates true error, which means ERM works. Infinite VC dimension means exponential growth, which means noise can be fit, which means overfitting is unavoidable, which means no algorithm works.
Rademacher complexity refines this binary verdict into a continuous measure: it tells you not just whether you can learn, but how fast, and it adapts to the specific data you're working with.
Appendix: Proof Details and Worked Examples
A.1 Why \(\cosh(x) \leq e^{x^2/2}\)
This identity is used in Massart's lemma. We verify it term by term:
$$ \cosh(x) = \sum_{k=0}^{\infty} \frac{x^{2k}}{(2k)!}, \qquad e^{x^2/2} = \sum_{k=0}^{\infty} \frac{x^{2k}}{2^k \cdot k!}. $$For each \(k\): \((2k)! = (2k)(2k-1)\cdots 1 \geq 2^k \cdot k!\) (pair up consecutive terms: \((2k)(2k-1) \geq 2k\), etc.). So \(\frac{1}{(2k)!} \leq \frac{1}{2^k k!}\), and each term of \(\cosh\) is \(\leq\) the corresponding term of \(e^{x^2/2}\). \(\checkmark\)
A.2 The Rademacher Complexity of Thresholds
\(\mathcal{H}\) = thresholds on \(\mathbb{R}\), \(\operatorname{VCdim} = 1\). For a sample of size \(m\), \(\Pi_{\mathcal{H}}(m) = m + 1\) (each threshold selects a suffix of the sorted points, plus the empty set). By our bound:
$$ \mathcal{R}_m(\mathcal{H}) \leq \sqrt{\frac{2\ln(m+1)}{m}}. $$For \(m = 100\): \(\mathcal{R}_{100} \leq \sqrt{2\ln(101)/100} \approx \sqrt{0.0924} \approx 0.304\).
For \(m = 10{,}000\): \(\mathcal{R}_{10000} \leq \sqrt{2\ln(10001)/10000} \approx \sqrt{0.00184} \approx 0.043\).
Vanishes as \(O(\sqrt{\ln m / m})\) — thresholds are very learnable. \(\checkmark\)
A.3 Complete Worked Example: Generalization Bound for Linear Classifiers in \(\mathbb{R}^{10}\)
Setup: \(\mathcal{H}\) = linear classifiers in \(\mathbb{R}^{10}\), \(d = \operatorname{VCdim}(\mathcal{H}) = 11\). We want uniform convergence within \(\epsilon = 0.1\) with confidence \(\delta = 0.05\), using the VC bound:
$$ 4\left(\frac{2em}{d}\right)^d e^{-m\epsilon^2/8} \leq \delta. $$Substituting: \(4(2em/11)^{11} \cdot e^{-m/800} \leq 0.05\).
Taking logarithms: \(\ln 4 + 11\ln(2em/11) - m/800 \leq \ln(0.05)\).
At \(m = 12{,}000\): LHS \(\approx 1.39 + 11\ln(5{,}922) - 15 = 1.39 + 11(8.69) - 15 = 1.39 + 95.56 - 15 = 81.95\) and RHS \(= -3.0\). Not enough.
At \(m = 100{,}000\): LHS \(\approx 1.39 + 11\ln(49{,}350) - 125 = 1.39 + 118.8 - 125 = -4.8\). This works since \(-4.8 < -3.0\). \(\checkmark\)
So the VC bound asks for about \(m \approx 100{,}000\) samples. This is conservative — the Rademacher bound (which can use norm constraints) typically gives much tighter sample requirements in practice.