The Johnson–Lindenstrauss Lemma from Scratch
May 17, 2026
The Johnson–Lindenstrauss (JL) lemma says something that feels like it shouldn't be true: you can take any set of \(n\) points in absurdly high-dimensional space and squash them down into just \(O(\log n)\) dimensions, while approximately preserving every pairwise distance.
This post is a complete, self-contained walkthrough of the easiest proof I know. Every tool we use — expectation, variance, moment generating functions, Markov's inequality, the union bound — will be built up from definitions with examples, so you can see how a collection of small, elementary pieces snap together into one of the most useful results in all of high-dimensional geometry.
Proof Roadmap
Here is the complete chain of ideas we will build, one on top of the other:
- Expectation & Variance — the two fundamental summaries of a random variable.
- Moment Generating Functions (MGFs) — a device that encodes all moments in one function.
- Markov's Inequality — the most basic probability tail bound.
- The MGF Chernoff Method — combining Markov with MGFs to get exponentially sharp tail bounds.
- Sub-Gaussian Random Variables — the key property of Gaussian projections.
- Concentration of \(\chi^2\) — a sum of squared Gaussians concentrates around its mean.
- The JL Lemma for a Single Pair — random projection nearly preserves one distance.
- Union Bound — from one pair to all \(\binom{n}{2}\) pairs simultaneously.
- The Full JL Lemma — putting it all together.
1. The Statement
Let's state what we're trying to prove before building the machinery.
Theorem (Johnson–Lindenstrauss, 1984)
For any \(0 < \eps < 1\), any integer \(n \ge 2\), and any set \(P\) of \(n\) points in \(\R^d\), there exists a map \(f : \R^d \to \R^k\) with
$$k = O\!\left(\frac{\log n}{\eps^2}\right)$$such that for every pair \(u, v \in P\):
$$(1 - \eps)\,\norm{u - v}^2 \;\le\; \norm{f(u) - f(v)}^2 \;\le\; (1 + \eps)\,\norm{u - v}^2.$$Read that again: the target dimension \(k\) depends only on the number of points \(n\) and the distortion tolerance \(\eps\). It is completely independent of the original dimension \(d\). You could start in \(\R^{1{,}000{,}000}\) and, with say 1000 points and 10% distortion, land in roughly \(k \approx 700\) dimensions.
Concrete Example
Suppose you have \(n = 10{,}000\) images, each represented as a vector in \(\R^{1{,}000{,}000}\) (a million pixel-intensities). You want to preserve all pairwise distances up to \(\eps = 0.1\) (10% distortion). The JL lemma says you can project down to:
$$k = C \cdot \frac{\log 10{,}000}{0.01} = C \cdot \frac{4 \ln 10}{0.01} \approx 921\,C$$where \(C\) is a modest constant (around 4–8 depending on the proof). So roughly \(k \sim 4{,}000\) to \(8{,}000\) dimensions suffice — a compression from a million down to a few thousand, with all distances nearly intact.
The proof strategy: we will construct \(f\) as a random linear map. Specifically, we take a \(k \times d\) matrix \(A\) whose entries are i.i.d. \(\mathcal{N}(0, 1)\), and define:
$$f(x) = \frac{1}{\sqrt{k}}\,A\,x.$$We will show that for any fixed pair of points, this map preserves their distance with high probability. Then a union bound over all \(\binom{n}{2}\) pairs finishes the job.
2. Expectation and Variance
We start at the very beginning. These are the two most fundamental summaries of a random variable.
Definition: Expected Value
Let \(X\) be a random variable taking values in \(\R\). If \(X\) is discrete with possible values \(x_1, x_2, \ldots\) and probabilities \(p_1, p_2, \ldots\), then:
$$\E[X] = \sum_{i} x_i \, p_i.$$If \(X\) is continuous with probability density function \(f_X\), then:
$$\E[X] = \int_{-\infty}^{\infty} x \, f_X(x) \, dx.$$The expected value is the "center of mass" of the distribution of \(X\).
Example: Fair Die
Roll a fair six-sided die. The random variable \(X\) takes values \(1, 2, 3, 4, 5, 6\), each with probability \(1/6\).
$$\E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \frac{21}{6} = 3.5.$$The expected value 3.5 is not even a possible outcome — it's the long-run average.
Linearity of Expectation
The single most useful property of expectation: for any random variables \(X, Y\) (dependent or not):
$$\E[X + Y] = \E[X] + \E[Y].$$Proof (discrete case)
Let \(X\) and \(Y\) be discrete random variables on the same probability space. By definition of expectation over the joint distribution:
$$\E[X + Y] = \sum_{x}\sum_{y}(x + y)\,\Pr(X = x, Y = y)$$ $$= \sum_{x}\sum_{y} x \,\Pr(X = x, Y = y) + \sum_{x}\sum_{y} y \,\Pr(X = x, Y = y)$$In the first sum, \(x\) doesn't depend on \(y\), so summing over \(y\) gives the marginal:
$$= \sum_{x} x \,\Pr(X = x) + \sum_{y} y \, \Pr(Y = y) = \E[X] + \E[Y].$$Notice we never assumed independence.
■
Definition: Variance
The variance of a random variable \(X\) measures how spread out it is around its mean \(\mu = \E[X]\):
$$\Var(X) = \E\!\big[(X - \mu)^2\big] = \E[X^2] - (\E[X])^2.$$Proof of the shortcut formula \(\Var(X) = \E[X^2] - (\E[X])^2\)
Expand the square inside the expectation:
$$\Var(X) = \E[(X - \mu)^2] = \E[X^2 - 2\mu X + \mu^2]$$By linearity of expectation (and the fact that \(\mu\) is a constant):
$$= \E[X^2] - 2\mu\,\E[X] + \mu^2 = \E[X^2] - 2\mu^2 + \mu^2 = \E[X^2] - \mu^2.$$■
Example: Fair Die Variance
We already know \(\E[X] = 3.5\). Now:
$$\E[X^2] = \frac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \frac{91}{6} \approx 15.17.$$ $$\Var(X) = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{182 - 147}{12} = \frac{35}{12} \approx 2.92.$$Variance of Independent Sums
Fact: Variance of a Sum of Independent Variables
If \(X_1, \ldots, X_k\) are independent, then:
$$\Var(X_1 + \cdots + X_k) = \Var(X_1) + \cdots + \Var(X_k).$$Proof (for two variables; the general case follows by induction)
Let \(S = X + Y\) with \(X, Y\) independent. Write \(\mu_X = \E[X]\), \(\mu_Y = \E[Y]\).
$$\Var(S) = \E[(S - \E[S])^2] = \E[((X - \mu_X) + (Y - \mu_Y))^2]$$ $$= \E[(X - \mu_X)^2] + 2\,\E[(X - \mu_X)(Y - \mu_Y)] + \E[(Y - \mu_Y)^2].$$The cross term: since \(X, Y\) are independent,
$$\E[(X - \mu_X)(Y - \mu_Y)] = \E[X - \mu_X]\cdot\E[Y - \mu_Y] = 0 \cdot 0 = 0.$$So \(\Var(S) = \Var(X) + \Var(Y)\).
■
3. Moment Generating Functions
The moment generating function (MGF) is a single function that packages all the moments \(\E[X], \E[X^2], \E[X^3], \ldots\) of a random variable into one tidy object. It is the secret weapon behind every sharp concentration inequality.
Definition: Moment Generating Function
For a random variable \(X\), the MGF is the function:
$$M_X(t) = \E[e^{tX}]$$defined for all \(t \in \R\) where this expectation is finite.
Why "moment generating"?
Recall the Taylor expansion of the exponential:
$$e^{tX} = 1 + tX + \frac{(tX)^2}{2!} + \frac{(tX)^3}{3!} + \cdots$$Taking expectations term-by-term (justified when the MGF exists in a neighborhood of 0):
$$M_X(t) = \E[e^{tX}] = 1 + t\,\E[X] + \frac{t^2}{2!}\,\E[X^2] + \frac{t^3}{3!}\,\E[X^3] + \cdots$$So the \(n\)-th derivative at \(t = 0\) gives the \(n\)-th moment:
$$M_X^{(n)}(0) = \E[X^n].$$Example: MGF of a Standard Normal
Let \(Z \sim \mathcal{N}(0, 1)\) with density \(f_Z(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\). Then:
$$M_Z(t) = \E[e^{tZ}] = \int_{-\infty}^{\infty} e^{tz} \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\,dz.$$Complete the square in the exponent:
$$tz - \frac{z^2}{2} = -\frac{1}{2}(z^2 - 2tz) = -\frac{1}{2}(z - t)^2 + \frac{t^2}{2}.$$So:
$$M_Z(t) = e^{t^2/2} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(z-t)^2/2}\,dz = e^{t^2/2} \cdot 1 = e^{t^2/2}.$$The integral equals 1 because the integrand is the density of \(\mathcal{N}(t, 1)\).
So the MGF of a standard normal is:
We can verify the moments: \(M_Z'(t) = t\,e^{t^2/2}\), so \(M_Z'(0) = 0 = \E[Z]\). And \(M_Z''(t) = (1 + t^2)\,e^{t^2/2}\), so \(M_Z''(0) = 1 = \E[Z^2]\). Checks out.
Key Property: MGFs Factor Under Independence
Fact
If \(X\) and \(Y\) are independent, then:
$$M_{X+Y}(t) = M_X(t) \cdot M_Y(t).$$Proof
$$M_{X+Y}(t) = \E[e^{t(X+Y)}] = \E[e^{tX} \cdot e^{tY}].$$Since \(X\) and \(Y\) are independent, \(e^{tX}\) and \(e^{tY}\) are also independent (functions of independent r.v.'s are independent). For independent random variables, the expectation of a product equals the product of expectations:
$$= \E[e^{tX}] \cdot \E[e^{tY}] = M_X(t) \cdot M_Y(t).$$■
4. Markov's Inequality
Markov's inequality is the most elementary tail bound in all of probability. It's almost embarrassingly simple, yet it is the foundation of every sharper bound we'll derive.
Theorem (Markov's Inequality)
Let \(X\) be a non-negative random variable. Then for any \(a > 0\):
$$\Pr(X \ge a) \le \frac{\E[X]}{a}.$$Proof
We use a simple indicator trick. Define the indicator random variable:
$$\mathbf{1}_{X \ge a} = \begin{cases} 1 & \text{if } X \ge a \\ 0 & \text{otherwise.}\end{cases}$$Since \(X \ge 0\), whenever \(X \ge a\) we certainly have \(X \ge a \cdot 1 = a\). In general:
$$X \ge a \cdot \mathbf{1}_{X \ge a}.$$This is the heart of the proof: if \(X < a\), the right side is 0 and the inequality holds trivially (since \(X \ge 0\)). If \(X \ge a\), the right side is \(a\) and the inequality says \(X \ge a\), which is true by assumption.
Now take expectations of both sides (expectation preserves inequalities):
$$\E[X] \ge a \cdot \E[\mathbf{1}_{X \ge a}] = a \cdot \Pr(X \ge a).$$Dividing both sides by \(a > 0\):
$$\Pr(X \ge a) \le \frac{\E[X]}{a}.$$■
Example
Suppose the average commute time in a city is \(\E[X] = 30\) minutes. Without knowing anything else about the distribution, Markov tells us:
$$\Pr(X \ge 120) \le \frac{30}{120} = \frac{1}{4}.$$At most 25% of people commute 2 hours or more. This bound is loose — the real fraction is probably much smaller — but it's remarkable that we got anything at all from just the average.
5. The Chernoff Bound Method
Here's the beautiful idea: Markov's inequality is weak because it only uses the mean. But what if we applied Markov to a cleverer non-negative random variable derived from \(X\)?
The Core Trick
For any \(t > 0\), the function \(x \mapsto e^{tx}\) is non-negative and strictly increasing. So for any real-valued \(X\) and any \(a\):
$$\Pr(X \ge a) = \Pr(e^{tX} \ge e^{ta}).$$Now \(e^{tX}\) is non-negative, so we can apply Markov:
$$\Pr(X \ge a) = \Pr(e^{tX} \ge e^{ta}) \le \frac{\E[e^{tX}]}{e^{ta}} = e^{-ta}\,M_X(t).$$Since this holds for every \(t > 0\), we can optimize:
This is the Chernoff bound (or Chernoff method). Let's make sure every step is clear:
Step-by-step derivation
Step 1: Pick any \(t > 0\). Since \(e^{tx}\) is strictly increasing: \(X \ge a \iff e^{tX} \ge e^{ta}\).
Step 2: The random variable \(e^{tX}\) is always positive (the exponential is always positive). Apply Markov's inequality with the non-negative random variable \(e^{tX}\) and threshold \(e^{ta}\):
$$\Pr(e^{tX} \ge e^{ta}) \le \frac{\E[e^{tX}]}{e^{ta}}.$$Step 3: Recognize that \(\E[e^{tX}] = M_X(t)\), the moment generating function.
Step 4: Since the bound holds for every \(t > 0\), take the infimum (tightest bound) over all \(t > 0\).
■
Example: Chernoff Bound for a Standard Normal
Let \(Z \sim \mathcal{N}(0,1)\). We showed \(M_Z(t) = e^{t^2/2}\). So for any \(a > 0\):
$$\Pr(Z \ge a) \le \inf_{t > 0}\; e^{-ta} \cdot e^{t^2/2} = \inf_{t > 0}\; e^{t^2/2 - ta}.$$Minimize the exponent \(g(t) = t^2/2 - ta\). Take derivative: \(g'(t) = t - a = 0\), so \(t^* = a\). Then:
$$\Pr(Z \ge a) \le e^{a^2/2 - a^2} = e^{-a^2/2}.$$This is a Gaussian tail bound: the probability of a standard normal exceeding \(a\) drops like \(e^{-a^2/2}\). For \(a = 3\): Markov gives only \(\Pr(|Z| \ge 3) \le \E[Z^2]/9 = 1/9 \approx 0.11\), while Chernoff gives \(\Pr(Z \ge 3) \le e^{-4.5} \approx 0.011\), much tighter.
6. Sub-Gaussian Random Variables
The Chernoff bound for the standard normal gave us \(\Pr(Z \ge a) \le e^{-a^2/2}\). The key input was that \(M_Z(t) = e^{t^2/2}\). Many other random variables have MGFs bounded by a similar Gaussian-like form. These are called sub-Gaussian.
Definition: Sub-Gaussian Random Variable
A random variable \(X\) with \(\E[X] = 0\) is called \(\sigma^2\)-sub-Gaussian if its MGF satisfies:
$$M_X(t) = \E[e^{tX}] \le e^{\sigma^2 t^2 / 2} \qquad \text{for all } t \in \R.$$In words: the MGF of \(X\) is dominated by the MGF of a \(\mathcal{N}(0, \sigma^2)\) random variable.
Examples of Sub-Gaussian Random Variables
| Random Variable | Sub-Gaussian parameter \(\sigma^2\) |
|---|---|
| \(\mathcal{N}(0, \sigma^2)\) | \(\sigma^2\) (exactly) |
| Rademacher: \(\pm 1\) with prob \(1/2\) each | \(1\) |
| Uniform on \([-a, a]\) | \(a^2\) |
| Any bounded r.v. on \([a, b]\) with mean 0 | \((b-a)^2/4\) (Hoeffding) |
Why is a Rademacher variable 1-sub-Gaussian?
Let \(X = \pm 1\) with equal probability. Then:
$$M_X(t) = \E[e^{tX}] = \frac{e^t + e^{-t}}{2} = \cosh(t).$$We need to show \(\cosh(t) \le e^{t^2/2}\) for all \(t\). Use the Taylor series:
$$\cosh(t) = \sum_{k=0}^{\infty} \frac{t^{2k}}{(2k)!}, \qquad e^{t^2/2} = \sum_{k=0}^{\infty} \frac{t^{2k}}{2^k \cdot k!}.$$We need \(\frac{1}{(2k)!} \le \frac{1}{2^k \cdot k!}\) for all \(k \ge 0\). This is equivalent to \(2^k \cdot k! \le (2k)!\).
Check by induction. Base case \(k=0\): \(1 \le 1\). Inductive step: assume \(2^k \cdot k! \le (2k)!\). Then:
$$2^{k+1}(k+1)! = 2 \cdot (k+1) \cdot 2^k \cdot k! \le 2(k+1) \cdot (2k)!.$$We need \(2(k+1)(2k)! \le (2k+2)! = (2k+2)(2k+1)(2k)!\), i.e., \(2(k+1) \le (2k+2)(2k+1)\). Since \(2k+2 = 2(k+1)\), this becomes \(1 \le 2k+1\), true for \(k \ge 0\).
■
Why sub-Gaussians matter
Applying the Chernoff method to a \(\sigma^2\)-sub-Gaussian variable \(X\) immediately gives:
$$\Pr(X \ge a) \le \inf_{t > 0}\; e^{-ta} \cdot e^{\sigma^2 t^2/2}.$$Optimizing: \(g(t) = \sigma^2 t^2/2 - ta\), \(g'(t) = \sigma^2 t - a = 0\), so \(t^* = a/\sigma^2\):
By symmetry (if \(-X\) is also \(\sigma^2\)-sub-Gaussian, which it is when \(X\) is symmetric or Gaussian):
$$\Pr(|X| \ge a) \le 2\,e^{-a^2/(2\sigma^2)}.$$7. Gaussian Vectors and the Chi-Squared Distribution
A random projection maps a vector \(v \in \R^d\) to \(\frac{1}{\sqrt{k}} A v\) where \(A\) is a \(k \times d\) matrix of i.i.d. standard normals. To analyze what happens to the length of \(v\), we need to understand the distribution of the squared norm of a Gaussian vector.
Definition: Chi-Squared Distribution
If \(Z_1, \ldots, Z_k\) are independent \(\mathcal{N}(0,1)\) random variables, then the sum
$$Q = Z_1^2 + Z_2^2 + \cdots + Z_k^2$$has the chi-squared distribution with \(k\) degrees of freedom, written \(Q \sim \chi^2_k\).
Mean and Variance of \(\chi^2_k\)
Fact
If \(Q \sim \chi^2_k\), then \(\E[Q] = k\) and \(\Var(Q) = 2k\).
Proof
Mean: Each \(Z_i \sim \mathcal{N}(0,1)\), so \(\E[Z_i^2] = \Var(Z_i) + (\E[Z_i])^2 = 1 + 0 = 1\). By linearity:
$$\E[Q] = \sum_{i=1}^{k} \E[Z_i^2] = k.$$Variance: We need \(\Var(Z_i^2)\). Since \(\E[Z_i^2] = 1\), we need \(\E[Z_i^4]\).
For \(Z \sim \mathcal{N}(0,1)\):
$$\E[Z^4] = \int_{-\infty}^{\infty} z^4 \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\,dz.$$Integrate by parts with \(u = z^3\), \(dv = z \cdot \frac{1}{\sqrt{2\pi}}e^{-z^2/2}\,dz\). Note \(v = -\frac{1}{\sqrt{2\pi}}e^{-z^2/2}\). The boundary terms vanish, giving:
$$\E[Z^4] = 3\,\E[Z^2] = 3.$$(More directly: from the MGF \(M_Z(t) = e^{t^2/2}\), we get \(M_Z^{(4)}(0) = 3\).)
So \(\Var(Z_i^2) = \E[Z_i^4] - (\E[Z_i^2])^2 = 3 - 1 = 2\).
Since the \(Z_i\) are independent, the \(Z_i^2\) are also independent, so:
$$\Var(Q) = \sum_{i=1}^{k} \Var(Z_i^2) = 2k.$$■
8. Concentration of Chi-Squared: The Key Lemma
This is the technical heart of the entire proof. We need to show that \(Q/k\) is not just close to 1 on average, but close to 1 with exponentially high probability.
Lemma (Chi-Squared Concentration)
Let \(Q \sim \chi^2_k\). For any \(0 < \eps < 1\):
$$\Pr\!\left(\frac{Q}{k} \ge 1 + \eps\right) \le \left(\frac{e^{\eps}}{(1 + \eps)^{1+\eps}}\right)^{k/2}$$ $$\Pr\!\left(\frac{Q}{k} \le 1 - \eps\right) \le \left(\frac{e^{-\eps}}{(1 - \eps)^{1-\eps}}\right)^{k/2}$$In particular, both probabilities are at most \(e^{-k\eps^2/8}\) for \(0 < \eps < 1\).
Before proving this, let's build the two ingredients we need: the MGF of a \(\chi^2_1\) random variable, and the Chernoff method applied to a sum.
Step 1: MGF of \(Z^2\) where \(Z \sim \mathcal{N}(0,1)\)
Claim
If \(Z \sim \mathcal{N}(0,1)\) and \(t < 1/2\), then:
$$M_{Z^2}(t) = \E[e^{tZ^2}] = \frac{1}{\sqrt{1 - 2t}}.$$Proof
Compute directly:
$$\E[e^{tZ^2}] = \int_{-\infty}^{\infty} e^{tz^2} \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\,dz = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-z^2(1/2 - t)}\,dz.$$For this to converge, we need \(1/2 - t > 0\), i.e., \(t < 1/2\). Let \(\sigma^2 = \frac{1}{1 - 2t}\). Then:
$$= \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-z^2/(2\sigma^2)}\,dz = \frac{1}{\sqrt{2\pi}} \cdot \sigma\sqrt{2\pi} = \sigma = \frac{1}{\sqrt{1-2t}}.$$The integral equals \(\sigma\sqrt{2\pi}\) because it's the integral of an (unnormalized) Gaussian with variance \(\sigma^2\).
■
Step 2: MGF of the Chi-Squared Sum
Since \(Q = Z_1^2 + \cdots + Z_k^2\) with independent \(Z_i\), the factoring property of MGFs gives:
$$M_Q(t) = \prod_{i=1}^{k} M_{Z_i^2}(t) = \left(\frac{1}{\sqrt{1-2t}}\right)^k = (1 - 2t)^{-k/2}.$$Step 3: Apply the Chernoff Method
Proof of the Upper Tail: \(\Pr(Q/k \ge 1 + \eps)\)
We want to bound \(\Pr(Q \ge k(1+\eps))\). By the Chernoff method, for any \(0 < t < 1/2\):
$$\Pr(Q \ge k(1+\eps)) \le e^{-tk(1+\eps)} \cdot M_Q(t) = e^{-tk(1+\eps)} \cdot (1-2t)^{-k/2}.$$Optimize over \(t\). Let \(h(t) = -t(1+\eps) - \frac{1}{2}\ln(1-2t)\). We want to minimize this (the exponent is \(k \cdot h(t)\), and we want the bound \(e^{k \cdot h(t)}\) as small as possible).
$$h'(t) = -(1+\eps) + \frac{1}{1-2t} = 0 \implies 1 - 2t = \frac{1}{1+\eps} \implies t^* = \frac{\eps}{2(1+\eps)}.$$Check: \(0 < t^* < 1/2\) for \(\eps > 0\). Good. Substituting back:
$$1 - 2t^* = \frac{1}{1+\eps}, \qquad t^*(1+\eps) = \frac{\eps}{2}.$$ $$h(t^*) = -\frac{\eps}{2} + \frac{1}{2}\ln(1+\eps).$$So the bound is:
$$\Pr\!\left(\frac{Q}{k} \ge 1+\eps\right) \le e^{k(-\eps/2 + \ln(1+\eps)/2)} = \left(\frac{(1+\eps)^{1/2}}{e^{\eps/2}}\right)^k = \left(\frac{e^{\eps}}{(1+\eps)^{1+\eps}}\right)^{k/2}.$$(The last step: raise to the power \(k\), rearrange, and note we can write the exponent in the stated form.)
■
Proof of the Lower Tail: \(\Pr(Q/k \le 1 - \eps)\)
For the lower tail, we use the trick: \(\Pr(Q \le k(1-\eps)) = \Pr(-Q \ge -k(1-\eps))\). Apply the Chernoff method with \(e^{-tQ}\) for \(t > 0\):
$$\Pr(Q \le k(1-\eps)) = \Pr(e^{-tQ} \ge e^{-tk(1-\eps)}) \le e^{tk(1-\eps)} \cdot \E[e^{-tQ}].$$Now \(\E[e^{-tQ}] = M_Q(-t) = (1+2t)^{-k/2}\) (valid for all \(t > 0\)). So:
$$\Pr(Q \le k(1-\eps)) \le e^{tk(1-\eps)} \cdot (1+2t)^{-k/2}.$$Optimize: set \(t^* = \frac{\eps}{2(1-\eps)}\) (by the same calculus). Substituting gives the stated bound.
■
Simplifying to get \(e^{-k\eps^2/8}\)
The exact bounds above are a bit unwieldy. We now show they imply the cleaner bound \(e^{-k\eps^2/8}\) for \(0 < \eps < 1\).
Simplification
It suffices to show that for \(0 < \eps < 1\):
$$\frac{\eps}{2} - \frac{\ln(1+\eps)}{2} \ge \frac{\eps^2}{8}.$$Equivalently: \(\eps - \ln(1+\eps) \ge \eps^2/4\). Define \(f(\eps) = \eps - \ln(1+\eps) - \eps^2/4\). Then \(f(0) = 0\) and:
$$f'(\eps) = 1 - \frac{1}{1+\eps} - \frac{\eps}{2} = \frac{\eps}{1+\eps} - \frac{\eps}{2} = \eps\left(\frac{1}{1+\eps} - \frac{1}{2}\right) = \eps \cdot \frac{1 - \eps}{2(1+\eps)}.$$For \(0 < \eps < 1\), we have \(f'(\eps) > 0\). Since \(f(0) = 0\) and \(f\) is increasing on \((0,1)\), we get \(f(\eps) \ge 0\) for \(\eps \in [0, 1]\). A similar (slightly more involved) calculation handles the lower tail.
Therefore both tails satisfy:
■
Example: How tight is the concentration?
Take \(k = 1000\) and \(\eps = 0.1\). Then:
$$\Pr\!\left(\left|\frac{Q}{1000} - 1\right| \ge 0.1\right) \le 2\,e^{-1000 \cdot 0.01/8} = 2\,e^{-1.25} \approx 0.57.$$That's for \(k = 1000\). With \(k = 10{,}000\):
$$2\,e^{-10000 \cdot 0.01 / 8} = 2\,e^{-12.5} \approx 7.4 \times 10^{-6}.$$The concentration becomes extremely strong as \(k\) grows. The constant 8 in the denominator is not tight — more careful analysis can replace it with roughly \(2 + 2/3\) — but \(1/8\) suffices for our purposes.
9. Random Projection Preserves One Distance
Now we connect chi-squared concentration to random projections. This is where the pieces start clicking together.
The Setup
Let \(A\) be a \(k \times d\) matrix with i.i.d. entries \(A_{ij} \sim \mathcal{N}(0, 1)\). Define the random projection:
$$f(x) = \frac{1}{\sqrt{k}}\,A\,x, \qquad x \in \R^d.$$Fix a single vector \(v \in \R^d\) (think of \(v = u_1 - u_2\) as the difference between two data points). We want to analyze \(\norm{f(v)}^2\).
Lemma (Single-Vector Projection)
For any fixed \(v \in \R^d\) with \(v \ne 0\) and any \(0 < \eps < 1\):
$$\Pr\!\left(\left|\,\norm{f(v)}^2 - \norm{v}^2\,\right| \ge \eps\,\norm{v}^2\right) \le 2\,e^{-k\eps^2/8}.$$Proof
Step 1: Compute \(\norm{f(v)}^2\).
The \(i\)-th coordinate of \(f(v) = \frac{1}{\sqrt{k}} Av\) is:
$$[f(v)]_i = \frac{1}{\sqrt{k}} \sum_{j=1}^{d} A_{ij}\,v_j.$$Each \(A_{ij}\) is independent \(\mathcal{N}(0,1)\), so the sum \(W_i = \sum_{j=1}^{d} A_{ij}\,v_j\) is a linear combination of independent normals.
Step 2: What is the distribution of \(W_i\)?
A linear combination of independent normals is normal. The mean is 0 (since each \(A_{ij}\) has mean 0). The variance is:
$$\Var(W_i) = \sum_{j=1}^{d} v_j^2 \,\Var(A_{ij}) = \sum_{j=1}^{d} v_j^2 = \norm{v}^2.$$So \(W_i \sim \mathcal{N}(0, \norm{v}^2)\), and we can write \(W_i = \norm{v} \cdot Z_i\) where \(Z_i \sim \mathcal{N}(0,1)\).
Step 3: Different rows are independent.
The rows of \(A\) are independent (each is a fresh draw of \(d\) independent Gaussians), so \(W_1, \ldots, W_k\) are independent.
Step 4: Express \(\norm{f(v)}^2\) in terms of chi-squared.
$$\norm{f(v)}^2 = \sum_{i=1}^{k} [f(v)]_i^2 = \frac{1}{k}\sum_{i=1}^{k} W_i^2 = \frac{\norm{v}^2}{k}\sum_{i=1}^{k} Z_i^2 = \frac{\norm{v}^2}{k}\,Q$$where \(Q = \sum_{i=1}^{k} Z_i^2 \sim \chi^2_k\).
Step 5: Apply chi-squared concentration.
$$\left|\norm{f(v)}^2 - \norm{v}^2\right| = \norm{v}^2 \cdot \left|\frac{Q}{k} - 1\right|.$$So:
$$\Pr\!\left(\left|\norm{f(v)}^2 - \norm{v}^2\right| \ge \eps\,\norm{v}^2\right) = \Pr\!\left(\left|\frac{Q}{k} - 1\right| \ge \eps\right) \le 2\,e^{-k\eps^2/8}.$$■
10. The Union Bound
We've shown that a random projection preserves the length of any one vector with high probability. But we need it to work for all \(\binom{n}{2}\) pairwise difference vectors simultaneously. The tool for this is the union bound.
Theorem (Union Bound / Boole's Inequality)
For any (finite or countable) collection of events \(E_1, E_2, \ldots, E_m\):
$$\Pr(E_1 \cup E_2 \cup \cdots \cup E_m) \le \Pr(E_1) + \Pr(E_2) + \cdots + \Pr(E_m).$$Proof
We prove it for two events first, then extend by induction.
Two events: By inclusion-exclusion,
$$\Pr(E_1 \cup E_2) = \Pr(E_1) + \Pr(E_2) - \Pr(E_1 \cap E_2).$$Since \(\Pr(E_1 \cap E_2) \ge 0\), we get \(\Pr(E_1 \cup E_2) \le \Pr(E_1) + \Pr(E_2)\).
General case by induction: Assume true for \(m-1\) events. Then:
$$\Pr\!\left(\bigcup_{i=1}^{m} E_i\right) = \Pr\!\left(\left(\bigcup_{i=1}^{m-1} E_i\right) \cup E_m\right) \le \Pr\!\left(\bigcup_{i=1}^{m-1} E_i\right) + \Pr(E_m)$$ $$\le \sum_{i=1}^{m-1}\Pr(E_i) + \Pr(E_m) = \sum_{i=1}^{m}\Pr(E_i).$$■
Example: The Birthday Bound
What's the probability that among 23 people, at least two share a birthday? The union bound over all \(\binom{23}{2} = 253\) pairs, each with collision probability \(1/365\), gives:
$$\Pr(\text{collision}) \le 253 \cdot \frac{1}{365} \approx 0.693.$$The true answer is about 0.507 — the union bound overestimates because it doesn't account for the overlaps, but it's in the right ballpark and trivial to compute.
11. The Full Proof
We now have every ingredient. Let's assemble them.
Theorem (JL Lemma — Restated)
Let \(P = \{x_1, \ldots, x_n\} \subset \R^d\), let \(0 < \eps < 1\), and let
$$k \ge \frac{8 \ln n}{\eps^2}.$$Draw a \(k \times d\) random matrix \(A\) with i.i.d. \(\mathcal{N}(0,1)\) entries and define \(f(x) = \frac{1}{\sqrt{k}} Ax\). Then with probability at least \(1 - 1/n\), for all pairs \(i, j\):
$$(1 - \eps)\norm{x_i - x_j}^2 \le \norm{f(x_i) - f(x_j)}^2 \le (1 + \eps)\norm{x_i - x_j}^2.$$Proof
Step 1: Linearity reduces distances to lengths.
Since \(f\) is linear: \(f(x_i) - f(x_j) = f(x_i - x_j)\). So:
$$\norm{f(x_i) - f(x_j)}^2 = \norm{f(x_i - x_j)}^2.$$It suffices to show that \(f\) approximately preserves the length of every difference vector \(v_{ij} = x_i - x_j\).
Step 2: Per-pair concentration (from Section 9).
For each fixed pair \((i, j)\) with \(i \lt j\), define the bad event:
$$E_{ij} = \left\{\left|\,\norm{f(v_{ij})}^2 - \norm{v_{ij}}^2\,\right| > \eps\,\norm{v_{ij}}^2\right\}.$$By the Single-Vector Projection Lemma:
$$\Pr(E_{ij}) \le 2\,e^{-k\eps^2/8}.$$Step 3: Union bound over all pairs.
The number of pairs is \(\binom{n}{2} = \frac{n(n-1)}{2} < \frac{n^2}{2}\). By the union bound:
$$\Pr\!\left(\bigcup_{i \lt j} E_{ij}\right) \le \binom{n}{2} \cdot 2\,e^{-k\eps^2/8} \lt \frac{n^2}{2} \cdot 2\,e^{-k\eps^2/8} = n^2\,e^{-k\eps^2/8}.$$Step 4: Choose \(k\) to make this small.
We want \(n^2\,e^{-k\eps^2/8} \le 1/n\), i.e., \(e^{-k\eps^2/8} \le n^{-3}\), i.e.,
$$k\eps^2/8 \ge 3 \ln n \quad \Longleftrightarrow \quad k \ge \frac{24 \ln n}{\eps^2}.$$With \(k = \lceil 24 \ln n / \eps^2 \rceil\), we get:
$$\Pr(\text{some pair distorted by more than } \eps) \le \frac{1}{n}.$$In particular, with probability at least \(1 - 1/n\), every pairwise distance is preserved within factor \((1 \pm \eps)\).
Remark on the constant: We used \(k \ge 24\ln n / \eps^2\) for the \(1 - 1/n\) success probability. More generally, to get failure probability at most \(\delta\), we need:
$$k \ge \frac{8(2\ln n + \ln(1/\delta))}{\eps^2}.$$The constant 8 can be improved to roughly 3 with a tighter analysis of chi-squared tails. The \(O(\log n / \eps^2)\) scaling is what matters.
■
The Johnson–Lindenstrauss Lemma
A random Gaussian projection into \(k = O(\log n / \eps^2)\) dimensions preserves all pairwise distances within factor \((1 \pm \eps)\), with high probability.
12. The Proof at a Glance
Here is the entire logical chain, compressed into one view:
Dependency Graph of the Proof
Linearity of Expectation Variance of Independent Sums \ / v v E[Q] = k Var(Q) = 2k \ / \ / v v MGF of N(0,1): e^{t²/2} | v MGF of Z²: (1-2t)^{-1/2} | v MGF of χ²_k: (1-2t)^{-k/2} | Markov's Inequality | v Chernoff Method | v χ² Concentration: Pr(|Q/k - 1| ≥ ε) ≤ 2e^{-kε²/8} | v Single Pair: Pr(distortion > ε) ≤ 2e^{-kε²/8} | Union Bound | v All Pairs: Pr(any distortion > ε) ≤ n² · 2e^{-kε²/8} | Choose k = O(log n / ε²) | v JL LEMMA ✓
13. A Worked Numerical Example
Let's trace through the entire argument with concrete numbers to make sure nothing is hiding behind the notation.
Setup
We have \(n = 100\) points in \(\R^{10{,}000}\) and want \(\eps = 0.2\) (20% distortion).
Step 1: Choose \(k\).
$$k = \left\lceil \frac{24 \ln 100}{0.04}\right\rceil = \left\lceil \frac{24 \times 4.605}{0.04}\right\rceil = \lceil 2763 \rceil = 2763.$$So we project from \(\R^{10{,}000}\) down to \(\R^{2763}\).
Step 2: Per-pair failure probability.
$$2\,e^{-k\eps^2/8} = 2\,e^{-2763 \times 0.04/8} = 2\,e^{-13.815} \approx 2 \times 10^{-6}.$$Step 3: Total failure probability.
Number of pairs: \(\binom{100}{2} = 4950\).
$$\Pr(\text{failure}) \le 4950 \times 2 \times 10^{-6} \approx 0.0099 < 0.01.$$Conclusion: With probability at least 99%, a single random \(2763 \times 10{,}000\) Gaussian matrix, scaled by \(1/\sqrt{2763}\), preserves all 4950 pairwise distances to within 20%. And this takes about 27.6 million random numbers — easily generated on a laptop.
14. What Makes This Proof Tick
Let me highlight the key ideas that made this proof possible:
15. Broader Perspective
The JL lemma is one of the most applied results in theoretical computer science and machine learning:
- Nearest-neighbor search: Project data into low dimensions first, then search. Distances are approximately preserved, so approximate nearest neighbors are preserved.
- Sketching and streaming: JL-type projections underlie many streaming algorithms that process massive datasets in small memory.
- Compressed sensing: The same random matrix ideas appear in the theory of recovering sparse signals from few measurements.
- Kernel approximation: Random features for kernel methods are essentially JL projections in a lifted space.
Extensions and Variations
| Variant | Key Idea |
|---|---|
| Sparse JL (Achlioptas, 2003) | Use entries from \(\{-1, 0, +1\}\) instead of Gaussians — 2/3 of entries are 0, so projection is faster. |
| Fast JL (Ailon-Chazelle, 2009) | Use randomized Hadamard transform + sparse projection. Runs in \(O(d \log d)\) time. |
| Optimal JL (Larsen-Nelson, 2017) | The dimension \(k = \Theta(\log n / \eps^2)\) is optimal: no map (linear or not) can do better. |
| Terminal JL | Preserve distances from all points to a fixed set of "terminals." Allows even lower \(k\) in some cases. |
16. Summary
We built the JL lemma from absolute scratch. The logical chain was:
- Define expectation and variance, prove linearity and the variance-of-sums formula.
- Define the moment generating function, compute it for a standard Gaussian, prove it factors under independence.
- Prove Markov's inequality from the indicator trick.
- Combine Markov with MGFs to get the Chernoff method, yielding exponential tail bounds.
- Compute the MGF of a chi-squared distribution and apply Chernoff to get chi-squared concentration.
- Show that the squared norm of a Gaussian projection of a fixed vector is a scaled chi-squared, so it concentrates around the original squared norm.
- Apply the union bound over all \(\binom{n}{2}\) pairs to show all distances are preserved simultaneously.
- Set \(k = O(\log n / \eps^2)\) to make the union-bounded failure probability small.
Each step used only the tools developed in the steps before it. No magic, no black boxes, just small pieces stacked carefully on top of each other. That is how you prove theorems.
References:
- W. B. Johnson and J. Lindenstrauss, "Extensions of Lipschitz mappings into a Hilbert space," Contemporary Mathematics, 1984.
- S. Dasgupta and A. Gupta, "An elementary proof of a theorem of Johnson and Lindenstrauss," Random Structures & Algorithms, 2003.
- R. Vershynin, High-Dimensional Probability, Cambridge University Press, 2018.