← Back to all posts

The Johnson–Lindenstrauss Lemma from Scratch

May 17, 2026

The Johnson–Lindenstrauss (JL) lemma says something that feels like it shouldn't be true: you can take any set of \(n\) points in absurdly high-dimensional space and squash them down into just \(O(\log n)\) dimensions, while approximately preserving every pairwise distance.

This post is a complete, self-contained walkthrough of the easiest proof I know. Every tool we use — expectation, variance, moment generating functions, Markov's inequality, the union bound — will be built up from definitions with examples, so you can see how a collection of small, elementary pieces snap together into one of the most useful results in all of high-dimensional geometry.

Proof Roadmap

Here is the complete chain of ideas we will build, one on top of the other:

  1. Expectation & Variance — the two fundamental summaries of a random variable.
  2. Moment Generating Functions (MGFs) — a device that encodes all moments in one function.
  3. Markov's Inequality — the most basic probability tail bound.
  4. The MGF Chernoff Method — combining Markov with MGFs to get exponentially sharp tail bounds.
  5. Sub-Gaussian Random Variables — the key property of Gaussian projections.
  6. Concentration of \(\chi^2\) — a sum of squared Gaussians concentrates around its mean.
  7. The JL Lemma for a Single Pair — random projection nearly preserves one distance.
  8. Union Bound — from one pair to all \(\binom{n}{2}\) pairs simultaneously.
  9. The Full JL Lemma — putting it all together.

1. The Statement

Let's state what we're trying to prove before building the machinery.

Theorem (Johnson–Lindenstrauss, 1984)

For any \(0 < \eps < 1\), any integer \(n \ge 2\), and any set \(P\) of \(n\) points in \(\R^d\), there exists a map \(f : \R^d \to \R^k\) with

$$k = O\!\left(\frac{\log n}{\eps^2}\right)$$

such that for every pair \(u, v \in P\):

$$(1 - \eps)\,\norm{u - v}^2 \;\le\; \norm{f(u) - f(v)}^2 \;\le\; (1 + \eps)\,\norm{u - v}^2.$$

Read that again: the target dimension \(k\) depends only on the number of points \(n\) and the distortion tolerance \(\eps\). It is completely independent of the original dimension \(d\). You could start in \(\R^{1{,}000{,}000}\) and, with say 1000 points and 10% distortion, land in roughly \(k \approx 700\) dimensions.

Concrete Example

Suppose you have \(n = 10{,}000\) images, each represented as a vector in \(\R^{1{,}000{,}000}\) (a million pixel-intensities). You want to preserve all pairwise distances up to \(\eps = 0.1\) (10% distortion). The JL lemma says you can project down to:

$$k = C \cdot \frac{\log 10{,}000}{0.01} = C \cdot \frac{4 \ln 10}{0.01} \approx 921\,C$$

where \(C\) is a modest constant (around 4–8 depending on the proof). So roughly \(k \sim 4{,}000\) to \(8{,}000\) dimensions suffice — a compression from a million down to a few thousand, with all distances nearly intact.

Why should this be possible? In high dimensions, most of the "volume" is concentrated in a thin shell. Random directions are nearly orthogonal to each other. A random subspace of dimension \(O(\log n)\) is rich enough to approximately capture the geometry of any \(n\) points — because there are only \(\binom{n}{2}\) pairwise distances to preserve, and each one is determined by just one direction.

The proof strategy: we will construct \(f\) as a random linear map. Specifically, we take a \(k \times d\) matrix \(A\) whose entries are i.i.d. \(\mathcal{N}(0, 1)\), and define:

$$f(x) = \frac{1}{\sqrt{k}}\,A\,x.$$

We will show that for any fixed pair of points, this map preserves their distance with high probability. Then a union bound over all \(\binom{n}{2}\) pairs finishes the job.

2. Expectation and Variance

We start at the very beginning. These are the two most fundamental summaries of a random variable.

Definition: Expected Value

Let \(X\) be a random variable taking values in \(\R\). If \(X\) is discrete with possible values \(x_1, x_2, \ldots\) and probabilities \(p_1, p_2, \ldots\), then:

$$\E[X] = \sum_{i} x_i \, p_i.$$

If \(X\) is continuous with probability density function \(f_X\), then:

$$\E[X] = \int_{-\infty}^{\infty} x \, f_X(x) \, dx.$$

The expected value is the "center of mass" of the distribution of \(X\).

Example: Fair Die

Roll a fair six-sided die. The random variable \(X\) takes values \(1, 2, 3, 4, 5, 6\), each with probability \(1/6\).

$$\E[X] = \frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = \frac{21}{6} = 3.5.$$

The expected value 3.5 is not even a possible outcome — it's the long-run average.

Linearity of Expectation

The single most useful property of expectation: for any random variables \(X, Y\) (dependent or not):

$$\E[X + Y] = \E[X] + \E[Y].$$

Proof (discrete case)

Let \(X\) and \(Y\) be discrete random variables on the same probability space. By definition of expectation over the joint distribution:

$$\E[X + Y] = \sum_{x}\sum_{y}(x + y)\,\Pr(X = x, Y = y)$$ $$= \sum_{x}\sum_{y} x \,\Pr(X = x, Y = y) + \sum_{x}\sum_{y} y \,\Pr(X = x, Y = y)$$

In the first sum, \(x\) doesn't depend on \(y\), so summing over \(y\) gives the marginal:

$$= \sum_{x} x \,\Pr(X = x) + \sum_{y} y \, \Pr(Y = y) = \E[X] + \E[Y].$$

Notice we never assumed independence.

Definition: Variance

The variance of a random variable \(X\) measures how spread out it is around its mean \(\mu = \E[X]\):

$$\Var(X) = \E\!\big[(X - \mu)^2\big] = \E[X^2] - (\E[X])^2.$$

Proof of the shortcut formula \(\Var(X) = \E[X^2] - (\E[X])^2\)

Expand the square inside the expectation:

$$\Var(X) = \E[(X - \mu)^2] = \E[X^2 - 2\mu X + \mu^2]$$

By linearity of expectation (and the fact that \(\mu\) is a constant):

$$= \E[X^2] - 2\mu\,\E[X] + \mu^2 = \E[X^2] - 2\mu^2 + \mu^2 = \E[X^2] - \mu^2.$$

Example: Fair Die Variance

We already know \(\E[X] = 3.5\). Now:

$$\E[X^2] = \frac{1}{6}(1 + 4 + 9 + 16 + 25 + 36) = \frac{91}{6} \approx 15.17.$$ $$\Var(X) = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{182 - 147}{12} = \frac{35}{12} \approx 2.92.$$

Variance of Independent Sums

Fact: Variance of a Sum of Independent Variables

If \(X_1, \ldots, X_k\) are independent, then:

$$\Var(X_1 + \cdots + X_k) = \Var(X_1) + \cdots + \Var(X_k).$$

Proof (for two variables; the general case follows by induction)

Let \(S = X + Y\) with \(X, Y\) independent. Write \(\mu_X = \E[X]\), \(\mu_Y = \E[Y]\).

$$\Var(S) = \E[(S - \E[S])^2] = \E[((X - \mu_X) + (Y - \mu_Y))^2]$$ $$= \E[(X - \mu_X)^2] + 2\,\E[(X - \mu_X)(Y - \mu_Y)] + \E[(Y - \mu_Y)^2].$$

The cross term: since \(X, Y\) are independent,

$$\E[(X - \mu_X)(Y - \mu_Y)] = \E[X - \mu_X]\cdot\E[Y - \mu_Y] = 0 \cdot 0 = 0.$$

So \(\Var(S) = \Var(X) + \Var(Y)\).

3. Moment Generating Functions

The moment generating function (MGF) is a single function that packages all the moments \(\E[X], \E[X^2], \E[X^3], \ldots\) of a random variable into one tidy object. It is the secret weapon behind every sharp concentration inequality.

Definition: Moment Generating Function

For a random variable \(X\), the MGF is the function:

$$M_X(t) = \E[e^{tX}]$$

defined for all \(t \in \R\) where this expectation is finite.

Why "moment generating"?

Recall the Taylor expansion of the exponential:

$$e^{tX} = 1 + tX + \frac{(tX)^2}{2!} + \frac{(tX)^3}{3!} + \cdots$$

Taking expectations term-by-term (justified when the MGF exists in a neighborhood of 0):

$$M_X(t) = \E[e^{tX}] = 1 + t\,\E[X] + \frac{t^2}{2!}\,\E[X^2] + \frac{t^3}{3!}\,\E[X^3] + \cdots$$

So the \(n\)-th derivative at \(t = 0\) gives the \(n\)-th moment:

$$M_X^{(n)}(0) = \E[X^n].$$

Example: MGF of a Standard Normal

Let \(Z \sim \mathcal{N}(0, 1)\) with density \(f_Z(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\). Then:

$$M_Z(t) = \E[e^{tZ}] = \int_{-\infty}^{\infty} e^{tz} \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\,dz.$$

Complete the square in the exponent:

$$tz - \frac{z^2}{2} = -\frac{1}{2}(z^2 - 2tz) = -\frac{1}{2}(z - t)^2 + \frac{t^2}{2}.$$

So:

$$M_Z(t) = e^{t^2/2} \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(z-t)^2/2}\,dz = e^{t^2/2} \cdot 1 = e^{t^2/2}.$$

The integral equals 1 because the integrand is the density of \(\mathcal{N}(t, 1)\).

So the MGF of a standard normal is:

$$M_Z(t) = e^{t^2/2}$$

We can verify the moments: \(M_Z'(t) = t\,e^{t^2/2}\), so \(M_Z'(0) = 0 = \E[Z]\). And \(M_Z''(t) = (1 + t^2)\,e^{t^2/2}\), so \(M_Z''(0) = 1 = \E[Z^2]\). Checks out.

Key Property: MGFs Factor Under Independence

Fact

If \(X\) and \(Y\) are independent, then:

$$M_{X+Y}(t) = M_X(t) \cdot M_Y(t).$$

Proof

$$M_{X+Y}(t) = \E[e^{t(X+Y)}] = \E[e^{tX} \cdot e^{tY}].$$

Since \(X\) and \(Y\) are independent, \(e^{tX}\) and \(e^{tY}\) are also independent (functions of independent r.v.'s are independent). For independent random variables, the expectation of a product equals the product of expectations:

$$= \E[e^{tX}] \cdot \E[e^{tY}] = M_X(t) \cdot M_Y(t).$$

Why this matters: When we eventually analyze a sum of independent random variables (which is exactly what a random projection produces), the MGF of the sum will factor into a product of individual MGFs. Products of exponentials are easy to handle — they just add in the exponent.

4. Markov's Inequality

Markov's inequality is the most elementary tail bound in all of probability. It's almost embarrassingly simple, yet it is the foundation of every sharper bound we'll derive.

Theorem (Markov's Inequality)

Let \(X\) be a non-negative random variable. Then for any \(a > 0\):

$$\Pr(X \ge a) \le \frac{\E[X]}{a}.$$

Proof

We use a simple indicator trick. Define the indicator random variable:

$$\mathbf{1}_{X \ge a} = \begin{cases} 1 & \text{if } X \ge a \\ 0 & \text{otherwise.}\end{cases}$$

Since \(X \ge 0\), whenever \(X \ge a\) we certainly have \(X \ge a \cdot 1 = a\). In general:

$$X \ge a \cdot \mathbf{1}_{X \ge a}.$$

This is the heart of the proof: if \(X < a\), the right side is 0 and the inequality holds trivially (since \(X \ge 0\)). If \(X \ge a\), the right side is \(a\) and the inequality says \(X \ge a\), which is true by assumption.

Now take expectations of both sides (expectation preserves inequalities):

$$\E[X] \ge a \cdot \E[\mathbf{1}_{X \ge a}] = a \cdot \Pr(X \ge a).$$

Dividing both sides by \(a > 0\):

$$\Pr(X \ge a) \le \frac{\E[X]}{a}.$$

Example

Suppose the average commute time in a city is \(\E[X] = 30\) minutes. Without knowing anything else about the distribution, Markov tells us:

$$\Pr(X \ge 120) \le \frac{30}{120} = \frac{1}{4}.$$

At most 25% of people commute 2 hours or more. This bound is loose — the real fraction is probably much smaller — but it's remarkable that we got anything at all from just the average.

Limitation: Markov's inequality requires \(X \ge 0\) and only uses \(\E[X]\). It gives no information about the shape of the distribution. To get sharper bounds, we need to use more information about \(X\). That's where the MGF trick comes in.

5. The Chernoff Bound Method

Here's the beautiful idea: Markov's inequality is weak because it only uses the mean. But what if we applied Markov to a cleverer non-negative random variable derived from \(X\)?

The Core Trick

For any \(t > 0\), the function \(x \mapsto e^{tx}\) is non-negative and strictly increasing. So for any real-valued \(X\) and any \(a\):

$$\Pr(X \ge a) = \Pr(e^{tX} \ge e^{ta}).$$

Now \(e^{tX}\) is non-negative, so we can apply Markov:

$$\Pr(X \ge a) = \Pr(e^{tX} \ge e^{ta}) \le \frac{\E[e^{tX}]}{e^{ta}} = e^{-ta}\,M_X(t).$$

Since this holds for every \(t > 0\), we can optimize:

$$\Pr(X \ge a) \le \inf_{t > 0} \; e^{-ta}\,M_X(t)$$

This is the Chernoff bound (or Chernoff method). Let's make sure every step is clear:

Step-by-step derivation

Step 1: Pick any \(t > 0\). Since \(e^{tx}\) is strictly increasing: \(X \ge a \iff e^{tX} \ge e^{ta}\).

Step 2: The random variable \(e^{tX}\) is always positive (the exponential is always positive). Apply Markov's inequality with the non-negative random variable \(e^{tX}\) and threshold \(e^{ta}\):

$$\Pr(e^{tX} \ge e^{ta}) \le \frac{\E[e^{tX}]}{e^{ta}}.$$

Step 3: Recognize that \(\E[e^{tX}] = M_X(t)\), the moment generating function.

Step 4: Since the bound holds for every \(t > 0\), take the infimum (tightest bound) over all \(t > 0\).

Why is this so much better than plain Markov? Markov uses only \(\E[X]\) — one number. The Chernoff method uses \(\E[e^{tX}]\) for the best \(t\), which effectively leverages all moments of \(X\) simultaneously (via the Taylor expansion of \(e^{tX}\)). For well-behaved distributions, this gives bounds that decay exponentially in the deviation, not just polynomially.

Example: Chernoff Bound for a Standard Normal

Let \(Z \sim \mathcal{N}(0,1)\). We showed \(M_Z(t) = e^{t^2/2}\). So for any \(a > 0\):

$$\Pr(Z \ge a) \le \inf_{t > 0}\; e^{-ta} \cdot e^{t^2/2} = \inf_{t > 0}\; e^{t^2/2 - ta}.$$

Minimize the exponent \(g(t) = t^2/2 - ta\). Take derivative: \(g'(t) = t - a = 0\), so \(t^* = a\). Then:

$$\Pr(Z \ge a) \le e^{a^2/2 - a^2} = e^{-a^2/2}.$$

This is a Gaussian tail bound: the probability of a standard normal exceeding \(a\) drops like \(e^{-a^2/2}\). For \(a = 3\): Markov gives only \(\Pr(|Z| \ge 3) \le \E[Z^2]/9 = 1/9 \approx 0.11\), while Chernoff gives \(\Pr(Z \ge 3) \le e^{-4.5} \approx 0.011\), much tighter.

6. Sub-Gaussian Random Variables

The Chernoff bound for the standard normal gave us \(\Pr(Z \ge a) \le e^{-a^2/2}\). The key input was that \(M_Z(t) = e^{t^2/2}\). Many other random variables have MGFs bounded by a similar Gaussian-like form. These are called sub-Gaussian.

Definition: Sub-Gaussian Random Variable

A random variable \(X\) with \(\E[X] = 0\) is called \(\sigma^2\)-sub-Gaussian if its MGF satisfies:

$$M_X(t) = \E[e^{tX}] \le e^{\sigma^2 t^2 / 2} \qquad \text{for all } t \in \R.$$

In words: the MGF of \(X\) is dominated by the MGF of a \(\mathcal{N}(0, \sigma^2)\) random variable.

Examples of Sub-Gaussian Random Variables

Random Variable Sub-Gaussian parameter \(\sigma^2\)
\(\mathcal{N}(0, \sigma^2)\) \(\sigma^2\) (exactly)
Rademacher: \(\pm 1\) with prob \(1/2\) each \(1\)
Uniform on \([-a, a]\) \(a^2\)
Any bounded r.v. on \([a, b]\) with mean 0 \((b-a)^2/4\) (Hoeffding)

Why is a Rademacher variable 1-sub-Gaussian?

Let \(X = \pm 1\) with equal probability. Then:

$$M_X(t) = \E[e^{tX}] = \frac{e^t + e^{-t}}{2} = \cosh(t).$$

We need to show \(\cosh(t) \le e^{t^2/2}\) for all \(t\). Use the Taylor series:

$$\cosh(t) = \sum_{k=0}^{\infty} \frac{t^{2k}}{(2k)!}, \qquad e^{t^2/2} = \sum_{k=0}^{\infty} \frac{t^{2k}}{2^k \cdot k!}.$$

We need \(\frac{1}{(2k)!} \le \frac{1}{2^k \cdot k!}\) for all \(k \ge 0\). This is equivalent to \(2^k \cdot k! \le (2k)!\).

Check by induction. Base case \(k=0\): \(1 \le 1\). Inductive step: assume \(2^k \cdot k! \le (2k)!\). Then:

$$2^{k+1}(k+1)! = 2 \cdot (k+1) \cdot 2^k \cdot k! \le 2(k+1) \cdot (2k)!.$$

We need \(2(k+1)(2k)! \le (2k+2)! = (2k+2)(2k+1)(2k)!\), i.e., \(2(k+1) \le (2k+2)(2k+1)\). Since \(2k+2 = 2(k+1)\), this becomes \(1 \le 2k+1\), true for \(k \ge 0\).

Why sub-Gaussians matter

Applying the Chernoff method to a \(\sigma^2\)-sub-Gaussian variable \(X\) immediately gives:

$$\Pr(X \ge a) \le \inf_{t > 0}\; e^{-ta} \cdot e^{\sigma^2 t^2/2}.$$

Optimizing: \(g(t) = \sigma^2 t^2/2 - ta\), \(g'(t) = \sigma^2 t - a = 0\), so \(t^* = a/\sigma^2\):

$$\Pr(X \ge a) \le e^{-a^2/(2\sigma^2)} \qquad \text{(sub-Gaussian tail bound)}$$

By symmetry (if \(-X\) is also \(\sigma^2\)-sub-Gaussian, which it is when \(X\) is symmetric or Gaussian):

$$\Pr(|X| \ge a) \le 2\,e^{-a^2/(2\sigma^2)}.$$

7. Gaussian Vectors and the Chi-Squared Distribution

A random projection maps a vector \(v \in \R^d\) to \(\frac{1}{\sqrt{k}} A v\) where \(A\) is a \(k \times d\) matrix of i.i.d. standard normals. To analyze what happens to the length of \(v\), we need to understand the distribution of the squared norm of a Gaussian vector.

Definition: Chi-Squared Distribution

If \(Z_1, \ldots, Z_k\) are independent \(\mathcal{N}(0,1)\) random variables, then the sum

$$Q = Z_1^2 + Z_2^2 + \cdots + Z_k^2$$

has the chi-squared distribution with \(k\) degrees of freedom, written \(Q \sim \chi^2_k\).

Mean and Variance of \(\chi^2_k\)

Fact

If \(Q \sim \chi^2_k\), then \(\E[Q] = k\) and \(\Var(Q) = 2k\).

Proof

Mean: Each \(Z_i \sim \mathcal{N}(0,1)\), so \(\E[Z_i^2] = \Var(Z_i) + (\E[Z_i])^2 = 1 + 0 = 1\). By linearity:

$$\E[Q] = \sum_{i=1}^{k} \E[Z_i^2] = k.$$

Variance: We need \(\Var(Z_i^2)\). Since \(\E[Z_i^2] = 1\), we need \(\E[Z_i^4]\).

For \(Z \sim \mathcal{N}(0,1)\):

$$\E[Z^4] = \int_{-\infty}^{\infty} z^4 \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\,dz.$$

Integrate by parts with \(u = z^3\), \(dv = z \cdot \frac{1}{\sqrt{2\pi}}e^{-z^2/2}\,dz\). Note \(v = -\frac{1}{\sqrt{2\pi}}e^{-z^2/2}\). The boundary terms vanish, giving:

$$\E[Z^4] = 3\,\E[Z^2] = 3.$$

(More directly: from the MGF \(M_Z(t) = e^{t^2/2}\), we get \(M_Z^{(4)}(0) = 3\).)

So \(\Var(Z_i^2) = \E[Z_i^4] - (\E[Z_i^2])^2 = 3 - 1 = 2\).

Since the \(Z_i\) are independent, the \(Z_i^2\) are also independent, so:

$$\Var(Q) = \sum_{i=1}^{k} \Var(Z_i^2) = 2k.$$

What this tells us: The random variable \(Q/k\) (a normalized chi-squared) has mean 1 and variance \(2/k\). As \(k\) grows, \(Q/k\) concentrates tightly around 1. This is the seed of the JL lemma: the squared norm of a random projection, when properly normalized, concentrates around the original squared norm.

8. Concentration of Chi-Squared: The Key Lemma

This is the technical heart of the entire proof. We need to show that \(Q/k\) is not just close to 1 on average, but close to 1 with exponentially high probability.

Lemma (Chi-Squared Concentration)

Let \(Q \sim \chi^2_k\). For any \(0 < \eps < 1\):

$$\Pr\!\left(\frac{Q}{k} \ge 1 + \eps\right) \le \left(\frac{e^{\eps}}{(1 + \eps)^{1+\eps}}\right)^{k/2}$$ $$\Pr\!\left(\frac{Q}{k} \le 1 - \eps\right) \le \left(\frac{e^{-\eps}}{(1 - \eps)^{1-\eps}}\right)^{k/2}$$

In particular, both probabilities are at most \(e^{-k\eps^2/8}\) for \(0 < \eps < 1\).

Before proving this, let's build the two ingredients we need: the MGF of a \(\chi^2_1\) random variable, and the Chernoff method applied to a sum.

Step 1: MGF of \(Z^2\) where \(Z \sim \mathcal{N}(0,1)\)

Claim

If \(Z \sim \mathcal{N}(0,1)\) and \(t < 1/2\), then:

$$M_{Z^2}(t) = \E[e^{tZ^2}] = \frac{1}{\sqrt{1 - 2t}}.$$

Proof

Compute directly:

$$\E[e^{tZ^2}] = \int_{-\infty}^{\infty} e^{tz^2} \cdot \frac{1}{\sqrt{2\pi}} e^{-z^2/2}\,dz = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-z^2(1/2 - t)}\,dz.$$

For this to converge, we need \(1/2 - t > 0\), i.e., \(t < 1/2\). Let \(\sigma^2 = \frac{1}{1 - 2t}\). Then:

$$= \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-z^2/(2\sigma^2)}\,dz = \frac{1}{\sqrt{2\pi}} \cdot \sigma\sqrt{2\pi} = \sigma = \frac{1}{\sqrt{1-2t}}.$$

The integral equals \(\sigma\sqrt{2\pi}\) because it's the integral of an (unnormalized) Gaussian with variance \(\sigma^2\).

Step 2: MGF of the Chi-Squared Sum

Since \(Q = Z_1^2 + \cdots + Z_k^2\) with independent \(Z_i\), the factoring property of MGFs gives:

$$M_Q(t) = \prod_{i=1}^{k} M_{Z_i^2}(t) = \left(\frac{1}{\sqrt{1-2t}}\right)^k = (1 - 2t)^{-k/2}.$$

Step 3: Apply the Chernoff Method

Proof of the Upper Tail: \(\Pr(Q/k \ge 1 + \eps)\)

We want to bound \(\Pr(Q \ge k(1+\eps))\). By the Chernoff method, for any \(0 < t < 1/2\):

$$\Pr(Q \ge k(1+\eps)) \le e^{-tk(1+\eps)} \cdot M_Q(t) = e^{-tk(1+\eps)} \cdot (1-2t)^{-k/2}.$$

Optimize over \(t\). Let \(h(t) = -t(1+\eps) - \frac{1}{2}\ln(1-2t)\). We want to minimize this (the exponent is \(k \cdot h(t)\), and we want the bound \(e^{k \cdot h(t)}\) as small as possible).

$$h'(t) = -(1+\eps) + \frac{1}{1-2t} = 0 \implies 1 - 2t = \frac{1}{1+\eps} \implies t^* = \frac{\eps}{2(1+\eps)}.$$

Check: \(0 < t^* < 1/2\) for \(\eps > 0\). Good. Substituting back:

$$1 - 2t^* = \frac{1}{1+\eps}, \qquad t^*(1+\eps) = \frac{\eps}{2}.$$ $$h(t^*) = -\frac{\eps}{2} + \frac{1}{2}\ln(1+\eps).$$

So the bound is:

$$\Pr\!\left(\frac{Q}{k} \ge 1+\eps\right) \le e^{k(-\eps/2 + \ln(1+\eps)/2)} = \left(\frac{(1+\eps)^{1/2}}{e^{\eps/2}}\right)^k = \left(\frac{e^{\eps}}{(1+\eps)^{1+\eps}}\right)^{k/2}.$$

(The last step: raise to the power \(k\), rearrange, and note we can write the exponent in the stated form.)

Proof of the Lower Tail: \(\Pr(Q/k \le 1 - \eps)\)

For the lower tail, we use the trick: \(\Pr(Q \le k(1-\eps)) = \Pr(-Q \ge -k(1-\eps))\). Apply the Chernoff method with \(e^{-tQ}\) for \(t > 0\):

$$\Pr(Q \le k(1-\eps)) = \Pr(e^{-tQ} \ge e^{-tk(1-\eps)}) \le e^{tk(1-\eps)} \cdot \E[e^{-tQ}].$$

Now \(\E[e^{-tQ}] = M_Q(-t) = (1+2t)^{-k/2}\) (valid for all \(t > 0\)). So:

$$\Pr(Q \le k(1-\eps)) \le e^{tk(1-\eps)} \cdot (1+2t)^{-k/2}.$$

Optimize: set \(t^* = \frac{\eps}{2(1-\eps)}\) (by the same calculus). Substituting gives the stated bound.

Simplifying to get \(e^{-k\eps^2/8}\)

The exact bounds above are a bit unwieldy. We now show they imply the cleaner bound \(e^{-k\eps^2/8}\) for \(0 < \eps < 1\).

Simplification

It suffices to show that for \(0 < \eps < 1\):

$$\frac{\eps}{2} - \frac{\ln(1+\eps)}{2} \ge \frac{\eps^2}{8}.$$

Equivalently: \(\eps - \ln(1+\eps) \ge \eps^2/4\). Define \(f(\eps) = \eps - \ln(1+\eps) - \eps^2/4\). Then \(f(0) = 0\) and:

$$f'(\eps) = 1 - \frac{1}{1+\eps} - \frac{\eps}{2} = \frac{\eps}{1+\eps} - \frac{\eps}{2} = \eps\left(\frac{1}{1+\eps} - \frac{1}{2}\right) = \eps \cdot \frac{1 - \eps}{2(1+\eps)}.$$

For \(0 < \eps < 1\), we have \(f'(\eps) > 0\). Since \(f(0) = 0\) and \(f\) is increasing on \((0,1)\), we get \(f(\eps) \ge 0\) for \(\eps \in [0, 1]\). A similar (slightly more involved) calculation handles the lower tail.

Therefore both tails satisfy:

$$\Pr\!\left(\left|\frac{Q}{k} - 1\right| \ge \eps\right) \le 2\,e^{-k\eps^2/8}$$

Example: How tight is the concentration?

Take \(k = 1000\) and \(\eps = 0.1\). Then:

$$\Pr\!\left(\left|\frac{Q}{1000} - 1\right| \ge 0.1\right) \le 2\,e^{-1000 \cdot 0.01/8} = 2\,e^{-1.25} \approx 0.57.$$

That's for \(k = 1000\). With \(k = 10{,}000\):

$$2\,e^{-10000 \cdot 0.01 / 8} = 2\,e^{-12.5} \approx 7.4 \times 10^{-6}.$$

The concentration becomes extremely strong as \(k\) grows. The constant 8 in the denominator is not tight — more careful analysis can replace it with roughly \(2 + 2/3\) — but \(1/8\) suffices for our purposes.

9. Random Projection Preserves One Distance

Now we connect chi-squared concentration to random projections. This is where the pieces start clicking together.

The Setup

Let \(A\) be a \(k \times d\) matrix with i.i.d. entries \(A_{ij} \sim \mathcal{N}(0, 1)\). Define the random projection:

$$f(x) = \frac{1}{\sqrt{k}}\,A\,x, \qquad x \in \R^d.$$

Fix a single vector \(v \in \R^d\) (think of \(v = u_1 - u_2\) as the difference between two data points). We want to analyze \(\norm{f(v)}^2\).

Lemma (Single-Vector Projection)

For any fixed \(v \in \R^d\) with \(v \ne 0\) and any \(0 < \eps < 1\):

$$\Pr\!\left(\left|\,\norm{f(v)}^2 - \norm{v}^2\,\right| \ge \eps\,\norm{v}^2\right) \le 2\,e^{-k\eps^2/8}.$$

Proof

Step 1: Compute \(\norm{f(v)}^2\).

The \(i\)-th coordinate of \(f(v) = \frac{1}{\sqrt{k}} Av\) is:

$$[f(v)]_i = \frac{1}{\sqrt{k}} \sum_{j=1}^{d} A_{ij}\,v_j.$$

Each \(A_{ij}\) is independent \(\mathcal{N}(0,1)\), so the sum \(W_i = \sum_{j=1}^{d} A_{ij}\,v_j\) is a linear combination of independent normals.

Step 2: What is the distribution of \(W_i\)?

A linear combination of independent normals is normal. The mean is 0 (since each \(A_{ij}\) has mean 0). The variance is:

$$\Var(W_i) = \sum_{j=1}^{d} v_j^2 \,\Var(A_{ij}) = \sum_{j=1}^{d} v_j^2 = \norm{v}^2.$$

So \(W_i \sim \mathcal{N}(0, \norm{v}^2)\), and we can write \(W_i = \norm{v} \cdot Z_i\) where \(Z_i \sim \mathcal{N}(0,1)\).

Step 3: Different rows are independent.

The rows of \(A\) are independent (each is a fresh draw of \(d\) independent Gaussians), so \(W_1, \ldots, W_k\) are independent.

Step 4: Express \(\norm{f(v)}^2\) in terms of chi-squared.

$$\norm{f(v)}^2 = \sum_{i=1}^{k} [f(v)]_i^2 = \frac{1}{k}\sum_{i=1}^{k} W_i^2 = \frac{\norm{v}^2}{k}\sum_{i=1}^{k} Z_i^2 = \frac{\norm{v}^2}{k}\,Q$$

where \(Q = \sum_{i=1}^{k} Z_i^2 \sim \chi^2_k\).

Step 5: Apply chi-squared concentration.

$$\left|\norm{f(v)}^2 - \norm{v}^2\right| = \norm{v}^2 \cdot \left|\frac{Q}{k} - 1\right|.$$

So:

$$\Pr\!\left(\left|\norm{f(v)}^2 - \norm{v}^2\right| \ge \eps\,\norm{v}^2\right) = \Pr\!\left(\left|\frac{Q}{k} - 1\right| \ge \eps\right) \le 2\,e^{-k\eps^2/8}.$$

The reduction in a nutshell: The squared norm of a random Gaussian projection of \(v\) is just \(\norm{v}^2\) times a normalized chi-squared. So the question "does the projection preserve length?" reduces to "does chi-squared concentrate around its mean?" — which we already answered.

10. The Union Bound

We've shown that a random projection preserves the length of any one vector with high probability. But we need it to work for all \(\binom{n}{2}\) pairwise difference vectors simultaneously. The tool for this is the union bound.

Theorem (Union Bound / Boole's Inequality)

For any (finite or countable) collection of events \(E_1, E_2, \ldots, E_m\):

$$\Pr(E_1 \cup E_2 \cup \cdots \cup E_m) \le \Pr(E_1) + \Pr(E_2) + \cdots + \Pr(E_m).$$

Proof

We prove it for two events first, then extend by induction.

Two events: By inclusion-exclusion,

$$\Pr(E_1 \cup E_2) = \Pr(E_1) + \Pr(E_2) - \Pr(E_1 \cap E_2).$$

Since \(\Pr(E_1 \cap E_2) \ge 0\), we get \(\Pr(E_1 \cup E_2) \le \Pr(E_1) + \Pr(E_2)\).

General case by induction: Assume true for \(m-1\) events. Then:

$$\Pr\!\left(\bigcup_{i=1}^{m} E_i\right) = \Pr\!\left(\left(\bigcup_{i=1}^{m-1} E_i\right) \cup E_m\right) \le \Pr\!\left(\bigcup_{i=1}^{m-1} E_i\right) + \Pr(E_m)$$ $$\le \sum_{i=1}^{m-1}\Pr(E_i) + \Pr(E_m) = \sum_{i=1}^{m}\Pr(E_i).$$

Example: The Birthday Bound

What's the probability that among 23 people, at least two share a birthday? The union bound over all \(\binom{23}{2} = 253\) pairs, each with collision probability \(1/365\), gives:

$$\Pr(\text{collision}) \le 253 \cdot \frac{1}{365} \approx 0.693.$$

The true answer is about 0.507 — the union bound overestimates because it doesn't account for the overlaps, but it's in the right ballpark and trivial to compute.

How we'll use it: Let \(E_{ij}\) be the "bad event" that the projection distorts the distance between points \(i\) and \(j\) by more than \(\eps\). We've shown \(\Pr(E_{ij}) \le 2e^{-k\eps^2/8}\). The projection fails only if at least one pair is distorted too much, i.e., the event \(\bigcup_{i \lt j} E_{ij}\). The union bound says the failure probability is at most \(\binom{n}{2}\) times the per-pair failure probability. We need \(\binom{n}{2} \cdot 2e^{-k\eps^2/8}\) to be small, which forces \(k = O(\log n / \eps^2)\).

11. The Full Proof

We now have every ingredient. Let's assemble them.

Theorem (JL Lemma — Restated)

Let \(P = \{x_1, \ldots, x_n\} \subset \R^d\), let \(0 < \eps < 1\), and let

$$k \ge \frac{8 \ln n}{\eps^2}.$$

Draw a \(k \times d\) random matrix \(A\) with i.i.d. \(\mathcal{N}(0,1)\) entries and define \(f(x) = \frac{1}{\sqrt{k}} Ax\). Then with probability at least \(1 - 1/n\), for all pairs \(i, j\):

$$(1 - \eps)\norm{x_i - x_j}^2 \le \norm{f(x_i) - f(x_j)}^2 \le (1 + \eps)\norm{x_i - x_j}^2.$$

Proof

Step 1: Linearity reduces distances to lengths.

Since \(f\) is linear: \(f(x_i) - f(x_j) = f(x_i - x_j)\). So:

$$\norm{f(x_i) - f(x_j)}^2 = \norm{f(x_i - x_j)}^2.$$

It suffices to show that \(f\) approximately preserves the length of every difference vector \(v_{ij} = x_i - x_j\).

Step 2: Per-pair concentration (from Section 9).

For each fixed pair \((i, j)\) with \(i \lt j\), define the bad event:

$$E_{ij} = \left\{\left|\,\norm{f(v_{ij})}^2 - \norm{v_{ij}}^2\,\right| > \eps\,\norm{v_{ij}}^2\right\}.$$

By the Single-Vector Projection Lemma:

$$\Pr(E_{ij}) \le 2\,e^{-k\eps^2/8}.$$

Step 3: Union bound over all pairs.

The number of pairs is \(\binom{n}{2} = \frac{n(n-1)}{2} < \frac{n^2}{2}\). By the union bound:

$$\Pr\!\left(\bigcup_{i \lt j} E_{ij}\right) \le \binom{n}{2} \cdot 2\,e^{-k\eps^2/8} \lt \frac{n^2}{2} \cdot 2\,e^{-k\eps^2/8} = n^2\,e^{-k\eps^2/8}.$$

Step 4: Choose \(k\) to make this small.

We want \(n^2\,e^{-k\eps^2/8} \le 1/n\), i.e., \(e^{-k\eps^2/8} \le n^{-3}\), i.e.,

$$k\eps^2/8 \ge 3 \ln n \quad \Longleftrightarrow \quad k \ge \frac{24 \ln n}{\eps^2}.$$

With \(k = \lceil 24 \ln n / \eps^2 \rceil\), we get:

$$\Pr(\text{some pair distorted by more than } \eps) \le \frac{1}{n}.$$

In particular, with probability at least \(1 - 1/n\), every pairwise distance is preserved within factor \((1 \pm \eps)\).

Remark on the constant: We used \(k \ge 24\ln n / \eps^2\) for the \(1 - 1/n\) success probability. More generally, to get failure probability at most \(\delta\), we need:

$$k \ge \frac{8(2\ln n + \ln(1/\delta))}{\eps^2}.$$

The constant 8 can be improved to roughly 3 with a tighter analysis of chi-squared tails. The \(O(\log n / \eps^2)\) scaling is what matters.

The Johnson–Lindenstrauss Lemma

A random Gaussian projection into \(k = O(\log n / \eps^2)\) dimensions preserves all pairwise distances within factor \((1 \pm \eps)\), with high probability.

12. The Proof at a Glance

Here is the entire logical chain, compressed into one view:

Dependency Graph of the Proof

Linearity of Expectation Variance of Independent Sums \ / v v E[Q] = k Var(Q) = 2k \ / \ / v v MGF of N(0,1): e^{t²/2} | v MGF of Z²: (1-2t)^{-1/2} | v MGF of χ²_k: (1-2t)^{-k/2} | Markov's Inequality | v Chernoff Method | v χ² Concentration: Pr(|Q/k - 1| ≥ ε) ≤ 2e^{-kε²/8} | v Single Pair: Pr(distortion > ε) ≤ 2e^{-kε²/8} | Union Bound | v All Pairs: Pr(any distortion > ε) ≤ n² · 2e^{-kε²/8} | Choose k = O(log n / ε²) | v JL LEMMA ✓

13. A Worked Numerical Example

Let's trace through the entire argument with concrete numbers to make sure nothing is hiding behind the notation.

Setup

We have \(n = 100\) points in \(\R^{10{,}000}\) and want \(\eps = 0.2\) (20% distortion).

Step 1: Choose \(k\).

$$k = \left\lceil \frac{24 \ln 100}{0.04}\right\rceil = \left\lceil \frac{24 \times 4.605}{0.04}\right\rceil = \lceil 2763 \rceil = 2763.$$

So we project from \(\R^{10{,}000}\) down to \(\R^{2763}\).

Step 2: Per-pair failure probability.

$$2\,e^{-k\eps^2/8} = 2\,e^{-2763 \times 0.04/8} = 2\,e^{-13.815} \approx 2 \times 10^{-6}.$$

Step 3: Total failure probability.

Number of pairs: \(\binom{100}{2} = 4950\).

$$\Pr(\text{failure}) \le 4950 \times 2 \times 10^{-6} \approx 0.0099 < 0.01.$$

Conclusion: With probability at least 99%, a single random \(2763 \times 10{,}000\) Gaussian matrix, scaled by \(1/\sqrt{2763}\), preserves all 4950 pairwise distances to within 20%. And this takes about 27.6 million random numbers — easily generated on a laptop.

14. What Makes This Proof Tick

Let me highlight the key ideas that made this proof possible:

1. Reduction to a one-dimensional problem. The projection is a linear map, so preserving distances reduces to preserving the length of each difference vector. The squared length of a Gaussian projection of a fixed vector is a scalar random variable (a scaled chi-squared). We reduced a high-dimensional geometry problem to analyzing a single well-understood distribution.
2. The exponential trick (Chernoff method). Markov's inequality alone gives weak polynomial bounds. The Chernoff method — applying Markov to \(e^{tX}\) and optimizing over \(t\) — gives exponential bounds. This is why we can afford the union bound over \(\binom{n}{2} \sim n^2\) pairs: the per-pair failure probability decays like \(e^{-k\eps^2}\), and \(n^2 = e^{2\ln n}\) can be absorbed by choosing \(k \propto \log n\).
3. The union bound as the finishing glue. We didn't need any clever coupling between different pairs. Each pair was analyzed independently, and the union bound — the crudest way to go from one event to many — was strong enough because the per-event bound was exponentially sharp.
4. Gaussians are magical. The Gaussian distribution is rotationally invariant: the distribution of \(Av\) depends only on \(\norm{v}\), not on the direction of \(v\). This is why the analysis for one difference vector applies uniformly to all of them, without us having to worry about alignment issues.

15. Broader Perspective

The JL lemma is one of the most applied results in theoretical computer science and machine learning:

Extensions and Variations

Variant Key Idea
Sparse JL (Achlioptas, 2003) Use entries from \(\{-1, 0, +1\}\) instead of Gaussians — 2/3 of entries are 0, so projection is faster.
Fast JL (Ailon-Chazelle, 2009) Use randomized Hadamard transform + sparse projection. Runs in \(O(d \log d)\) time.
Optimal JL (Larsen-Nelson, 2017) The dimension \(k = \Theta(\log n / \eps^2)\) is optimal: no map (linear or not) can do better.
Terminal JL Preserve distances from all points to a fixed set of "terminals." Allows even lower \(k\) in some cases.

16. Summary

We built the JL lemma from absolute scratch. The logical chain was:

  1. Define expectation and variance, prove linearity and the variance-of-sums formula.
  2. Define the moment generating function, compute it for a standard Gaussian, prove it factors under independence.
  3. Prove Markov's inequality from the indicator trick.
  4. Combine Markov with MGFs to get the Chernoff method, yielding exponential tail bounds.
  5. Compute the MGF of a chi-squared distribution and apply Chernoff to get chi-squared concentration.
  6. Show that the squared norm of a Gaussian projection of a fixed vector is a scaled chi-squared, so it concentrates around the original squared norm.
  7. Apply the union bound over all \(\binom{n}{2}\) pairs to show all distances are preserved simultaneously.
  8. Set \(k = O(\log n / \eps^2)\) to make the union-bounded failure probability small.

Each step used only the tools developed in the steps before it. No magic, no black boxes, just small pieces stacked carefully on top of each other. That is how you prove theorems.

References: