The Mathematics of Google's TurboQuant

May 2026

In March 2026, Google Research published a paper that rattled memory chip stocks and made Cloudflare's CEO call it "Google's DeepSeek moment." The paper is about compressing numbers. Specifically, it describes an algorithm called TurboQuant that compresses the key-value cache of large language models to ~3 bits per value — a 6x reduction — with effectively no loss in quality.

What makes this interesting is not the engineering result (impressive as it is) but the mathematics underneath. TurboQuant is a clean demonstration of how three old ideas — random rotations, optimal quantization, and the Johnson-Lindenstrauss lemma — combine into something greater than the sum of their parts.

This post walks through the math. I'll start with a puzzle, then build up each piece of the algorithm, and end with why the result is close to the best anyone can ever do.

A Puzzle to Start

The Random Rotation Puzzle

Imagine you have a vector in 1000 dimensions. All of its energy is concentrated in a single coordinate:

$$\mathbf{v} = (1000, \; 0, \; 0, \; 0, \; \ldots, \; 0) \in \mathbb{R}^{1000}$$

Now apply a random rotation to this vector — pick a uniformly random orthogonal matrix $Q$ and compute $Q\mathbf{v}$.

Question: After the rotation, roughly what does each coordinate of $Q\mathbf{v}$ look like? Are most of them still zero? Are a few of them large? Or does something else happen entirely?

Bonus: What probability distribution do the individual coordinates follow?

The Answer

Something surprising happens: every single coordinate becomes roughly the same size. Not a few large ones and many zeros — all of them, roughly equal. Let's see why, step by step.

Step 1: What does $Q\mathbf{v}$ actually compute?

Our vector is $\mathbf{v} = (1000, 0, 0, \ldots, 0)$. Multiplying by an orthogonal matrix $Q$ gives:

$$(Q\mathbf{v})_i = \sum_{j=1}^{d} Q_{ij} \, v_j = Q_{i1} \cdot 1000$$

So the $i$-th coordinate of the rotated vector is just $1000 \times Q_{i1}$ — it's 1000 times the first column of $Q$. The question reduces to: what do the entries of a single column of a uniformly random orthogonal matrix look like?

Step 2: A column of a random orthogonal matrix is a random unit vector

What does "a uniformly random orthogonal matrix" even mean? An orthogonal matrix is one whose columns are mutually perpendicular unit vectors. The set of all such $d \times d$ matrices forms a group called $O(d)$. We need to pick one "at random" from this set — but it's a continuous set, so we can't just pick each entry independently.

The right notion is: choose $Q$ so that the distribution is invariant under further rotation. That is, if $Q$ is our random matrix and $R$ is any fixed orthogonal matrix, then $RQ$ has exactly the same distribution as $Q$. There's a theorem (due to Haar) that exactly one probability distribution on $O(d)$ has this property. It's called the Haar measure, and it's the natural meaning of "uniformly random rotation."

A concrete way to sample from it: take a $d \times d$ matrix of i.i.d. standard Gaussians and run the Gram-Schmidt process (or QR decomposition) on its columns. The resulting orthogonal matrix is Haar-distributed.

The key consequence for us: by the rotational invariance property, no direction in $\mathbb{R}^d$ is special. Every column of $Q$ is a uniformly random point on the unit sphere $S^{d-1}$. The first column $\mathbf{u} = (Q_{11}, Q_{21}, \ldots, Q_{d1})$ satisfies:

$$\|\mathbf{u}\|^2 = \sum_{i=1}^{d} u_i^2 = 1$$

So our problem becomes: take a point $\mathbf{u}$ uniformly distributed on the unit sphere in $\mathbb{R}^{1000}$, and look at one coordinate $u_i$. What distribution does it follow?

Step 3: Deriving the distribution (where the Beta comes from)

Here's the standard construction. To generate a uniform random point on $S^{d-1}$, take $d$ independent standard Gaussians $g_1, \ldots, g_d \sim \mathcal{N}(0,1)$ and normalize:

$$u_i = \frac{g_i}{\sqrt{g_1^2 + g_2^2 + \cdots + g_d^2}}$$

Now consider the squared coordinate $u_i^2$:

$$u_i^2 = \frac{g_i^2}{g_1^2 + g_2^2 + \cdots + g_d^2} = \frac{g_i^2}{g_i^2 + \underbrace{(g_1^2 + \cdots + g_{i-1}^2 + g_{i+1}^2 + \cdots + g_d^2)}_{\text{sum of } d-1 \text{ other squared Gaussians}}}$$

The numerator $g_i^2 \sim \chi^2(1)$ is a chi-squared random variable with 1 degree of freedom. The remaining sum in the denominator is $\chi^2(d-1)$, independent of $g_i^2$. A classical result in probability says that if $X \sim \chi^2(a)$ and $Y \sim \chi^2(b)$ are independent, then:

$$\frac{X}{X + Y} \sim \operatorname{Beta}\!\left(\frac{a}{2}, \frac{b}{2}\right)$$

Applying this with $a = 1$ and $b = d - 1$:

$$\boxed{u_i^2 \sim \operatorname{Beta}\!\left(\frac{1}{2}, \; \frac{d-1}{2}\right)}$$

This is the exact distribution. For our puzzle with $d = 1000$, each squared coordinate of the unit vector follows $\operatorname{Beta}(1/2, \; 499.5)$.

Step 4: What does this Beta distribution look like? (the magnitude argument)

The mean of $\operatorname{Beta}(\alpha, \beta)$ is $\alpha / (\alpha + \beta)$. So:

$$\mathbb{E}[u_i^2] = \frac{1/2}{1/2 + (d-1)/2} = \frac{1}{d}$$

This makes geometric sense: there are $d$ coordinates, they must share a total squared length of 1, and by symmetry each gets an equal share of $1/d$.

The variance of this Beta is:

$$\operatorname{Var}(u_i^2) = \frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)} = \frac{\frac{1}{2} \cdot \frac{d-1}{2}}{\left(\frac{d}{2}\right)^2 \cdot \frac{d+2}{2}} \approx \frac{2}{d^2} \quad \text{for large } d$$

The standard deviation of $u_i^2$ is $\approx \sqrt{2}/d$. Notice that this is actually larger than the mean $1/d$ by a factor of $\sqrt{2}$ — so any individual $u_i^2$ is not tightly concentrated around its mean. Individual coordinates have real spread.

But here's what matters for TurboQuant: we don't need individual coordinates to be concentrated. We need every coordinate to follow the same known distribution, regardless of the input vector's direction. And that's exactly what the rotation guarantees. The distribution $\operatorname{Beta}(1/2, (d-1)/2)$ depends only on $d$ — not on $\mathbf{v}$. This means we can precompute the optimal quantizer for this distribution once, and it works for every vector we'll ever see.

Now, coming back to our original vector: $(Q\mathbf{v})_i = 1000 \cdot u_i$, so:

$$\mathbb{E}[(Q\mathbf{v})_i^2] = 1000^2 \cdot \mathbb{E}[u_i^2] = 1000^2 \cdot \frac{1}{1000} = 1000$$

The root-mean-square value of each coordinate is $\sqrt{1000} \approx 31.6$. That's the typical magnitude, and that's where the $\|\mathbf{v}\| / \sqrt{d}$ figure comes from. Individual coordinates will vary around this — some larger, some smaller — but they're all drawn from the same predictable distribution.

Step 5: The Gaussian approximation (why it looks normal)

Recall that $u_i = g_i / \sqrt{g_1^2 + \cdots + g_d^2}$. By the law of large numbers, the denominator concentrates:

$$\frac{1}{d}\sum_{j=1}^{d} g_j^2 \;\xrightarrow{\;d \to \infty\;}\; 1 \qquad \text{so} \qquad \sqrt{\sum g_j^2} \approx \sqrt{d}$$

Therefore:

$$u_i \approx \frac{g_i}{\sqrt{d}} \sim \mathcal{N}\!\left(0, \frac{1}{d}\right)$$

And for the rotated vector:

$$(Q\mathbf{v})_i = 1000 \cdot u_i \;\approx\; \mathcal{N}\!\left(0, \;\frac{1000^2}{1000}\right) = \mathcal{N}(0, \; 1000)$$

This is the Gaussian approximation. It becomes accurate as $d$ grows because the denominator $\sqrt{\sum g_j^2}$ has vanishing relative fluctuations: its coefficient of variation is $O(1/\sqrt{d})$. At $d = 1000$, the approximation is already very good.

Step 6: Why are the coordinates nearly independent?

This is the subtlest point. The coordinates of a uniform random unit vector are not exactly independent — they satisfy the constraint $\sum u_i^2 = 1$, which couples them. But the coupling becomes negligible in high dimensions. Here's why.

Consider two coordinates $u_i$ and $u_j$ with $i \neq j$. Their covariance is:

$$\operatorname{Cov}(u_i, u_j) = \mathbb{E}[u_i \, u_j] - \mathbb{E}[u_i]\mathbb{E}[u_j]$$

By symmetry, $\mathbb{E}[u_i] = 0$ (the sphere is symmetric under sign flips). So $\operatorname{Cov}(u_i, u_j) = \mathbb{E}[u_i \, u_j]$.

To compute this, use the Gaussian construction again: $u_i u_j = g_i g_j / \sum g_k^2$. Since $g_i$ and $g_j$ are independent standard Gaussians, $\mathbb{E}[g_i g_j] = 0$. The denominator $\sum g_k^2$ is independent of the product $g_i g_j$ (by a symmetry argument involving rotational invariance of the Gaussian), so:

$$\mathbb{E}[u_i \, u_j] = \mathbb{E}\!\left[\frac{g_i g_j}{\sum g_k^2}\right] = 0$$

Wait — the covariance is exactly zero? Not just small? Almost. The covariance is indeed exactly zero: $\operatorname{Cov}(u_i, u_j) = 0$. But zero covariance doesn't imply independence.

The deeper fact is about higher-order correlations. For example, consider $\operatorname{Cov}(u_i^2, u_j^2)$. To compute this, we need $\mathbb{E}[u_i^2 u_j^2]$. Let's derive it from scratch.

Using the Gaussian construction, $u_i = g_i / \|\mathbf{g}\|$, so:

$$\mathbb{E}[u_i^2 u_j^2] = \mathbb{E}\!\left[\frac{g_i^2 \, g_j^2}{\left(\sum_{k=1}^d g_k^2\right)^2}\right]$$

Since $g_i$ and $g_j$ are independent of each other (and of all other $g_k$'s), the numerator and denominator are not independent — $g_i^2$ and $g_j^2$ both appear in the sum. But we can use a known identity for ratios of Gaussian products. For independent $g_k \sim \mathcal{N}(0,1)$ and distinct indices $i \neq j$:

$$\mathbb{E}\!\left[\frac{g_i^2 \, g_j^2}{\left(\sum_{k=1}^d g_k^2\right)^2}\right] = \frac{\mathbb{E}[g_i^2] \cdot \mathbb{E}[g_j^2]}{d(d+2)} = \frac{1 \cdot 1}{d(d+2)} = \frac{1}{d(d+2)}$$

Where does the $d(d+2)$ come from? It's a standard result about moments of the Dirichlet distribution. The vector $(u_1^2, u_2^2, \ldots, u_d^2)$ follows a symmetric Dirichlet distribution with parameter $\alpha = 1/2$. For a symmetric Dirichlet with parameter $\alpha$ and $d$ components, the second mixed moment is:

$$\mathbb{E}[u_i^2 u_j^2] = \frac{\alpha(\alpha+1) \cdot \alpha(\alpha+1)}{(d\alpha)(d\alpha+1)(d\alpha+2)(d\alpha+3)} \cdot \ldots$$

Actually, let's get there more directly. The trick is to recognize that $(g_1^2, \ldots, g_d^2) / \sum g_k^2$ has the same distribution as a normalized vector of independent $\chi^2(1)$ variables. For such vectors, there's a clean formula: if $X_1, \ldots, X_d$ are i.i.d. $\chi^2(1)$ and $S = \sum X_k$, then for $i \neq j$:

$$\mathbb{E}\!\left[\frac{X_i X_j}{S^2}\right] = \frac{\mathbb{E}[X_i] \cdot \mathbb{E}[X_j]}{\mathbb{E}[S] \cdot (\mathbb{E}[S] + 2)} = \frac{1 \cdot 1}{d \cdot (d + 2)}$$

This follows because $X_i/S$ and $X_j/S$ are marginally $\operatorname{Beta}(1/2, (d-1)/2)$, and their joint distribution is Dirichlet. The factor $(d+2)$ instead of $(d+1)$ comes from the specific moments of the chi-squared: $\mathbb{E}[X^2] = \operatorname{Var}(X) + (\mathbb{E}[X])^2 = 2 + 1 = 3$ for $\chi^2(1)$, and this propagates through the ratio as the "+2" correction.

Now we can compute the covariance. We already know $\mathbb{E}[u_i^2] = 1/d$, so:

$$\operatorname{Cov}(u_i^2, u_j^2) = \mathbb{E}[u_i^2 u_j^2] - \mathbb{E}[u_i^2]\mathbb{E}[u_j^2] = \frac{1}{d(d+2)} - \frac{1}{d^2} = \frac{d - (d+2)}{d^2(d+2)} = -\frac{2}{d^2(d+2)}$$

This is negative (knowing one coordinate is large makes others slightly smaller, by the sphere constraint) but scales as $O(1/d^3)$, while each variance is $O(1/d^2)$. The correlation coefficient is:

$$\rho(u_i^2, u_j^2) = \frac{\operatorname{Cov}(u_i^2, u_j^2)}{\operatorname{Var}(u_i^2)} \approx \frac{-2/d^3}{2/d^2} = -\frac{1}{d}$$

At $d = 1000$, this correlation is $-0.001$. For all practical purposes, the coordinates are independent. This is a manifestation of the concentration of measure phenomenon on high-dimensional spheres: local behavior (a single coordinate) is almost decoupled from the global constraint ($\sum u_i^2 = 1$) because each coordinate carries only a $1/d$ fraction of the total.

Putting it all together

After a random rotation of any vector $\mathbf{v} \in \mathbb{R}^d$:

Each coordinate of $Q\mathbf{v}$ has exactly the distribution $\|\mathbf{v}\| \cdot u_i$ where $u_i^2 \sim \operatorname{Beta}(1/2, (d-1)/2)$.
The typical magnitude is $\|\mathbf{v}\|/\sqrt{d}$, with tight concentration around this value.
For large $d$, each coordinate is well-approximated by $\mathcal{N}(0, \|\mathbf{v}\|^2/d)$.
Distinct coordinates are uncorrelated and, for large $d$, nearly independent (correlations scale as $1/d$).

This is the core mathematical trick behind TurboQuant. If you can make every coordinate look the same — drawn from a known, predictable distribution with near-independence — then you can design an optimal scalar quantizer once and apply it to every coordinate. No need to inspect the data. No need for calibration. Just rotate and quantize.

The Problem: LLMs Are Drowning in Memory

Large language models like Llama, Gemini, and GPT run on a mechanism called attention. Every time the model processes a new token, it computes inner products between the current token's query vector and the key vectors of all previous tokens. The key and value vectors from all past tokens are stored in something called the KV cache.

The KV cache grows linearly with context length. For Llama 3.1 70B processing a 128K-token context, the KV cache alone consumes roughly 40 GB of GPU memory — often more than the model weights themselves.

The question is: can we store these vectors at lower precision without breaking the attention computation?

This is a vector quantization problem, and it has been studied since Shannon's 1959 source coding theory. The challenge is that naive quantization (just round each number to the nearest low-precision value) introduces errors that accumulate as bias in the attention scores. TurboQuant solves this elegantly.

The Three-Paper Arc

TurboQuant didn't appear from nowhere. It's the final piece of a research program spanning three papers:

AAAI 2025 — QJL (Quantized Johnson-Lindenstrauss)
Showed that the sign of a random projection preserves enough angular information to debias dot product estimates at a cost of 1 bit per dimension.

AISTATS 2026 — PolarQuant
Showed that randomly rotating a vector makes each coordinate follow a known Beta distribution, enabling precomputed optimal scalar quantizers with zero calibration.

ICLR 2026 — TurboQuant
Combined PolarQuant (rotation + Lloyd-Max quantization) with QJL (1-bit residual correction) to achieve near-optimal distortion with unbiased inner products.

Let's look at each piece.

Stage 1: PolarQuant — Rotation Makes Everything Predictable

Consider a vector $\mathbf{x} \in \mathbb{R}^d$ that we want to quantize. In a KV cache, $d$ is the head dimension (typically 64–128). The vector could have any shape — some coordinates large, others small, the distribution varying from token to token.

PolarQuant's insight is to apply a random orthogonal rotation first:

$$\tilde{\mathbf{x}} = Q \mathbf{x}$$

where $Q \in \mathbb{R}^{d \times d}$ is a random orthogonal matrix (generated once, shared across all vectors). This is exactly the setup from our puzzle.

Why does rotation help?

After rotation, each coordinate $\tilde{x}_i$ of the transformed vector follows a distribution that depends only on $\|\mathbf{x}\|$ and $d$, not on the direction of $\mathbf{x}$. Specifically:

$$\tilde{x}_i = \|\mathbf{x}\| \cdot z_i$$

where $z_i$ is the $i$-th coordinate of a uniformly random unit vector in $\mathbb{R}^d$. This coordinate follows a shifted and scaled Beta distribution:

$$z_i^2 \;\sim\; \operatorname{Beta}\!\bigl(\tfrac{1}{2}, \; \tfrac{d-1}{2}\bigr)$$

For large $d$, the central limit theorem kicks in and $z_i \approx \mathcal{N}(0, 1/d)$. Moreover, distinct coordinates become nearly independent.

The key consequence: Because the post-rotation distribution is known analytically and does not depend on the input data, we can precompute the optimal quantizer once, before seeing any data. This is what makes TurboQuant data-oblivious — no calibration set, no fine-tuning, no codebook training.

The Lloyd-Max Quantizer

Given a known probability distribution $p(z)$ and a budget of $b$ bits (giving $2^b$ possible output levels), the Lloyd-Max quantizer is the scalar quantizer that minimizes mean squared error (MSE).

It works by solving a one-dimensional optimization: partition the real line into $2^b$ intervals and assign a representative value to each interval, chosen to minimize:

$$\operatorname{MSE} = \int_{-\infty}^{\infty} (z - Q_b(z))^2 \, p(z) \, dz$$

where $Q_b(z)$ maps $z$ to its nearest representative. The optimal partition boundaries and representatives satisfy a pair of necessary conditions (essentially: boundaries are midpoints of adjacent representatives, and representatives are conditional means within their intervals). These can be solved numerically by alternating between the two conditions — which is exactly Lloyd's algorithm (also known as k-means in one dimension).

Because the Beta distribution is known, PolarQuant precomputes the Lloyd-Max quantizer for each target bitwidth once. At runtime, quantizing a coordinate is a table lookup.

This is easier to understand with examples. Let's build up from the simplest case.

Example: 1-bit quantizer for a uniform distribution

Suppose our values are drawn from a uniform distribution on $[-1, 1]$, and we have 1 bit — just 2 possible output levels. Where should we place them?

Lloyd-Max says: partition $[-1, 1]$ into two intervals, and set each representative to the conditional mean within its interval. By symmetry, the boundary is at 0:

-1 0 +1 |------[--x--]--|--[--x--]------| r₁=-0.5 r₂=+0.5 Interval 1: [-1, 0) → output -0.5 Interval 2: [0, +1] → output +0.5

The representative for $[-1, 0)$ is $\mathbb{E}[z \mid z \in [-1, 0)] = -0.5$. Similarly for the other half. Any value in $[-1, 0)$ gets mapped to $-0.5$; any value in $[0, 1]$ gets mapped to $+0.5$.

The MSE is $\mathbb{E}[(z - Q(z))^2] = \frac{1}{12} \approx 0.083$. (Each half contributes the variance of a uniform on an interval of width 1, which is $1/12$.)

Example: 2-bit quantizer for a uniform distribution

Now we have 2 bits — 4 output levels. For the uniform distribution on $[-1, 1]$, Lloyd-Max splits into 4 equal intervals:

-1 -0.5 0 +0.5 +1 |--[x]---|---[x]---|---[x]---|---[x]--| r₁=-0.75 r₂=-0.25 r₃=+0.25 r₄=+0.75 Interval 1: [-1, -0.5) → output -0.75 Interval 2: [-0.5, 0) → output -0.25 Interval 3: [0, +0.5) → output +0.25 Interval 4: [0.5, +1] → output +0.75

MSE = $\frac{1}{48} \approx 0.021$. Going from 1 bit to 2 bits cut the MSE by 4x. This is the $4^{-b}$ scaling: each extra bit reduces MSE by a factor of 4 (for well-behaved distributions).

Example: 2-bit quantizer for a Gaussian — non-uniform intervals

Now the values follow a standard normal $\mathcal{N}(0, 1)$ instead of a uniform. Most values cluster near 0, with rare outliers. If we used equal-width intervals (like above), we'd waste two output levels on the tails where almost no data lives.

Lloyd-Max adapts: it places narrower intervals where the density is high (near 0) and wider intervals in the sparse tails:

density: ___ / \ / \ / \ __/ \__ ____/ \____ ---|-----|---------|---------|-----|--- -∞ -1.51 -0.45 +0.45 +1.51 +∞ Interval 1: (-∞, -0.9816) → output -1.51 Interval 2: [-0.9816, 0) → output -0.45 Interval 3: [0, +0.9816) → output +0.45 Interval 4: [+0.9816, +∞) → output +1.51

Notice the asymmetry in interval widths: the inner intervals (covering the dense center) are narrower ($\approx 0.98$ wide) than the outer intervals (stretching to infinity). This is not something you can get from simple "round to nearest" quantization — it specifically accounts for where the data actually lives.

The MSE for this optimal 2-bit Gaussian quantizer is about 0.2356, compared to the variance of 1.0 — so we capture about 76% of the signal with just 4 levels.

The representatives ($\pm 0.45$, $\pm 1.51$) and boundaries ($\pm 0.98$) are computed once by running Lloyd's algorithm on the known Gaussian density. They never change.

Example: What PolarQuant actually precomputes

In TurboQuant, after rotation, each coordinate follows approximately $\mathcal{N}(0, 1/d)$. Rescaling by $\|\mathbf{x}\|$, the distribution is known. PolarQuant precomputes a Lloyd-Max quantizer for this distribution at each bitwidth.

For a 3-bit quantizer ($2^3 = 8$ levels) on a standard Gaussian, the precomputed table looks roughly like:

Level Interval Representative Probability ----- -------- -------------- ----------- 0 (-∞, -1.748) -2.152 3.6% 1 [-1.748, -1.050) -1.344 6.4% 2 [-1.050, -0.501) -0.756 12.7% 3 [-0.501, 0) -0.245 19.2% 4 [ 0, +0.501) +0.245 19.2% 5 [+0.501, +1.050) +0.756 12.7% 6 [+1.050, +1.748) +1.344 6.4% 7 [+1.748, +∞) +2.152 3.6%

At runtime, quantizing one coordinate means: compare it against 7 boundaries, pick the matching interval, store a 3-bit index (0–7). Dequantizing means: look up the representative from a table of 8 values. No floating-point arithmetic, no per-vector statistics, no calibration. The boundaries and representatives are baked in at compile time.

This is what makes TurboQuant fast: the entire quantization step is 7 comparisons and a table index per coordinate.

Eliminating normalization overhead

A subtle but important advantage: traditional quantization schemes (like min-max or absmax) need to store per-block scale factors, typically adding 1–2 extra bits per coordinate of overhead. PolarQuant eliminates this entirely. Since the distribution after rotation is fixed, the quantizer boundaries are fixed. No per-vector metadata is needed.

The only thing we need to store alongside the quantized coordinates is $\|\mathbf{x}\|$ (a single scalar per vector) to reconstruct the scale. This is negligible overhead.

Stage 2: QJL — Fixing the Bias with One Bit

Stage 1 gives us excellent MSE. But there's a problem that MSE doesn't capture.

In attention, we don't care about reconstructing the vector itself — we care about inner products. The attention score between a query $\mathbf{q}$ and a key $\mathbf{k}$ is $\langle \mathbf{q}, \mathbf{k} \rangle$. If we quantize $\mathbf{k}$ to $\hat{\mathbf{k}}$, we need $\langle \mathbf{q}, \hat{\mathbf{k}} \rangle \approx \langle \mathbf{q}, \mathbf{k} \rangle$.

The problem is that MSE-optimal quantizers introduce systematic bias in inner product estimates. The quantization error $\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}}$ is correlated with $\hat{\mathbf{k}}$, which means:

$$\mathbb{E}[\langle \mathbf{q}, \hat{\mathbf{k}} \rangle] \neq \langle \mathbf{q}, \mathbf{k} \rangle$$

This bias accumulates across tokens in the attention softmax, silently degrading output quality. TurboQuant fixes this with a 1-bit correction based on the Johnson-Lindenstrauss lemma.

The Johnson-Lindenstrauss Lemma

The JL lemma (1984) is one of the most useful results in high-dimensional geometry. It says:

JL Lemma: For any set of $n$ points in $\mathbb{R}^d$, there exists a linear map $A: \mathbb{R}^d \to \mathbb{R}^k$ with $k = O(\log n / \varepsilon^2)$ that preserves all pairwise distances within a factor of $(1 \pm \varepsilon)$. Moreover, a random Gaussian matrix works with high probability.

In other words: random projections preserve geometry, even when you're projecting into a much lower-dimensional space.

How QJL corrects the bias

Let $\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}}$ be the residual error from Stage 1. We want to estimate $\langle \mathbf{q}, \mathbf{e} \rangle$ — the part of the inner product lost to quantization — so we can add it back. But we can't store $\mathbf{e}$ itself (that would defeat the purpose of compression). QJL stores a 1-bit sketch of $\mathbf{e}$ instead.

Step 1: Storing the sketch (1 bit per projection)

Fix $m$ random Gaussian vectors $\mathbf{g}_1, \ldots, \mathbf{g}_m \in \mathbb{R}^d$, drawn once and shared across all keys. For each key's residual $\mathbf{e}$, store:

$$s_i = \operatorname{sign}(\langle \mathbf{g}_i, \mathbf{e} \rangle) \in \{-1, +1\} \quad \text{for } i = 1, \ldots, m$$

Each $s_i$ is a single bit. The inner product $\langle \mathbf{g}_i, \mathbf{e} \rangle$ is a real number, but we only keep its sign. This feels like throwing away almost all the information — how can a sign bit help estimate an inner product?

Step 2: Where $\pi$ comes from — the sign-angle identity

The key mathematical fact is a classical result about Gaussian random projections. For any two fixed vectors $\mathbf{a}, \mathbf{b} \in \mathbb{R}^d$ and a random Gaussian vector $\mathbf{g} \sim \mathcal{N}(\mathbf{0}, I_d)$:

$$\mathbb{E}\big[\operatorname{sign}(\langle \mathbf{g}, \mathbf{a} \rangle) \cdot \langle \mathbf{g}, \mathbf{b} \rangle\big] = \sqrt{\frac{2}{\pi}} \, \|\mathbf{a}\| \cos\theta$$

Wait — let me derive this from scratch so the $\pi$ doesn't feel like magic.

Consider the simpler 1D case first. Let $g \sim \mathcal{N}(0,1)$. What is $\mathbb{E}[\operatorname{sign}(g) \cdot g]$?

$$\mathbb{E}[\operatorname{sign}(g) \cdot g] = \mathbb{E}[|g|] = \int_{-\infty}^{\infty} |g| \cdot \frac{1}{\sqrt{2\pi}} e^{-g^2/2} \, dg = 2 \int_{0}^{\infty} g \cdot \frac{1}{\sqrt{2\pi}} e^{-g^2/2} \, dg$$

The integral $\int_0^\infty g \, e^{-g^2/2} \, dg = 1$ (substitute $u = g^2/2$), so:

$$\mathbb{E}[|g|] = \frac{2}{\sqrt{2\pi}} = \sqrt{\frac{2}{\pi}}$$

That's where the $\sqrt{2/\pi}$ comes from — it's the mean of the half-normal distribution. It arises because we're taking the absolute value (or equivalently, multiplying by the sign) of a Gaussian.

Now generalize to $\mathbb{R}^d$. Let $\mathbf{g} \sim \mathcal{N}(\mathbf{0}, I_d)$, and let $\theta$ be the angle between fixed vectors $\mathbf{e}$ and $\mathbf{q}$. We can decompose $\mathbf{g}$ into a component along $\mathbf{e}$ and a component perpendicular to $\mathbf{e}$. The projection $\langle \mathbf{g}, \mathbf{e} \rangle / \|\mathbf{e}\|$ is a scalar Gaussian, and the sign of $\langle \mathbf{g}, \mathbf{e} \rangle$ depends only on this component. Meanwhile, $\langle \mathbf{g}, \mathbf{q} \rangle$ decomposes as:

$$\langle \mathbf{g}, \mathbf{q} \rangle = \underbrace{\langle \mathbf{g}, \mathbf{q}_\parallel \rangle}_{\text{component along } \mathbf{e}} + \underbrace{\langle \mathbf{g}, \mathbf{q}_\perp \rangle}_{\text{component } \perp \text{ to } \mathbf{e}}$$

where $\mathbf{q}_\parallel = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|^2} \mathbf{e}$ is the projection of $\mathbf{q}$ onto $\mathbf{e}$, and $\mathbf{q}_\perp = \mathbf{q} - \mathbf{q}_\parallel$.

The perpendicular part $\langle \mathbf{g}, \mathbf{q}_\perp \rangle$ is independent of $\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle)$ (because $\mathbf{q}_\perp$ is orthogonal to $\mathbf{e}$, so this projection involves different Gaussian components). Therefore its contribution vanishes in expectation:

$$\mathbb{E}\big[\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle) \cdot \langle \mathbf{g}, \mathbf{q}_\perp \rangle\big] = 0$$

For the parallel part, let $\alpha = \langle \mathbf{g}, \mathbf{e} \rangle / \|\mathbf{e}\|$ (a standard Gaussian scalar). Then:

$$\langle \mathbf{g}, \mathbf{q}_\parallel \rangle = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} \cdot \alpha$$

and:

$$\mathbb{E}\big[\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle) \cdot \langle \mathbf{g}, \mathbf{q}_\parallel \rangle\big] = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} \cdot \mathbb{E}[\operatorname{sign}(\alpha) \cdot \alpha] = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} \cdot \sqrt{\frac{2}{\pi}}$$

Combining the parallel and perpendicular parts:

The sign-projection identity: $$\mathbb{E}\big[\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle) \cdot \langle \mathbf{g}, \mathbf{q} \rangle\big] = \sqrt{\frac{2}{\pi}} \cdot \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|}$$ This holds for any fixed $\mathbf{e}, \mathbf{q}$ and random $\mathbf{g} \sim \mathcal{N}(\mathbf{0}, I)$. The $\sqrt{2/\pi}$ is the mean absolute value of a standard Gaussian.

Step 3: Building the unbiased estimator

The identity above tells us that $\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle) \cdot \langle \mathbf{g}, \mathbf{q} \rangle$ has expected value proportional to $\langle \mathbf{q}, \mathbf{e} \rangle$, but scaled by a factor of $\sqrt{2/\pi} / \|\mathbf{e}\|$. To get an unbiased estimator of $\langle \mathbf{q}, \mathbf{e} \rangle$ itself, we need to undo this scaling.

If we knew $\|\mathbf{e}\|$, the estimator from a single projection would be:

$$\hat{T}_i = \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot s_i \cdot \langle \mathbf{g}_i, \mathbf{q} \rangle$$

and by the identity: $\mathbb{E}[\hat{T}_i] = \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot \sqrt{\frac{2}{\pi}} \cdot \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} = \langle \mathbf{q}, \mathbf{e} \rangle$. Unbiased.

In practice, TurboQuant stores $\|\mathbf{e}\|$ alongside the sketch (it's a single scalar per vector, negligible overhead). Averaging over $m$ independent projections reduces variance:

$$\widehat{\langle \mathbf{q}, \mathbf{e} \rangle} = \frac{1}{m} \sum_{i=1}^{m} \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot s_i \cdot \langle \mathbf{g}_i, \mathbf{q} \rangle$$

This is sometimes written more compactly as $\frac{\pi}{2m} \sum s_i \langle \mathbf{g}_i, \mathbf{q} \rangle$ (absorbing $\|\mathbf{e}\|$ into the normalization of $\mathbf{g}_i$), which is the formula you see in the paper.

Step 4: Proving unbiasedness explicitly

Let's verify this end to end. The estimator for the full inner product is:

$$\widehat{\langle \mathbf{q}, \mathbf{k} \rangle} = \underbrace{\langle \mathbf{q}, \hat{\mathbf{k}} \rangle}_{\text{from Stage 1}} + \underbrace{\widehat{\langle \mathbf{q}, \mathbf{e} \rangle}}_{\text{QJL correction}}$$

Taking expectations (over the randomness of the Gaussian vectors $\mathbf{g}_i$):

$$\mathbb{E}\big[\widehat{\langle \mathbf{q}, \mathbf{k} \rangle}\big] = \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \mathbb{E}\big[\widehat{\langle \mathbf{q}, \mathbf{e} \rangle}\big] = \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \langle \mathbf{q}, \mathbf{e} \rangle$$

Now use $\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}}$:

$$= \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \langle \mathbf{q}, \mathbf{k} - \hat{\mathbf{k}} \rangle = \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \langle \mathbf{q}, \mathbf{k} \rangle - \langle \mathbf{q}, \hat{\mathbf{k}} \rangle = \langle \mathbf{q}, \mathbf{k} \rangle$$

The $\langle \mathbf{q}, \hat{\mathbf{k}} \rangle$ terms cancel perfectly. The estimator recovers the true inner product in expectation, regardless of how large the quantization error $\mathbf{e}$ is.

Step 5: A concrete numerical example

Example: QJL correction in 2D

Let's trace through this with small numbers to see it working.

Setup: Dimension $d = 2$. Suppose:

True key: $\mathbf{k} = (3.0, \; 4.0)$
Quantized key: $\hat{\mathbf{k}} = (2.5, \; 4.5)$ (Stage 1 output)
Residual: $\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}} = (0.5, \; -0.5)$
Query: $\mathbf{q} = (1.0, \; 2.0)$

True inner product: $\langle \mathbf{q}, \mathbf{k} \rangle = 1 \cdot 3 + 2 \cdot 4 = 11.0$

Naive (Stage 1 only): $\langle \mathbf{q}, \hat{\mathbf{k}} \rangle = 1 \cdot 2.5 + 2 \cdot 4.5 = 11.5$

Error without correction: $11.5 - 11.0 = 0.5$ (this is the bias)

Lost inner product: $\langle \mathbf{q}, \mathbf{e} \rangle = 1 \cdot 0.5 + 2 \cdot (-0.5) = -0.5$

QJL correction with $m = 3$ projections:

Draw 3 random Gaussian vectors and compute the sketch:

$\mathbf{g}_1 = (0.8, \; -0.3)$: $\langle \mathbf{g}_1, \mathbf{e} \rangle = 0.4 + 0.15 = 0.55$, so $s_1 = +1$
$\mathbf{g}_2 = (-1.2, \; 0.7)$: $\langle \mathbf{g}_2, \mathbf{e} \rangle = -0.6 - 0.35 = -0.95$, so $s_2 = -1$
$\mathbf{g}_3 = (0.1, \; 1.5)$: $\langle \mathbf{g}_3, \mathbf{e} \rangle = 0.05 - 0.75 = -0.70$, so $s_3 = -1$

Store only: $s_1 = +1, \; s_2 = -1, \; s_3 = -1$ and $\|\mathbf{e}\| = \sqrt{0.25 + 0.25} = \frac{1}{\sqrt{2}} \approx 0.707$.

At query time, compute $\langle \mathbf{g}_i, \mathbf{q} \rangle$ for each:

$\langle \mathbf{g}_1, \mathbf{q} \rangle = 0.8 + (-0.6) = 0.2$
$\langle \mathbf{g}_2, \mathbf{q} \rangle = -1.2 + 1.4 = 0.2$
$\langle \mathbf{g}_3, \mathbf{q} \rangle = 0.1 + 3.0 = 3.1$

QJL estimate of $\langle \mathbf{q}, \mathbf{e} \rangle$:

$$\widehat{\langle \mathbf{q}, \mathbf{e} \rangle} = \frac{\sqrt{\pi/2} \cdot \|\mathbf{e}\|}{m} \sum_{i} s_i \langle \mathbf{g}_i, \mathbf{q} \rangle = \frac{1.253 \cdot 0.707}{3} \big[(+1)(0.2) + (-1)(0.2) + (-1)(3.1)\big]$$ $$= \frac{0.886}{3} \cdot (0.2 - 0.2 - 3.1) = 0.295 \times (-3.1) = -0.915$$

Corrected estimate: $11.5 + (-0.915) = 10.585$

True value: $11.0$

The correction moved us from an error of $+0.5$ to an error of $-0.415$. On this single draw, the correction overshot — but that's because $m = 3$ is very small and we got unlucky with $\mathbf{g}_3$. In expectation (averaged over many random draws of the $\mathbf{g}_i$), the estimate would be exactly 11.0. With a realistic $m$ (say 64 or 128, matching the head dimension), the variance shrinks dramatically and the estimate would be very close to 11.0 on any single draw.

Step 6: Why the two stages complement each other

Now we can see precisely how the two stages work together:

Stage 1 (PolarQuant) makes $\|\mathbf{e}\|$ small. The better the Lloyd-Max quantizer, the smaller the residual. This directly reduces the variance of the QJL estimator (which scales with $\|\mathbf{e}\|^2$).
Stage 2 (QJL) makes the estimate unbiased. No matter how large or small $\mathbf{e}$ is, the correction recovers the true inner product in expectation.

Without Stage 1, the residual would be large and the QJL estimator would have high variance. Without Stage 2, the quantized inner product would have low variance but persistent bias. Together: low variance and zero bias.

The total inner product estimate is:

$$\langle \mathbf{q}, \mathbf{k} \rangle \;\approx\; \langle \mathbf{q}, \hat{\mathbf{k}} \rangle \;+\; \frac{1}{m}\sum_{i=1}^{m} \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot s_i \cdot \langle \mathbf{g}_i, \mathbf{q} \rangle$$

The Full TurboQuant Pipeline

Input vector x ∈ ℝᵈ | v [Random Rotation] x̃ = Qx | v [Lloyd-Max Quantize] x̂ = decode(quantize(x̃)) ... b bits/coord | v [Compute Residual] e = x̃ - x̂ | v [QJL 1-bit Sketch] sᵢ = sign(⟨gᵢ, e⟩) ... 1 bit/coord | v Store: b-bit codes + 1-bit sketch + ∥x∥ At query time: ⟨q, k⟩ ≈ ⟨q, k̂⟩ + (π/2m) Σ sᵢ · ⟨gᵢ, q⟩

TurboQuant compresses each KV vector in two stages: rotation + scalar quantization (PolarQuant), then 1-bit residual correction (QJL).

The total storage per coordinate is $b + 1$ bits (e.g., 3 bits PolarQuant + 1 bit QJL = 4 bits total, compared to 16 bits for FP16). This gives the 4–6x memory reduction.

How Good Is This? The Information-Theoretic Limit

Now for the punchline. How close is TurboQuant to the best possible?

Shannon's rate-distortion theory provides a hard lower bound: for any $b$-bit quantizer operating on unit-norm vectors in $\mathbb{R}^d$, the minimum achievable MSE scales as:

$$\operatorname{MSE}_{\min} \;\geq\; C \cdot 4^{-b}$$

for some constant $C$. No quantizer — no matter how clever, how much data it sees, or how much computation it uses — can beat this bound.

The TurboQuant paper (Theorem 3 in arXiv:2504.19874) proves that TurboQuant achieves:

$$\operatorname{MSE}_{\mathrm{TurboQuant}} \;\leq\; \frac{\sqrt{3\pi}}{2} \cdot 4^{-b} \;\approx\; 2.7 \cdot 4^{-b}$$

TurboQuant is within a factor of ~2.7 of the information-theoretic limit at every bitwidth.

This factor is constant — it does not grow with $d$ or $b$.

To put this in perspective: no future algorithm can improve on TurboQuant by more than 2.7x in MSE, regardless of approach. The remaining gap is between the Lloyd-Max scalar quantizer (optimal for scalar quantization of a known distribution) and the Shannon bound for full vector quantization (which requires exponential codebook searches in theory).

Inner product guarantee

For the inner product estimator (which is what attention actually uses), TurboQuant provides:

Unbiased: $\mathbb{E}[\langle \mathbf{q}, \hat{\mathbf{k}} \rangle_{\mathrm{TQ}}] = \langle \mathbf{q}, \mathbf{k} \rangle$
Variance bound: $\operatorname{Var}[\langle \mathbf{q}, \hat{\mathbf{k}} \rangle_{\mathrm{TQ}}] \leq \frac{\sqrt{3\pi}}{2} \cdot \frac{\|\mathbf{q}\|^2}{d} \cdot 4^{-b}$

The variance shrinks exponentially with the bit budget and inversely with the dimension. For typical LLM head dimensions ($d = 128$) and 3–4 bit quantization, this variance is tiny.

What This Looks Like in Practice

Configuration	Bits per value	KV cache size (Llama 70B, 128K ctx)	Quality
FP16 (baseline)	16	~40 GB	Full
TurboQuant 4-bit	4	~10 GB	Matches FP16
TurboQuant 3.5-bit	3.5	~8.75 GB	Quality-neutral
TurboQuant 3-bit	3	~7.5 GB	Near quality-neutral
TurboQuant 2.5-bit	2.5	~6.25 GB	Marginal degradation

At 4-bit precision, TurboQuant also delivers up to 8x speedup for attention logit computation on H100 GPUs, since lower-precision arithmetic is faster.

When benchmarked on long-context tasks (LongBench, Needle-in-a-Haystack, RULER), 3.5-bit TurboQuant is statistically indistinguishable from FP16. The algorithm achieves this without any training, fine-tuning, or calibration data.

The Deeper Puzzle: Why Is This So Close to Optimal?

The factor of 2.7 deserves a closer look. Where does it come from, and why can't we close the gap?

The gap exists because TurboQuant uses scalar quantization (quantizing each coordinate independently) rather than vector quantization (quantizing all coordinates jointly). Scalar quantization is fast — $O(d)$ per vector — but it treats each coordinate in isolation, missing correlations between coordinates.

Full vector quantization could, in theory, exploit these correlations to achieve lower distortion. But it requires a codebook of size $2^{bd}$ and search time exponential in $d$. For $d = 128$ and $b = 3$, that's $2^{384}$ codewords — a number larger than the atoms in the observable universe.

A Question to Sit With

The random rotation in PolarQuant makes coordinates nearly independent in high dimensions. But "nearly" isn't "exactly." There are residual correlations that the scalar quantizer ignores.

Is there a tractable quantization scheme — one that runs in polynomial time — that could exploit these residual correlations to close the 2.7x gap? Or is there a computational lower bound that says this gap is inherent for efficient algorithms?

As far as I know, this question is open.

Why This Matters Beyond LLMs

TurboQuant is framed as a KV cache trick, but the mathematics applies to any setting where you need to store high-dimensional vectors compactly while preserving inner products:

Vector databases (Pinecone, Weaviate, etc.): similarity search at lower memory cost
Embedding storage: compress sentence/image embeddings for retrieval
Streaming quantization: the algorithm is online — each vector is quantized independently with no look-ahead, making it suitable for real-time applications
Federated learning: compress gradient updates for communication-efficient training

The data-oblivious property is particularly valuable: you don't need to retrain or recalibrate when the data distribution shifts.

What's Left on the Table

TurboQuant is close to optimal for compressing the data as-is. But the authors note that most future gains in LLM memory efficiency will likely come from changing what data needs to be stored in the first place:

Multi-query attention / grouped-query attention: share key-value heads across query heads, reducing the number of vectors stored
Sliding window attention: don't store the full context, just a recent window
Architectural innovations: models designed from the start for efficient KV storage

Compression is approaching the theoretical wall. Architecture is where the headroom is.

Summary

TurboQuant combines three clean mathematical ideas:

Random rotation makes every coordinate look the same (PolarQuant insight).
Lloyd-Max quantization exploits the known distribution to minimize MSE with zero calibration.
1-bit JL sketch of the residual eliminates inner product bias (QJL insight).

The result is within 2.7x of the information-theoretic limit, requires no training data, and runs fast enough for real-time inference. It's a nice example of how theoretical tools from high-dimensional geometry (JL lemma), information theory (rate-distortion bounds), and signal processing (Lloyd-Max quantization) combine to solve a practical engineering problem.

Paper: arXiv:2504.19874 — Zandieh, Daliri, Hadian, Mirrokni (2025). Accepted at ICLR 2026.

Appendix: Mathematical Background

This blog uses several probability distributions and their interrelationships. If you're rusty on any of them, this appendix walks through each one from scratch with worked examples that connect directly to the arguments above.

A.1 — The Chi-Squared Distribution

Definition. If $g_1, g_2, \ldots, g_k$ are independent standard normal random variables ($g_i \sim \mathcal{N}(0,1)$), then the sum of their squares $$X = g_1^2 + g_2^2 + \cdots + g_k^2$$ follows a chi-squared distribution with $k$ degrees of freedom, written $X \sim \chi^2(k)$.

Key properties:

Mean: $\mathbb{E}[X] = k$
Variance: $\operatorname{Var}(X) = 2k$
The PDF is $f(x) = \frac{1}{2^{k/2}\,\Gamma(k/2)} \, x^{k/2-1} \, e^{-x/2}$ for $x > 0$.
Additive: if $X \sim \chi^2(a)$ and $Y \sim \chi^2(b)$ are independent, then $X + Y \sim \chi^2(a+b)$.

Special case used in this blog: A single squared Gaussian $g^2$ where $g \sim \mathcal{N}(0,1)$ is $\chi^2(1)$. Its mean is 1 and its variance is 2. This is the building block for everything that follows.

Worked Example A.1: Sum of squared Gaussians

Problem: Let $g_1, \ldots, g_{1000} \sim \mathcal{N}(0,1)$ independently. What is the distribution of $S = \sum_{i=1}^{1000} g_i^2$? What are its mean and standard deviation? How tightly is it concentrated?

Solution:

$S \sim \chi^2(1000)$
$\mathbb{E}[S] = 1000$
$\operatorname{Var}(S) = 2 \times 1000 = 2000$, so $\operatorname{SD}(S) = \sqrt{2000} \approx 44.7$
Coefficient of variation: $\operatorname{SD}/\operatorname{Mean} = 44.7 / 1000 \approx 0.045$

The relative fluctuation is under 5%. This is why, in Step 5 of the puzzle solution, we could approximate $\sqrt{\sum g_j^2} \approx \sqrt{d}$ — the sum $\sum g_j^2$ is tightly concentrated around its mean $d = 1000$.

Worked Example A.2: Moments of $\chi^2(1)$

Problem: If $X \sim \chi^2(1)$ (i.e., $X = g^2$ for $g \sim \mathcal{N}(0,1)$), compute $\mathbb{E}[X]$, $\mathbb{E}[X^2]$, and $\operatorname{Var}(X)$.

Solution:

$\mathbb{E}[X] = \mathbb{E}[g^2] = 1$ (since $g$ has mean 0 and variance 1).
$\mathbb{E}[X^2] = \mathbb{E}[g^4]$. For a standard normal, $\mathbb{E}[g^4] = 3$ (this is the fourth moment, equal to $3!! = 3 \cdot 1 = 3$).
$\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = 3 - 1 = 2$. This matches the general formula $\operatorname{Var}(\chi^2(k)) = 2k$ with $k = 1$.

This is exactly the calculation that produces the "$+2$" in the $d(d+2)$ denominator in Step 6 of the puzzle solution. The second moment of $\chi^2(1)$ being 3 (not 1) is what creates that correction term.

A.2 — The Beta Distribution

Definition. A random variable $Z$ follows a Beta distribution with parameters $\alpha > 0$ and $\beta > 0$, written $Z \sim \operatorname{Beta}(\alpha, \beta)$, if it takes values in $[0, 1]$ with PDF: $$f(z) = \frac{z^{\alpha-1}(1-z)^{\beta-1}}{B(\alpha, \beta)} \quad \text{for } z \in [0,1]$$ where $B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}$ is the beta function.

Key properties:

Mean: $\displaystyle\mathbb{E}[Z] = \frac{\alpha}{\alpha + \beta}$
Variance: $\displaystyle\operatorname{Var}(Z) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$
Mode (for $\alpha, \beta > 1$): $\displaystyle\frac{\alpha - 1}{\alpha + \beta - 2}$
If $\alpha = \beta$, the distribution is symmetric around $1/2$.
If $\alpha < 1$ and $\beta \gg 1$, the distribution is concentrated near 0 with a right skew.

Where it appears in this blog: The squared coordinate $u_i^2$ of a uniform random unit vector follows $\operatorname{Beta}(1/2, (d-1)/2)$. This has $\alpha = 1/2 < 1$, so the PDF diverges at 0 — most of the probability mass is near zero, which makes sense because each coordinate of a high-dimensional unit vector is small.

Connection: Chi-squared → Beta. This is the key link used in Step 3 of the puzzle solution. If $X \sim \chi^2(a)$ and $Y \sim \chi^2(b)$ are independent, then: $$\frac{X}{X + Y} \sim \operatorname{Beta}\!\left(\frac{a}{2}, \frac{b}{2}\right)$$ In words: the fraction of a total chi-squared sum attributable to one part follows a Beta distribution. The proof goes through the change-of-variables formula for joint densities, transforming $(X, Y) \to (X/(X+Y),\; X+Y)$.

Worked Example A.3: The Beta distribution of a single coordinate

Problem: A uniform random unit vector $\mathbf{u}$ on the sphere in $\mathbb{R}^5$. What is the exact distribution of $u_1^2$? Compute its mean and variance.

Solution: Using the Gaussian construction, $u_1 = g_1 / \|\mathbf{g}\|$, so:

$$u_1^2 = \frac{g_1^2}{g_1^2 + g_2^2 + g_3^2 + g_4^2 + g_5^2} = \frac{X}{X + Y}$$

where $X = g_1^2 \sim \chi^2(1)$ and $Y = g_2^2 + \cdots + g_5^2 \sim \chi^2(4)$, independent of $X$.

By the chi-squared-to-Beta connection:

$$u_1^2 \sim \operatorname{Beta}\!\left(\frac{1}{2}, \frac{4}{2}\right) = \operatorname{Beta}\!\left(\frac{1}{2}, 2\right)$$

Now compute moments:

$\mathbb{E}[u_1^2] = \dfrac{1/2}{1/2 + 2} = \dfrac{1/2}{5/2} = \dfrac{1}{5}$. Checks out: $d = 5$, and by symmetry each of the 5 coordinates gets $1/5$ of the squared norm.
$\operatorname{Var}(u_1^2) = \dfrac{(1/2)(2)}{(5/2)^2(5/2 + 1)} = \dfrac{1}{(25/4)(7/2)} = \dfrac{1}{87.5} = \dfrac{4}{350} = \dfrac{2}{175}$
$\operatorname{SD}(u_1^2) = \sqrt{2/175} \approx 0.107$. The mean is $0.2$, so the coefficient of variation is about 0.53 — substantial spread, as discussed in Step 4.

Worked Example A.4: Seeing the Gaussian approximation emerge

Problem: For $d = 5$ vs. $d = 1000$, compare the Beta distribution of $u_1^2$ to the Gaussian approximation.

Solution: The coordinate $u_1$ (not squared) has mean 0 and variance $1/d$. The Gaussian approximation says $u_1 \approx \mathcal{N}(0, 1/d)$.

For $d = 5$: $u_1 \approx \mathcal{N}(0, 0.2)$. But the actual distribution of $u_1$ has support only on $[-1, 1]$ (it's a coordinate of a unit vector), and the density has noticeable curvature at the tails. The Gaussian leaks probability outside $[-1, 1]$. The approximation is rough.

For $d = 1000$: $u_1 \approx \mathcal{N}(0, 0.001)$, so $\operatorname{SD}(u_1) \approx 0.032$. Almost all probability mass sits in $[-0.1, 0.1]$, far from the boundary at $\pm 1$. The tails of the Gaussian that leak outside $[-1, 1]$ have negligible mass ($< 10^{-200}$). The approximation is excellent.

This is why PolarQuant works well at LLM head dimensions ($d = 64$ to 128) — large enough that the Gaussian approximation to the Beta is already very accurate, and the precomputed Lloyd-Max quantizer designed for this distribution performs nearly optimally.

A.3 — The Dirichlet Distribution

Definition. A random vector $(Z_1, Z_2, \ldots, Z_d)$ follows a Dirichlet distribution with parameters $(\alpha_1, \alpha_2, \ldots, \alpha_d)$, written $\operatorname{Dir}(\alpha_1, \ldots, \alpha_d)$, if:

Each $Z_i \geq 0$
$\sum_{i=1}^{d} Z_i = 1$ (the components live on the probability simplex)
The joint PDF is $\displaystyle f(z_1, \ldots, z_d) = \frac{\Gamma(\alpha_1 + \cdots + \alpha_d)}{\Gamma(\alpha_1)\cdots\Gamma(\alpha_d)} \prod_{i=1}^{d} z_i^{\alpha_i - 1}$

When all $\alpha_i = \alpha$, this is a symmetric Dirichlet with parameter $\alpha$.

Key properties:

Mean of each component: $\displaystyle\mathbb{E}[Z_i] = \frac{\alpha_i}{\alpha_0}$ where $\alpha_0 = \sum \alpha_j$
Variance: $\displaystyle\operatorname{Var}(Z_i) = \frac{\alpha_i(\alpha_0 - \alpha_i)}{\alpha_0^2(\alpha_0 + 1)}$
Covariance (for $i \neq j$): $\displaystyle\operatorname{Cov}(Z_i, Z_j) = \frac{-\alpha_i \alpha_j}{\alpha_0^2(\alpha_0 + 1)}$
Each marginal is Beta: $Z_i \sim \operatorname{Beta}(\alpha_i, \alpha_0 - \alpha_i)$

Connection: Chi-squared → Dirichlet → Beta. If $X_1, \ldots, X_d$ are independent with $X_i \sim \chi^2(2\alpha_i)$, and $S = \sum X_i$, then: $$\left(\frac{X_1}{S}, \frac{X_2}{S}, \ldots, \frac{X_d}{S}\right) \sim \operatorname{Dir}(\alpha_1, \ldots, \alpha_d)$$ The Beta distribution is the special case $d = 2$: splitting a total into two parts gives a single ratio that's Beta. The Dirichlet generalizes this to splitting a total into $d$ parts simultaneously.

Where it appears in this blog: The vector of squared coordinates $(u_1^2, u_2^2, \ldots, u_d^2)$ of a uniform random unit vector follows:

$$\operatorname{Dir}\!\left(\frac{1}{2}, \frac{1}{2}, \ldots, \frac{1}{2}\right) \quad \text{(symmetric Dirichlet with } \alpha = 1/2 \text{)}$$

This is because $u_i^2 = g_i^2 / \sum g_k^2$, and each $g_i^2 \sim \chi^2(1)$, and $\chi^2(1) = \chi^2(2 \times 1/2)$, so the construction above applies with $\alpha_i = 1/2$ for all $i$.

Worked Example A.5: Dirichlet moments and the $d(d+2)$ formula

Problem: For $(Z_1, \ldots, Z_d) \sim \operatorname{Dir}(1/2, \ldots, 1/2)$ (the distribution of squared coordinates of a random unit vector), compute $\mathbb{E}[Z_i Z_j]$ for $i \neq j$.

Solution: We use the general Dirichlet second moment formula. For a Dirichlet with parameters $(\alpha_1, \ldots, \alpha_d)$ and $\alpha_0 = \sum \alpha_k$:

$$\mathbb{E}[Z_i Z_j] = \frac{\alpha_i \alpha_j}{\alpha_0(\alpha_0 + 1)} \quad \text{for } i \neq j$$

Let's derive this. We have:

$$\mathbb{E}[Z_i Z_j] = \mathbb{E}[Z_i \, \mathbb{E}[Z_j \mid Z_i]]$$

But it's easier to go through the chi-squared construction. Let $X_k \sim \chi^2(2\alpha_k)$ independently, $S = \sum X_k$, and $Z_k = X_k/S$. Then:

$$\mathbb{E}[Z_i Z_j] = \mathbb{E}\!\left[\frac{X_i X_j}{S^2}\right]$$

Since $X_i$ and $X_j$ are independent of each other (and of the rest), we can condition on $S$. But actually, the cleanest approach uses the identity: for independent Gamma variables $X_i \sim \text{Gamma}(\alpha_i, 1)$, with $S = \sum X_k$:

$$\mathbb{E}\!\left[\frac{X_i X_j}{S^2}\right] = \frac{\mathbb{E}[X_i]\,\mathbb{E}[X_j]}{\mathbb{E}[S]\,(\mathbb{E}[S]+1)} = \frac{\alpha_i \, \alpha_j}{\alpha_0(\alpha_0+1)}$$

(This identity follows from the fact that $Z_i$ is independent of $S$ — a defining property of the Dirichlet — and from the formula for $\mathbb{E}[1/S]$ for a Gamma sum.)

In our case, $\alpha_i = \alpha_j = 1/2$ and $\alpha_0 = d/2$, so:

$$\mathbb{E}[Z_i Z_j] = \frac{(1/2)(1/2)}{(d/2)(d/2 + 1)} = \frac{1/4}{(d/2)\cdot(d+2)/2} = \frac{1/4}{d(d+2)/4} = \frac{1}{d(d+2)}$$

That's the $1/d(d+2)$ from Step 6. Now the covariance:

$$\operatorname{Cov}(Z_i, Z_j) = \mathbb{E}[Z_i Z_j] - \mathbb{E}[Z_i]\mathbb{E}[Z_j] = \frac{1}{d(d+2)} - \frac{1}{d} \cdot \frac{1}{d} = \frac{1}{d(d+2)} - \frac{1}{d^2}$$ $$= \frac{d - (d+2)}{d^2(d+2)} = \frac{-2}{d^2(d+2)}$$

For $d = 1000$: $\operatorname{Cov}(Z_i, Z_j) = -2/(1000^2 \cdot 1002) \approx -2.0 \times 10^{-9}$. Tiny.

Worked Example A.6: Checking the Dirichlet formula numerically

Problem: Verify the formula $\mathbb{E}[Z_i Z_j] = 1/d(d+2)$ for $d = 3$ by computing it from the chi-squared definition.

Solution: The formula predicts $\mathbb{E}[Z_1 Z_2] = 1/(3 \times 5) = 1/15$.

Let's check with the Dirichlet moment formula directly. For $\operatorname{Dir}(1/2, 1/2, 1/2)$, the general cross-moment is:

$$\mathbb{E}\!\left[\prod Z_i^{n_i}\right] = \frac{B(\alpha_1+n_1, \ldots, \alpha_d+n_d)}{B(\alpha_1, \ldots, \alpha_d)}$$

where $B$ is the multivariate beta function. With $n_1 = 1, n_2 = 1, n_3 = 0$:

$$\mathbb{E}[Z_1 Z_2] = \frac{\Gamma(1/2 + 1)\Gamma(1/2 + 1)\Gamma(1/2)}{\Gamma(3/2 + 2)} \cdot \frac{\Gamma(3/2)}{\Gamma(1/2)\Gamma(1/2)\Gamma(1/2)}$$

Using $\Gamma(1/2) = \sqrt{\pi}$, $\Gamma(3/2) = \frac{1}{2}\sqrt{\pi}$, $\Gamma(5/2) = \frac{3}{4}\sqrt{\pi}$, $\Gamma(7/2) = \frac{15}{8}\sqrt{\pi}$:

$$= \frac{(\frac{1}{2}\sqrt{\pi})(\frac{1}{2}\sqrt{\pi})(\sqrt{\pi})}{(\frac{15}{8}\sqrt{\pi})} \cdot \frac{(\frac{1}{2}\sqrt{\pi})}{(\sqrt{\pi})(\sqrt{\pi})(\sqrt{\pi})} = \frac{\frac{1}{4}\pi\sqrt{\pi}}{\frac{15}{8}\sqrt{\pi}} \cdot \frac{1}{2\pi} = \frac{\frac{1}{4}}{\frac{15}{8}} \cdot \frac{1}{2} = \frac{2}{15} \cdot \frac{1}{2} = \frac{1}{15}$$

Matches $1/d(d+2) = 1/15$.

A.4 — The Family Tree: How They All Connect

Here is how the three distributions relate, in the order that matters for this blog:

Level 0: Gaussian. Start with independent $g_1, \ldots, g_d \sim \mathcal{N}(0,1)$.

Level 1: Chi-squared. Square them: each $g_i^2 \sim \chi^2(1)$. Sum $k$ of them: $\sum_{i=1}^{k} g_i^2 \sim \chi^2(k)$.

Level 2: Beta. Take the ratio of one part to the whole: $\dfrac{g_i^2}{\sum_{j=1}^d g_j^2} \sim \operatorname{Beta}(1/2, (d-1)/2)$. This is a single squared coordinate of a random unit vector.

Level 3: Dirichlet. Take all such ratios jointly: $\left(\dfrac{g_1^2}{\sum g_j^2}, \ldots, \dfrac{g_d^2}{\sum g_j^2}\right) \sim \operatorname{Dir}(1/2, \ldots, 1/2)$. This is the full vector of squared coordinates. Each marginal is Beta (Level 2), but the Dirichlet captures the joint structure — including the constraint that they sum to 1, which is what creates the small negative covariances in Step 6.

d independent Gaussians: g₁, ..., g⇸ ~ N(0,1) | v Square each: g₁², ..., g⇸² ~ χ²(1) each | +------------------+-------------------+ | | v v Sum all d: Normalize by sum: S = Σ gᵢ² ~ χ²(d) (g₁²/S, ..., g⇸²/S) ~ Dir(½,...,½) | v Any single ratio: gᵢ²/S ~ Beta(½, (d-1)/2) | v Take square root: uᵢ = gᵢ/√S ≈ N(0, 1/d) [coordinate of random unit vector]

The chain from Gaussians to the coordinate distribution used by PolarQuant. Each level builds on the previous.

Worked Example A.7: Tracing through the full chain for $d = 4$

Problem: Draw $g_1 = 1.2$, $g_2 = -0.5$, $g_3 = 0.8$, $g_4 = -1.1$ from $\mathcal{N}(0,1)$. Compute the unit vector, the Dirichlet vector, and verify the distributions.

Solution:

Chi-squared level: $g_1^2 = 1.44$, $g_2^2 = 0.25$, $g_3^2 = 0.64$, $g_4^2 = 1.21$. Sum: $S = 3.54$. (One draw from $\chi^2(4)$; the mean of $\chi^2(4)$ is 4, so 3.54 is a reasonable value.)

Unit vector (take square root of S): $\|\mathbf{g}\| = \sqrt{3.54} = 1.881$.

$$(u_1, u_2, u_3, u_4) = \frac{1}{1.881}(1.2,\; -0.5,\; 0.8,\; -1.1) = (0.638,\; -0.266,\; 0.425,\; -0.585)$$

Check: $0.638^2 + 0.266^2 + 0.425^2 + 0.585^2 = 0.407 + 0.071 + 0.181 + 0.342 = 1.001 \approx 1$.

Dirichlet level (squared coordinates):

$$(u_1^2, u_2^2, u_3^2, u_4^2) = (0.407, 0.071, 0.181, 0.342)$$

These sum to 1 and each is in $[0,1]$ — one draw from $\operatorname{Dir}(1/2, 1/2, 1/2, 1/2)$. The expected value of each component is $1/d = 1/4 = 0.25$.

Beta level (single component): $u_1^2 = 0.407$ is one draw from $\operatorname{Beta}(1/2, 3/2)$. The mean of this Beta is $(1/2)/(1/2 + 3/2) = 1/4 = 0.25$. Our draw of 0.407 is above the mean, which is fine — the distribution has substantial spread at $d = 4$.

Gaussian approximation: $u_1 = 0.638$. The approximation says $u_1 \approx \mathcal{N}(0, 1/4)$, so $\operatorname{SD} = 0.5$. Our value of 0.638 is 1.28 standard deviations out — perfectly typical. At $d = 4$ the approximation is coarse; at $d = 128$ (a real LLM head dimension) it would be far more accurate.

Worked Example A.8: Why the Dirichlet constraint matters (and why it doesn't)

Problem: In the Dirichlet vector above, suppose you observe that $u_1^2 = 0.9$ (one coordinate hogs 90% of the squared norm). What can you say about the remaining coordinates?

Solution: Since $\sum u_i^2 = 1$, the remaining three squared coordinates must share $1 - 0.9 = 0.1$. Their conditional distribution is $\operatorname{Dir}(1/2, 1/2, 1/2)$ rescaled to sum to 0.1 instead of 1. Each remaining coordinate has conditional mean $0.1/3 \approx 0.033$, much less than the unconditional mean of $0.25$.

This is the negative covariance at work: knowing one coordinate is large forces the others to be small. But at $d = 1000$, the probability of any single coordinate hogging 90% of the norm is astronomically small (it would require a $\operatorname{Beta}(1/2, 499.5)$ random variable to take the value 0.9, which is many hundreds of standard deviations from the mean of 0.001). In practice, no coordinate deviates enough from $1/d$ to meaningfully constrain the others. That's the concentration of measure argument for near-independence.

The Mathematics of Google's TurboQuant

A Puzzle to Start

The Random Rotation Puzzle

The Answer

Step 1: What does \(Q\mathbf{v}\) actually compute?

Step 2: A column of a random orthogonal matrix is a random unit vector

Step 3: Deriving the distribution (where the Beta comes from)

Step 4: What does this Beta distribution look like? (the magnitude argument)

Step 5: The Gaussian approximation (why it looks normal)

Step 6: Why are the coordinates nearly independent?

Putting it all together

The Problem: LLMs Are Drowning in Memory

The Three-Paper Arc

Stage 1: PolarQuant — Rotation Makes Everything Predictable

Why does rotation help?

The Lloyd-Max Quantizer

Example: 1-bit quantizer for a uniform distribution

Example: 2-bit quantizer for a uniform distribution

Example: 2-bit quantizer for a Gaussian — non-uniform intervals

Example: What PolarQuant actually precomputes

Eliminating normalization overhead

Stage 2: QJL — Fixing the Bias with One Bit

The Johnson-Lindenstrauss Lemma

How QJL corrects the bias

Step 1: Storing the sketch (1 bit per projection)

Step 2: Where \(\pi\) comes from — the sign-angle identity

Step 3: Building the unbiased estimator

Step 4: Proving unbiasedness explicitly

Step 5: A concrete numerical example

Example: QJL correction in 2D

Step 6: Why the two stages complement each other

The Full TurboQuant Pipeline

How Good Is This? The Information-Theoretic Limit

Inner product guarantee

What This Looks Like in Practice

The Deeper Puzzle: Why Is This So Close to Optimal?

A Question to Sit With

Why This Matters Beyond LLMs

What's Left on the Table

Summary

Further Reading

Appendix: Mathematical Background

A.1 — The Chi-Squared Distribution

Worked Example A.1: Sum of squared Gaussians

Worked Example A.2: Moments of \(\chi^2(1)\)

A.2 — The Beta Distribution

Worked Example A.3: The Beta distribution of a single coordinate

Worked Example A.4: Seeing the Gaussian approximation emerge

A.3 — The Dirichlet Distribution

Worked Example A.5: Dirichlet moments and the \(d(d+2)\) formula

Worked Example A.6: Checking the Dirichlet formula numerically

A.4 — The Family Tree: How They All Connect

Worked Example A.7: Tracing through the full chain for \(d = 4\)

Worked Example A.8: Why the Dirichlet constraint matters (and why it doesn't)