The Mathematics of Google's TurboQuant
In March 2026, Google Research published a paper that rattled memory chip stocks and made Cloudflare's CEO call it "Google's DeepSeek moment." The paper is about compressing numbers. Specifically, it describes an algorithm called TurboQuant that compresses the key-value cache of large language models to ~3 bits per value — a 6x reduction — with effectively no loss in quality.
What makes this interesting is not the engineering result (impressive as it is) but the mathematics underneath. TurboQuant is a clean demonstration of how three old ideas — random rotations, optimal quantization, and the Johnson-Lindenstrauss lemma — combine into something greater than the sum of their parts.
This post walks through the math. I'll start with a puzzle, then build up each piece of the algorithm, and end with why the result is close to the best anyone can ever do.
A Puzzle to Start
The Random Rotation Puzzle
Imagine you have a vector in 1000 dimensions. All of its energy is concentrated in a single coordinate:
$$\mathbf{v} = (1000, \; 0, \; 0, \; 0, \; \ldots, \; 0) \in \mathbb{R}^{1000}$$Now apply a random rotation to this vector — pick a uniformly random orthogonal matrix \(Q\) and compute \(Q\mathbf{v}\).
Question: After the rotation, roughly what does each coordinate of \(Q\mathbf{v}\) look like? Are most of them still zero? Are a few of them large? Or does something else happen entirely?
Bonus: What probability distribution do the individual coordinates follow?
The Problem: LLMs Are Drowning in Memory
Large language models like Llama, Gemini, and GPT run on a mechanism called attention. Every time the model processes a new token, it computes inner products between the current token's query vector and the key vectors of all previous tokens. The key and value vectors from all past tokens are stored in something called the KV cache.
The KV cache grows linearly with context length. For Llama 3.1 70B processing a 128K-token context, the KV cache alone consumes roughly 40 GB of GPU memory — often more than the model weights themselves.
The question is: can we store these vectors at lower precision without breaking the attention computation?
This is a vector quantization problem, and it has been studied since Shannon's 1959 source coding theory. The challenge is that naive quantization (just round each number to the nearest low-precision value) introduces errors that accumulate as bias in the attention scores. TurboQuant solves this elegantly.
The Three-Paper Arc
TurboQuant didn't appear from nowhere. It's the final piece of a research program spanning three papers:
Showed that the sign of a random projection preserves enough angular information to debias dot product estimates at a cost of 1 bit per dimension.
Showed that randomly rotating a vector makes each coordinate follow a known Beta distribution, enabling precomputed optimal scalar quantizers with zero calibration.
Combined PolarQuant (rotation + Lloyd-Max quantization) with QJL (1-bit residual correction) to achieve near-optimal distortion with unbiased inner products.
Let's look at each piece.
Stage 1: PolarQuant — Rotation Makes Everything Predictable
Consider a vector \(\mathbf{x} \in \mathbb{R}^d\) that we want to quantize. In a KV cache, \(d\) is the head dimension (typically 64–128). The vector could have any shape — some coordinates large, others small, the distribution varying from token to token.
PolarQuant's insight is to apply a random orthogonal rotation first:
$$\tilde{\mathbf{x}} = Q \mathbf{x}$$where \(Q \in \mathbb{R}^{d \times d}\) is a random orthogonal matrix (generated once, shared across all vectors). This is exactly the setup from our puzzle.
Why does rotation help?
After rotation, each coordinate \(\tilde{x}_i\) of the transformed vector follows a distribution that depends only on \(\|\mathbf{x}\|\) and \(d\), not on the direction of \(\mathbf{x}\). Specifically:
$$\tilde{x}_i = \|\mathbf{x}\| \cdot z_i$$where \(z_i\) is the \(i\)-th coordinate of a uniformly random unit vector in \(\mathbb{R}^d\). This coordinate follows a shifted and scaled Beta distribution:
For large \(d\), the central limit theorem kicks in and \(z_i \approx \mathcal{N}(0, 1/d)\). Moreover, distinct coordinates become nearly independent.
The Lloyd-Max Quantizer
Given a known probability distribution \(p(z)\) and a budget of \(b\) bits (giving \(2^b\) possible output levels), the Lloyd-Max quantizer is the scalar quantizer that minimizes mean squared error (MSE).
It works by solving a one-dimensional optimization: partition the real line into \(2^b\) intervals and assign a representative value to each interval, chosen to minimize:
$$\operatorname{MSE} = \int_{-\infty}^{\infty} (z - Q_b(z))^2 \, p(z) \, dz$$where \(Q_b(z)\) maps \(z\) to its nearest representative. The optimal partition boundaries and representatives satisfy a pair of necessary conditions (essentially: boundaries are midpoints of adjacent representatives, and representatives are conditional means within their intervals). These can be solved numerically by alternating between the two conditions — which is exactly Lloyd's algorithm (also known as k-means in one dimension).
Because the Beta distribution is known, PolarQuant precomputes the Lloyd-Max quantizer for each target bitwidth once. At runtime, quantizing a coordinate is a table lookup.
This is easier to understand with examples. Let's build up from the simplest case.
Example: 1-bit quantizer for a uniform distribution
Suppose our values are drawn from a uniform distribution on \([-1, 1]\), and we have 1 bit — just 2 possible output levels. Where should we place them?
Lloyd-Max says: partition \([-1, 1]\) into two intervals, and set each representative to the conditional mean within its interval. By symmetry, the boundary is at 0:
The representative for \([-1, 0)\) is \(\mathbb{E}[z \mid z \in [-1, 0)] = -0.5\). Similarly for the other half. Any value in \([-1, 0)\) gets mapped to \(-0.5\); any value in \([0, 1]\) gets mapped to \(+0.5\).
The MSE is \(\mathbb{E}[(z - Q(z))^2] = \frac{1}{12} \approx 0.083\). (Each half contributes the variance of a uniform on an interval of width 1, which is \(1/12\).)
Example: 2-bit quantizer for a uniform distribution
Now we have 2 bits — 4 output levels. For the uniform distribution on \([-1, 1]\), Lloyd-Max splits into 4 equal intervals:
MSE = \(\frac{1}{48} \approx 0.021\). Going from 1 bit to 2 bits cut the MSE by 4x. This is the \(4^{-b}\) scaling: each extra bit reduces MSE by a factor of 4 (for well-behaved distributions).
Example: 2-bit quantizer for a Gaussian — non-uniform intervals
Now the values follow a standard normal \(\mathcal{N}(0, 1)\) instead of a uniform. Most values cluster near 0, with rare outliers. If we used equal-width intervals (like above), we'd waste two output levels on the tails where almost no data lives.
Lloyd-Max adapts: it places narrower intervals where the density is high (near 0) and wider intervals in the sparse tails:
Notice the asymmetry in interval widths: the inner intervals (covering the dense center) are narrower (\(\approx 0.98\) wide) than the outer intervals (stretching to infinity). This is not something you can get from simple "round to nearest" quantization — it specifically accounts for where the data actually lives.
The MSE for this optimal 2-bit Gaussian quantizer is about 0.2356, compared to the variance of 1.0 — so we capture about 76% of the signal with just 4 levels.
The representatives (\(\pm 0.45\), \(\pm 1.51\)) and boundaries (\(\pm 0.98\)) are computed once by running Lloyd's algorithm on the known Gaussian density. They never change.
Example: What PolarQuant actually precomputes
In TurboQuant, after rotation, each coordinate follows approximately \(\mathcal{N}(0, 1/d)\). Rescaling by \(\|\mathbf{x}\|\), the distribution is known. PolarQuant precomputes a Lloyd-Max quantizer for this distribution at each bitwidth.
For a 3-bit quantizer (\(2^3 = 8\) levels) on a standard Gaussian, the precomputed table looks roughly like:
At runtime, quantizing one coordinate means: compare it against 7 boundaries, pick the matching interval, store a 3-bit index (0–7). Dequantizing means: look up the representative from a table of 8 values. No floating-point arithmetic, no per-vector statistics, no calibration. The boundaries and representatives are baked in at compile time.
This is what makes TurboQuant fast: the entire quantization step is 7 comparisons and a table index per coordinate.
Eliminating normalization overhead
A subtle but important advantage: traditional quantization schemes (like min-max or absmax) need to store per-block scale factors, typically adding 1–2 extra bits per coordinate of overhead. PolarQuant eliminates this entirely. Since the distribution after rotation is fixed, the quantizer boundaries are fixed. No per-vector metadata is needed.
The only thing we need to store alongside the quantized coordinates is \(\|\mathbf{x}\|\) (a single scalar per vector) to reconstruct the scale. This is negligible overhead.
Stage 2: QJL — Fixing the Bias with One Bit
Stage 1 gives us excellent MSE. But there's a problem that MSE doesn't capture.
In attention, we don't care about reconstructing the vector itself — we care about inner products. The attention score between a query \(\mathbf{q}\) and a key \(\mathbf{k}\) is \(\langle \mathbf{q}, \mathbf{k} \rangle\). If we quantize \(\mathbf{k}\) to \(\hat{\mathbf{k}}\), we need \(\langle \mathbf{q}, \hat{\mathbf{k}} \rangle \approx \langle \mathbf{q}, \mathbf{k} \rangle\).
The problem is that MSE-optimal quantizers introduce systematic bias in inner product estimates. The quantization error \(\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}}\) is correlated with \(\hat{\mathbf{k}}\), which means:
$$\mathbb{E}[\langle \mathbf{q}, \hat{\mathbf{k}} \rangle] \neq \langle \mathbf{q}, \mathbf{k} \rangle$$This bias accumulates across tokens in the attention softmax, silently degrading output quality. TurboQuant fixes this with a 1-bit correction based on the Johnson-Lindenstrauss lemma.
The Johnson-Lindenstrauss Lemma
The JL lemma (1984) is one of the most useful results in high-dimensional geometry. It says:
In other words: random projections preserve geometry, even when you're projecting into a much lower-dimensional space.
How QJL corrects the bias
Let \(\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}}\) be the residual error from Stage 1. We want to estimate \(\langle \mathbf{q}, \mathbf{e} \rangle\) — the part of the inner product lost to quantization — so we can add it back. But we can't store \(\mathbf{e}\) itself (that would defeat the purpose of compression). QJL stores a 1-bit sketch of \(\mathbf{e}\) instead.
Step 1: Storing the sketch (1 bit per projection)
Fix \(m\) random Gaussian vectors \(\mathbf{g}_1, \ldots, \mathbf{g}_m \in \mathbb{R}^d\), drawn once and shared across all keys. For each key's residual \(\mathbf{e}\), store:
$$s_i = \operatorname{sign}(\langle \mathbf{g}_i, \mathbf{e} \rangle) \in \{-1, +1\} \quad \text{for } i = 1, \ldots, m$$Each \(s_i\) is a single bit. The inner product \(\langle \mathbf{g}_i, \mathbf{e} \rangle\) is a real number, but we only keep its sign. This feels like throwing away almost all the information — how can a sign bit help estimate an inner product?
Step 2: Where \(\pi\) comes from — the sign-angle identity
The key mathematical fact is a classical result about Gaussian random projections. For any two fixed vectors \(\mathbf{a}, \mathbf{b} \in \mathbb{R}^d\) and a random Gaussian vector \(\mathbf{g} \sim \mathcal{N}(\mathbf{0}, I_d)\):
$$\mathbb{E}\big[\operatorname{sign}(\langle \mathbf{g}, \mathbf{a} \rangle) \cdot \langle \mathbf{g}, \mathbf{b} \rangle\big] = \sqrt{\frac{2}{\pi}} \, \|\mathbf{a}\| \cos\theta$$Wait — let me derive this from scratch so the \(\pi\) doesn't feel like magic.
Consider the simpler 1D case first. Let \(g \sim \mathcal{N}(0,1)\). What is \(\mathbb{E}[\operatorname{sign}(g) \cdot g]\)?
$$\mathbb{E}[\operatorname{sign}(g) \cdot g] = \mathbb{E}[|g|] = \int_{-\infty}^{\infty} |g| \cdot \frac{1}{\sqrt{2\pi}} e^{-g^2/2} \, dg = 2 \int_{0}^{\infty} g \cdot \frac{1}{\sqrt{2\pi}} e^{-g^2/2} \, dg$$The integral \(\int_0^\infty g \, e^{-g^2/2} \, dg = 1\) (substitute \(u = g^2/2\)), so:
$$\mathbb{E}[|g|] = \frac{2}{\sqrt{2\pi}} = \sqrt{\frac{2}{\pi}}$$That's where the \(\sqrt{2/\pi}\) comes from — it's the mean of the half-normal distribution. It arises because we're taking the absolute value (or equivalently, multiplying by the sign) of a Gaussian.
Now generalize to \(\mathbb{R}^d\). Let \(\mathbf{g} \sim \mathcal{N}(\mathbf{0}, I_d)\), and let \(\theta\) be the angle between fixed vectors \(\mathbf{e}\) and \(\mathbf{q}\). We can decompose \(\mathbf{g}\) into a component along \(\mathbf{e}\) and a component perpendicular to \(\mathbf{e}\). The projection \(\langle \mathbf{g}, \mathbf{e} \rangle / \|\mathbf{e}\|\) is a scalar Gaussian, and the sign of \(\langle \mathbf{g}, \mathbf{e} \rangle\) depends only on this component. Meanwhile, \(\langle \mathbf{g}, \mathbf{q} \rangle\) decomposes as:
$$\langle \mathbf{g}, \mathbf{q} \rangle = \underbrace{\langle \mathbf{g}, \mathbf{q}_\parallel \rangle}_{\text{component along } \mathbf{e}} + \underbrace{\langle \mathbf{g}, \mathbf{q}_\perp \rangle}_{\text{component } \perp \text{ to } \mathbf{e}}$$where \(\mathbf{q}_\parallel = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|^2} \mathbf{e}\) is the projection of \(\mathbf{q}\) onto \(\mathbf{e}\), and \(\mathbf{q}_\perp = \mathbf{q} - \mathbf{q}_\parallel\).
The perpendicular part \(\langle \mathbf{g}, \mathbf{q}_\perp \rangle\) is independent of \(\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle)\) (because \(\mathbf{q}_\perp\) is orthogonal to \(\mathbf{e}\), so this projection involves different Gaussian components). Therefore its contribution vanishes in expectation:
$$\mathbb{E}\big[\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle) \cdot \langle \mathbf{g}, \mathbf{q}_\perp \rangle\big] = 0$$For the parallel part, let \(\alpha = \langle \mathbf{g}, \mathbf{e} \rangle / \|\mathbf{e}\|\) (a standard Gaussian scalar). Then:
$$\langle \mathbf{g}, \mathbf{q}_\parallel \rangle = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} \cdot \alpha$$and:
$$\mathbb{E}\big[\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle) \cdot \langle \mathbf{g}, \mathbf{q}_\parallel \rangle\big] = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} \cdot \mathbb{E}[\operatorname{sign}(\alpha) \cdot \alpha] = \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} \cdot \sqrt{\frac{2}{\pi}}$$Combining the parallel and perpendicular parts:
Step 3: Building the unbiased estimator
The identity above tells us that \(\operatorname{sign}(\langle \mathbf{g}, \mathbf{e} \rangle) \cdot \langle \mathbf{g}, \mathbf{q} \rangle\) has expected value proportional to \(\langle \mathbf{q}, \mathbf{e} \rangle\), but scaled by a factor of \(\sqrt{2/\pi} / \|\mathbf{e}\|\). To get an unbiased estimator of \(\langle \mathbf{q}, \mathbf{e} \rangle\) itself, we need to undo this scaling.
If we knew \(\|\mathbf{e}\|\), the estimator from a single projection would be:
$$\hat{T}_i = \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot s_i \cdot \langle \mathbf{g}_i, \mathbf{q} \rangle$$and by the identity: \(\mathbb{E}[\hat{T}_i] = \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot \sqrt{\frac{2}{\pi}} \cdot \frac{\langle \mathbf{q}, \mathbf{e} \rangle}{\|\mathbf{e}\|} = \langle \mathbf{q}, \mathbf{e} \rangle\). Unbiased.
In practice, TurboQuant stores \(\|\mathbf{e}\|\) alongside the sketch (it's a single scalar per vector, negligible overhead). Averaging over \(m\) independent projections reduces variance:
$$\widehat{\langle \mathbf{q}, \mathbf{e} \rangle} = \frac{1}{m} \sum_{i=1}^{m} \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot s_i \cdot \langle \mathbf{g}_i, \mathbf{q} \rangle$$This is sometimes written more compactly as \(\frac{\pi}{2m} \sum s_i \langle \mathbf{g}_i, \mathbf{q} \rangle\) (absorbing \(\|\mathbf{e}\|\) into the normalization of \(\mathbf{g}_i\)), which is the formula you see in the paper.
Step 4: Proving unbiasedness explicitly
Let's verify this end to end. The estimator for the full inner product is:
$$\widehat{\langle \mathbf{q}, \mathbf{k} \rangle} = \underbrace{\langle \mathbf{q}, \hat{\mathbf{k}} \rangle}_{\text{from Stage 1}} + \underbrace{\widehat{\langle \mathbf{q}, \mathbf{e} \rangle}}_{\text{QJL correction}}$$Taking expectations (over the randomness of the Gaussian vectors \(\mathbf{g}_i\)):
$$\mathbb{E}\big[\widehat{\langle \mathbf{q}, \mathbf{k} \rangle}\big] = \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \mathbb{E}\big[\widehat{\langle \mathbf{q}, \mathbf{e} \rangle}\big] = \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \langle \mathbf{q}, \mathbf{e} \rangle$$Now use \(\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}}\):
$$= \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \langle \mathbf{q}, \mathbf{k} - \hat{\mathbf{k}} \rangle = \langle \mathbf{q}, \hat{\mathbf{k}} \rangle + \langle \mathbf{q}, \mathbf{k} \rangle - \langle \mathbf{q}, \hat{\mathbf{k}} \rangle = \langle \mathbf{q}, \mathbf{k} \rangle$$The \(\langle \mathbf{q}, \hat{\mathbf{k}} \rangle\) terms cancel perfectly. The estimator recovers the true inner product in expectation, regardless of how large the quantization error \(\mathbf{e}\) is.
Step 5: A concrete numerical example
Example: QJL correction in 2D
Let's trace through this with small numbers to see it working.
Setup: Dimension \(d = 2\). Suppose:
- True key: \(\mathbf{k} = (3.0, \; 4.0)\)
- Quantized key: \(\hat{\mathbf{k}} = (2.5, \; 4.5)\) (Stage 1 output)
- Residual: \(\mathbf{e} = \mathbf{k} - \hat{\mathbf{k}} = (0.5, \; -0.5)\)
- Query: \(\mathbf{q} = (1.0, \; 2.0)\)
True inner product: \(\langle \mathbf{q}, \mathbf{k} \rangle = 1 \cdot 3 + 2 \cdot 4 = 11.0\)
Naive (Stage 1 only): \(\langle \mathbf{q}, \hat{\mathbf{k}} \rangle = 1 \cdot 2.5 + 2 \cdot 4.5 = 11.5\)
Error without correction: \(11.5 - 11.0 = 0.5\) (this is the bias)
Lost inner product: \(\langle \mathbf{q}, \mathbf{e} \rangle = 1 \cdot 0.5 + 2 \cdot (-0.5) = -0.5\)
QJL correction with \(m = 3\) projections:
Draw 3 random Gaussian vectors and compute the sketch:
- \(\mathbf{g}_1 = (0.8, \; -0.3)\): \(\langle \mathbf{g}_1, \mathbf{e} \rangle = 0.4 + 0.15 = 0.55\), so \(s_1 = +1\)
- \(\mathbf{g}_2 = (-1.2, \; 0.7)\): \(\langle \mathbf{g}_2, \mathbf{e} \rangle = -0.6 - 0.35 = -0.95\), so \(s_2 = -1\)
- \(\mathbf{g}_3 = (0.1, \; 1.5)\): \(\langle \mathbf{g}_3, \mathbf{e} \rangle = 0.05 - 0.75 = -0.70\), so \(s_3 = -1\)
Store only: \(s_1 = +1, \; s_2 = -1, \; s_3 = -1\) and \(\|\mathbf{e}\| = \sqrt{0.25 + 0.25} = \frac{1}{\sqrt{2}} \approx 0.707\).
At query time, compute \(\langle \mathbf{g}_i, \mathbf{q} \rangle\) for each:
- \(\langle \mathbf{g}_1, \mathbf{q} \rangle = 0.8 + (-0.6) = 0.2\)
- \(\langle \mathbf{g}_2, \mathbf{q} \rangle = -1.2 + 1.4 = 0.2\)
- \(\langle \mathbf{g}_3, \mathbf{q} \rangle = 0.1 + 3.0 = 3.1\)
QJL estimate of \(\langle \mathbf{q}, \mathbf{e} \rangle\):
$$\widehat{\langle \mathbf{q}, \mathbf{e} \rangle} = \frac{\sqrt{\pi/2} \cdot \|\mathbf{e}\|}{m} \sum_{i} s_i \langle \mathbf{g}_i, \mathbf{q} \rangle = \frac{1.253 \cdot 0.707}{3} \big[(+1)(0.2) + (-1)(0.2) + (-1)(3.1)\big]$$ $$= \frac{0.886}{3} \cdot (0.2 - 0.2 - 3.1) = 0.295 \times (-3.1) = -0.915$$Corrected estimate: \(11.5 + (-0.915) = 10.585\)
True value: \(11.0\)
The correction moved us from an error of \(+0.5\) to an error of \(-0.415\). On this single draw, the correction overshot — but that's because \(m = 3\) is very small and we got unlucky with \(\mathbf{g}_3\). In expectation (averaged over many random draws of the \(\mathbf{g}_i\)), the estimate would be exactly 11.0. With a realistic \(m\) (say 64 or 128, matching the head dimension), the variance shrinks dramatically and the estimate would be very close to 11.0 on any single draw.
Step 6: Why the two stages complement each other
Now we can see precisely how the two stages work together:
- Stage 1 (PolarQuant) makes \(\|\mathbf{e}\|\) small. The better the Lloyd-Max quantizer, the smaller the residual. This directly reduces the variance of the QJL estimator (which scales with \(\|\mathbf{e}\|^2\)).
- Stage 2 (QJL) makes the estimate unbiased. No matter how large or small \(\mathbf{e}\) is, the correction recovers the true inner product in expectation.
Without Stage 1, the residual would be large and the QJL estimator would have high variance. Without Stage 2, the quantized inner product would have low variance but persistent bias. Together: low variance and zero bias.
The total inner product estimate is:
$$\langle \mathbf{q}, \mathbf{k} \rangle \;\approx\; \langle \mathbf{q}, \hat{\mathbf{k}} \rangle \;+\; \frac{1}{m}\sum_{i=1}^{m} \sqrt{\frac{\pi}{2}} \cdot \|\mathbf{e}\| \cdot s_i \cdot \langle \mathbf{g}_i, \mathbf{q} \rangle$$The Full TurboQuant Pipeline
The total storage per coordinate is \(b + 1\) bits (e.g., 3 bits PolarQuant + 1 bit QJL = 4 bits total, compared to 16 bits for FP16). This gives the 4–6x memory reduction.
How Good Is This? The Information-Theoretic Limit
Now for the punchline. How close is TurboQuant to the best possible?
Shannon's rate-distortion theory provides a hard lower bound: for any \(b\)-bit quantizer operating on unit-norm vectors in \(\mathbb{R}^d\), the minimum achievable MSE scales as:
$$\operatorname{MSE}_{\min} \;\geq\; C \cdot 4^{-b}$$for some constant \(C\). No quantizer — no matter how clever, how much data it sees, or how much computation it uses — can beat this bound.
The TurboQuant paper (Theorem 3 in arXiv:2504.19874) proves that TurboQuant achieves:
$$\operatorname{MSE}_{\mathrm{TurboQuant}} \;\leq\; \frac{\sqrt{3\pi}}{2} \cdot 4^{-b} \;\approx\; 2.7 \cdot 4^{-b}$$TurboQuant is within a factor of ~2.7 of the information-theoretic limit at every bitwidth.
This factor is constant — it does not grow with \(d\) or \(b\).
To put this in perspective: no future algorithm can improve on TurboQuant by more than 2.7x in MSE, regardless of approach. The remaining gap is between the Lloyd-Max scalar quantizer (optimal for scalar quantization of a known distribution) and the Shannon bound for full vector quantization (which requires exponential codebook searches in theory).
Inner product guarantee
For the inner product estimator (which is what attention actually uses), TurboQuant provides:
- Unbiased: \(\mathbb{E}[\langle \mathbf{q}, \hat{\mathbf{k}} \rangle_{\mathrm{TQ}}] = \langle \mathbf{q}, \mathbf{k} \rangle\)
- Variance bound: \(\operatorname{Var}[\langle \mathbf{q}, \hat{\mathbf{k}} \rangle_{\mathrm{TQ}}] \leq \frac{\sqrt{3\pi}}{2} \cdot \frac{\|\mathbf{q}\|^2}{d} \cdot 4^{-b}\)
The variance shrinks exponentially with the bit budget and inversely with the dimension. For typical LLM head dimensions (\(d = 128\)) and 3–4 bit quantization, this variance is tiny.
What This Looks Like in Practice
| Configuration | Bits per value | KV cache size (Llama 70B, 128K ctx) | Quality |
|---|---|---|---|
| FP16 (baseline) | 16 | ~40 GB | Full |
| TurboQuant 4-bit | 4 | ~10 GB | Matches FP16 |
| TurboQuant 3.5-bit | 3.5 | ~8.75 GB | Quality-neutral |
| TurboQuant 3-bit | 3 | ~7.5 GB | Near quality-neutral |
| TurboQuant 2.5-bit | 2.5 | ~6.25 GB | Marginal degradation |
At 4-bit precision, TurboQuant also delivers up to 8x speedup for attention logit computation on H100 GPUs, since lower-precision arithmetic is faster.
When benchmarked on long-context tasks (LongBench, Needle-in-a-Haystack, RULER), 3.5-bit TurboQuant is statistically indistinguishable from FP16. The algorithm achieves this without any training, fine-tuning, or calibration data.
The Deeper Puzzle: Why Is This So Close to Optimal?
The factor of 2.7 deserves a closer look. Where does it come from, and why can't we close the gap?
The gap exists because TurboQuant uses scalar quantization (quantizing each coordinate independently) rather than vector quantization (quantizing all coordinates jointly). Scalar quantization is fast — \(O(d)\) per vector — but it treats each coordinate in isolation, missing correlations between coordinates.
Full vector quantization could, in theory, exploit these correlations to achieve lower distortion. But it requires a codebook of size \(2^{bd}\) and search time exponential in \(d\). For \(d = 128\) and \(b = 3\), that's \(2^{384}\) codewords — a number larger than the atoms in the observable universe.
A Question to Sit With
The random rotation in PolarQuant makes coordinates nearly independent in high dimensions. But "nearly" isn't "exactly." There are residual correlations that the scalar quantizer ignores.
Is there a tractable quantization scheme — one that runs in polynomial time — that could exploit these residual correlations to close the 2.7x gap? Or is there a computational lower bound that says this gap is inherent for efficient algorithms?
As far as I know, this question is open.
Why This Matters Beyond LLMs
TurboQuant is framed as a KV cache trick, but the mathematics applies to any setting where you need to store high-dimensional vectors compactly while preserving inner products:
- Vector databases (Pinecone, Weaviate, etc.): similarity search at lower memory cost
- Embedding storage: compress sentence/image embeddings for retrieval
- Streaming quantization: the algorithm is online — each vector is quantized independently with no look-ahead, making it suitable for real-time applications
- Federated learning: compress gradient updates for communication-efficient training
The data-oblivious property is particularly valuable: you don't need to retrain or recalibrate when the data distribution shifts.
What's Left on the Table
TurboQuant is close to optimal for compressing the data as-is. But the authors note that most future gains in LLM memory efficiency will likely come from changing what data needs to be stored in the first place:
- Multi-query attention / grouped-query attention: share key-value heads across query heads, reducing the number of vectors stored
- Sliding window attention: don't store the full context, just a recent window
- Architectural innovations: models designed from the start for efficient KV storage
Compression is approaching the theoretical wall. Architecture is where the headroom is.
Summary
TurboQuant combines three clean mathematical ideas:
- Random rotation makes every coordinate look the same (PolarQuant insight).
- Lloyd-Max quantization exploits the known distribution to minimize MSE with zero calibration.
- 1-bit JL sketch of the residual eliminates inner product bias (QJL insight).
The result is within 2.7x of the information-theoretic limit, requires no training data, and runs fast enough for real-time inference. It's a nice example of how theoretical tools from high-dimensional geometry (JL lemma), information theory (rate-distortion bounds), and signal processing (Lloyd-Max quantization) combine to solve a practical engineering problem.
Paper: arXiv:2504.19874 — Zandieh, Daliri, Hadian, Mirrokni (2025). Accepted at ICLR 2026.
Further Reading
- Google Research blog post on TurboQuant
- TurboQuant paper (arXiv:2504.19874)
- Johnson-Lindenstrauss lemma (Wikipedia)
- Lloyd's algorithm / k-means (Wikipedia)
- Rate-distortion theory (Wikipedia)
Appendix: Mathematical Background
This blog uses several probability distributions and their interrelationships. If you're rusty on any of them, this appendix walks through each one from scratch with worked examples that connect directly to the arguments above.
A.1 — The Chi-Squared Distribution
Key properties:
- Mean: \(\mathbb{E}[X] = k\)
- Variance: \(\operatorname{Var}(X) = 2k\)
- The PDF is \(f(x) = \frac{1}{2^{k/2}\,\Gamma(k/2)} \, x^{k/2-1} \, e^{-x/2}\) for \(x > 0\).
- Additive: if \(X \sim \chi^2(a)\) and \(Y \sim \chi^2(b)\) are independent, then \(X + Y \sim \chi^2(a+b)\).
Special case used in this blog: A single squared Gaussian \(g^2\) where \(g \sim \mathcal{N}(0,1)\) is \(\chi^2(1)\). Its mean is 1 and its variance is 2. This is the building block for everything that follows.
Worked Example A.1: Sum of squared Gaussians
Problem: Let \(g_1, \ldots, g_{1000} \sim \mathcal{N}(0,1)\) independently. What is the distribution of \(S = \sum_{i=1}^{1000} g_i^2\)? What are its mean and standard deviation? How tightly is it concentrated?
Solution:
- \(S \sim \chi^2(1000)\)
- \(\mathbb{E}[S] = 1000\)
- \(\operatorname{Var}(S) = 2 \times 1000 = 2000\), so \(\operatorname{SD}(S) = \sqrt{2000} \approx 44.7\)
- Coefficient of variation: \(\operatorname{SD}/\operatorname{Mean} = 44.7 / 1000 \approx 0.045\)
The relative fluctuation is under 5%. This is why, in Step 5 of the puzzle solution, we could approximate \(\sqrt{\sum g_j^2} \approx \sqrt{d}\) — the sum \(\sum g_j^2\) is tightly concentrated around its mean \(d = 1000\).
Worked Example A.2: Moments of \(\chi^2(1)\)
Problem: If \(X \sim \chi^2(1)\) (i.e., \(X = g^2\) for \(g \sim \mathcal{N}(0,1)\)), compute \(\mathbb{E}[X]\), \(\mathbb{E}[X^2]\), and \(\operatorname{Var}(X)\).
Solution:
- \(\mathbb{E}[X] = \mathbb{E}[g^2] = 1\) (since \(g\) has mean 0 and variance 1).
- \(\mathbb{E}[X^2] = \mathbb{E}[g^4]\). For a standard normal, \(\mathbb{E}[g^4] = 3\) (this is the fourth moment, equal to \(3!! = 3 \cdot 1 = 3\)).
- \(\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = 3 - 1 = 2\). This matches the general formula \(\operatorname{Var}(\chi^2(k)) = 2k\) with \(k = 1\).
This is exactly the calculation that produces the "\(+2\)" in the \(d(d+2)\) denominator in Step 6 of the puzzle solution. The second moment of \(\chi^2(1)\) being 3 (not 1) is what creates that correction term.
A.2 — The Beta Distribution
Key properties:
- Mean: \(\displaystyle\mathbb{E}[Z] = \frac{\alpha}{\alpha + \beta}\)
- Variance: \(\displaystyle\operatorname{Var}(Z) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\)
- Mode (for \(\alpha, \beta > 1\)): \(\displaystyle\frac{\alpha - 1}{\alpha + \beta - 2}\)
- If \(\alpha = \beta\), the distribution is symmetric around \(1/2\).
- If \(\alpha < 1\) and \(\beta \gg 1\), the distribution is concentrated near 0 with a right skew.
Where it appears in this blog: The squared coordinate \(u_i^2\) of a uniform random unit vector follows \(\operatorname{Beta}(1/2, (d-1)/2)\). This has \(\alpha = 1/2 < 1\), so the PDF diverges at 0 — most of the probability mass is near zero, which makes sense because each coordinate of a high-dimensional unit vector is small.
Worked Example A.3: The Beta distribution of a single coordinate
Problem: A uniform random unit vector \(\mathbf{u}\) on the sphere in \(\mathbb{R}^5\). What is the exact distribution of \(u_1^2\)? Compute its mean and variance.
Solution: Using the Gaussian construction, \(u_1 = g_1 / \|\mathbf{g}\|\), so:
$$u_1^2 = \frac{g_1^2}{g_1^2 + g_2^2 + g_3^2 + g_4^2 + g_5^2} = \frac{X}{X + Y}$$where \(X = g_1^2 \sim \chi^2(1)\) and \(Y = g_2^2 + \cdots + g_5^2 \sim \chi^2(4)\), independent of \(X\).
By the chi-squared-to-Beta connection:
$$u_1^2 \sim \operatorname{Beta}\!\left(\frac{1}{2}, \frac{4}{2}\right) = \operatorname{Beta}\!\left(\frac{1}{2}, 2\right)$$Now compute moments:
- \(\mathbb{E}[u_1^2] = \dfrac{1/2}{1/2 + 2} = \dfrac{1/2}{5/2} = \dfrac{1}{5}\). Checks out: \(d = 5\), and by symmetry each of the 5 coordinates gets \(1/5\) of the squared norm.
- \(\operatorname{Var}(u_1^2) = \dfrac{(1/2)(2)}{(5/2)^2(5/2 + 1)} = \dfrac{1}{(25/4)(7/2)} = \dfrac{1}{87.5} = \dfrac{4}{350} = \dfrac{2}{175}\)
- \(\operatorname{SD}(u_1^2) = \sqrt{2/175} \approx 0.107\). The mean is \(0.2\), so the coefficient of variation is about 0.53 — substantial spread, as discussed in Step 4.
Worked Example A.4: Seeing the Gaussian approximation emerge
Problem: For \(d = 5\) vs. \(d = 1000\), compare the Beta distribution of \(u_1^2\) to the Gaussian approximation.
Solution: The coordinate \(u_1\) (not squared) has mean 0 and variance \(1/d\). The Gaussian approximation says \(u_1 \approx \mathcal{N}(0, 1/d)\).
For \(d = 5\): \(u_1 \approx \mathcal{N}(0, 0.2)\). But the actual distribution of \(u_1\) has support only on \([-1, 1]\) (it's a coordinate of a unit vector), and the density has noticeable curvature at the tails. The Gaussian leaks probability outside \([-1, 1]\). The approximation is rough.
For \(d = 1000\): \(u_1 \approx \mathcal{N}(0, 0.001)\), so \(\operatorname{SD}(u_1) \approx 0.032\). Almost all probability mass sits in \([-0.1, 0.1]\), far from the boundary at \(\pm 1\). The tails of the Gaussian that leak outside \([-1, 1]\) have negligible mass (\(< 10^{-200}\)). The approximation is excellent.
This is why PolarQuant works well at LLM head dimensions (\(d = 64\) to 128) — large enough that the Gaussian approximation to the Beta is already very accurate, and the precomputed Lloyd-Max quantizer designed for this distribution performs nearly optimally.
A.3 — The Dirichlet Distribution
- Each \(Z_i \geq 0\)
- \(\sum_{i=1}^{d} Z_i = 1\) (the components live on the probability simplex)
- The joint PDF is \(\displaystyle f(z_1, \ldots, z_d) = \frac{\Gamma(\alpha_1 + \cdots + \alpha_d)}{\Gamma(\alpha_1)\cdots\Gamma(\alpha_d)} \prod_{i=1}^{d} z_i^{\alpha_i - 1}\)
Key properties:
- Mean of each component: \(\displaystyle\mathbb{E}[Z_i] = \frac{\alpha_i}{\alpha_0}\) where \(\alpha_0 = \sum \alpha_j\)
- Variance: \(\displaystyle\operatorname{Var}(Z_i) = \frac{\alpha_i(\alpha_0 - \alpha_i)}{\alpha_0^2(\alpha_0 + 1)}\)
- Covariance (for \(i \neq j\)): \(\displaystyle\operatorname{Cov}(Z_i, Z_j) = \frac{-\alpha_i \alpha_j}{\alpha_0^2(\alpha_0 + 1)}\)
- Each marginal is Beta: \(Z_i \sim \operatorname{Beta}(\alpha_i, \alpha_0 - \alpha_i)\)
Where it appears in this blog: The vector of squared coordinates \((u_1^2, u_2^2, \ldots, u_d^2)\) of a uniform random unit vector follows:
$$\operatorname{Dir}\!\left(\frac{1}{2}, \frac{1}{2}, \ldots, \frac{1}{2}\right) \quad \text{(symmetric Dirichlet with } \alpha = 1/2 \text{)}$$This is because \(u_i^2 = g_i^2 / \sum g_k^2\), and each \(g_i^2 \sim \chi^2(1)\), and \(\chi^2(1) = \chi^2(2 \times 1/2)\), so the construction above applies with \(\alpha_i = 1/2\) for all \(i\).
Worked Example A.5: Dirichlet moments and the \(d(d+2)\) formula
Problem: For \((Z_1, \ldots, Z_d) \sim \operatorname{Dir}(1/2, \ldots, 1/2)\) (the distribution of squared coordinates of a random unit vector), compute \(\mathbb{E}[Z_i Z_j]\) for \(i \neq j\).
Solution: We use the general Dirichlet second moment formula. For a Dirichlet with parameters \((\alpha_1, \ldots, \alpha_d)\) and \(\alpha_0 = \sum \alpha_k\):
$$\mathbb{E}[Z_i Z_j] = \frac{\alpha_i \alpha_j}{\alpha_0(\alpha_0 + 1)} \quad \text{for } i \neq j$$Let's derive this. We have:
$$\mathbb{E}[Z_i Z_j] = \mathbb{E}[Z_i \, \mathbb{E}[Z_j \mid Z_i]]$$But it's easier to go through the chi-squared construction. Let \(X_k \sim \chi^2(2\alpha_k)\) independently, \(S = \sum X_k\), and \(Z_k = X_k/S\). Then:
$$\mathbb{E}[Z_i Z_j] = \mathbb{E}\!\left[\frac{X_i X_j}{S^2}\right]$$Since \(X_i\) and \(X_j\) are independent of each other (and of the rest), we can condition on \(S\). But actually, the cleanest approach uses the identity: for independent Gamma variables \(X_i \sim \text{Gamma}(\alpha_i, 1)\), with \(S = \sum X_k\):
$$\mathbb{E}\!\left[\frac{X_i X_j}{S^2}\right] = \frac{\mathbb{E}[X_i]\,\mathbb{E}[X_j]}{\mathbb{E}[S]\,(\mathbb{E}[S]+1)} = \frac{\alpha_i \, \alpha_j}{\alpha_0(\alpha_0+1)}$$(This identity follows from the fact that \(Z_i\) is independent of \(S\) — a defining property of the Dirichlet — and from the formula for \(\mathbb{E}[1/S]\) for a Gamma sum.)
In our case, \(\alpha_i = \alpha_j = 1/2\) and \(\alpha_0 = d/2\), so:
$$\mathbb{E}[Z_i Z_j] = \frac{(1/2)(1/2)}{(d/2)(d/2 + 1)} = \frac{1/4}{(d/2)\cdot(d+2)/2} = \frac{1/4}{d(d+2)/4} = \frac{1}{d(d+2)}$$That's the \(1/d(d+2)\) from Step 6. Now the covariance:
$$\operatorname{Cov}(Z_i, Z_j) = \mathbb{E}[Z_i Z_j] - \mathbb{E}[Z_i]\mathbb{E}[Z_j] = \frac{1}{d(d+2)} - \frac{1}{d} \cdot \frac{1}{d} = \frac{1}{d(d+2)} - \frac{1}{d^2}$$ $$= \frac{d - (d+2)}{d^2(d+2)} = \frac{-2}{d^2(d+2)}$$For \(d = 1000\): \(\operatorname{Cov}(Z_i, Z_j) = -2/(1000^2 \cdot 1002) \approx -2.0 \times 10^{-9}\). Tiny.
Worked Example A.6: Checking the Dirichlet formula numerically
Problem: Verify the formula \(\mathbb{E}[Z_i Z_j] = 1/d(d+2)\) for \(d = 3\) by computing it from the chi-squared definition.
Solution: The formula predicts \(\mathbb{E}[Z_1 Z_2] = 1/(3 \times 5) = 1/15\).
Let's check with the Dirichlet moment formula directly. For \(\operatorname{Dir}(1/2, 1/2, 1/2)\), the general cross-moment is:
$$\mathbb{E}\!\left[\prod Z_i^{n_i}\right] = \frac{B(\alpha_1+n_1, \ldots, \alpha_d+n_d)}{B(\alpha_1, \ldots, \alpha_d)}$$where \(B\) is the multivariate beta function. With \(n_1 = 1, n_2 = 1, n_3 = 0\):
$$\mathbb{E}[Z_1 Z_2] = \frac{\Gamma(1/2 + 1)\Gamma(1/2 + 1)\Gamma(1/2)}{\Gamma(3/2 + 2)} \cdot \frac{\Gamma(3/2)}{\Gamma(1/2)\Gamma(1/2)\Gamma(1/2)}$$Using \(\Gamma(1/2) = \sqrt{\pi}\), \(\Gamma(3/2) = \frac{1}{2}\sqrt{\pi}\), \(\Gamma(5/2) = \frac{3}{4}\sqrt{\pi}\), \(\Gamma(7/2) = \frac{15}{8}\sqrt{\pi}\):
$$= \frac{(\frac{1}{2}\sqrt{\pi})(\frac{1}{2}\sqrt{\pi})(\sqrt{\pi})}{(\frac{15}{8}\sqrt{\pi})} \cdot \frac{(\frac{1}{2}\sqrt{\pi})}{(\sqrt{\pi})(\sqrt{\pi})(\sqrt{\pi})} = \frac{\frac{1}{4}\pi\sqrt{\pi}}{\frac{15}{8}\sqrt{\pi}} \cdot \frac{1}{2\pi} = \frac{\frac{1}{4}}{\frac{15}{8}} \cdot \frac{1}{2} = \frac{2}{15} \cdot \frac{1}{2} = \frac{1}{15}$$Matches \(1/d(d+2) = 1/15\).
A.4 — The Family Tree: How They All Connect
Here is how the three distributions relate, in the order that matters for this blog:
Level 0: Gaussian. Start with independent \(g_1, \ldots, g_d \sim \mathcal{N}(0,1)\).
Level 1: Chi-squared. Square them: each \(g_i^2 \sim \chi^2(1)\). Sum \(k\) of them: \(\sum_{i=1}^{k} g_i^2 \sim \chi^2(k)\).
Level 2: Beta. Take the ratio of one part to the whole: \(\dfrac{g_i^2}{\sum_{j=1}^d g_j^2} \sim \operatorname{Beta}(1/2, (d-1)/2)\). This is a single squared coordinate of a random unit vector.
Level 3: Dirichlet. Take all such ratios jointly: \(\left(\dfrac{g_1^2}{\sum g_j^2}, \ldots, \dfrac{g_d^2}{\sum g_j^2}\right) \sim \operatorname{Dir}(1/2, \ldots, 1/2)\). This is the full vector of squared coordinates. Each marginal is Beta (Level 2), but the Dirichlet captures the joint structure — including the constraint that they sum to 1, which is what creates the small negative covariances in Step 6.
Worked Example A.7: Tracing through the full chain for \(d = 4\)
Problem: Draw \(g_1 = 1.2\), \(g_2 = -0.5\), \(g_3 = 0.8\), \(g_4 = -1.1\) from \(\mathcal{N}(0,1)\). Compute the unit vector, the Dirichlet vector, and verify the distributions.
Solution:
Chi-squared level: \(g_1^2 = 1.44\), \(g_2^2 = 0.25\), \(g_3^2 = 0.64\), \(g_4^2 = 1.21\). Sum: \(S = 3.54\). (One draw from \(\chi^2(4)\); the mean of \(\chi^2(4)\) is 4, so 3.54 is a reasonable value.)
Unit vector (take square root of S): \(\|\mathbf{g}\| = \sqrt{3.54} = 1.881\).
$$(u_1, u_2, u_3, u_4) = \frac{1}{1.881}(1.2,\; -0.5,\; 0.8,\; -1.1) = (0.638,\; -0.266,\; 0.425,\; -0.585)$$Check: \(0.638^2 + 0.266^2 + 0.425^2 + 0.585^2 = 0.407 + 0.071 + 0.181 + 0.342 = 1.001 \approx 1\).
Dirichlet level (squared coordinates):
$$(u_1^2, u_2^2, u_3^2, u_4^2) = (0.407, 0.071, 0.181, 0.342)$$These sum to 1 and each is in \([0,1]\) — one draw from \(\operatorname{Dir}(1/2, 1/2, 1/2, 1/2)\). The expected value of each component is \(1/d = 1/4 = 0.25\).
Beta level (single component): \(u_1^2 = 0.407\) is one draw from \(\operatorname{Beta}(1/2, 3/2)\). The mean of this Beta is \((1/2)/(1/2 + 3/2) = 1/4 = 0.25\). Our draw of 0.407 is above the mean, which is fine — the distribution has substantial spread at \(d = 4\).
Gaussian approximation: \(u_1 = 0.638\). The approximation says \(u_1 \approx \mathcal{N}(0, 1/4)\), so \(\operatorname{SD} = 0.5\). Our value of 0.638 is 1.28 standard deviations out — perfectly typical. At \(d = 4\) the approximation is coarse; at \(d = 128\) (a real LLM head dimension) it would be far more accurate.
Worked Example A.8: Why the Dirichlet constraint matters (and why it doesn't)
Problem: In the Dirichlet vector above, suppose you observe that \(u_1^2 = 0.9\) (one coordinate hogs 90% of the squared norm). What can you say about the remaining coordinates?
Solution: Since \(\sum u_i^2 = 1\), the remaining three squared coordinates must share \(1 - 0.9 = 0.1\). Their conditional distribution is \(\operatorname{Dir}(1/2, 1/2, 1/2)\) rescaled to sum to 0.1 instead of 1. Each remaining coordinate has conditional mean \(0.1/3 \approx 0.033\), much less than the unconditional mean of \(0.25\).
This is the negative covariance at work: knowing one coordinate is large forces the others to be small. But at \(d = 1000\), the probability of any single coordinate hogging 90% of the norm is astronomically small (it would require a \(\operatorname{Beta}(1/2, 499.5)\) random variable to take the value 0.9, which is many hundreds of standard deviations from the mean of 0.001). In practice, no coordinate deviates enough from \(1/d\) to meaningfully constrain the others. That's the concentration of measure argument for near-independence.