thepragmaticquant.com

The complete series · 6 parts

driftbound

Part 01 · June 12, 2026

Thirty-one papers, zero error analyses

TL;DR — The streaming sliding-window inner-product kernel — the primitive deployed as the matrix profile — updates by subtracting one product and adding one, reusing yesterday’s value forever. Classical analysis says its rounding error grows like O(N·u) in stream length, and across the entire MP-I..MP-XXXI literature no paper had ever bounded that drift. Measured against a 128-bit oracle on the sweep’s cleanest corpus: naive f64 drift climbs 3.8 decades over 16.7M ticks; at f32, even Kahan–Babuška–Neumaier (KBN) and Ogita–Rump–Oishi compensated summation grow ~3,009× and ~412× on the same data where they are essentially exact at f64. The fix is one line of topology — reseed the carry chain every k ticks — and it collapses to a closed-form tuning rule you set in a config file: fp64-safe at the default, but silently 43× off at fp32. All 18 anchored cells hold the bound with zero violations, and I keep the row where the fix loses. The math is in a public preprint (doi:10.5281/zenodo.20599315).

Nobody can tell you whether the world’s most-deployed time-series similarity kernel survives a trillion ticks at fp32, on hardware with no fp64 unit — the question has no published answer. The primitive underneath is a sliding-window inner product carried forward one product at a time; it ships as the matrix profile, the engine behind motif discovery, discord detection, and a long line of anomaly-mining systems, across thirty-one papers in its canonical series, MP-I through MP-XXXI. None of them carries a forward-error analysis of the streaming recurrence the deployed kernels actually run. So I did the math, and I started by reading the shipped code.

The kernel everyone ships

The streaming kernel maintains a sliding-window inner product per row with a rank-one subtract–add update: one operand pair leaves the window, one enters, and the cached value from the previous tick is updated in place.

text
# the streaming update, schematically
qt = qt_prev - leaving_a * leaving_b + entering_a * entering_b

That reuse is the whole story. Every tick commits a rounding error into the accumulator, and because tomorrow’s value is computed from today’s, the error never leaves. The carry chain grows with the stream.

I audited the deployed implementations at their public release tags to see who actually carries this chain. stumpy@v1.13.0 carries the recurrence indefinitely with only a single per-tick head refresh. go-matrixprofile@v0.2.0 goes to the other extreme and recomputes a full FFT every tick. SCAMP@v4.0.3 and matrixprofile@v1.1.10 are batch-only — no streaming carry at all. Every implementation sits at an extreme; none occupies the middle ground of a periodic full-state reseed, which is the regime the analysis ends up recommending.

Where each shipped kernel sits on the carry-retention axis. The deployed implementations cluster at the extremes; the bounded periodic-reseed regime the analysis recommends is the empty middle. Release tags verified by re-cloning each repository and checking out the tag.
data table
implementationrelease tagkernel class
stumpyv1.13.0streaming carry, kept indefinitely
SCAMPv4.0.3batch, per-tile, no carry
matrixprofilev1.1.10batch, diagonal sampling
go-matrixprofilev0.2.0full FFT recompute per tick
The carry chain: one rounding error committed at tick t rides inside the accumulator for every remaining tick. Periodic re-anchoring cuts the chain. Schematic of the mechanism — no measured data in this figure (doi:10.5281/zenodo.20599315).

What linear drift looks like

I ran the recurrence as a single unbroken carry chain across stream lengths N = 2¹³ to 2²⁴ — past 16.7 million ticks — against a 128-bit MPFR reference oracle (~38 decimal digits), over seven synthetic corpora, 30 seeds each, three accumulator strategies, two precisions. On the cleanest corpus in the sweep — a deterministic linear-congruential walk, which adds no rounding of its own above the accumulator — the naive f64 accumulator climbs roughly 3.8 decades, from 1.75×10⁻¹⁰ to 1.08×10⁻⁶, tracing the predicted O(N·u) envelope. Compensated Kahan–Babuška–Neumaier (KBN) at f64 holds the same chain near the oracle floor — never above 7.5×10⁻⁹. Ogita–Rump–Oishi at f64 is correctly rounded on this corpus: drift identically zero, off the log axis.

Then the f32 twist. On the same corpus, the Ogita correctly-rounded condition is first crossed around n = 2²⁰ — the exact crossing point depends on a small constant in the condition — and both compensated paths grow near-linearly from there: KBN-f32 grows ~3,009× to 9.27, Ogita-f32 ~412× to 1.27, and naive f32 reaches 1.71×10³. The algorithms did not get worse — the precision tier moved the condition boundary under them. Compensated summation alone cannot arrest the growth once the correctly-rounded condition is crossed, and dropping to f32 crossed it on this data.

Measured drift of the unbroken carry chain vs stream length, per accumulator and precision, against a 128-bit MPFR oracle. Each charted point is the worst case over the seven corpora (30-seed median); the f32 lines blow up past n = 2²⁰, where the compensated correctly-rounded condition is crossed. Chart: the printed drift-vs-N figure in doi:10.5281/zenodo.20599315. The table below is a different cut: the supplement's per-corpus drift-range table for the LCG corpus, whose minima sit below the charted worst-case lines.
data table
precisionaccumulatordrift min (LCG)drift max (LCG)
f64naive1.75e-101.08e-6
f64KBN5.82e-117.45e-9
f64Ogita00
f32naive1.25e-21.71e3
f32KBN3.08e-39.27
f32Ogita3.08e-31.27

Why not just use fp64

Because deployment already left it behind. High-throughput GPU similarity joins are bounded by memory bandwidth, not arithmetic: holding the stream in a 64-bit storage format halves effective cache and bus throughput against a 32-bit one. Power- and area-constrained edge processors frequently ship no double-precision unit at all. This is the same economics that put bf16 and fp8 into deep-learning training — reduced precision is the operating point, not a tuning choice. What was missing is a forward-error bound parametric in the working precision, stating the conditions under which the stream stays inside tolerance.

One caveat up front, because it is the structural difference the bound has to price in: a streaming carry chain compounds across ticks. That is strictly worse than one-shot GEMM accumulation, where each output element’s error closes out when the reduction ends. Here nothing ever closes out — and the production target is the trillions-of-ticks regime, several decades beyond where the naive recurrence already fails.

Cut the carry chain

The fix is one line of topology. Reseed the recurrence from a full dot product every k ticks, and the error bound stops depending on how long you stream — the stream length N drops out of the formula exactly. The streaming kernel’s whole problem is that a rounding error committed at tick t rides inside the accumulator for every remaining tick. The re-anchored topology breaks that ride: at every tick, two anchor rows are reseeded from a full length-w inner product, and a Periodic{k} policy reseeds the whole window every k ticks. A carry chain that reaches any cell now has length pk, never N.

The resulting deterministic envelope, quoted from the preprint (doi:10.5281/zenodo.20599315, Theorem 4.1):

text
|computed QT − exact QT|  ≤  γ_{4(p+w)}(u_acc) · (2p + w) · T★

The Higham envelope γn(u) = (1+u)n − 1 ≈ nu counts compounded roundings; 4(p+w) over-counts the roundings along the longest surviving chain — at most four ops per carry step (4p) plus the length-w reseed’s 2w − 1 roundings, together ≤ 4(p+w); (2p+w) counts the operands whose magnitudes can feed the error; T bounds the magnitude of any single pairwise product of stream samples (T = M² for a sample magnitude bound M); uacc is the accumulator’s unit roundoff. Nothing in the formula grows with N. Stream for a trillion ticks and the bound is the same number it was at tick k.

The theorem becomes a knob

A bound is only useful if someone can act on it. Because the envelope is strictly monotone in the carry length p, it inverts into a tuning rule: given an error tolerance τ, the largest safe reseed horizon is

text
k★  ≈  sqrt( τ / (8 · u_acc · T★) )

— the 8 is just the operation-count factor 4 times the operand factor 2. Deployment is that one division and a square root. The theorem compiles down to one integer:

yaml
# streaming similarity join — retention policy
anchor:
  policy: periodic
  k: 4096   # construction default: fp64-safe, ~4000x headroom
  # k: 628  # the fp32-safe bisection root of the envelope constraint (closed form ~676)

The numbers, from the preprint’s retention-tuning section (§SM10.1): at fp64 the safe root is around 1.7 × 107, so the construction default k = 4096 sits under it with roughly 4,000x headroom. At fp32 the safe root is k* = 628. The same default sits ~4,000x under the fp64 root and 6.5x over the fp32 one — ship fp32 at k = 4096 and you silently accept a (4096/628)² ≈ 43x tolerance excess. Nothing in the config changed — only the precision — and the theorem says the safe value moved by four orders of magnitude. That is what a bound in a config file buys you: the failure is computable before you ship it.

The tuning rule k* ~ sqrt(tau / (8 u T*)) across accumulator precision, with the exact bisection roots and the k = 4096 default marked. The default clears fp64 by ~4,000x and overshoots the fp32 root by ~6.5x — a ~43x tolerance excess. Computed from the closed form and roots printed in §SM10.1 of the preprint, doi:10.5281/zenodo.20599315.

The row where anchoring loses

The measured validation, run on the self-anchored sliding sum-of-squares kernel that carries the same recurrence topology (Table 2 of the preprint): 18 (precision, accumulator, k) cells across N = 213…224, seven corpora, 30 seeds. Every cell holds its measured drift below the envelope — zero violations. At f64 the anchored cells keep drift at or below 4.9 × 10−8 against uniform-in-N envelopes of 7.6 × 10−4 (k = 676) and 2.5 × 10−2 (k = 4096), while the unanchored baseline’s envelope grows to 4.1 × 105 by N = 224.

Measured worst-case drift (points) against the proven envelope (dashes) for all 18 cells — twelve anchored plus the six unanchored k=0 baselines. Zero violations. The vermillion point is the row where anchoring loses. Data: the anchored Periodic{k} table (Table 2) in the preprint, doi:10.5281/zenodo.20599315.
data table
precisionaccumulatorkdrift maxenvelope max
f64naive01.07e-064.08e+05
f64naive6761.49e-087.60e-04
f64naive40964.84e-082.49e-02
f64kbn004.08e+05
f64kbn6762.24e-087.60e-04
f64kbn40961.49e-082.49e-02
f64ogita07.45e-094.08e+05
f64ogita6762.24e-087.60e-04
f64ogita40961.49e-082.49e-02
f32naive01.71e+033.07e+12
f32naive6761.73e+014.08e+05
f32naive40963.16e+011.34e+07
f32kbn09.27e+003.07e+12
f32kbn6766.89e+004.08e+05
f32kbn40965.27e+001.34e+07
f32ogita01.27e+003.07e+12
f32ogita6766.89e+004.08e+05
f32ogita40965.27e+001.34e+07

Now the row that argues against the fix. The unanchored Ogita–Rump accumulator at f32 reaches a worst-case drift of 1.27 — below every anchored f32 cell (those run 5.27 to 31.6). The reason is mechanical: the reseed itself rounds a length-w sum every k ticks, a cost an unbroken compensated chain never pays. Where the accumulator was already strong enough to tame the full chain, anchoring adds rounding events. Where it was designed to help, it helps unambiguously: naive at every tier (1.7 × 103 down to ~30 at f32), KBN at f32 (9.27 down to 5.27). And for every accumulator it wins in the worst case, where only bounded retention keeps the envelope from growing with N.

The regime boundary

The envelope is a theorem with a boundary. It says something only in the convergence regime 4(p+wuacc < 1. At the w = 64 deployment with an fp32 accumulator, 4w·uacc ≈ 1.5 × 10−5 — comfortable. Swap in a bf16 accumulator (uacc = 2−8) and 4w·uacc ≈ 1.0 already at p = 0: the envelope is vacuous before the first reseed. This is why the bound keeps storage precision and accumulator precision as separate parameters — bf16 storage is fine; bf16 accumulation is outside the theorem (preprint, Table SM2).

driftbound: frequently asked questions

What is driftbound?
driftbound is a forward-error analysis of the streaming sliding-window inner product, the kernel deployed as the matrix profile for motif discovery, discord detection, and anomaly mining. It bounds how far the streaming result can drift from the exact value, and reduces the fix to a single integer you set in a config file. The bound is machine-checked in Lean and Coq.
What problem does driftbound solve?
The streaming kernel updates by subtracting one product and adding one. It reuses the previous tick's value indefinitely, so each tick's rounding error stays in the accumulator and grows linearly with stream length. Across thirty-one matrix-profile papers, none carried a forward-error analysis of this drift. At 32-bit precision, even compensated summation grows without bound past roughly a million ticks.
What is the fix?
Reseed the carry chain from a full dot product every k ticks. Stream length then drops out of the error bound, so the bound holds the same at a trillion ticks as at the first reseed. The largest safe k follows a closed-form tuning rule that reduces to one integer in a config file. At 32-bit precision the safe value is about 43 times tighter than at 64-bit, so a mis-set config is computable before you ship.
Is the bound proven and validated?
Yes. The deterministic envelope is proven in a public preprint (doi:10.5281/zenodo.20599315) and machine-checked. Across 18 precision, accumulator, and k configurations over seven corpora and stream lengths to 16.7 million ticks, every cell held its measured drift below the bound with zero violations. This includes the one disclosed row where periodic reseeding loses to compensated summation. Two papers are under review.

Where the series goes

That is the deterministic result: an envelope uniform in stream length, and a knob to set it with. What it cannot cover is the worst case itself — this bound has a matching adversary no summation algorithm escapes, and a stall input that makes round-to-nearest discard every increment in the same direction. After that: the kernel production actually runs, how the proofs were made checkable, and a closing scoreboard where every empirical leg of the program reports its violation count. Next: the adversary and the coin flip that breaks it.

Part 02 · June 12, 2026

The accumulator that never moved

TL;DR — There is an input on which round-to-nearest discards every increment, in the same direction, forever: the computed accumulator never moves while the true sum climbs. That stall proves the window-magnitude factor in the opening article’s envelope is necessary, and it proves the hypothesis behind the natural probabilistic refinement false in the worst case, which is why that refinement ships as a conjecture. A separate matching lower bound (products pre-rounded, accumulator exact) shows the linear error growth itself is intrinsic: no summation algorithm escapes it. The escape is a different rounding mode: stochastic rounding gives an unconditional √n envelope — the model-derived fp32 deep-n slope is 0.5014, a direct 20,000-seed emulator run to n = 3×10⁷ measures 0.4978, and both confidence intervals exclude the deterministic rate 1.

Park an accumulator at an exactly representable value. Feed it increments that are each just under half an ulp of that value. Round-to-nearest evaluates the addition, finds the result closer to where it started, and snaps back. Every step. The same direction. The computed sum is stationary while the true sum climbs monotonically — and the discarded residuals stack into an error of order p(w−p)·u·T★, quadratic in the window parameters when the carry length sits near half the window. The round-back induction is machine-checked in Coq/Flocq (what that means, and what it does not, is a later article’s job).

The drift floor no algorithm escapes

The opening article gave an upper bound that grows linearly in the carry length p. The reflexive objection is that the linearity is analysis slack — surely a better accumulator beats it. My matching lower bound says no: there is an adversarial input forcing drift of at least c·p·u·T★ even if the accumulator performs every summation step in exact arithmetic. This is a different construction from the stall above, its mirror image: here the products are rounded before they ever reach the accumulator, the adversary places each one a hair past a rounding midpoint so every error lands with the same sign, and the chain has no internal cancellation to spend. In the stall, the products are benign and the accumulator’s own rounding carries the error. (The model excludes fused multiply-add; that is a different machine.)

That construction is why the upper bound is tight. Kahan, Ogita–Rump, pairwise — pick any summation scheme you like. The error floor is set by the rounding mode, before the algorithm gets a vote.

A conjecture, not a theorem

The natural next move is probabilistic: model the rounding errors as mean-zero given the past, run a martingale argument, and the linear rate sharpens to √(p+w). That result appears in my paper — as a conjecture, not a theorem. Its load-bearing hypothesis, mean-independence of the rounding errors under round-to-nearest (the paper calls it L1), is what my own stall construction falsifies, and I did not have to hunt for the counterexample: the lower-bound witness and the counterexample are the same object. The input I built to prove the magnitude factor necessary is also an input on which every error is discarded in the same direction, so conditioned on the past the errors are deterministically same-signed; mean-zero fails outright. The hypothesis is provably false in the worst case. Whether it still holds for generic zero-mean data is open — the same question flagged in the recent numerical-analysis literature (research problem 16.1 of Mary’s 2025 habilitation). Demoting it had a price: the deterministic √(p+w) rate stays unclaimed, and the unconditional stochastic-rounding theorem below had to carry the paper’s headline instead.

Flip a coin

Stochastic rounding rounds up or down at random, with probability proportional to proximity. That randomness removes the one property the stall relies on: every error landing the same way. Under stochastic rounding the per-step error splits into a mean-zero random part and a small deterministic residual, and the mean-zero part is mean-zero by construction, for every input. No L1 hypothesis, no distributional assumption. My main theorem turns that into an unconditional high-probability bound at the √n rate, parametric in three quantities: the storage precision, the accumulator precision, and the hardware random-bit budget r. The residual deterministic bias scales as 2−r, so a finite random-bit budget leaves a finite bias.

20,000 seeds say √n

A probabilistic envelope is a claim about a distribution, so I validated it with an ensemble: 20,000 independent stochastic-rounding seeds, run on the coherent regime where round-to-nearest is at its worst. The rate is the claim that matters. At fp32 the in-binade walk is provably a binomial process, and that model puts the deep-n slope at 0.5014 with a confidence interval [0.4990, 0.5039]; a direct emulator run out to n = 3×10⁷ measures 0.4978 [0.4898, 0.5058]. Both intervals exclude the deterministic slope 1, and fp16 (0.5027) and bf16 (0.5144) read the same way. The mean-zero claim checks out too: a 95% interval brackets zero at every precision tier. Envelope coverage is 1.000 at every tested failure probability (λ = 0.1, 0.05, 0.01), with Wilson 99.99% lower bounds at 0.999 — and the bound is not slack for all that coverage, because the empirical-to-theoretical constant ratio sits between 0.56 and 0.72, order one, which is the tightness check.

Error growth rendered from the printed rates: round-to-nearest's worst-case linear envelope (slope 1) against the stochastic-rounding ensemble's model-derived fp32 deep-n slope of 0.5014 (a direct emulator run to n = 3×10⁷ measures 0.4978), normalized at n = 10³. Slopes and coverage from the SR ensemble coverage table in the preprint (doi:10.5281/zenodo.20599315).
data table
tiercoverage (λ=0.1 / 0.05 / 0.01)slope [CI]const. ratio
fp321.000 [0.999] at all three0.5014 [0.4990, 0.5039] (model) / 0.4978 [0.4898, 0.5058] (measured)0.72
fp161.000 [0.999] at all three0.5027 [0.4885, 0.5168]0.65
bf161.000 [0.999] at all three0.5144 [0.4544, 0.5745]0.56
demo (u=2⁻¹³)1.000 [0.999] at all three0.4964 [0.4877, 0.5051]0.69
The stall, drawn: a bucket that silently rejects every sub-half-ulp drop while the true level rises. A sketch of the Coq-checked construction; no measured data in this figure.

This is the same mechanism deep-learning accelerators already ship, here with a proven envelope and the random-bit budget as an explicit parameter. The budget is documented on one vendor’s hardware-SR path (the Graphcore IPU, r ≈ 23), inferable for NVIDIA Hopper, and undisclosed for Blackwell’s down-conversion path. So the bound reads as a hardware contract: the number of random bits a device must supply to sustain a streaming join without coherent drift.

All of this covers the textbook inner-product kernel. The one production actually ships is structurally different — and it had no analysis at all.

Part 03 · June 12, 2026

The error analysis everyone cites is for a kernel nobody runs

TL;DR — Production deployments of the primitive (SCAMP, STUMPY) do not run the textbook sliding inner product the earlier articles analyzed; they accumulate a mean-centred covariance with cached, reused increments. At mu = 0 the two kernels compute the same values — and their floating-point computation graphs still differ, by two structural breaks and one bias amplification that defeat the raw proof’s machinery and carry a second paper (doi:10.5281/zenodo.20599478). The centred envelope held in all 100 measured containment cells, worst ratio 7.54e-3.

A reviewer will say the centred analysis is a corollary of the raw bound: set the mean to zero and the centred kernel’s values coincide with the raw one’s. I re-derived both kernels symbolically to settle it, and the answer is no. The values coincide at mu = 0; the floating-point computation graph does not — by two structural breaks and one bias amplification the raw proof’s machinery cannot absorb. That distinction is the entire second paper.

What production actually accumulates

SCAMP and STUMPY do not slide a raw dot product. They slide a cached covariance by the SCAMP update,

text
cov[i+1, j+1] = cov[i, j] + df[i] * dg[j] + dg[i] * df[j]

where the mean-deviation increments — df[i] = (x[i+m] - x[i]) / 2 and its dg sibling — are precomputed once per stream position and reused at every diagonal cell that visits position i. I verified this against the public release tags: at SCAMP@v4.0.3 the per-diagonal covariance accumulation lives in the CPU stats and tile-kernel sources, and at stumpy@v1.13.0 in the diagonal-compute routine. Both round each product before the staged add — four rounding events per carry tick.

Two things now break the raw kernel’s structure. Each cached increment is read at up to min(i+1, P−i) distinct diagonal steps. And the carry adds two products per tick and evicts none. A third feature is shared with the raw kernel but enters differently: because increments are reused across all near-diagonal cells, the deterministic rounding bias they carry is correlated across those cells — the sqrt-of-cell-count averaging an independent-bias model would grant simply is not available, so the bias contribution is stronger, not new.

The cache-reuse trap

The raw paper’s probabilistic bound rests on a no-double-count lemma: every rounding event occupies a distinct martingale index, exactly once. Increment reuse kills that hypothesis directly. One rounding committed when df[i] was formed re-enters up to min(i+1, P−i) later computations — the same coin flip appearing at many martingale positions. Counted naively, the argument dies.

The fix is a re-partition, one of the paper’s two new lemmas; the companion composition algebra that assembles the resulting martingale bound is machine-checked in Coq. Fix the precompute draws once and fold the reuse into deterministic coefficient perturbations of the form (1+ζ); what remains is a clean event count — the seed fires 4m−1 rounding events, each carry tick fires 4 genuinely fresh ones, for (4m−1) + 4p in total — with every index distinct.

Increment reuse fan-out versus the raw telescope, and the healing-in-the-gap locality of the localization corollary: a spike in the dead zone between a cell's two anchor windows provably cannot touch that cell. Mechanism sketch, no measured data; the localization corollary it depicts is proved in the preprint (doi:10.5281/zenodo.20599478).

Add two, evict none

The raw carry adds one product and removes one, so errors can partially telescope. The centred carry adds two and removes nothing. Counterintuitively, this makes the lower bound easier to prove: the summed errors are 2p entering products with no leaving term to cancel against, so the adversary just pushes every rounding the same way. The paper’s lower bounds land at Θ(p·u·M_cov) and Θ(p·m·u·M_cov) against an upper envelope of Θ((2p+m)(p+m)·u·M_cov) — matched on the carry-rate in p and on the magnitude weighting; a residual gap of order m/p remains when p is much smaller than m.

Conditioning is not a prefactor

The way mean-centring is usually justified conflates two effects — I conflated them myself until the derivation forced the split. Centring wins because it avoids catastrophic cancellation: the small covariance you divide by is computed directly, not as the difference of two large inner products. It does not win by a smaller error constant — the centred magnitude scale M_cov can be up to twice the raw scale T★. The bounds quantify each effect separately.

The remaining check is whether the centred envelope holds on the kernel that ships. The measured check ran the genuine two-product off-diagonal carry against an f64-exact reference at N = 2^20, across 7 synthetic corpora, 30 seeds per cell, windows 16 through 256, reference distances up to 4096, near and far diagonal bands. All 100 (precision, window, distance, band) cells were contained; the table below prints the per-(precision, window) worst-case rows of that grid. The worst ratio of measured error to envelope was 7.54e-3 — two orders of magnitude of headroom — and at f32 with window 256 the measured error was 1.89 against an envelope of 290.

Worst-cell ratio of measured forward error to the centred envelope, per precision and window. Every cell sits roughly two orders of magnitude below the containment line. Data: Table 1 of the preprint (doi:10.5281/zenodo.20599478).
data table
precisionwtightest (p, band)measured errcentred enveloperatio
f6416(64, near)2.04e-105.06e-084.02e-3
f6432(64, near)4.66e-106.75e-086.90e-3
f6464(64, near)7.57e-101.08e-077.01e-3
f64128(64, near)1.63e-092.16e-077.54e-3
f64256(64, near)3.61e-095.40e-076.68e-3
f3216(64, near)9.74e-022.72e+013.59e-3
f3232(64, near)1.89e-013.62e+015.22e-3
f3264(64, near)3.81e-015.80e+016.57e-3
f32128(64, near)7.25e-011.16e+026.25e-3
f32256(64, near)1.89e+002.90e+026.51e-3

This experiment exercises the off-diagonal carry specifically, and the design earns a closer look: a validation of the same shape can stream the wrong quantity, compare it against the wrong envelope, and still pass with every cell green. How a green experiment can be measuring nothing is the next article’s story.

Part 04 · June 12, 2026

Green is not evidence

TL;DR — Three verification gate designs can read green while verifying nothing: an empirical validation that measures the wrong quantity against the wrong envelope, a canary that returns the same exit code whether the gate is alive or dead, and a reviewer-facing audit command that produces no evidence and exits 0. Each failure mode became a mechanical rule — the table at the end carries all three.

An empirical validation that streams a diagonal sum-of-squares and checks it against the raw kernel’s envelope — the wrong quantity, against the wrong bound — will pass. Of course it will pass: an experiment that validates the wrong thing looks exactly like one that validates the right thing. Both produce a table of cells under a curve and a green check mark. The centred paper’s validation instead runs the genuine off-diagonal cross-covariance grid whose numbers the previous article reports — every cell contained, against the centred envelope that actually governs it. That corrected design is the good news. This article is about the failure mode: how the wrong design passes, and two other gate designs that fail for equally hollow reasons.

Consistency is not faithfulness

Formal gates — proof checkers, CI assertions — check that conclusions follow from premises. None of them check that the premises are true of the deployed code. The wrong-envelope experiment above is internally consistent: the harness runs, the comparison executes, every cell lands under the curve. The flaw sits upstream of everything the gates can see — the quantity being streamed is not the quantity the deployed kernel streams, and the envelope it is compared against is not the theorem that covers it.

This is the same failure class as the misformalization the previous article describes, one layer down: there, analyzing the textbook kernel while production runs a different one; here, validating a different quantity while believing you validated the real one. A claim can ship false while every formal gate is green. Consistency and faithfulness need different machinery.

The uncomfortable property of this failure class is why it attracts no scrutiny. A passing test looks fine: the harness is careful, the comparison is correct given its inputs, and reviewer instinct spends attention on red cells, of which there are none. Nothing in the output reveals that the quantity being measured is the wrong one. The wrongness lives in the binding between the experiment and the thing it claims to be about — a place no amount of staring at output can reach. So the binding is what a machine must test, deliberately, in every build.

Strict equality or nothing

The fix was a rebuild, not a patch: a different streamed quantity, a different envelope, the whole grid rerun from scratch. The machinery that keeps it fixed: one symbolic oracle as the single source of truth for what the deployed recurrence computes, fuzzed against the actual pip-installed kernel on integer-valued inputs where floating point is exact — so the comparison is strict ==. Any tolerance parameter is rejected outright, because tolerance creep is the named decay path: every atol starts as “just to absorb harmless noise” and ends as the hole the next wrong-quantity bug walks through.

python
# the shape of the rule, not project source
assert oracle_value == deployed_value          # accepted
assert abs(oracle_value - deployed_value) < 1e-9   # rejected at review, always

And because the oracle itself is now the single point of failure, it gets the same treatment: a deliberately poisoned mutant oracle is swapped in on every run, and the harness must catch it or the build fails. The gate demonstrates it can go red in every build that goes green.

The canary that couldn’t sing

That last rule has its own postmortem. One gate’s self-test — its canary — ran the gate against deliberately broken input and asserted a non-zero exit. The canary returned exit 1 in the healthy branch and in the broken branch. A dead gate read as PASS, indistinguishable from a live one, for the entire lifetime of the canary. An adversarial review pass — run by an AI agent — is the class of check that catches it. The fix is to give every outcome a distinct signature: poison caught, poison missed, clean input passed. That fix, when applied to the first canary, immediately exposed the identical defect in the suite’s one other, older canary: two canaries, one shared blind spot.

The canary truth table, before and after. Before the fix, both the healthy and the broken gate produced the same exit code on poisoned input. A sketch of the documented bug mechanism, not project source.

Days later, an audit found that the planted poisons themselves proved less than they appeared to. Both canaries’ poisons were true statements whose only falsehood was a placeholder proof. They fired — but one well-meaning edit completing the placeholder would have disarmed them silently, leaving a canary that reports healthy and tests nothing. Every poison now states a proposition that is actually false.

Exit-0 theater

The third postmortem is the simplest. A reviewer-facing audit command is supposed to print, for each theorem in the certificate, its full assumption set — the artifact a referee checks. In this failure class it prints section headers and no evidence, for dozens of theorems, and exits 0. The cause: the prover’s batch mode silently ignores input on stdin and reports success having checked nothing. The verification command verifies nothing, cleanly, every time — and no measurement can catch it without an independent content assertion.

sh
# the trap: success is inferred from the absence of error
check_all | tee audit.txt ; echo "exit=$?"   # exit=0, audit.txt: headers only, zero axiom sets

# the fix: assert on content, not on exit code
for thm in $THEOREMS; do
  out=$(probe "$thm")
  [ -n "$out" ] || { echo "FATAL: $thm produced no evidence"; exit 1; }
done

The rebuilt command runs one compile probe per theorem and fails loudly if any probe does not produce its evidence. Exit 0 is a claim about plumbing — the process ran and nothing crashed. It is not a claim about the world.

The rules

what read greenwhat was actually truethe rule it became
Validation passed, all cells under the envelopeWrong quantity, wrong envelope — the real kernel was never measuredOne oracle as source of truth; strict equality at trust boundaries; no tolerance parameters, ever
Canary asserted non-zero exit, got itSame exit code from a live gate and a dead oneDistinct failure signatures per outcome; mutation-test the test
Planted poisons fired and the canaries reported healthyThe poisons were true statements with placeholder proofs — one edit would have disarmed them silentlyEvery poison states an actually false proposition, and demonstrably fires
Audit command exited 0Headers and no evidence — the prover’s batch mode checked nothingAssert on evidence produced, never on absence of error

The generalization is one the AI crowd already lives with: these are evals. A benchmark harness that passes while measuring the wrong thing is indistinguishable from one that works — until you plant a failure that must fail and watch whether it does. If a planted failure doesn’t fail CI, your gate is decoration, and green is a color, not evidence. The gates that survived are the ones where judgment lives in an artifact a small kernel re-checks rather than in an exit code — the next article is about those.

Part 05 · June 12, 2026

Three axioms and one disclosed seam

TL;DR — The headline error bounds in this series carry machine-checked proof obligations across four independent tools — five of the six results; the sixth is an open conjecture — and the entire Lean development reduces, mechanically, to three classical axioms — propext, Classical.choice, Quot.sound — with zero unfinished proofs. This article is the from-zero tour of what that certificate means and of the one cross-prover seam no tool can compile away.

AlphaProof — the trained reinforcement-learning prover — ran standalone at roughly 64 TPU-hours per problem and solved none of the nine. A basic agent that just alternated LLM generation with Lean verification replicated all nine, costing more only on the hardest. That is the ablation buried in DeepMind’s formal-proof-search paper this spring (arXiv:2605.22763), and it is the part I keep quoting; the headline that got the press is the paper’s most capable agent autonomously resolving 9 of 353 open Erdős problems, at a few hundred dollars each. The documented failure mode is the other part worth quoting: proof sketches leaning on lemmas the model claimed were “established results in the mathematical literature.” On inspection, hallucinations. Prompting against the behavior failed. The sorry-free requirement — end-to-end formal verification — is what exposed them. The kernel is the whole game. Here is what the one behind my two papers buys.

The published ablation, redrawn: a bare LLM + Lean-verifier loop replicated all 9 resolved Erdős problems; the trained prover standalone solved 0 of 9. Data from arXiv:2605.22763; the 9/353 headline belongs to the paper's most capable agent.
data table
systemresolved (of the 9)
basic LLM + Lean verifier loop9
AlphaProof standalone (tree search, ~64 TPU-hours/problem)0

Four tools, one claim

No single tool checks my papers. Lean 4 + Mathlib carries the probabilistic core — the Azuma–Hoeffding martingale envelope behind the stochastic-rounding bound, plus the matching lower bounds. Coq + Flocq carries the bit-level facts: the round-to-nearest stall from the lower-bound article and the deterministic recurrence composition, compiled with zero Admitted. Gappa certifies the per-step interval round-off bounds. sympy and mpmath discharge the symbolic operand-count and rate-match obligations. Five of the six headline results carry machine-checked proof obligations; the sixth is the round-to-nearest conjecture, left open with a stated target: pass/fail criteria can be pre-registered against a bound before the experiments that test it run, the way trials pre-register endpoints. That habit comes from empirical work and is separate from what the published papers claim.

ResultSubstrateAxiom surface
deterministic envelope, uniform in NLean + Coq/Flocq + Gappaclassical triple
unconditional SR Azuma envelopeLean + sympyclassical triple
carry-chain matching lower bound (p ≤ w)Lean + Coq/Flocqclassical triple
magnitude-factor necessity stallLean + Coq/Flocqclassical triple
SR removes adversarial coherenceLeanclassical triple
round-to-nearest refinement, conditional on the open L1 hypothesisopen conjecture

Transcribed from the per-result machine-verification coverage table printed in the preprint (doi:10.5281/zenodo.20599315).

A schematic informed by the preprints' coverage sections (doi:10.5281/zenodo.20599315, doi:10.5281/zenodo.20599478): four lanes, a per-result substrate map, and the cross-prover boundary drawn as an explicit labeled seam. Not measured data.

Three axioms, every theorem

A Lean proof is not trusted because the proof script looks right. The script is an untrusted program; a small kernel re-derives every step. What the kernel will not volunteer on its own is the axiom inventory — so for each public bridge theorem, the development byte-pins the #print axioms output behind a #guard_msgs gate: any new axiom, or a sorry buried ten imports deep surfacing as sorryAx, changes the pinned message and breaks the build. The papers state that this gate is itself the machine proof that the development is sorry-free and classical-axioms-only on every theorem. The pinned transcript ships with the paper, verifiable without the library source:

text
'MpAnchor.mp_qt_forward_error' depends on axioms: [propext, Classical.choice, Quot.sound]

That one line is the certificate. No project-specific axiom, no unfinished proof, three axioms classical mathematics already runs on.

Verified has gradations

Even inside a prover, some paths bypass the kernel. Lean’s native_decide compiles the goal to native code and admits the answer through an extra axiom — fast, and exactly the kind of shortcut a trusted-base audit deny-lists. And a proof can pass every axiom check while certifying nothing: define rounding as the identity function and the whole development green-builds, vacuously. So certificates are labeled by strength: an implication certificate carries a weaker guarantee than a model certificate — the inference is kernel-checked but a premise crosses a disclosed boundary, versus the claim holding end-to-end inside one kernel’s model — and the label says which one you are holding.

The seam you cannot compile away

In 2026 there is no production tool that transports a Coq proof into Lean. The bit-level facts live in Coq because Flocq is where floating-point theory exists; the probability lives in Lean because Mathlib is where martingales exist. So the boundary between them is disclosed, not hidden: where a fact proved in Coq matters to a Lean theorem, the Lean statement binds it as an explicit named hypothesis, visible in the theorem statement itself, and the load-bearing shared fact is certified independently on both substrates — the round-to-nearest stall is constructed in Lean while the Coq/Flocq lane independently certifies the same stall. That is why every public bridge can print the classical triple with no project-specific axiom. If a binder were ever dropped — a Coq fact entering Lean as an axiom instead of a hypothesis — the right design is one disclosed bridge axiom that verbatim-mirrors the Qed-closed Coq theorem, content-pinned and independently re-checked by Coq’s own coqchk, with a firewall restricting which theorems may touch it and a canary proving the firewall admits the allowed consumers and fires on everyone else. That is a design sketch for the failure case, not a description of the shipped development. Pin everything: a floating reference is an attack surface whether it points at a package or at an axiom — the supply-chain lesson, one layer down.

Why this does not generalize for free

Mainstream Lean libraries carry essentially no floating-point theory, so the deterministic lane had to be human-authored in Coq — no LLM loop, however good, can retrieve lemmas that do not exist. The DeepMind result worked because Mathlib is a deep substrate for the mathematics it targeted; “AI can prove things” is substrate-dependent, and the narrowness of my domain is what made the faithfulness checks mechanizable.

A certificate, though, is only as good as the data thrown at it. The next article takes the proven envelope to a real earthquake record.

Part 06 · June 12, 2026

One earthquake pinned my error bound forever

TL;DR — The deterministic error bound carries a global magnitude factor T★ — a single worst-case constant over the entire stream. One large spike pins it at a uselessly large value for every tick that follows, even though the actual error stays tiny. The fix is a windowed factor that heals once the spike ages out: on the printed synthetic run the post-exit ratio between the two envelopes averages ~3.2×10⁵. The capstone scoreboard covers the main empirical legs of both papers: zero violations.

The last article ended on a warning: a certificate is only as good as the data you throw at it. So here is the data. The bound’s magnitude factor T★ is a single global constant — the largest product magnitude anywhere in the entire stream. One large spike anywhere fixes it for the whole bound. Feed the kernel that spike early, and T★ sits at the spike’s scale for every remaining tick. The error itself stays tiny; the envelope around it stays enormous. The theorem is true, and on exactly the high-dynamic-range data that motivates streaming in the first place — seismic data — its global form tells you almost nothing.

Global pins, local heals

My answer is locality. The first patch I reached for — a prefix maximum T_t that grows as the stream does — repaired nothing: it is non-decreasing, so after the spike it equals T★ at every later tick.

The patch that works: window the magnitude factor on the re-anchoring topology itself. Under a Periodic{k} reseed, only a bounded window of recent ticks can still influence the carry, so the magnitude factor can be taken over that window alone. Call it T_local(t). Once the spike exits the window, T_local collapses back to the pre-spike level — on the printed run, the global-only envelope I would otherwise have shipped runs five orders of magnitude looser for the entire post-exit stream. The global bound remembers forever; the local one is allowed to forget.

The printed synthetic run makes the gap concrete. Stream of 20,000 ticks, window 50, reseed every 200, fp32, worst-case carry lag 199. A spike of scale 10³ enters at tick 300 and leaves the data 21 ticks later. The global envelope jumps from about 0.23 to about 8.1×10⁴ and never returns. The local envelope rides at the same inflated level only while the spike sits inside its window — then, at tick ≈621, it heals back to roughly where it started. (In the chart below the drop lands at the t = 600 re-anchor boundary, where the envelope is next re-evaluated, just ahead of the printed window-exit tick of ≈621.) Post-exit ratio between the two: a mean of ~3.2×10⁵. Same theorem, same stream, five orders of magnitude of difference in what the bound is worth.

Global versus windowed envelope on the synthetic healing run: one spike pins the global bound forever; the local bound heals at tick ≈621, post-exit ratio mean ≈3.2×10⁵. Measured error never exceeds either. Rebuilt from the data behind Figure 3 of the preprint (doi:10.5281/zenodo.20599315).
data table
tickglobal envelopelocal envelopemeasured error
250 (pre-spike)0.2260.2267.3e-7
320 (in spike)8.09e48.09e41.9e-4
620 (post-heal)8.09e40.2342.2e-7
19,999 (end)8.09e40.3361.4e-6

The earthquake

The synthetic spike is a planted toy, so I ran the same comparison on the real Parkfield broadband seismogram. The toy’s pathology reproduced exactly: the global factor stays pinned at the quake’s peak for the entire post-event stream, while T_local recovers to four to five orders of magnitude smaller across the benign tail. The measured drift never exceeded either envelope at any point of the post-event stream. The bound held on data that was designed, by nature rather than by me, to embarrass it.

The figure below traces the whole record — every series across the full stream, not a window around the event. The mainshock arrives partway through. The global envelope steps up to the quake’s peak and stays there flat to the end. The local envelope rides up with it through the event, then bleeds back down across the long benign tail that follows. The measured drift sits a dozen-plus orders of magnitude below both the whole way across, with no excursion toward either line at the mainshock or anywhere after it.

The full long-record run: the global envelope pins at the mainshock's peak and never recovers, while the windowed local envelope heals down across the benign tail that follows; the measured drift stays a dozen-plus orders of magnitude below both, with zero violations across the whole stream. Per-bucket max downsample (600 buckets, peaks preserved). Drawn from the data behind the earthquake section of the preprint (doi:10.5281/zenodo.20599315).

What “holds” means

Here is the empirical record of both papers, with the failure count printed. This table doubles as the series index.

empirical legscaleresultowner
unbroken-chain drift sweepN = 2¹³…2²⁴, 7 corpora × 30 seeds, f64 + f32traces the predicted carry-chain growth (numbers in Part 1)Part 1
anchored Periodic{k} grid18 (precision, accumulator, k) cells0 envelope violationsPart 1
regime-boundary sweepfull regime-boundary grid0 failures; every in-regime cell containedPart 1
stochastic-rounding ensemble20,000 seeds × 4 precision tierscoverage held at every tested failure probabilityPart 2
centred cross-covariance grid100 (precision, window, distance, band) cells0 envelope violationsPart 3
Parkfield seismic recordmainshock broadband recordmeasured drift never exceeded either envelopethis article

A worst-case bound that survives all of this is a falsifiable prediction that was given every chance to fail — planted adversaries, 20,000 random seeds, a real earthquake — and did not. That is the standard the gate-design article argues for: these checks could have failed, visibly, in any of those cells.

One narrow family of kernels

These are theorems about one family of streaming inner-product kernels under IEEE-754 arithmetic, machine-checked because the domain is narrow enough that the spec-faithfulness question can be mechanized — the trust stack works because the territory is small. Both manuscripts are public preprints (doi:10.5281/zenodo.20599315, doi:10.5281/zenodo.20599478) and currently under review. The work is part of an ongoing research program, and that is all this series will say about what comes next.