Three axioms and one disclosed seam
TL;DR — The headline error bounds in this series carry machine-checked proof obligations across four independent tools — five of the six results; the sixth is an open conjecture — and the entire Lean development reduces, mechanically, to three classical axioms — propext, Classical.choice, Quot.sound — with zero unfinished proofs. This article is the from-zero tour of what that certificate means and of the one cross-prover seam no tool can compile away.
AlphaProof — the trained reinforcement-learning prover — ran standalone at roughly 64 TPU-hours per problem and solved none of the nine. A basic agent that just alternated LLM generation with Lean verification replicated all nine, costing more only on the hardest. That is the ablation buried in DeepMind’s formal-proof-search paper this spring (arXiv:2605.22763), and it is the part I keep quoting; the headline that got the press is the paper’s most capable agent autonomously resolving 9 of 353 open Erdős problems, at a few hundred dollars each. The documented failure mode is the other part worth quoting: proof sketches leaning on lemmas the model claimed were “established results in the mathematical literature.” On inspection, hallucinations. Prompting against the behavior failed. The sorry-free requirement — end-to-end formal verification — is what exposed them. The kernel is the whole game. Here is what the one behind my two papers buys.
data table
| system | resolved (of the 9) |
|---|---|
| basic LLM + Lean verifier loop | 9 |
| AlphaProof standalone (tree search, ~64 TPU-hours/problem) | 0 |
Four tools, one claim
No single tool checks my papers. Lean 4 + Mathlib carries the probabilistic core — the Azuma–Hoeffding martingale envelope behind the stochastic-rounding bound, plus the matching lower bounds. Coq + Flocq carries the bit-level facts: the round-to-nearest stall from the lower-bound article and the deterministic recurrence composition, compiled with zero Admitted. Gappa certifies the per-step interval round-off bounds. sympy and mpmath discharge the symbolic operand-count and rate-match obligations. Five of the six headline results carry machine-checked proof obligations; the sixth is the round-to-nearest conjecture, left open with a stated target: pass/fail criteria can be pre-registered against a bound before the experiments that test it run, the way trials pre-register endpoints. That habit comes from empirical work and is separate from what the published papers claim.
| Result | Substrate | Axiom surface |
|---|---|---|
| deterministic envelope, uniform in N | Lean + Coq/Flocq + Gappa | classical triple |
| unconditional SR Azuma envelope | Lean + sympy | classical triple |
| carry-chain matching lower bound (p ≤ w) | Lean + Coq/Flocq | classical triple |
| magnitude-factor necessity stall | Lean + Coq/Flocq | classical triple |
| SR removes adversarial coherence | Lean | classical triple |
| round-to-nearest refinement, conditional on the open L1 hypothesis | — | open conjecture |
Transcribed from the per-result machine-verification coverage table printed in the preprint (doi:10.5281/zenodo.20599315).
Three axioms, every theorem
A Lean proof is not trusted because the proof script looks right. The script is an untrusted program; a small kernel re-derives every step. What the kernel will not volunteer on its own is the axiom inventory — so for each public bridge theorem, the development byte-pins the #print axioms output behind a #guard_msgs gate: any new axiom, or a sorry buried ten imports deep surfacing as sorryAx, changes the pinned message and breaks the build. The papers state that this gate is itself the machine proof that the development is sorry-free and classical-axioms-only on every theorem. The pinned transcript ships with the paper, verifiable without the library source:
'MpAnchor.mp_qt_forward_error' depends on axioms: [propext, Classical.choice, Quot.sound]That one line is the certificate. No project-specific axiom, no unfinished proof, three axioms classical mathematics already runs on.
Verified has gradations
Even inside a prover, some paths bypass the kernel. Lean’s native_decide compiles the goal to native code and admits the answer through an extra axiom — fast, and exactly the kind of shortcut a trusted-base audit deny-lists. And a proof can pass every axiom check while certifying nothing: define rounding as the identity function and the whole development green-builds, vacuously. So certificates are labeled by strength: an implication certificate carries a weaker guarantee than a model certificate — the inference is kernel-checked but a premise crosses a disclosed boundary, versus the claim holding end-to-end inside one kernel’s model — and the label says which one you are holding.
The seam you cannot compile away
In 2026 there is no production tool that transports a Coq proof into Lean. The bit-level facts live in Coq because Flocq is where floating-point theory exists; the probability lives in Lean because Mathlib is where martingales exist. So the boundary between them is disclosed, not hidden: where a fact proved in Coq matters to a Lean theorem, the Lean statement binds it as an explicit named hypothesis, visible in the theorem statement itself, and the load-bearing shared fact is certified independently on both substrates — the round-to-nearest stall is constructed in Lean while the Coq/Flocq lane independently certifies the same stall. That is why every public bridge can print the classical triple with no project-specific axiom. If a binder were ever dropped — a Coq fact entering Lean as an axiom instead of a hypothesis — the right design is one disclosed bridge axiom that verbatim-mirrors the Qed-closed Coq theorem, content-pinned and independently re-checked by Coq’s own coqchk, with a firewall restricting which theorems may touch it and a canary proving the firewall admits the allowed consumers and fires on everyone else. That is a design sketch for the failure case, not a description of the shipped development. Pin everything: a floating reference is an attack surface whether it points at a package or at an axiom — the supply-chain lesson, one layer down.
Why this does not generalize for free
Mainstream Lean libraries carry essentially no floating-point theory, so the deterministic lane had to be human-authored in Coq — no LLM loop, however good, can retrieve lemmas that do not exist. The DeepMind result worked because Mathlib is a deep substrate for the mathematics it targeted; “AI can prove things” is substrate-dependent, and the narrowness of my domain is what made the faithfulness checks mechanizable.
A certificate, though, is only as good as the data thrown at it. The next article takes the proven envelope to a real earthquake record.