The complete series · 2 parts
stillpoint
Part 01 · June 24, 2026
When is the swarm actually done?
TL;DR — When one AI agent hands work to another, which hands it to a third, no single participant can tell whether the whole chain actually finished. Each agent only sees its own immediate neighbor say “got it.” Below is a tiny three-program demo you can copy, paste, and run in two minutes: a root caller asks A, A asks B, B answers 200 accepted — and then silently drops the work it just promised. Root prints SUCCESS. Every process exits 0. The system looks healthy and is lying, and nobody in the chain misbehaved. The reason isn’t a bug you can patch in the demo: once A’s request to B closes, A literally cannot learn B’s downstream fate. There is no global observer of the cascade. This has a name distributed-systems researchers have studied for forty years — distributed termination detection — and the agent ecosystem is building swarms as if it doesn’t exist.
Picture a relay race where you, the coach, can see the first runner take off but not the finish line. The first runner passes the baton, jogs back, and tells you “handed it off, all good.” You write down race complete. But the runner three legs downstream just tripped, dropped the baton, and walked off the track. Your first runner doesn’t know that. You don’t know that. The only report you got — “I passed the baton” — was completely honest and completely useless for the question you actually care about: did anyone cross the finish line?
Swap runners for AI agents and that is the exact shape of a problem the agent industry is racing past. An orchestrator agent delegates to a specialist; the specialist delegates to a tool-running sub-agent; that one delegates again. Everyone reports back to the agent directly above them, and everyone reports the truth. And the system as a whole can still have no idea whether the work got done.
Let me make that concrete enough to run on your laptop.
Three programs that lie to each other politely
The entire system fits in three scripts. Root is a script. A and B are two tiny web servers on two different ports — separate processes, talking over the network, exactly like real agents do. Pure FastAPI and httpx, no agent frameworks, nothing private. You need only pip install fastapi uvicorn httpx — uvicorn is the tiny web server that runs A and B.
(If async/await isn’t your daily language: read await as “wait here for this.” That’s enough to follow every line.)
b.py — the agent that accepts work and then drops it:
# b.py -- run with: uvicorn b:app --port 8002
import asyncio, logging
from fastapi import FastAPI
app = FastAPI()
_running = set() # anchor background tasks so they can't be silently garbage-collected
async def do_the_real_work():
# The actual job B promised to do. Pretend it is a long task.
try:
await asyncio.sleep(1)
raise RuntimeError("B's worker died here")
except Exception:
logging.exception("WORK DROPPED -- and no one upstream is listening")
@app.post("/work")
async def work():
# Kick off the real work in the background, do NOT wait for it...
t = asyncio.create_task(do_the_real_work())
_running.add(t)
t.add_done_callback(_running.discard)
# ...and immediately answer the caller: "accepted!"
return {"status": "accepted"} # HTTP 200, instantlya.py — the middle agent that trusts B’s “accepted”:
# a.py -- run with: uvicorn a:app --port 8001
import httpx
from fastapi import FastAPI
app = FastAPI()
@app.post("/run")
async def run():
async with httpx.AsyncClient() as client:
resp = await client.post("http://localhost:8002/work")
# A sees B's 200 "accepted" and considers ITS OWN job done.
return {"status": "done", "downstream": resp.json()}root.py — the orchestrator that believes A:
# root.py -- run with: python root.py
import httpx
resp = httpx.post("http://localhost:8001/run")
print("Root got:", resp.json())
print("SUCCESS" if resp.json()["status"] == "done" else "FAILURE")Start the two servers in two terminals, then run the root in a third:
pip install fastapi uvicorn httpx
uvicorn b:app --port 8002 # terminal 1
uvicorn a:app --port 8001 # terminal 2
python root.py # terminal 3Root’s terminal prints:
Root got: {'status': 'done', 'downstream': {'status': 'accepted'}}
SUCCESSSUCCESS. Exit code 0 (the success code). Meanwhile, moments later, in B’s terminal, you will see a line like WORK DROPPED -- and no one upstream is listening followed by a traceback — because B already answered 200 before the work ran, so when the work blows up, the failure has nowhere to go. The job you asked for did not happen, and every single participant believes everything is fine.
The line t = asyncio.create_task(do_the_real_work()) is the whole trick: it starts the work running but does not wait for it — Python schedules it to run later and immediately moves on to return the 200. That single line is the gap between “accepted” and “done,” and it is everywhere. Swap create_task for a queue, a background thread, or a message to another service, and the trap is identical: nearly every async API in the world answers accepted and lets you assume completed.
That gap is the crux. What each reply actually licenses you to believe:
| What the reply says | What it actually means | What it does NOT mean |
|---|---|---|
200 / accepted | B received the request | B finished, or even started, the work |
A returns done | A heard B say accepted | the work below A succeeded |
Root prints SUCCESS | every reply came back | the job got done |
The work is gone. The system is green. No one lied.
Why A genuinely cannot know
If you go looking for a bug in the code, you won’t find one — not in the usual sense. Each node behaved correctly given what it could see. The failure is in the shape of the system, not in any one node’s code.
Walk the timeline. A opens an HTTP request to B. B accepts it and returns 200. That response closes the connection. From that instant, A and B share no channel. The work B does next happens on B’s side of a boundary A can no longer see across. A asking “did B finish?” after the request closed is like asking “is the runner still running?” after they have left the stadium — there is no wire left to carry the answer.
You could patch this demo (have B wait for the work, or call A back when it is done). But every patch just moves the boundary. Make B call A back, and now B can’t see whether A finished relaying that result up to Root. Add a C downstream of B and the frontier — the set of still-active hops in the cascade — runs off the edge of everyone’s vision again. The structural fact survives every patch.
That distinction is the spine of the whole problem:
A local fact is something one node can see for itself. A global fact is something true of the whole cascade. No node in a delegation chain can directly observe a global fact — it can only see its own edges.
Look at the right-hand column: every row that actually answers “are we done?” reads nobody.
| Fact | Kind | Who can see it directly |
|---|---|---|
| ”I sent my request” | Local | The sender |
”I got a 200 back” | Local | The caller on that one hop |
| ”My immediate child accepted” | Local | The parent of that hop |
| ”B’s background work succeeded” | Global | Nobody — it happens after the hop closed |
| ”The entire A→B→C cascade has settled” | Global | Nobody — no participant sees all the edges |
Every row a node can see is local. The information each agent holds is true and insufficient, all at once.
What each node can — and can’t — see
The reason no clever logging fixes this is that the knowledge is partitioned by design. Each participant holds one slice of the truth and none holds the union:
| Node | Knows for certain | Is blind to |
|---|---|---|
| Root | ”A returned done.” | Whether B (or anyone below) actually did the work. |
| A | ”B returned accepted.” | Whether B’s accepted work ran, dropped, or died. |
| B | ”I accepted the work; it later crashed.” | That Root already declared SUCCESS and stopped listening. |
Stack those rows and the gap is visible: there is no column for “the whole system,” because there is no node standing where the whole system is visible.
This is the part that separates a mesh from a single box. On one machine, in one process, you can at least imagine one watcher seeing everything — the same memory, one event loop you could instrument. (That single-box version has its own quiet failure mode, covered in green is not evidence — but it lives inside one process where, in principle, one observer could exist.) The moment you cross a network boundary, that imagined watcher is gone for good. A and B are separate processes. The instant A’s request to B closes, the only wire between them is severed. There is no global observer, and there cannot be one. This isn’t a missing feature someone forgot to build; no single participant sits where it could see the whole picture. That is what “distributed” means.
”Just wait a few seconds” is not an answer
Faced with this, the reflex is to throw a timeout at it: give the cascade ten seconds, then call it done. It feels prudent. It is the wrong shape of answer, not just a weak one.
A timeout asks the clock. The question you actually have is about the mesh: has every hop, including ones you can’t see, settled? Those are different questions, and the clock’s answer is never the mesh’s answer:
| Your 10-second timeout fires; you assume done | What is really happening | Verdict it gives you |
|---|---|---|
| Cascade settled in 1 second | The work finished long ago | Right answer, wasted 9 seconds |
| A hop is still working at second 11 | Live work, declared dead | Wrong — green over running work |
| A hop died silently at second 3 | The work is already a corpse | Wrong — SUCCESS over a corpse |
| Cascade settled in exactly 10 seconds | — | Right by luck |
A timeout is right only by coincidence. It can be too slow and too fast in the same system, because the work it is guessing about has no fixed duration. Tune it short and it fails healthy chains; tune it long and it passes dead ones. There is no right number, because a timeout is not a slightly-weak version of the right check — it is a different category. It samples one node’s clock; the property you want is about all the edges being empty at once. You cannot measure a global property with a local stopwatch.
This problem has a name (and it’s older than you think)
This converts a vague unease into a known, heavy-artillery problem with a literature behind it — the kind of thing worth carrying into your next architecture review, or a board meeting.
What you are looking at is not a quirk of FastAPI or of agents. Proving that a computation spread across many independent participants has globally stopped — every node idle, and no work still in flight between them — is a named, studied, genuinely hard problem in distributed systems, and computer scientists have wrestled with it since the early 1980s:
Distributed termination detection — proving that a spread-out computation has globally stopped: every participant is idle, and no work is still traveling between them.
It is hard for exactly the reason the demo shows. Any single node can be idle right now while a message carrying more work is traveling toward it. To declare the whole system done, you need to know that all nodes are idle and nothing is in transit — a global property no single participant can see alone. (Our two-hop toy shows the idle-misread half; the “in transit” half is what makes the general A→B→C problem, with a real C still being fed, even harder.) There are real algorithms for it — Dijkstra–Scholten, the Dijkstra–Feijen–van Gasteren token ring, and others — and entire textbook chapters, with named algorithms and proven lower bounds on how much coordination it costs. It is hard on purpose — which is precisely why it tends to get skipped.
For the reader who wants one glanceable glossary to walk away with:
| Plain phrase | The CS name | Why it’s hard |
|---|---|---|
| ”Is the whole swarm done?” | Distributed termination detection | No node can see the global state |
| ”I waited 10s, assume done” | Timeout-based liveness | Guesses with a clock; never observes the mesh |
”B said accepted” | A local liveness fact | True, but nowhere near sufficient |
Stated plainly: the agent ecosystem is currently blind to it. The frameworks now wiring agents into delegating chains — one model calling another, fanning work across services — are rebuilding multi-node distributed systems without the vocabulary the multi-node world spent forty years building. They ship the topology and skip the termination detection. The result is exactly our demo at scale: swarms of A→B→C cascades that print SUCCESS and exit 0 while the frontier is quietly non-empty.
I think quiescence — knowing a mesh has actually gone quiet — deserves to be a first-class thing you can ask for, not a wall-clock guess. And the bar for a real answer is exact: a sound quiescence signal must fire only when every node is idle and no work is in transit between them — it has to observe the whole active frontier of the mesh, not just the edges any one node can see. A closed stream is not that. A clock is not that. Whether you reach it with the classic algorithms (Dijkstra–Scholten, token rings) or something new, that is the invariant a real answer has to satisfy. We haven’t closed the gap yet, but the shape of the missing primitive is strictly defined.
The proof that this is real, not theory
This is not an extrapolation from a toy. This exact blind spot was sitting in Google’s reference A2A SDK — a teardown bug I found and fixed, now open as a PR upstream. And it is not one vendor’s slip: the A2A protocol itself punts on the hard part, and the official conformance test kit fakes the answer. Three independent admissions, from three layers of the stack, that the ecosystem has solved “the stream closed” and has not solved “the mesh is done.”
That is Task was destroyed but it is pending — the same blindness, this time with primary sources you can click. The fix I submitted there is real, tested, and local: it makes one process’s teardown deterministic. It does not hand the ecosystem the missing drain primitive. The in-process bug and the cross-network gap in this demo are the same shape at two zoom levels, but solving the first does not solve the second. The problem is systemic and profound; the fix I submitted is local and precise.
The portable rule
| What read green | What was actually true | The rule it became |
|---|---|---|
Root printed SUCCESS; exit 0 | B dropped the work it accepted; no one upstream can ever learn it | A reply means “my neighbor received it,” never “the work finished" |
"The request returned 200” | A local fact about one edge | Never let a local fact stand in for the global one it can’t speak to |
| ”We waited 10 seconds, nothing screamed” | The clock advanced; the mesh’s actual state was never observed | Gate on an observed settlement signal, never on a wall-clock timeout |
| Each node reported truthfully | True local facts composed into a false global one | No participant in a mesh can see the whole frontier — so don’t ask one to |
If a planted, work-dropping node can survive your cascade and your system stays green, your “done” is decoration. Stop asking one node whether the swarm finished. No node can answer. That is the portable rule — and the reason “are we done yet?” is a harder question than the entire agent industry is currently treating it as.
The in-process priors are next door. Your AI coding agents can’t hear each other is the same blindness on a single box — peers that cannot tell when a neighbor finishes or fails. Green is not evidence is the false-green lens this whole cascade wears. This piece is the cross-process, cross-network sequel: the single box can be given a nervous system, but the mesh does not yet have a way to know it has gone quiet.
Frequently asked questions
- What is the demo actually showing?
- Three programs across a network — Root calls A, A calls B. B answers HTTP 200 'accepted' and then drops the work it promised, which dies a moment later with nobody listening. A only ever saw the 200, so it reports 'done', and Root prints SUCCESS and exits 0. The work never happened, and every participant told the truth about what it could see.
- Why can't A just check whether B finished?
- Because once A's HTTP request to B returns, the connection closes and A and B share no channel. B's work happens on the far side of a boundary A can no longer see across. A can't learn B's downstream fate even in principle — the information channel is gone.
- What is distributed termination detection?
- It is the classic distributed-systems problem of proving that a computation spread across many independent participants has globally stopped: every node idle AND no work still in transit between them. It has been studied since the early 1980s, with named algorithms (Dijkstra–Scholten and others) and proven lower bounds on coordination cost. It is hard precisely because no single participant can see the global state.
- How is this different from 'green is not evidence'?
- That piece is a single-process failure — one program exits 0 over work it never observed, on one machine, where in principle one observer could exist. This piece is a topology failure across a network boundary: three or more separate processes where no participant can see the whole frontier, so no single observer can exist at all. It is the cross-network cousin, not a rerun.
- Does the author claim to have solved this?
- No. This piece names the problem; it does not claim to fix the mesh. The only concrete fix — covered in the next piece — is local: it makes one process's teardown deterministic. The cross-network gap stays open. The stance across both pieces is deliberate: name a systemic failure honestly, while standing behind only the localized fix you can actually defend.
The proof that this is real, not theory — the bug in Google’s A2A SDK, the protocol punting on it, the official test kit faking it — is the next piece.
Part 02 · June 24, 2026
Task was destroyed but it is pending
TL;DR — A server running Google’s A2A SDK shuts down mid-cascade and prints Task was destroyed but it is pending!. Most teams scroll past it. In a2aproject/a2a-python it was a real teardown deadlock: a background producer coroutine (a function that can pause and resume) left running while the EventQueue subscriber it fed was torn down underneath it. The fix — Issue #1101, PR #1105, 313 lines with tests, CI green, open for review — is a public async aclose() with four load-bearing moves, one per section below: it closes the queues immediate=True (drop pending events instead of waiting to flush them), cancels-and-gathers the background tasks, refuses to resurrect a finished task, and releases an internal lock before it waits, because the very tasks it is shutting down need that same lock to finish. Hold the lock and everything deadlocks. The fix closes one race. The gap it exposes stays open: the agent ecosystem has no sound way to know when an async mesh is actually done.
An A2A server is shutting down. A push-delegation cascade was in flight a moment ago: agent A handed work to B, who handed it to C. Somewhere in that chain, a producer coroutine is still mid-await (paused, waiting on something that will now never arrive), feeding an EventQueue whose subscriber is already being torn down. A producer coroutine fills a queue; a subscriber drains it; the queue is the buffer between them. The event loop, the single scheduler that runs every coroutine in an async Python program, closes. Python prints its one line and moves on:
Task was destroyed but it is pending!Whatever that producer was about to deliver is gone, swallowed with the loop. No exception that reaches your code. No failed status anyone reads. Just an in-flight delegation that quietly never happened, and a warning that fired at interpreter exit, disconnected by then from the request that spawned it.
That line is the cheapest thing in async Python to ignore. It arrives at shutdown, after the interesting logs, looking like janitorial noise. In a2aproject/a2a-python it was not noise. It was a deadlock that happened to lose the race to print before the interpreter exited — the producer’s cleanup was hung, the loop tore down around it, and the “pending” warning was the corpse, not the crime. I went and read it.
A warning is a claim about plumbing
Treat Task was destroyed but it is pending! the way you should treat a green check: as a claim about plumbing, not about the world. CPython (the standard Python interpreter) prints it when the event loop is finalized and finds a Task — an asyncio object wrapping a running coroutine — still alive, still parked on an await it will never resume. The message is the interpreter telling you the truth it noticed on its way out the door: I was about to run more of your code, and then I wasn’t.
Almost always it is benign. A fire-and-forget coroutine nobody was waiting on, cancelled cleanly at shutdown. So the instinct to scroll past it is, statistically, correct. Which is exactly why it is dangerous when it is not noise. In #1101 the producer was the opposite of idle. It was blocked, on purpose, by the SDK’s own teardown, holding work the subscriber would never read. The only reason you saw a warning instead of a hang is that the interpreter tore the loop down before the deadlock could announce itself.
Where the producer goes to die
To see why the producer was blocked and not idle, walk the lifecycle the PR exposes. A DefaultRequestHandlerV2 owns an ActiveTaskRegistry; the registry owns one ActiveTask per in-flight request. Each ActiveTask runs a producer — the coroutine generating events for that request — that writes into an EventQueue, and a subscriber that drains the queue and forwards events out, to the caller or, in a delegating mesh, to the next agent in the cascade. Producer fills, subscriber empties; the queue is the seam between them.
The reproduction is exact, and it is in the public issue. When a request reaches a terminal state at the application layer — finished, failed, or cancelled, no more events coming — the owning ActiveTask begins cleaning up its background producer, and that cleanup awaits:
# the shape of the trap on the public PR surface, not project source
await self._event_queue_subscribers.close(immediate=False)close(immediate=False) means drain gracefully: wait for every queued event and every child sink to flush. But the subscriber on the other end of that queue is the thing being torn down. Nobody is reading. The drain waits for a consumer that is never coming back, the producer’s cleanup never returns, and the task stays pending until the loop is yanked out from under it at shutdown. The graceful path is the hang.
A correct teardown has to reach both ends of the seam — close the reader and reckon with the still-live writer — without leaving the loop holding a coroutine blocked on the very queue you just closed. The pair was started together and was never stopped together. That asymmetry, a spawn with no symmetric awaited teardown, is the bug in one sentence. The fix is four moves; here they are in the order the teardown actually executes them.
immediate=True, or you drain forever
The first move runs against instinct. The reflex at shutdown is to drain gracefully: let the queue empty, let the producer finish its last item, then close. You cannot do that here. You cannot gracefully drain a queue whose producer you are simultaneously cancelling. “Drain” means “let the producer finish pushing and the consumer finish reading,” and you are in the middle of cancelling that producer. A graceful drain would wait for events the cancelled producer will never push, which is the original hang wearing a more reassuring name.
So the queues close immediate=True: stop accepting new events, and force a producer blocked on the queue to raise out of its await instead of parking there forever. Close immediately first. Only then cancel-and-gather the background tasks — that is the next move, and it is where the real bug lives.
The obvious fix re-deadlocks
The obvious next step is to stop waiting on a dead consumer: cancel the background tasks, gather them (wait for all of them together) so their exceptions surface, and be done. And because the registry is shared mutable state being changed during teardown, you hold the registry lock while you do it. That reads as careful.
It looks correct. It hangs forever.
# the trap: cancel-and-gather WHILE holding the registry lock
async def aclose(self) -> None:
async with self._lock: # take the lock
for task in self._active.values():
task.cancel()
await asyncio.gather(*self._active.values(),
return_exceptions=True) # <-- never returns: deadlockThe reason is the re-entrant detail that separates a real fix from a plausible one. When you cancel an ActiveTask, its own teardown path runs, and that teardown reaches back into the registry to deregister itself from the map — which needs the same registry lock you are currently holding. So gather waits for the tasks to finish, the tasks wait for the lock to deregister, and the lock waits for gather to return it. Three parties, one cycle, no progress. The cancel you issued to break the deadlock created a tighter one. (Re-entrant here just means the cancelled task reaches back for a lock the canceller already holds.)
gather holds the lock and waits for the tasks; the tasks need that same lock to deregister. Hold the lock across the await and the loop is sealed.The X is not Y here: holding the lock is not owning the teardown. The lock protects the membership set — the act of mutating the map. It was never a license to block inside it while the things you are tearing down need it to finish. The fix shrinks the lock to exactly what it protects and moves the await outside:
# the fix: snapshot under the lock, await OUTSIDE it
async def aclose(self) -> None:
async with self._lock: # lock protects the snapshot only
self._is_finished = True
tasks = list(self._active.values())
# lock released here — the tasks' teardown can now re-acquire it
for task in tasks:
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)Take the lock only long enough to read a consistent snapshot and mark the registry closed. Release it. Then cancel and gather, so each cancelled task can re-enter, deregister, and complete, and the gather returns. That single move — snapshot inside, await outside — is the difference between a fix that works and one that re-deadlocks while passing review. You only find it by reading what the cancelled tasks do on their way out, not by reasoning about the canceller alone.
The gather with return_exceptions=True is doing more than waiting. It means the CancelledError each cancelled coroutine raises, plus any real exception, surfaces into the gathered result instead of evaporating as an exit-time “pending” warning. A bare cancel() without the gather drops the tasks’ exceptions on the floor and reintroduces the exact ghost. The gather is how you convert a silent loop-finalization into a value you can inspect — the difference between a teardown and a del.
No resurrection
Releasing the lock before awaiting is correct, but it forces a second invariant to carry weight, and aclose() is incomplete without it. While aclose() is mid-gather with the lock released, a late start() can arrive — a request that was already in flight when shutdown began — and try to spawn a fresh producer into the registry aclose() thinks it just emptied. Leave that path open and you finish shutting down with a new background coroutine running past the barrier you closed, feeding a queue nobody will ever drain.
The flag and the lock-discipline are one invariant, not two fixes. aclose() sets _is_finished in the same critical section as the snapshot — atomic close-marking — and start() checks that flag under the same lock before it spawns:
# finished is terminal: refuse to start
async def start(self) -> None:
async with self._lock:
if self._is_finished:
raise RuntimeError("task is finished; refusing to start")
self._spawn_producer()Because _is_finished is set inside the lock and read inside the lock, a start() that races the gather sees the closed flag and refuses, every time. There is no window. Name it as a rule the state machine enforces: a finished state that a late start() can quietly undo is just a suggestion. The lifecycle gets a head, a tail, and a one-way door between them.
That is the whole fix: a public async aclose() on DefaultRequestHandlerV2, ActiveTaskRegistry, and ActiveTask that closes queues immediate=True, releases the lock before awaiting, cancels-then-gathers, and refuses resurrection. PR #1105 is 313 lines, with tests that reproduce the deadlock deterministically and CI green; it is open for review against the reference SDK. The load-bearing hunk is the lock-scope change in aclose() — open the PR and the snapshot-under-lock, gather-outside-lock shape is right there.
Zoom out: this was a symptom
Now the part that outlasts the bug.
This race isn’t a one-off process quirk; it is a structural class of failure. Step back from the single Python loop and look at the cascade: there is no local way to know when an A→B→C push-delegating chain has actually settled. A closed its stream, but a closed stream on A says nothing about whether B’s push to C is still in flight.
The teardown race is precisely what the gap between a local fact and a global one looks like inside a single process.
The handler believed the request was done — terminal status reached — while a background producer was still mid-await. “This stream closed” is a local fact. “The cascade has settled” is a global one, and nothing in the loop was positioned to tell the difference. Multiply that across processes and a network and you get the same disagreement at the scale of a mesh: A reports done while B is still pushing to C, and no participant is positioned to notice.
The protocol made this harder on purpose. The A2A v1.0 spec removed the per-task final flag. Earlier drafts let a task-status event carry an explicit final: true — a producer-asserted “this stream is done” marker. v1.0 dropped it; the changelog says to “leverage protocol binding specific stream closure mechanism instead”. Completion is now inferred from stream-closure or terminal status, never asserted. That is a defensible protocol simplification, and it quietly relocated a hard problem from the wire onto every implementer’s teardown path.
Go looking for the primitive that would close the gap, and it is not there. No A2A SDK — Python, JS, or Go — exposes a wait_for_idle, a drain, or any “all background work is done” signal. The thing every distributed-systems textbook calls termination detection, proving a distributed computation has globally stopped, has no entry in the agent-protocol vocabulary.
Fifteen seconds is not a proof
The conformance kit gives the gap away. The canonical A2A conformance kit, a2a-tck, tests push-notification delivery — and to decide whether a webhook arrived, it falls back to a hardcoded fifteen-second wait. Not a drain signal. Not a quiescence check. A wall-clock guess, because there is no quiescent signal to wait on.
When the reference conformance suite — the thing whose whole job is to define “correct” — resorts to “wait fifteen seconds and assume,” that is not a tooling shortcut. It is the ecosystem admitting it has no sound primitive to test against. This is the green-is-not-evidence failure class natively ported to agents: a shutdown that exits 0 over a leaked, undrained producer reads identically to a clean one. Exit-0 guarantees the loop returned; it proves nothing about delegations draining. A fifteen-second timeout just means the clock advanced, not that the mesh settled.
And a timeout is not a slightly-weak quiescence check — it is the wrong shape of check. A push-delegating A→B→C cascade has no single owner who can observe that the whole frontier — every still-active hop in the cascade — is empty. A holds an edge to B; B holds an edge to C; no participant sees the others’ in-flight work. Quiescence is a global property of the mesh; a per-edge timeout is a local guess. Term by term: stream-closure on one edge is necessary for “the cascade settled,” and nowhere near sufficient, because the edge that closed tells you nothing about the edge three hops downstream that is busiest right now. This is genuinely distributed termination detection, wearing an agent-protocol hat — a problem with decades of theory and no agent-protocol implementation. It’s hard, and the field skipped it for being hard. That is not a reason to keep skipping it.
What this does not claim
Keep the disclosed loss in view. The #1101 fix is real, tested, and closes one race in one process. It does not hand the ecosystem the missing drain primitive. A correct aclose() makes a single handler’s teardown deterministic; it does nothing to tell agent A that agent B’s push to C has landed. The in-process bug and the cross-network gap are the same shape at two zoom levels, but solving the first does not solve the second — and a piece that pretended otherwise would be the exact false-green it is warning about.
That distinction is also why quiescence belongs as a first-class primitive, not a fifteen-second hope. Naming the gap is not solving it; the fix upstream and the primitive are two different commitments, and only one of them is a concrete, tested patch sitting in a vendor SDK’s PR queue today.
The rule it became
| what read green | what was actually true | the rule it became |
|---|---|---|
Shutdown exited 0, no traceback | A producer was left running; an in-flight delegation evaporated with the loop | Assert on drain completed, never on an exit code returned |
cancel() + gather() under the lock looked careful | The tasks’ own teardown re-acquired that lock — instant deadlock | Snapshot under the lock; release it; cancel-and-gather outside it |
start() succeeded during teardown | A late spawn leaked a fresh producer past the barrier aclose() closed | Set the finished flag and check it under the same lock — refuse, loudly |
a2a-tck passed the push test after 15s | The clock advanced; the mesh’s actual settlement was never observed | Gate on an observed drain signal, never on a wall-clock timeout |
If a planted in-flight task can survive your shutdown and your gate stays green, your teardown is decoration. That is the portable rule.
Read it yourself
The absence is checkable without taking my word for it. Three primary sources, one find-target each:
- Read PR #1105 and find the lock release before the
gatherinaclose(). - Read the v1 changelog and find the line that removes the per-task
finalflag. - Read
a2a-tckand find the hardcoded fifteen-second fallback in the push-notification test.
Three clicks, one conclusion: the agent ecosystem has named “the stream closed” and has not named “the mesh is done.”
The in-process priors are next door. Coding agents can’t hear each other is the same blindness on one box — peers that cannot tell when a neighbor finishes or fails; this piece is the cross-process, cross-network sequel. Green is not evidence is the false-green lens this whole teardown wears. And the drain-before-return discipline the wire actually needs runs through source to subscriber in milliseconds. The single box has a nervous system. The mesh does not yet have a way to know it has gone quiet — and the producer that died at shutdown is what that looks like in a stack trace.
Frequently asked questions
- What is the 'Task was destroyed but it is pending!' warning?
- It is a message CPython prints at event-loop finalization when it finds an asyncio Task still alive — still parked on an await it will never resume. It is usually benign exit-time noise, which is exactly why a real deadlock that prints it can hide in plain sight.
- What was the bug in a2a-python Issue #1101?
- During teardown, a background producer coroutine inside an ActiveTask was left running while the EventQueue subscriber it fed was torn down underneath it. The naive fix — cancel-and-gather while holding the registry lock — re-deadlocked, because each cancelled task's own teardown re-acquires that same lock to deregister itself.
- What did PR #1105 change?
- It added a public async aclose() on DefaultRequestHandlerV2, ActiveTaskRegistry, and ActiveTask that closes the queues immediate=True, releases the registry lock before awaiting the gather, cancels-and-gathers the background tasks, and refuses to resurrect a finished task via a one-way finished state checked under the same lock. 313 lines with tests, CI green, open for review.
- What is the ecosystem-level gap the bug exposes?
- There is no sound way to know when an async A2A mesh has settled. The v1.0 spec removed the per-task final flag, no SDK exposes a wait_for_idle or drain primitive, and the a2a-tck conformance kit falls back to a hardcoded 15-second timeout. That is distributed termination detection, and the agent-protocol stack has no name for it yet.