Task was destroyed but it is pending
TL;DR — A server running Google’s A2A SDK shuts down mid-cascade and prints Task was destroyed but it is pending!. Most teams scroll past it. In a2aproject/a2a-python it was a real teardown deadlock: a background producer coroutine (a function that can pause and resume) left running while the EventQueue subscriber it fed was torn down underneath it. The fix — Issue #1101, PR #1105, 313 lines with tests, CI green, open for review — is a public async aclose() with four load-bearing moves, one per section below: it closes the queues immediate=True (drop pending events instead of waiting to flush them), cancels-and-gathers the background tasks, refuses to resurrect a finished task, and releases an internal lock before it waits, because the very tasks it is shutting down need that same lock to finish. Hold the lock and everything deadlocks. The fix closes one race. The gap it exposes stays open: the agent ecosystem has no sound way to know when an async mesh is actually done.
An A2A server is shutting down. A push-delegation cascade was in flight a moment ago: agent A handed work to B, who handed it to C. Somewhere in that chain, a producer coroutine is still mid-await (paused, waiting on something that will now never arrive), feeding an EventQueue whose subscriber is already being torn down. A producer coroutine fills a queue; a subscriber drains it; the queue is the buffer between them. The event loop, the single scheduler that runs every coroutine in an async Python program, closes. Python prints its one line and moves on:
Task was destroyed but it is pending!Whatever that producer was about to deliver is gone, swallowed with the loop. No exception that reaches your code. No failed status anyone reads. Just an in-flight delegation that quietly never happened, and a warning that fired at interpreter exit, disconnected by then from the request that spawned it.
That line is the cheapest thing in async Python to ignore. It arrives at shutdown, after the interesting logs, looking like janitorial noise. In a2aproject/a2a-python it was not noise. It was a deadlock that happened to lose the race to print before the interpreter exited — the producer’s cleanup was hung, the loop tore down around it, and the “pending” warning was the corpse, not the crime. I went and read it.
A warning is a claim about plumbing
Treat Task was destroyed but it is pending! the way you should treat a green check: as a claim about plumbing, not about the world. CPython (the standard Python interpreter) prints it when the event loop is finalized and finds a Task — an asyncio object wrapping a running coroutine — still alive, still parked on an await it will never resume. The message is the interpreter telling you the truth it noticed on its way out the door: I was about to run more of your code, and then I wasn’t.
Almost always it is benign. A fire-and-forget coroutine nobody was waiting on, cancelled cleanly at shutdown. So the instinct to scroll past it is, statistically, correct. Which is exactly why it is dangerous when it is not noise. In #1101 the producer was the opposite of idle. It was blocked, on purpose, by the SDK’s own teardown, holding work the subscriber would never read. The only reason you saw a warning instead of a hang is that the interpreter tore the loop down before the deadlock could announce itself.
Where the producer goes to die
To see why the producer was blocked and not idle, walk the lifecycle the PR exposes. A DefaultRequestHandlerV2 owns an ActiveTaskRegistry; the registry owns one ActiveTask per in-flight request. Each ActiveTask runs a producer — the coroutine generating events for that request — that writes into an EventQueue, and a subscriber that drains the queue and forwards events out, to the caller or, in a delegating mesh, to the next agent in the cascade. Producer fills, subscriber empties; the queue is the seam between them.
The reproduction is exact, and it is in the public issue. When a request reaches a terminal state at the application layer — finished, failed, or cancelled, no more events coming — the owning ActiveTask begins cleaning up its background producer, and that cleanup awaits:
# the shape of the trap on the public PR surface, not project source
await self._event_queue_subscribers.close(immediate=False)close(immediate=False) means drain gracefully: wait for every queued event and every child sink to flush. But the subscriber on the other end of that queue is the thing being torn down. Nobody is reading. The drain waits for a consumer that is never coming back, the producer’s cleanup never returns, and the task stays pending until the loop is yanked out from under it at shutdown. The graceful path is the hang.
A correct teardown has to reach both ends of the seam — close the reader and reckon with the still-live writer — without leaving the loop holding a coroutine blocked on the very queue you just closed. The pair was started together and was never stopped together. That asymmetry, a spawn with no symmetric awaited teardown, is the bug in one sentence. The fix is four moves; here they are in the order the teardown actually executes them.
immediate=True, or you drain forever
The first move runs against instinct. The reflex at shutdown is to drain gracefully: let the queue empty, let the producer finish its last item, then close. You cannot do that here. You cannot gracefully drain a queue whose producer you are simultaneously cancelling. “Drain” means “let the producer finish pushing and the consumer finish reading,” and you are in the middle of cancelling that producer. A graceful drain would wait for events the cancelled producer will never push, which is the original hang wearing a more reassuring name.
So the queues close immediate=True: stop accepting new events, and force a producer blocked on the queue to raise out of its await instead of parking there forever. Close immediately first. Only then cancel-and-gather the background tasks — that is the next move, and it is where the real bug lives.
The obvious fix re-deadlocks
The obvious next step is to stop waiting on a dead consumer: cancel the background tasks, gather them (wait for all of them together) so their exceptions surface, and be done. And because the registry is shared mutable state being changed during teardown, you hold the registry lock while you do it. That reads as careful.
It looks correct. It hangs forever.
# the trap: cancel-and-gather WHILE holding the registry lock
async def aclose(self) -> None:
async with self._lock: # take the lock
for task in self._active.values():
task.cancel()
await asyncio.gather(*self._active.values(),
return_exceptions=True) # <-- never returns: deadlockThe reason is the re-entrant detail that separates a real fix from a plausible one. When you cancel an ActiveTask, its own teardown path runs, and that teardown reaches back into the registry to deregister itself from the map — which needs the same registry lock you are currently holding. So gather waits for the tasks to finish, the tasks wait for the lock to deregister, and the lock waits for gather to return it. Three parties, one cycle, no progress. The cancel you issued to break the deadlock created a tighter one. (Re-entrant here just means the cancelled task reaches back for a lock the canceller already holds.)
gather holds the lock and waits for the tasks; the tasks need that same lock to deregister. Hold the lock across the await and the loop is sealed.The X is not Y here: holding the lock is not owning the teardown. The lock protects the membership set — the act of mutating the map. It was never a license to block inside it while the things you are tearing down need it to finish. The fix shrinks the lock to exactly what it protects and moves the await outside:
# the fix: snapshot under the lock, await OUTSIDE it
async def aclose(self) -> None:
async with self._lock: # lock protects the snapshot only
self._is_finished = True
tasks = list(self._active.values())
# lock released here — the tasks' teardown can now re-acquire it
for task in tasks:
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)Take the lock only long enough to read a consistent snapshot and mark the registry closed. Release it. Then cancel and gather, so each cancelled task can re-enter, deregister, and complete, and the gather returns. That single move — snapshot inside, await outside — is the difference between a fix that works and one that re-deadlocks while passing review. You only find it by reading what the cancelled tasks do on their way out, not by reasoning about the canceller alone.
The gather with return_exceptions=True is doing more than waiting. It means the CancelledError each cancelled coroutine raises, plus any real exception, surfaces into the gathered result instead of evaporating as an exit-time “pending” warning. A bare cancel() without the gather drops the tasks’ exceptions on the floor and reintroduces the exact ghost. The gather is how you convert a silent loop-finalization into a value you can inspect — the difference between a teardown and a del.
No resurrection
Releasing the lock before awaiting is correct, but it forces a second invariant to carry weight, and aclose() is incomplete without it. While aclose() is mid-gather with the lock released, a late start() can arrive — a request that was already in flight when shutdown began — and try to spawn a fresh producer into the registry aclose() thinks it just emptied. Leave that path open and you finish shutting down with a new background coroutine running past the barrier you closed, feeding a queue nobody will ever drain.
The flag and the lock-discipline are one invariant, not two fixes. aclose() sets _is_finished in the same critical section as the snapshot — atomic close-marking — and start() checks that flag under the same lock before it spawns:
# finished is terminal: refuse to start
async def start(self) -> None:
async with self._lock:
if self._is_finished:
raise RuntimeError("task is finished; refusing to start")
self._spawn_producer()Because _is_finished is set inside the lock and read inside the lock, a start() that races the gather sees the closed flag and refuses, every time. There is no window. Name it as a rule the state machine enforces: a finished state that a late start() can quietly undo is just a suggestion. The lifecycle gets a head, a tail, and a one-way door between them.
That is the whole fix: a public async aclose() on DefaultRequestHandlerV2, ActiveTaskRegistry, and ActiveTask that closes queues immediate=True, releases the lock before awaiting, cancels-then-gathers, and refuses resurrection. PR #1105 is 313 lines, with tests that reproduce the deadlock deterministically and CI green; it is open for review against the reference SDK. The load-bearing hunk is the lock-scope change in aclose() — open the PR and the snapshot-under-lock, gather-outside-lock shape is right there.
Zoom out: this was a symptom
Now the part that outlasts the bug.
This race isn’t a one-off process quirk; it is a structural class of failure. Step back from the single Python loop and look at the cascade: there is no local way to know when an A→B→C push-delegating chain has actually settled. A closed its stream, but a closed stream on A says nothing about whether B’s push to C is still in flight.
The teardown race is precisely what the gap between a local fact and a global one looks like inside a single process.
The handler believed the request was done — terminal status reached — while a background producer was still mid-await. “This stream closed” is a local fact. “The cascade has settled” is a global one, and nothing in the loop was positioned to tell the difference. Multiply that across processes and a network and you get the same disagreement at the scale of a mesh: A reports done while B is still pushing to C, and no participant is positioned to notice.
The protocol made this harder on purpose. The A2A v1.0 spec removed the per-task final flag. Earlier drafts let a task-status event carry an explicit final: true — a producer-asserted “this stream is done” marker. v1.0 dropped it; the changelog says to “leverage protocol binding specific stream closure mechanism instead”. Completion is now inferred from stream-closure or terminal status, never asserted. That is a defensible protocol simplification, and it quietly relocated a hard problem from the wire onto every implementer’s teardown path.
Go looking for the primitive that would close the gap, and it is not there. No A2A SDK — Python, JS, or Go — exposes a wait_for_idle, a drain, or any “all background work is done” signal. The thing every distributed-systems textbook calls termination detection, proving a distributed computation has globally stopped, has no entry in the agent-protocol vocabulary.
Fifteen seconds is not a proof
The conformance kit gives the gap away. The canonical A2A conformance kit, a2a-tck, tests push-notification delivery — and to decide whether a webhook arrived, it falls back to a hardcoded fifteen-second wait. Not a drain signal. Not a quiescence check. A wall-clock guess, because there is no quiescent signal to wait on.
When the reference conformance suite — the thing whose whole job is to define “correct” — resorts to “wait fifteen seconds and assume,” that is not a tooling shortcut. It is the ecosystem admitting it has no sound primitive to test against. This is the green-is-not-evidence failure class natively ported to agents: a shutdown that exits 0 over a leaked, undrained producer reads identically to a clean one. Exit-0 guarantees the loop returned; it proves nothing about delegations draining. A fifteen-second timeout just means the clock advanced, not that the mesh settled.
And a timeout is not a slightly-weak quiescence check — it is the wrong shape of check. A push-delegating A→B→C cascade has no single owner who can observe that the whole frontier — every still-active hop in the cascade — is empty. A holds an edge to B; B holds an edge to C; no participant sees the others’ in-flight work. Quiescence is a global property of the mesh; a per-edge timeout is a local guess. Term by term: stream-closure on one edge is necessary for “the cascade settled,” and nowhere near sufficient, because the edge that closed tells you nothing about the edge three hops downstream that is busiest right now. This is genuinely distributed termination detection, wearing an agent-protocol hat — a problem with decades of theory and no agent-protocol implementation. It’s hard, and the field skipped it for being hard. That is not a reason to keep skipping it.
What this does not claim
Keep the disclosed loss in view. The #1101 fix is real, tested, and closes one race in one process. It does not hand the ecosystem the missing drain primitive. A correct aclose() makes a single handler’s teardown deterministic; it does nothing to tell agent A that agent B’s push to C has landed. The in-process bug and the cross-network gap are the same shape at two zoom levels, but solving the first does not solve the second — and a piece that pretended otherwise would be the exact false-green it is warning about.
That distinction is also why quiescence belongs as a first-class primitive, not a fifteen-second hope. Naming the gap is not solving it; the fix upstream and the primitive are two different commitments, and only one of them is a concrete, tested patch sitting in a vendor SDK’s PR queue today.
The rule it became
| what read green | what was actually true | the rule it became |
|---|---|---|
Shutdown exited 0, no traceback | A producer was left running; an in-flight delegation evaporated with the loop | Assert on drain completed, never on an exit code returned |
cancel() + gather() under the lock looked careful | The tasks’ own teardown re-acquired that lock — instant deadlock | Snapshot under the lock; release it; cancel-and-gather outside it |
start() succeeded during teardown | A late spawn leaked a fresh producer past the barrier aclose() closed | Set the finished flag and check it under the same lock — refuse, loudly |
a2a-tck passed the push test after 15s | The clock advanced; the mesh’s actual settlement was never observed | Gate on an observed drain signal, never on a wall-clock timeout |
If a planted in-flight task can survive your shutdown and your gate stays green, your teardown is decoration. That is the portable rule.
Read it yourself
The absence is checkable without taking my word for it. Three primary sources, one find-target each:
- Read PR #1105 and find the lock release before the
gatherinaclose(). - Read the v1 changelog and find the line that removes the per-task
finalflag. - Read
a2a-tckand find the hardcoded fifteen-second fallback in the push-notification test.
Three clicks, one conclusion: the agent ecosystem has named “the stream closed” and has not named “the mesh is done.”
The in-process priors are next door. Coding agents can’t hear each other is the same blindness on one box — peers that cannot tell when a neighbor finishes or fails; this piece is the cross-process, cross-network sequel. Green is not evidence is the false-green lens this whole teardown wears. And the drain-before-return discipline the wire actually needs runs through source to subscriber in milliseconds. The single box has a nervous system. The mesh does not yet have a way to know it has gone quiet — and the producer that died at shutdown is what that looks like in a stack trace.
Frequently asked questions
- What is the 'Task was destroyed but it is pending!' warning?
- It is a message CPython prints at event-loop finalization when it finds an asyncio Task still alive — still parked on an await it will never resume. It is usually benign exit-time noise, which is exactly why a real deadlock that prints it can hide in plain sight.
- What was the bug in a2a-python Issue #1101?
- During teardown, a background producer coroutine inside an ActiveTask was left running while the EventQueue subscriber it fed was torn down underneath it. The naive fix — cancel-and-gather while holding the registry lock — re-deadlocked, because each cancelled task's own teardown re-acquires that same lock to deregister itself.
- What did PR #1105 change?
- It added a public async aclose() on DefaultRequestHandlerV2, ActiveTaskRegistry, and ActiveTask that closes the queues immediate=True, releases the registry lock before awaiting the gather, cancels-and-gathers the background tasks, and refuses to resurrect a finished task via a one-way finished state checked under the same lock. 313 lines with tests, CI green, open for review.
- What is the ecosystem-level gap the bug exposes?
- There is no sound way to know when an async A2A mesh has settled. The v1.0 spec removed the per-task final flag, no SDK exposes a wait_for_idle or drain primitive, and the a2a-tck conformance kit falls back to a hardcoded 15-second timeout. That is distributed termination detection, and the agent-protocol stack has no name for it yet.