thepragmaticquant.com

How waitbus works: from event source to a waiting agent, over MCP

TL;DR — How waitbus works, and why it is built the way it is. Four modules — a listener, a SQLite event store, an eventfd doorbell, and a broadcast fan-out — turn an upstream change into a wake in single-digit milliseconds. An agent talks to that bus over MCP: tools to query it, resources to read events, and a push channel so the agent is notified instead of polling. The load-bearing claim is the ratio: waitbus wakes an agent in single-digit-to-low-teens milliseconds against seconds of polling — 100 to 400x faster, on whatever machine you draw. The decisions underneath — AF_UNIX over Redis, SQLite over an in-memory queue, systemd-creds over the keyring library — each cost something, and one of them shipped a bug I caught, named, and fixed. Whether you can trust the latency number is the next piece: why my first benchmarks lied.

A coding agent’s waitbus wait --source github --match "conclusion=success" call just returned. The path inside waitbus during those milliseconds: the webhook arrived at the listener, which verified the HMAC signature, normalized the payload into a small JSON envelope, and committed it to SQLite. Before the handler returned, it pulsed a doorbell — a single byte written to the daemon’s AF_UNIX socket, which wakes the broadcast loop (the daemon coalesces these into an eventfd on Linux). The daemon read the new row, serialized it, and wrote a length-prefixed frame to each subscriber’s socket. No network stack, no broker, no round trip to a remote service.

Eighty polls collapse to one wake.

Architecture in one pass

Four modules do the active work between an upstream event and a subscriber waking up.

The wake path (doorbell -> broadcast -> wait/MCP/subscriber) accented. The write to SQLite happens before the doorbell rings — that ordering is the whole correctness argument.

The ordering — commit to SQLite, then ring the doorbell — means a crash between the two is a bounded delay, never a lost event: the row is already durable when the waiter next reads.

python
# waitbus/_doorbell.py — the writer side of the wake (both platforms).
import socket


def ring(path) -> None:
    # Connect to the daemon's AF_UNIX listener and write one byte. On Linux the
    # daemon forwards that byte into an internal eventfd — its coalescing wake
    # primitive, registered with the asyncio loop via add_reader; on macOS the
    # loop reads the socket directly.
    with socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) as s:
        s.connect(str(path))
        s.sendall(b".")

The cross-process wake is one byte on a unix socket. The daemon-internal coalescing layer — an eventfd on Linux — is what the broadcast loop actually waits on.

The local trust boundary

Installing a daemon that listens for external triggers on a shared box is a reasonable thing to be nervous about, so here is the model exactly as the code implements it. The trust boundary is a single UNIX user on one machine — waitbus is not multi-tenant, and it does not pretend to be. Two surfaces face outward. The inbound side is the webhook listener: it binds 127.0.0.1:9000, loopback only, never a routable interface, and every accepted body is checked against an HMAC-SHA256 signature in constant time before a row is written — a missing, malformed, or mismatched X-Hub-Signature-256 is a 401 and nothing is stored. The outbound side is the broadcast socket subscribers read from: at accept time the daemon reads the connecting peer’s UID straight from the kernel (SO_PEERCRED on Linux, getpeereid() on macOS) and silently closes any connection whose UID is not the daemon’s own. A different user on the same host cannot subscribe; they are dropped before they send a byte. The socket itself is mode 0600 and the whole state tree is 0700, so the kernel refuses the connection before that check even runs.

To be precise about the boundary: the same-UID check is exactly that — same UID, not same process. Any process running as you can connect to the broadcast socket and read your event stream, and any process running as you can write to the SQLite store or ring the doorbell. The doorbell in particular has no credential check at all; it only triggers a re-read of the event table, so the worst a local same-UID caller does there is make the daemon run a SELECT it was about to run anyway — it cannot inject an event through it. Event injection is gated by filesystem permission on the store, not by a capability. So the honest one-line version: waitbus defends you against the network and against other users on the box, and assumes every process under your own UID is already you. On a single-developer workstation — what this is for — that is the right boundary. On a genuinely shared multi-user host where you do not trust your own other processes, it is not a sandbox.

Commit-then-ring, step by step. A crash anywhere is a bounded delay, not a lost event.

The per-source comparison matrix

Every waitbus wake is measured end to end — from a state change on the source to the moment a subscriber’s recv() returns. The polling column is not a head-to-head race: it is the poll-interval ceiling each tool’s recommended pattern implies — a poller that re-checks every T seconds waits up to T, so its p99 is essentially that interval. The headline multiplier is therefore poll interval ÷ waitbus latency, and the 100–400x spread across sources reflects different recommended intervals (gh run watch re-checks roughly every 3 s, docker ps every 2 s, the pytest and fs pollers every 1 s), not different waitbus performance — waitbus is the same single-digit-millisecond wake regardless of source.

Polling is seconds; waitbus is milliseconds — 100 to 400x faster. The one row kept on purpose: the kernel's inotifywait beats waitbus on raw filesystem latency by ~50x.
data table
sourcepolling p99 (ms)waitbus p99 (ms)result
github2,9787.4402x faster
pytest9927.4134x faster
docker2,0796.0346x faster
fs9926.0167x faster
fs · inotifywait0.116 (kernel)6.0▼ waitbus loses 51x

The kernel’s filesystem notifier is ~50x faster than waitbus on raw fs latency, and the inotifywait row stays in the table. The reason to use waitbus anyway is the multi-source predicate: one waitbus wait that fires on a pytest run finishing AND a Docker container exiting AND a file change is something inotifywait cannot express.

The tail is the story

Only the three measured percentiles, with confidence intervals. Polling's tail explodes; waitbus is flat across percentiles in this capture (sustained-load drift over hours is a separate measurement).
The same waitbus p99s with their 95% confidence intervals. Tight where I ran 5,000 samples; visibly wider for docker, where I only ran 500 — the honest way to show how sure each number is.
data table
sourcep99 (ms)95% CI (ms)
github (n=5,000)7.4[7.40, 7.44]
pytest (n=5,000)7.4[7.32, 7.47]
docker (n=500)6.0[5.89, 7.13]
fs (n=5,000)6.0[5.92, 6.00]

How an agent actually talks to the bus

The architecture above is the wake path; what rides on top of it is an agent. You just pushed a branch. CI is running. The old path: the agent polls gh run list every few seconds, reads “in_progress” forty times, burns forty turns of context, then finally gets the result. The waitbus path: the agent calls a tool, blocks until the run completes, and gets back structured data. Two tool calls instead of eighty polling iterations.

The waitbus CLI surface — the wait primitive an MCP server exposes to the agent.

MCP in brief

Model Context Protocol is the standardized interface for tools and resources that AI coding agents consume. An MCP server exposes tools (callable functions), resources (readable URIs), and optional notifications (push updates) over JSON-RPC. Nearly all clients support calling tools (pull); far fewer surface server-initiated notifications (push). waitbus is built so the broadly portable path is the pull path, and push is a bonus where the client supports it.

Four tools on the pull path, two notification kinds on the push path, one socket underneath.

The wait predicate, and its failure edges

A blocking primitive is only trustworthy if you can see how it ends. waitbus wait resolves on a match, a timeout, or a peer/source failure.

Every exit edge is explicit. The 270-second cap returns control before a long wait can hit the multi-minute tool-call timeouts MCP clients impose.

The 64-KiB escape hatch

Raw webhook payloads are attacker-controlled and can be large. Rather than truncate silently, a read over the cap returns a marker with a raw_uri pointer to the full payload.

Explicit-consent UX: the cap is a gate, not a wall. A tiny-task agent never pays the context cost; one that needs the full payload follows the pointer.

The SDK pin

waitbus pins mcp to a single minor — >=1.27.1,<1.28 — rather than leaving the ceiling open, because the test suite byte-replays a two-tier wire fixture corpus and any minor bump has to pass both before the ceiling moves. There is also a subclass that flips a hardcoded resources.subscribe=False in the SDK until a specific upstream fix ships in a released version.

The decisions, and what they cost

The broker itself barely took an afternoon, and then a year went into everything wrapped around it: the wire protocol, the schema-ownership story, the security model, the macOS port, the open-loop benchmark methodology, the audit cycles, the supply-chain plumbing, and the multilingual-snippet test that catches any backwards-incompatible wire change at the same commit that introduces it.

systemd-creds, not the keyring library. An audit measured that keyring pulled in ten transitive packages and +21.6 MiB to read one secret. The replacement is two lines and zero dependencies. Measure the dependency closure of any auth-touching library before you import it.

AF_UNIX SOCK_STREAM, not Redis or NATS or TCP loopback. SO_PEERCRED gives the kernel-vouched UID of any connecting peer, and there is no port-allocation problem with two workstations side by side. The wire was originally SOCK_SEQPACKET until the macOS port forced length-prefixed SOCK_STREAM (Darwin has no SEQPACKET on AF_UNIX). Cross-platform constraints picked the wire shape, not theoretical purity.

SQLite, not an in-memory queue. A workstation daemon does not strictly need durability, but the broadcast daemon’s in-memory state is derived state: on restart the cursor reseeds from the events table, so a missed doorbell ring is a bounded delay, not data loss.

What the audits caught, and what they missed

Eight named audits over five days, each a four-pass template (wide-strict mypy, project-health, code-review, code-simplifier). A finding that can be mechanically checked becomes a test or a CI gate — that pattern is consistent enough to be a project rule.

But the audits did not find every bug. The canonical benchmark capture was running on a cloud box when bench 6 of 15 crashed, deep in CPython 3.12’s _wait_for_tstate_lock. The same bench passed on the dev box, which runs Python 3.14. Five minutes of reading the traceback explained it: a bench script had class _Driver(threading.Thread) that did self._stop = threading.Event() in __init__. _stop is a CPython internal that Thread.join reads on its slow path. Assigning to it shadows the internal. On 3.14 the shadowed call site changed enough that the bug is latent; on 3.12 it raises.

The fix was a rename. The real cost was that I had produced the buggy file by copy-pasting a template across four bench scripts — so I grepped the shadow’s signature across the batch, found three more siblings, and patched all four in one commit.

The audits could not have caught the _stop shadow: none of the passes runs the bench scripts under Python 3.12 against the canonical capture host. The bug was caught by running the bench on a different machine, under a different Python version, against a different workload than any audit ran. Audits and cross-environment runs catch different things, and I needed both. A project that runs eight named audits in five days catches more than a project that runs zero, and still misses bugs that only surface when the bench runs on a host the dev box is not.

That is the architecture, the wiring, and the decisions. But a latency number is no better than the way it was measured — and mine were a lie until I fixed a subtle methodology bug, then found the same code running at two different speeds on cloud hosts that are supposedly identical. That story is the next piece: Why my first benchmarks lied.