Why we’re publishing this
Most infrastructure startups treat their architecture like a moat. We don’t. The moat is the Trust Ledger — the per-vendor reliability dataset that compounds with every job we route. The shape of the control plane is just engineering, and engineering decisions get better when more people see them.
This is the first entry in an open log. Future entries cover the choices that come after V0: the Trust Ledger schema once it has data, how we score suppliers, what we learned from the first real failover, and the boring procurement work nobody else writes about.
The shape of V0
Four moving parts. Each one is its own package in the workspace, each one is replaceable without touching the others.
- SDK. A thin
pip install nodusclient. Nothing routes here; it speaks the public REST API. Versioned independently from the control plane so we can break internal contracts without breaking customers. - API. FastAPI app. Validates the spec, asks the router engine for a placement, writes a placed job to Postgres, enqueues a dispatch task. It does not call any supplier directly — that’s the worker’s job.
- Worker. An
arqprocess that consumes the queue, calls the chosen supplier, polls until terminal, writes the outcome to the Trust Ledger, and reports usage to billing. - Suppliers. Five-method adapter ABC. Add a new vendor by implementing it and registering the class. The router treats every adapter the same.
Why Python
Not because it’s the fastest. Because it’s the only language ML engineers will actually install your SDK in. The customer surface dictates the control-plane language — if the SDK is Python, every internal type that has to round-trip across the SDK boundary is cheaper to share if the API speaks Python too.
We share Pydantic v2 models across the SDK, the API,
and the worker via a single nodus_core package.
One source of truth for what a job is. Wire format and
in-memory format are the same object.
The control plane will eventually have hot paths in Go or Rust — quote aggregation, the scoring engine, supplier health probing. Python stays at the edges, where humans write code.
Why FastAPI plus arq
Job lifecycles are I/O-bound, not CPU-bound. Every interesting moment is waiting on a vendor: submit, poll, cancel. Async-native is the right shape, and FastAPI plus arq is the smallest stack that gives you it end-to-end without bridging frameworks at the queue.
We considered Celery and rejected it: heavier, sync by default, and its retry semantics are configured by prayer. arq is ~2k lines, asyncio-first, deduplicates jobs by id out of the box. We pin the arq job id to the Postgres job PK, which means a retried submit is a no-op — the worker won’t double-dispatch.
Why a Trust Ledger from day one
Every job writes a row to job_records:
spec, placement, outcome, cost, vendor job id, latency.
A periodic job aggregates those rows into
supplier_scores — success rate, p50 and
p95 provision time, average cost — per
(supplier, region, gpu) tuple.
The router does not read the score table yet. It will, the moment the table has rows worth reading. The point of writing the schema before the data exists is that every shipping decision — how we model placements, what we persist about a vendor failure, which fields the SDK exposes — is made under the assumption that we will need to score on it later. You cannot retrofit telemetry. You can only plan for it.
Anyone can write a router. Nobody else can write the ledger that tells the router what to do. That asymmetry is the entire reason this company exists.
What we deferred (and why)
V0 is honest about what it is: a working loop with the right shape. The following are scaffolded with wire-compatible interfaces but not yet implemented. Each one ships when there is a real reason to ship it, not before.
- Real auth. One dev API key from env. Replaced by HMAC-signed per-customer keys when the first prod customer signs. Route signatures don’t change.
- KYC and OFAC pipeline. Required before we onboard a regulated buyer; not required to land an async AI team. Two months out.
- Lambda Labs adapter. Wire-compatible, raises
supplier_not_implemented. Roughly 80 lines from scaffold to real submit; we are waiting until we have a customer with a real workload to shape the integration around. - Dashboard. No customer has asked for one. They have asked for an OpenAPI spec and good error codes. We shipped both.
- Multi-region API. One uvicorn process is fine until it isn’t.
The submit loop, end to end
Five hops between a customer call and a job record in the ledger. None of them are clever; that’s the point.
# 1. customer job = nodus.Client(api_key=...).run(image=..., gpu="h100_80gb") # 2. api: validate, choose placement, persist, enqueue placement = await router_engine.choose_placement(spec, registry) JobRepo(s).insert(JobRecord(spec=spec, placement=placement, status=PLACED)) await enqueue_dispatch(record.id) # 3. worker: pull task, call supplier adapter = registry.get(record.placement.supplier) supplier_job_id = await adapter.submit(spec) # 4. worker: poll until terminal, then write outcome result = await adapter.poll(supplier_job_id) JobRepo(s).update_status(jid, COMPLETED, result=result.model_dump()) # 5. worker: report usage to billing (no-op without stripe key) get_billing_client().report_gpu_hours(customer_id=..., hours=..., job_id=...)
The whole flow runs locally against a mock supplier that always succeeds in two seconds. That mattered: the loop has to be debuggable on a laptop without a vendor key, or nobody will ever debug it.
The rule we used
If a piece of V0 is a placeholder, it has to share a signature with what replaces it. If it doesn’t, it’s not a placeholder — it’s a rewrite waiting to happen.
Auth, billing, the Lambda adapter, the scoring engine — all four are stubbed with the production signature. The day we replace any one of them, no caller changes. That is the difference between a prototype and a V0.