All field notes

How we cut V0 to four moving parts.

The first version of Nodus is one Python workspace, four processes, two stateful stores, and roughly 1,500 lines of code that do real work. Here is what we kept, what we deferred, and the rule we used to tell the difference.

Why we’re publishing this

Most infrastructure startups treat their architecture like a moat. We don’t. The moat is the Trust Ledger — the per-vendor reliability dataset that compounds with every job we route. The shape of the control plane is just engineering, and engineering decisions get better when more people see them.

This is the first entry in an open log. Future entries cover the choices that come after V0: the Trust Ledger schema once it has data, how we score suppliers, what we learned from the first real failover, and the boring procurement work nobody else writes about.

The shape of V0

Four moving parts. Each one is its own package in the workspace, each one is replaceable without touching the others.

Nodus V0 architecture Customer SDK calls the FastAPI control plane, which writes a job record to Postgres and enqueues a dispatch task on Redis. The arq worker pulls the task, asks the supplier registry for a healthy adapter, runs the job, and writes the outcome back to the Trust Ledger. SDK pip install nodus API FastAPI · uvicorn Worker arq · async Suppliers N adapters Postgres job records Trust Ledger supplier scores Redis · arq queue
V0 control plane. The dashed edge is the queue boundary — everything to its left is request-time; everything to its right runs on the worker’s clock.
  • SDK. A thin pip install nodus client. Nothing routes here; it speaks the public REST API. Versioned independently from the control plane so we can break internal contracts without breaking customers.
  • API. FastAPI app. Validates the spec, asks the router engine for a placement, writes a placed job to Postgres, enqueues a dispatch task. It does not call any supplier directly — that’s the worker’s job.
  • Worker. An arq process that consumes the queue, calls the chosen supplier, polls until terminal, writes the outcome to the Trust Ledger, and reports usage to billing.
  • Suppliers. Five-method adapter ABC. Add a new vendor by implementing it and registering the class. The router treats every adapter the same.

Why Python

Not because it’s the fastest. Because it’s the only language ML engineers will actually install your SDK in. The customer surface dictates the control-plane language — if the SDK is Python, every internal type that has to round-trip across the SDK boundary is cheaper to share if the API speaks Python too.

We share Pydantic v2 models across the SDK, the API, and the worker via a single nodus_core package. One source of truth for what a job is. Wire format and in-memory format are the same object.

The control plane will eventually have hot paths in Go or Rust — quote aggregation, the scoring engine, supplier health probing. Python stays at the edges, where humans write code.

Why FastAPI plus arq

Job lifecycles are I/O-bound, not CPU-bound. Every interesting moment is waiting on a vendor: submit, poll, cancel. Async-native is the right shape, and FastAPI plus arq is the smallest stack that gives you it end-to-end without bridging frameworks at the queue.

We considered Celery and rejected it: heavier, sync by default, and its retry semantics are configured by prayer. arq is ~2k lines, asyncio-first, deduplicates jobs by id out of the box. We pin the arq job id to the Postgres job PK, which means a retried submit is a no-op — the worker won’t double-dispatch.

Why a Trust Ledger from day one

Every job writes a row to job_records: spec, placement, outcome, cost, vendor job id, latency. A periodic job aggregates those rows into supplier_scores — success rate, p50 and p95 provision time, average cost — per (supplier, region, gpu) tuple.

The router does not read the score table yet. It will, the moment the table has rows worth reading. The point of writing the schema before the data exists is that every shipping decision — how we model placements, what we persist about a vendor failure, which fields the SDK exposes — is made under the assumption that we will need to score on it later. You cannot retrofit telemetry. You can only plan for it.

Anyone can write a router. Nobody else can write the ledger that tells the router what to do. That asymmetry is the entire reason this company exists.

What we deferred (and why)

V0 is honest about what it is: a working loop with the right shape. The following are scaffolded with wire-compatible interfaces but not yet implemented. Each one ships when there is a real reason to ship it, not before.

  • Real auth. One dev API key from env. Replaced by HMAC-signed per-customer keys when the first prod customer signs. Route signatures don’t change.
  • KYC and OFAC pipeline. Required before we onboard a regulated buyer; not required to land an async AI team. Two months out.
  • Lambda Labs adapter. Wire-compatible, raises supplier_not_implemented. Roughly 80 lines from scaffold to real submit; we are waiting until we have a customer with a real workload to shape the integration around.
  • Dashboard. No customer has asked for one. They have asked for an OpenAPI spec and good error codes. We shipped both.
  • Multi-region API. One uvicorn process is fine until it isn’t.

The submit loop, end to end

Five hops between a customer call and a job record in the ledger. None of them are clever; that’s the point.

nodus · submit loop · v0 ~/nodus
# 1. customer
job = nodus.Client(api_key=...).run(image=..., gpu="h100_80gb")

# 2. api: validate, choose placement, persist, enqueue
placement = await router_engine.choose_placement(spec, registry)
JobRepo(s).insert(JobRecord(spec=spec, placement=placement, status=PLACED))
await enqueue_dispatch(record.id)

# 3. worker: pull task, call supplier
adapter = registry.get(record.placement.supplier)
supplier_job_id = await adapter.submit(spec)

# 4. worker: poll until terminal, then write outcome
result = await adapter.poll(supplier_job_id)
JobRepo(s).update_status(jid, COMPLETED, result=result.model_dump())

# 5. worker: report usage to billing (no-op without stripe key)
get_billing_client().report_gpu_hours(customer_id=..., hours=..., job_id=...)

The whole flow runs locally against a mock supplier that always succeeds in two seconds. That mattered: the loop has to be debuggable on a laptop without a vendor key, or nobody will ever debug it.

The rule we used

If a piece of V0 is a placeholder, it has to share a signature with what replaces it. If it doesn’t, it’s not a placeholder — it’s a rewrite waiting to happen.

Auth, billing, the Lambda adapter, the scoring engine — all four are stubbed with the production signature. The day we replace any one of them, no caller changes. That is the difference between a prototype and a V0.

All field notes