Docs · Nodus Python SDK

# Playground

Configure a job spec, watch it route, fail over, and land in the Trust Ledger — all in your browser. No API key required. The simulator runs the same state machine the real control plane does, against a pool of vetted mock suppliers with realistic latency and pricing. Flip Simulate a supplier outage to see the failover path.

Image

Command

GPU

Count

Max runtime (s)

Prefer region

Allowed tiers

Tier-3 Neocloud Hyperscaler

Simulate a supplier outage mid-submit (demo failover)

Request

Job idle

id: —
status: —
supplier: —
region: —
duration: —
cost_usd: —
vs retail: —
you saved: —

Event log

# Install

Python 3.10 or newer. The only runtime dependencies are httpx and pydantic.

install ~

$ pip install nodus-sdk
# or: uv pip install nodus-sdk

# Authentication

Every client reads its credentials from the environment by default. Pass api_key= explicitly when you need multiple keys in one process.

Variable	Default	Notes
`NODUS_API_KEY`	—	Required. Issued from the console as `nk_live_…` / `nk_test_…`.
`NODUS_BASE_URL`	`https://api.nodus.run`	Override for staging or a self-hosted control plane.

# Quick start

Submit a job, wait for it, print the result. This is the whole shape of the SDK; every other call is a variant of one of these three primitives.

quickstart · python ~/your-app

# export NODUS_API_KEY=nk_live_...
>>> import nodus
>>>
>>> with nodus.Client() as client:
...     job = client.run(
...         image="ghcr.io/acme/train:v3",
...         command=["python", "train.py", "--epochs", "10"],
...         gpu="h100_80gb",
...         gpu_count=8,
...         env={"WANDB_API_KEY": os.environ["WANDB_API_KEY"]},
...         max_runtime_seconds=18 * 3600,
...     )
...     job.wait(timeout_seconds=20 * 3600)
...     print(job.status, job.supplier, job.cost_usd)
JobStatus.COMPLETED lambda 47.62

run() returns immediately with a Job handle. Call wait() to block until the job reaches a terminal state, or refresh() to fetch the current state on demand.

# Async

The async client mirrors the sync one exactly. Same method names, same arguments, same return types — just await them.

asyncio · python ~/your-app

>>> import asyncio, nodus
>>>
>>> async def main():
...     async with nodus.AsyncClient() as client:
...         job = await client.run(image="ghcr.io/acme/train:v3", gpu="h100_80gb")
...         await job.wait()
...         print(job.status, job.cost_usd)
>>>
>>> asyncio.run(main())

# CLI

Installing the package puts a nodus command on your $PATH. Anything you can do in three lines of Python you can do in one shell line — useful for one-off jobs, health checks in deploy scripts, and CI smoke tests.

nodus · cli ~

$ export NODUS_API_KEY=nk_live_...

$ nodus run \
    --image ghcr.io/acme/train:v3 --gpu h100_80gb --gpu-count 8 \
    --env WANDB_API_KEY=$WANDB_API_KEY \
    --wait --timeout 64800 \
    -- python train.py --epochs 10

$ nodus list --limit 5
$ nodus get 01HX8ZV... --wait

# Client

nodus.Client is the sync entry point. One instance per process is enough — httpx handles connection pooling internally. Use it as a context manager or call close() explicitly.

Class

nodus.Client(api_key=None, *, base_url=None, timeout=30.0, max_retries=2)

Parameter	Type	Default	Notes
`api_key`	`str \| None`	env `NODUS_API_KEY`	Raises `ConfigurationError` if neither is set.
`base_url`	`str \| None`	`https://api.nodus.run`	Overridden by env `NODUS_BASE_URL`.
`timeout`	`float`	`30.0`	Per-request seconds.
`max_retries`	`int`	`2`	Transient failures only (network, 429, 5xx). Set `0` to disable.

Methods

Method

client.run(*, image, command=None, gpu="h100_80gb", gpu_count=1, env=None,
           max_runtime_seconds=3600, require_tiers=None, prefer_region=None,
           idempotency_key=None) → Job

Submit a job. Returns immediately with a Job handle. If idempotency_key is omitted the SDK generates one so the built-in retries are safe. Pass your own key for caller-controlled deduplication (e.g. retrying from a workflow engine).

Parameter	Type	Notes
`image`	`str`	OCI image reference, e.g. `ghcr.io/acme/train:v3`.
`command`	`list[str] \| None`	Argv list. `None` uses the image `CMD`.
`gpu`	`GpuType \| str`	Required. Enum or its wire value.
`gpu_count`	`int`	1 – 8.
`env`	`dict[str, str] \| None`	Environment variables injected into the container.
`max_runtime_seconds`	`int`	60 to 172 800 (48 hours).
`require_tiers`	`list[SupplierKind] \| None`	Hard filter on supplier classes.
`prefer_region`	`str \| None`	Soft preference; nudges placement score.
`idempotency_key`	`str \| None`	Max 128 chars. Submitting the same key twice returns the original job.

Method

client.get(job_id: str) → Job

Fetch a single job by id.

Method

client.list(*, limit: int = 50) → list[Job]

List recent jobs for the authenticated customer. Ordered by creation time, newest first. limit is capped server-side.

Method

client.iter_jobs(*, page_size: int = 50) → Iterator[Job]

Convenience iterator. Shaped like a cursor-paginated API so your code won’t change when the server gains a real cursor.

Method

client.healthz() → dict

Hit the control-plane health endpoint. Useful as a deploy-time smoke test.

Context manager

with nodus.Client() as client: ...
client.close()

Release the underlying HTTP connection pool. Prefer the context-manager form in application code.

# Job

The handle returned by client.run() and client.get(). Mutable: wait() and refresh() update fields in place so you can read them after the call without another round-trip. Not safe to share across threads.

Attributes

Attribute	Type	Notes
`id`	`str`	Server-issued UUID.
`status`	`JobStatus`	`queued`, `placed`, `running`, `completed`, `failed`, `cancelled`.
`supplier`	`str \| None`	Resolved supplier name. `None` until placement.
`region`	`str \| None`	Region the job landed in.
`cost_usd`	`float \| None`	Final cost. Populated once terminal.
`error_message`	`str \| None`	Set if `status == FAILED`.
`is_terminal`	`bool`	True when status is completed, failed, or cancelled.
`succeeded`	`bool`	Shortcut for `status == COMPLETED`.

Methods

Method

job.refresh() → Job

Re-fetch and update fields in place.

Method

job.wait(*, poll_seconds=2.0, timeout_seconds=None) → Job

Poll until the job reaches a terminal state. Raises TimeoutError if timeout_seconds elapses first. Pass timeout_seconds=None to wait forever.

# AsyncClient & AsyncJob

Identical to Client and Job, with every I/O method made awaitable. Use async with nodus.AsyncClient() as client: and call await client.aclose() when not using the context manager.

await client.run(…) → AsyncJob
await client.get(job_id) → AsyncJob
await client.list(limit=50) → list[AsyncJob]
await job.refresh(), await job.wait()

# Types

Enums and models are duplicated inside the SDK (not imported from nodus_core) so the SDK stays a zero-internal-dependency package you can pin independently of the control plane.

nodus.GpuType

Member	Wire value
`H100_80GB`	`"h100_80gb"`
`A100_80GB`	`"a100_80gb"`
`A100_40GB`	`"a100_40gb"`
`L40S`	`"l40s"`
`A10G`	`"a10g"`

nodus.SupplierKind

Member	Includes
`HYPERSCALER`	AWS, GCP, Azure, OCI.
`TIER3`	CoreWeave, Lambda, Crusoe.
`NEOCLOUD`	RunPod, Vast, Hyperbolic.

nodus.JobStatus

QUEUED, PLACED, RUNNING, COMPLETED, FAILED, CANCELLED. The last three are terminal.

nodus.JobSpec

Validated client-side copy of the submission payload. You rarely instantiate it directly — client.run() builds one for you — but it is exposed for libraries that want to construct specs ahead of time.

# Errors

Every failure mode is a subclass of nodus.NodusError. Catch specific classes for recoverable cases; catch the base for a last-resort logger. Each instance carries a stable .code string, a human-readable .message, .status_code (if HTTP), .request_id, and the raw .payload.

Class	When	Example `.code`
`ConfigurationError`	Before any network call. Missing API key or bad config.	`missing_api_key`
`AuthenticationError`	401 or 403 from the API.	`invalid_api_key`
`NotFoundError`	404 from the API.	`job_not_found`
`ValidationError`	400 or 422 from the API.	`invalid_spec`
`RateLimitError`	429 from the API. Has `.retry_after` in seconds.	`rate_limited`
`SupplierUnavailableError`	503 — no supplier could satisfy the spec right now.	`no_supplier`
`BudgetExceededError`	402 — submission would breach the key's monthly spend cap. See Keys & budgets.	`budget_exceeded`
`SignatureError`	401 — `Nodus-Signature` was missing, stale, or failed MAC verification. See Signed requests.	`signature_invalid`
`APIError`	Any other 4xx or 5xx.	`http_error`
`APIConnectionError`	Network-layer failure. DNS, TCP reset, TLS.	`connection_error`
`APITimeoutError`	Request exceeded the client timeout.	`timeout`

errors · python ~/your-app

>>> import time, nodus
>>>
>>> try:
...     client.run(image="ghcr.io/acme/train:v3", gpu="h100_80gb")
... except nodus.AuthenticationError:
...     raise   # bad / missing API key — fail loud
... except nodus.RateLimitError as e:
...     time.sleep(e.retry_after or 1)
... except nodus.SupplierUnavailableError:
...     # no capacity right now; back off, try again later
...     ...
... except nodus.ValidationError as e:
...     print(e.code, e.message, e.request_id)

# Keys & budgets

When an agent or service needs to run jobs on its own, mint it a dedicated key with a monthly spend cap. The cap is enforced at submission time against the sum of realised spend plus the estimated cost of every in-flight job issued by that key, so a fleet of agents can burn through budget concurrently without racing each other into overspend.

Each key carries a principal kind. This is the one bit of attribution that propagates all the way to the Trust Ledger row and onto the Stripe meter event, so finance can always answer “how much did the agents spend last month?” without joining three tables.

Principal kind	Intended use	Typical cap
`human`	Console-issued keys a person uses from a laptop or a CI job they own.	None or generous.
`agent`	An LLM-driven loop, a scheduled planner, a RAG pipeline that dispatches training itself.	Tight. Budget is the blast radius.
`service`	A first-party internal service — a training orchestrator, a batch refresher.	Per-service, tuned to the workload.

mint-key · shell ~/ops

# One-time, from an operator shell with DB access.
$ python -m nodus_ledger.admin mint \
.     --customer-id acme \
.     --kind agent \
.     --label planner-prod \
.     --monthly-cap-usd 2500 \
.     --signing-required

api_key         nk_live_7Z4d…q3
signing_secret  sgn_9bfe…14     # shown once, never again
key_id          0f3c-…-b210
principal_kind  agent
monthly_cap     $2500.00

The plaintext API key and signing secret are displayed exactly once at mint time. The database stores only hashes; the plaintext signing secret is also written to the control plane's Redis cache so request verification can run without a second round-trip. Treat both values like you would a production database password.

When a submission would push month-to-date spend past the cap, the API answers HTTP 402 budget_exceeded and the SDK raises BudgetExceededError. The error payload tells the caller exactly how much headroom it has, so an agent can shrink the spec and retry without a separate usage query.

budget · python ~/your-agent

>>> try:
...     client.run(image="ghcr.io/acme/eval:v2", gpu="h100_80gb", gpu_count=8)
... except nodus.BudgetExceededError as e:
...     headroom = e.payload["monthly_spend_cap_usd"] - e.payload["month_to_date_usd"]
...     # downscale and retry, or escalate to a human
...     log.warning("budget blocked", quoted=e.payload["estimated_cost_usd"], headroom=headroom)

Subscribe to the budget.threshold_reached webhook (default trigger: 80% of cap) to surface near-exhaustion to an operator before the first 402 ever fires. See Webhooks.

# Signed requests

Keys minted with --signing-required must accompany every request with an HMAC-SHA256 signature covering the method, path, body, and a timestamp. This pins each request to a single operation and a narrow time window (default tolerance: 5 minutes), so a leaked bearer alone cannot replay writes and a stolen log cannot be mined for live credentials.

Signed string format (identical to the webhook scheme, so one implementation covers both sides):

signed-string · format v1

signed_string = f"{ts}.{METHOD}.{path}.{sha256_hex(body)}"
mac           = HMAC-SHA256(signing_secret, signed_string).hexdigest()

Nodus-Signature: v1=<mac>,ts=<unix_seconds>

The SDK ships a helper so you don't have to reimplement this. It returns a dict of headers you can merge into any request — useful if you're calling an endpoint the SDK doesn't yet wrap, or building request signing into a different HTTP stack.

sign · python ~/your-agent

>>> import json, os, httpx, nodus
>>>
>>> body = json.dumps({"image": "ghcr.io/acme/train:v3", "gpu": "h100_80gb"}).encode()
>>> headers = {
...     "Authorization": "Bearer " + os.environ["NODUS_API_KEY"],
...     "Content-Type": "application/json",
... }
>>> headers.update(nodus.sign_request(
...     signing_secret=os.environ["NODUS_SIGNING_SECRET"],
...     method="POST",
...     path="/v1/jobs",
...     body=body,
... ))
>>> httpx.post("https://api.nodus.run/v1/jobs", content=body, headers=headers)
<Response [202 Accepted]>

Verification failures surface as SignatureError (401 signature_invalid, signature_stale, or signing_secret_unavailable). The correct response for each is different: resign with the same secret; resync the clock; rotate the key. Don't collapse them into a generic retry.

Body canonicalisation: always sign the exact bytes you put on the wire. Re-serialising the body between signing and sending (e.g. letting httpx build JSON from a dict after you signed a pre-encoded copy) will break verification. Sign once, send once.

# Webhooks

Instead of polling, register an HTTPS endpoint and Nodus will POST state-change events to it. Every delivery is signed with the same v1 HMAC scheme as outbound requests, so one verifier covers both directions. All deliveries are audited — delivery attempts, HTTP status, latency, and next retry are written to the Trust Ledger and exposed on the console.

Event	Fires when	Payload extras
`job.placed`	The API has accepted a submission and enqueued it for dispatch.	`estimated_cost_usd`, `principal_kind`
`job.completed`	The worker reports a successful terminal state.	`cost_usd`, `supplier`, `region`, `duration_s`
`job.failed`	Terminal failure — all retries and failovers exhausted.	`error_code`, `error_message`, `last_supplier`
`budget.threshold_reached`	Month-to-date spend on a key crosses the configured fraction (default 80%) of its cap.	`month_to_date_usd`, `monthly_spend_cap_usd`, `fraction`
`budget.exceeded`	A submission was blocked by the monthly cap.	`month_to_date_usd`, `monthly_spend_cap_usd`, `estimated_cost_usd`

Register an endpoint and subscribe it to one or more events. The signing secret is returned once in the response body; store it next to your other secrets.

register-webhook · shell ~/your-app

$ curl -sS -X POST https://api.nodus.run/v1/webhooks \
.     -H "Authorization: Bearer $NODUS_API_KEY" \
.     -H "Content-Type: application/json" \
.     -d '{
.       "url": "https://hooks.acme.internal/nodus",
.       "events": ["job.completed", "job.failed", "budget.threshold_reached"]
.     }'
{
  "id": "wh_b71c...",
  "url": "https://hooks.acme.internal/nodus",
  "events": ["job.completed", "job.failed", "budget.threshold_reached"],
  "secret": "whsec_4a9e...a1"        // shown once
}

Verify deliveries on your side with the same SDK helper that signs outbound requests, in reverse. The signed path is the one on your endpoint (strip any reverse-proxy mount prefix first).

verify · python ~/your-app

>>> # FastAPI route, but the helper is framework-agnostic.
>>> @app.post("/nodus")
... async def inbound(request: Request):
...     body = await request.body()
...     ok = nodus.verify_webhook(
...         signing_secret=os.environ["NODUS_WEBHOOK_SECRET"],
...         method="POST",
...         path="/nodus",
...         body=body,
...         signature_header=request.headers.get("Nodus-Signature"),
...     )
...     if not ok:
...         raise HTTPException(401, "bad signature")
...     event = json.loads(body)
...     # event["type"] in {"job.completed", "job.failed", ...}

Deliveries use exponential backoff with capped jitter (roughly: immediate, 30s, 2m, 10m, 1h) and are marked permanently failed after the final attempt. The Trust Ledger keeps the full attempt history so you can replay manually from the console if your endpoint was down.

# Retries & idempotency

Both clients retry transient failures automatically: network errors, HTTP 408, 409, 425, 429, and 5xx. Retry-After is honoured on 429; everything else uses decorrelated-jitter backoff. The default budget is two retries; raise or lower with max_retries=.

POST /v1/jobs is retried safely because the SDK generates an idempotency key for every run() call the caller did not key themselves. If you pass your own idempotency_key, the SDK forwards it unchanged and the server guarantees that submitting the same key twice returns the original job.

# Versioning

SemVer. The SDK is versioned independently of the control plane: a major bump on the server never forces a major bump on the SDK. New wire fields are additive — older SDK versions ignore fields they don’t know about (JobView uses extra="ignore"). We will publish a deprecation notice at least one minor release before removing any public symbol.

The public surface is everything re-exported from nodus/__init__.py. Anything prefixed with an underscore, or in nodus._http, is internal and can change without notice.