Python SDK
A thin client over the Nodus REST API. Sync and async, with typed errors, automatic retries, idempotent job submission, and a small CLI. Versioned and released independently of the control plane.
# Playground
Configure a job spec, watch it route, fail over, and land in the Trust Ledger — all in your browser. No API key required. The simulator runs the same state machine the real control plane does, against a pool of vetted mock suppliers with realistic latency and pricing. Flip Simulate a supplier outage to see the failover path.
- id
- —
- status
- —
- supplier
- —
- region
- —
- duration
- —
- cost_usd
- —
- vs retail
- —
- you saved
- —
# Install
Python 3.10 or newer. The only runtime dependencies are
httpx and
pydantic.
$ pip install nodus-sdk # or: uv pip install nodus-sdk
# Authentication
Every client reads its credentials from the environment
by default. Pass api_key=
explicitly when you need multiple keys in one process.
| Variable | Default | Notes |
|---|---|---|
NODUS_API_KEY |
— | Required. Issued from the console as nk_live_… / nk_test_…. |
NODUS_BASE_URL |
https://api.nodus.run |
Override for staging or a self-hosted control plane. |
# Quick start
Submit a job, wait for it, print the result. This is the whole shape of the SDK; every other call is a variant of one of these three primitives.
# export NODUS_API_KEY=nk_live_... >>> import nodus >>> >>> with nodus.Client() as client: ... job = client.run( ... image="ghcr.io/acme/train:v3", ... command=["python", "train.py", "--epochs", "10"], ... gpu="h100_80gb", ... gpu_count=8, ... env={"WANDB_API_KEY": os.environ["WANDB_API_KEY"]}, ... max_runtime_seconds=18 * 3600, ... ) ... job.wait(timeout_seconds=20 * 3600) ... print(job.status, job.supplier, job.cost_usd) JobStatus.COMPLETED lambda 47.62
run() returns immediately
with a Job handle. Call
wait() to block until the
job reaches a terminal state, or
refresh() to fetch the
current state on demand.
# Async
The async client mirrors the sync one exactly. Same method
names, same arguments, same return types — just
await them.
>>> import asyncio, nodus >>> >>> async def main(): ... async with nodus.AsyncClient() as client: ... job = await client.run(image="ghcr.io/acme/train:v3", gpu="h100_80gb") ... await job.wait() ... print(job.status, job.cost_usd) >>> >>> asyncio.run(main())
# CLI
Installing the package puts a nodus
command on your $PATH.
Anything you can do in three lines of Python you can do
in one shell line — useful for one-off jobs, health
checks in deploy scripts, and CI smoke tests.
$ export NODUS_API_KEY=nk_live_... $ nodus run \ --image ghcr.io/acme/train:v3 --gpu h100_80gb --gpu-count 8 \ --env WANDB_API_KEY=$WANDB_API_KEY \ --wait --timeout 64800 \ -- python train.py --epochs 10 $ nodus list --limit 5 $ nodus get 01HX8ZV... --wait
# Client
nodus.Client is the sync
entry point. One instance per process is enough —
httpx handles connection
pooling internally. Use it as a context manager or call
close() explicitly.
Class
nodus.Client(api_key=None, *, base_url=None, timeout=30.0, max_retries=2)
| Parameter | Type | Default | Notes |
|---|---|---|---|
api_key |
str | None |
env NODUS_API_KEY |
Raises ConfigurationError if neither is set. |
base_url |
str | None |
https://api.nodus.run |
Overridden by env NODUS_BASE_URL. |
timeout |
float |
30.0 |
Per-request seconds. |
max_retries |
int |
2 |
Transient failures only (network, 429, 5xx). Set 0 to disable. |
Methods
Method
client.run(*, image, command=None, gpu="h100_80gb", gpu_count=1, env=None, max_runtime_seconds=3600, require_tiers=None, prefer_region=None, idempotency_key=None) → Job
Submit a job. Returns immediately with a Job
handle. If idempotency_key
is omitted the SDK generates one so the built-in retries
are safe. Pass your own key for caller-controlled
deduplication (e.g. retrying from a workflow engine).
| Parameter | Type | Notes |
|---|---|---|
image | str | OCI image reference, e.g. ghcr.io/acme/train:v3. |
command | list[str] | None | Argv list. None uses the image CMD. |
gpu | GpuType | str | Required. Enum or its wire value. |
gpu_count | int | 1 – 8. |
env | dict[str, str] | None | Environment variables injected into the container. |
max_runtime_seconds | int | 60 to 172 800 (48 hours). |
require_tiers | list[SupplierKind] | None | Hard filter on supplier classes. |
prefer_region | str | None | Soft preference; nudges placement score. |
idempotency_key | str | None | Max 128 chars. Submitting the same key twice returns the original job. |
Method
client.list(*, limit: int = 50) → list[Job]
List recent jobs for the authenticated customer.
Ordered by creation time, newest first.
limit is capped
server-side.
Method
client.iter_jobs(*, page_size: int = 50) → Iterator[Job]
Convenience iterator. Shaped like a cursor-paginated API so your code won’t change when the server gains a real cursor.
Method
client.healthz() → dict
Hit the control-plane health endpoint. Useful as a deploy-time smoke test.
Context manager
with nodus.Client() as client: ... client.close()
Release the underlying HTTP connection pool. Prefer the context-manager form in application code.
# Job
The handle returned by client.run()
and client.get().
Mutable: wait() and
refresh() update fields
in place so you can read them after the call without
another round-trip. Not safe to share across threads.
Attributes
| Attribute | Type | Notes |
|---|---|---|
id | str | Server-issued UUID. |
status | JobStatus | queued, placed, running, completed, failed, cancelled. |
supplier | str | None | Resolved supplier name. None until placement. |
region | str | None | Region the job landed in. |
cost_usd | float | None | Final cost. Populated once terminal. |
error_message | str | None | Set if status == FAILED. |
is_terminal | bool | True when status is completed, failed, or cancelled. |
succeeded | bool | Shortcut for status == COMPLETED. |
Methods
Method
job.refresh() → Job
Re-fetch and update fields in place.
Method
job.wait(*, poll_seconds=2.0, timeout_seconds=None) → Job
Poll until the job reaches a terminal state. Raises
TimeoutError if
timeout_seconds
elapses first. Pass
timeout_seconds=None
to wait forever.
# AsyncClient & AsyncJob
Identical to Client and
Job, with every I/O method made
awaitable. Use
async with nodus.AsyncClient() as client:
and call await client.aclose()
when not using the context manager.
await client.run(…) → AsyncJobawait client.get(job_id) → AsyncJobawait client.list(limit=50) → list[AsyncJob]await job.refresh(),await job.wait()
# Types
Enums and models are duplicated inside the SDK (not
imported from nodus_core)
so the SDK stays a zero-internal-dependency package you
can pin independently of the control plane.
nodus.GpuType
| Member | Wire value |
|---|---|
H100_80GB | "h100_80gb" |
A100_80GB | "a100_80gb" |
A100_40GB | "a100_40gb" |
L40S | "l40s" |
A10G | "a10g" |
nodus.SupplierKind
| Member | Includes |
|---|---|
HYPERSCALER | AWS, GCP, Azure, OCI. |
TIER3 | CoreWeave, Lambda, Crusoe. |
NEOCLOUD | RunPod, Vast, Hyperbolic. |
nodus.JobStatus
QUEUED,
PLACED,
RUNNING,
COMPLETED,
FAILED,
CANCELLED. The last
three are terminal.
nodus.JobSpec
Validated client-side copy of the submission payload.
You rarely instantiate it directly —
client.run() builds one
for you — but it is exposed for libraries that
want to construct specs ahead of time.
# Errors
Every failure mode is a subclass of
nodus.NodusError. Catch
specific classes for recoverable cases; catch the base
for a last-resort logger. Each instance carries a stable
.code string, a
human-readable .message,
.status_code (if HTTP),
.request_id, and the raw
.payload.
| Class | When | Example .code |
|---|---|---|
ConfigurationError | Before any network call. Missing API key or bad config. | missing_api_key |
AuthenticationError | 401 or 403 from the API. | invalid_api_key |
NotFoundError | 404 from the API. | job_not_found |
ValidationError | 400 or 422 from the API. | invalid_spec |
RateLimitError | 429 from the API. Has .retry_after in seconds. | rate_limited |
SupplierUnavailableError | 503 — no supplier could satisfy the spec right now. | no_supplier |
BudgetExceededError | 402 — submission would breach the key's monthly spend cap. See Keys & budgets. | budget_exceeded |
SignatureError | 401 — Nodus-Signature was missing, stale, or failed MAC verification. See Signed requests. | signature_invalid |
APIError | Any other 4xx or 5xx. | http_error |
APIConnectionError | Network-layer failure. DNS, TCP reset, TLS. | connection_error |
APITimeoutError | Request exceeded the client timeout. | timeout |
>>> import time, nodus >>> >>> try: ... client.run(image="ghcr.io/acme/train:v3", gpu="h100_80gb") ... except nodus.AuthenticationError: ... raise # bad / missing API key — fail loud ... except nodus.RateLimitError as e: ... time.sleep(e.retry_after or 1) ... except nodus.SupplierUnavailableError: ... # no capacity right now; back off, try again later ... ... ... except nodus.ValidationError as e: ... print(e.code, e.message, e.request_id)
# Keys & budgets
When an agent or service needs to run jobs on its own, mint it a dedicated key with a monthly spend cap. The cap is enforced at submission time against the sum of realised spend plus the estimated cost of every in-flight job issued by that key, so a fleet of agents can burn through budget concurrently without racing each other into overspend.
Each key carries a principal kind. This is the one bit of attribution that propagates all the way to the Trust Ledger row and onto the Stripe meter event, so finance can always answer “how much did the agents spend last month?” without joining three tables.
| Principal kind | Intended use | Typical cap |
|---|---|---|
human | Console-issued keys a person uses from a laptop or a CI job they own. | None or generous. |
agent | An LLM-driven loop, a scheduled planner, a RAG pipeline that dispatches training itself. | Tight. Budget is the blast radius. |
service | A first-party internal service — a training orchestrator, a batch refresher. | Per-service, tuned to the workload. |
# One-time, from an operator shell with DB access. $ python -m nodus_ledger.admin mint \ . --customer-id acme \ . --kind agent \ . --label planner-prod \ . --monthly-cap-usd 2500 \ . --signing-required api_key nk_live_7Z4d…q3 signing_secret sgn_9bfe…14 # shown once, never again key_id 0f3c-…-b210 principal_kind agent monthly_cap $2500.00
The plaintext API key and signing secret are displayed exactly once at mint time. The database stores only hashes; the plaintext signing secret is also written to the control plane's Redis cache so request verification can run without a second round-trip. Treat both values like you would a production database password.
When a submission would push month-to-date spend past
the cap, the API answers HTTP 402
budget_exceeded and the SDK raises
BudgetExceededError. The
error payload tells the caller exactly how much headroom
it has, so an agent can shrink the spec and retry
without a separate usage query.
>>> try: ... client.run(image="ghcr.io/acme/eval:v2", gpu="h100_80gb", gpu_count=8) ... except nodus.BudgetExceededError as e: ... headroom = e.payload["monthly_spend_cap_usd"] - e.payload["month_to_date_usd"] ... # downscale and retry, or escalate to a human ... log.warning("budget blocked", quoted=e.payload["estimated_cost_usd"], headroom=headroom)
Subscribe to the
budget.threshold_reached
webhook (default trigger: 80% of cap) to surface
near-exhaustion to an operator before the first
402 ever fires. See
Webhooks.
# Signed requests
Keys minted with --signing-required
must accompany every request with an HMAC-SHA256
signature covering the method, path, body, and a
timestamp. This pins each request to a single operation
and a narrow time window (default tolerance: 5 minutes),
so a leaked bearer alone cannot replay writes and a
stolen log cannot be mined for live credentials.
Signed string format (identical to the webhook scheme, so one implementation covers both sides):
signed_string = f"{ts}.{METHOD}.{path}.{sha256_hex(body)}"
mac = HMAC-SHA256(signing_secret, signed_string).hexdigest()
Nodus-Signature: v1=<mac>,ts=<unix_seconds>
The SDK ships a helper so you don't have to reimplement this. It returns a dict of headers you can merge into any request — useful if you're calling an endpoint the SDK doesn't yet wrap, or building request signing into a different HTTP stack.
>>> import json, os, httpx, nodus >>> >>> body = json.dumps({"image": "ghcr.io/acme/train:v3", "gpu": "h100_80gb"}).encode() >>> headers = { ... "Authorization": "Bearer " + os.environ["NODUS_API_KEY"], ... "Content-Type": "application/json", ... } >>> headers.update(nodus.sign_request( ... signing_secret=os.environ["NODUS_SIGNING_SECRET"], ... method="POST", ... path="/v1/jobs", ... body=body, ... )) >>> httpx.post("https://api.nodus.run/v1/jobs", content=body, headers=headers) <Response [202 Accepted]>
Verification failures surface as
SignatureError
(401 signature_invalid,
signature_stale, or
signing_secret_unavailable).
The correct response for each is
different: resign with the same secret; resync the
clock; rotate the key. Don't collapse them into a
generic retry.
Body canonicalisation: always sign the exact
bytes you put on the wire. Re-serialising the body
between signing and sending (e.g. letting
httpx build JSON from a
dict after you signed a pre-encoded copy) will break
verification. Sign once, send once.
# Webhooks
Instead of polling, register an HTTPS endpoint and Nodus will POST state-change events to it. Every delivery is signed with the same v1 HMAC scheme as outbound requests, so one verifier covers both directions. All deliveries are audited — delivery attempts, HTTP status, latency, and next retry are written to the Trust Ledger and exposed on the console.
| Event | Fires when | Payload extras |
|---|---|---|
job.placed | The API has accepted a submission and enqueued it for dispatch. | estimated_cost_usd, principal_kind |
job.completed | The worker reports a successful terminal state. | cost_usd, supplier, region, duration_s |
job.failed | Terminal failure — all retries and failovers exhausted. | error_code, error_message, last_supplier |
budget.threshold_reached | Month-to-date spend on a key crosses the configured fraction (default 80%) of its cap. | month_to_date_usd, monthly_spend_cap_usd, fraction |
budget.exceeded | A submission was blocked by the monthly cap. | month_to_date_usd, monthly_spend_cap_usd, estimated_cost_usd |
Register an endpoint and subscribe it to one or more events. The signing secret is returned once in the response body; store it next to your other secrets.
$ curl -sS -X POST https://api.nodus.run/v1/webhooks \ . -H "Authorization: Bearer $NODUS_API_KEY" \ . -H "Content-Type: application/json" \ . -d '{ . "url": "https://hooks.acme.internal/nodus", . "events": ["job.completed", "job.failed", "budget.threshold_reached"] . }' { "id": "wh_b71c...", "url": "https://hooks.acme.internal/nodus", "events": ["job.completed", "job.failed", "budget.threshold_reached"], "secret": "whsec_4a9e...a1" // shown once }
Verify deliveries on your side with the same SDK helper that signs outbound requests, in reverse. The signed path is the one on your endpoint (strip any reverse-proxy mount prefix first).
>>> # FastAPI route, but the helper is framework-agnostic. >>> @app.post("/nodus") ... async def inbound(request: Request): ... body = await request.body() ... ok = nodus.verify_webhook( ... signing_secret=os.environ["NODUS_WEBHOOK_SECRET"], ... method="POST", ... path="/nodus", ... body=body, ... signature_header=request.headers.get("Nodus-Signature"), ... ) ... if not ok: ... raise HTTPException(401, "bad signature") ... event = json.loads(body) ... # event["type"] in {"job.completed", "job.failed", ...}
Deliveries use exponential backoff with capped jitter (roughly: immediate, 30s, 2m, 10m, 1h) and are marked permanently failed after the final attempt. The Trust Ledger keeps the full attempt history so you can replay manually from the console if your endpoint was down.
# Retries & idempotency
Both clients retry transient failures automatically:
network errors, HTTP 408, 409, 425, 429, and 5xx.
Retry-After is honoured
on 429; everything else uses decorrelated-jitter backoff.
The default budget is two retries; raise or lower with
max_retries=.
POST /v1/jobs is retried
safely because the SDK generates an idempotency key for
every run() call the
caller did not key themselves. If you pass your own
idempotency_key, the SDK
forwards it unchanged and the server guarantees that
submitting the same key twice returns the original job.
# Versioning
SemVer. The SDK is versioned independently of the
control plane: a major bump on the server never forces a
major bump on the SDK. New wire fields are additive
— older SDK versions ignore fields they don’t
know about
(JobView uses
extra="ignore"). We will
publish a deprecation notice at least one minor release
before removing any public symbol.
The public surface is everything re-exported from
nodus/__init__.py.
Anything prefixed with an underscore, or in
nodus._http, is internal
and can change without notice.