Packagenodus-sdk LanguagePython ≥ 3.10 Version0.1.0 StatusStable

Python SDK

A thin client over the Nodus REST API. Sync and async, with typed errors, automatic retries, idempotent job submission, and a small CLI. Versioned and released independently of the control plane.

# Playground

Configure a job spec, watch it route, fail over, and land in the Trust Ledger — all in your browser. No API key required. The simulator runs the same state machine the real control plane does, against a pool of vetted mock suppliers with realistic latency and pricing. Flip Simulate a supplier outage to see the failover path.

Request

                
Job idle
id
status
supplier
region
duration
cost_usd
vs retail
you saved
Event log

    # Install

    Python 3.10 or newer. The only runtime dependencies are httpx and pydantic.

    install ~
    $ pip install nodus-sdk
    # or: uv pip install nodus-sdk

    # Authentication

    Every client reads its credentials from the environment by default. Pass api_key= explicitly when you need multiple keys in one process.

    VariableDefaultNotes
    NODUS_API_KEY Required. Issued from the console as nk_live_… / nk_test_….
    NODUS_BASE_URL https://api.nodus.run Override for staging or a self-hosted control plane.

    # Quick start

    Submit a job, wait for it, print the result. This is the whole shape of the SDK; every other call is a variant of one of these three primitives.

    quickstart · python ~/your-app
    # export NODUS_API_KEY=nk_live_...
    >>> import nodus
    >>>
    >>> with nodus.Client() as client:
    ...     job = client.run(
    ...         image="ghcr.io/acme/train:v3",
    ...         command=["python", "train.py", "--epochs", "10"],
    ...         gpu="h100_80gb",
    ...         gpu_count=8,
    ...         env={"WANDB_API_KEY": os.environ["WANDB_API_KEY"]},
    ...         max_runtime_seconds=18 * 3600,
    ...     )
    ...     job.wait(timeout_seconds=20 * 3600)
    ...     print(job.status, job.supplier, job.cost_usd)
    JobStatus.COMPLETED lambda 47.62

    run() returns immediately with a Job handle. Call wait() to block until the job reaches a terminal state, or refresh() to fetch the current state on demand.

    # Async

    The async client mirrors the sync one exactly. Same method names, same arguments, same return types — just await them.

    asyncio · python ~/your-app
    >>> import asyncio, nodus
    >>>
    >>> async def main():
    ...     async with nodus.AsyncClient() as client:
    ...         job = await client.run(image="ghcr.io/acme/train:v3", gpu="h100_80gb")
    ...         await job.wait()
    ...         print(job.status, job.cost_usd)
    >>>
    >>> asyncio.run(main())

    # CLI

    Installing the package puts a nodus command on your $PATH. Anything you can do in three lines of Python you can do in one shell line — useful for one-off jobs, health checks in deploy scripts, and CI smoke tests.

    nodus · cli ~
    $ export NODUS_API_KEY=nk_live_...
    
    $ nodus run \
        --image ghcr.io/acme/train:v3 --gpu h100_80gb --gpu-count 8 \
        --env WANDB_API_KEY=$WANDB_API_KEY \
        --wait --timeout 64800 \
        -- python train.py --epochs 10
    
    $ nodus list --limit 5
    $ nodus get 01HX8ZV... --wait

    # Client

    nodus.Client is the sync entry point. One instance per process is enough — httpx handles connection pooling internally. Use it as a context manager or call close() explicitly.

    Class

    nodus.Client(api_key=None, *, base_url=None, timeout=30.0, max_retries=2)
    ParameterTypeDefaultNotes
    api_key str | None env NODUS_API_KEY Raises ConfigurationError if neither is set.
    base_url str | None https://api.nodus.run Overridden by env NODUS_BASE_URL.
    timeout float 30.0 Per-request seconds.
    max_retries int 2 Transient failures only (network, 429, 5xx). Set 0 to disable.

    Methods

    Method

    client.run(*, image, command=None, gpu="h100_80gb", gpu_count=1, env=None,
               max_runtime_seconds=3600, require_tiers=None, prefer_region=None,
               idempotency_key=None) → Job

    Submit a job. Returns immediately with a Job handle. If idempotency_key is omitted the SDK generates one so the built-in retries are safe. Pass your own key for caller-controlled deduplication (e.g. retrying from a workflow engine).

    ParameterTypeNotes
    imagestrOCI image reference, e.g. ghcr.io/acme/train:v3.
    commandlist[str] | NoneArgv list. None uses the image CMD.
    gpuGpuType | strRequired. Enum or its wire value.
    gpu_countint1 – 8.
    envdict[str, str] | NoneEnvironment variables injected into the container.
    max_runtime_secondsint60 to 172 800 (48 hours).
    require_tierslist[SupplierKind] | NoneHard filter on supplier classes.
    prefer_regionstr | NoneSoft preference; nudges placement score.
    idempotency_keystr | NoneMax 128 chars. Submitting the same key twice returns the original job.

    Method

    client.get(job_id: str) → Job

    Fetch a single job by id.

    Method

    client.list(*, limit: int = 50) → list[Job]

    List recent jobs for the authenticated customer. Ordered by creation time, newest first. limit is capped server-side.

    Method

    client.iter_jobs(*, page_size: int = 50) → Iterator[Job]

    Convenience iterator. Shaped like a cursor-paginated API so your code won’t change when the server gains a real cursor.

    Method

    client.healthz() → dict

    Hit the control-plane health endpoint. Useful as a deploy-time smoke test.

    Context manager

    with nodus.Client() as client: ...
    client.close()

    Release the underlying HTTP connection pool. Prefer the context-manager form in application code.

    # Job

    The handle returned by client.run() and client.get(). Mutable: wait() and refresh() update fields in place so you can read them after the call without another round-trip. Not safe to share across threads.

    Attributes

    AttributeTypeNotes
    idstrServer-issued UUID.
    statusJobStatusqueued, placed, running, completed, failed, cancelled.
    supplierstr | NoneResolved supplier name. None until placement.
    regionstr | NoneRegion the job landed in.
    cost_usdfloat | NoneFinal cost. Populated once terminal.
    error_messagestr | NoneSet if status == FAILED.
    is_terminalboolTrue when status is completed, failed, or cancelled.
    succeededboolShortcut for status == COMPLETED.

    Methods

    Method

    job.refresh() → Job

    Re-fetch and update fields in place.

    Method

    job.wait(*, poll_seconds=2.0, timeout_seconds=None) → Job

    Poll until the job reaches a terminal state. Raises TimeoutError if timeout_seconds elapses first. Pass timeout_seconds=None to wait forever.

    # AsyncClient & AsyncJob

    Identical to Client and Job, with every I/O method made awaitable. Use async with nodus.AsyncClient() as client: and call await client.aclose() when not using the context manager.

    • await client.run(…) → AsyncJob
    • await client.get(job_id) → AsyncJob
    • await client.list(limit=50) → list[AsyncJob]
    • await job.refresh(), await job.wait()

    # Types

    Enums and models are duplicated inside the SDK (not imported from nodus_core) so the SDK stays a zero-internal-dependency package you can pin independently of the control plane.

    nodus.GpuType

    MemberWire value
    H100_80GB"h100_80gb"
    A100_80GB"a100_80gb"
    A100_40GB"a100_40gb"
    L40S"l40s"
    A10G"a10g"

    nodus.SupplierKind

    MemberIncludes
    HYPERSCALERAWS, GCP, Azure, OCI.
    TIER3CoreWeave, Lambda, Crusoe.
    NEOCLOUDRunPod, Vast, Hyperbolic.

    nodus.JobStatus

    QUEUED, PLACED, RUNNING, COMPLETED, FAILED, CANCELLED. The last three are terminal.

    nodus.JobSpec

    Validated client-side copy of the submission payload. You rarely instantiate it directly — client.run() builds one for you — but it is exposed for libraries that want to construct specs ahead of time.

    # Errors

    Every failure mode is a subclass of nodus.NodusError. Catch specific classes for recoverable cases; catch the base for a last-resort logger. Each instance carries a stable .code string, a human-readable .message, .status_code (if HTTP), .request_id, and the raw .payload.

    ClassWhenExample .code
    ConfigurationErrorBefore any network call. Missing API key or bad config.missing_api_key
    AuthenticationError401 or 403 from the API.invalid_api_key
    NotFoundError404 from the API.job_not_found
    ValidationError400 or 422 from the API.invalid_spec
    RateLimitError429 from the API. Has .retry_after in seconds.rate_limited
    SupplierUnavailableError503 — no supplier could satisfy the spec right now.no_supplier
    BudgetExceededError402 — submission would breach the key's monthly spend cap. See Keys & budgets.budget_exceeded
    SignatureError401 — Nodus-Signature was missing, stale, or failed MAC verification. See Signed requests.signature_invalid
    APIErrorAny other 4xx or 5xx.http_error
    APIConnectionErrorNetwork-layer failure. DNS, TCP reset, TLS.connection_error
    APITimeoutErrorRequest exceeded the client timeout.timeout
    errors · python ~/your-app
    >>> import time, nodus
    >>>
    >>> try:
    ...     client.run(image="ghcr.io/acme/train:v3", gpu="h100_80gb")
    ... except nodus.AuthenticationError:
    ...     raise   # bad / missing API key — fail loud
    ... except nodus.RateLimitError as e:
    ...     time.sleep(e.retry_after or 1)
    ... except nodus.SupplierUnavailableError:
    ...     # no capacity right now; back off, try again later
    ...     ...
    ... except nodus.ValidationError as e:
    ...     print(e.code, e.message, e.request_id)

    # Keys & budgets

    When an agent or service needs to run jobs on its own, mint it a dedicated key with a monthly spend cap. The cap is enforced at submission time against the sum of realised spend plus the estimated cost of every in-flight job issued by that key, so a fleet of agents can burn through budget concurrently without racing each other into overspend.

    Each key carries a principal kind. This is the one bit of attribution that propagates all the way to the Trust Ledger row and onto the Stripe meter event, so finance can always answer “how much did the agents spend last month?” without joining three tables.

    Principal kindIntended useTypical cap
    humanConsole-issued keys a person uses from a laptop or a CI job they own.None or generous.
    agentAn LLM-driven loop, a scheduled planner, a RAG pipeline that dispatches training itself.Tight. Budget is the blast radius.
    serviceA first-party internal service — a training orchestrator, a batch refresher.Per-service, tuned to the workload.
    mint-key · shell ~/ops
    # One-time, from an operator shell with DB access.
    $ python -m nodus_ledger.admin mint \
    .     --customer-id acme \
    .     --kind agent \
    .     --label planner-prod \
    .     --monthly-cap-usd 2500 \
    .     --signing-required
    
    api_key         nk_live_7Z4d…q3
    signing_secret  sgn_9bfe…14     # shown once, never again
    key_id          0f3c-…-b210
    principal_kind  agent
    monthly_cap     $2500.00

    The plaintext API key and signing secret are displayed exactly once at mint time. The database stores only hashes; the plaintext signing secret is also written to the control plane's Redis cache so request verification can run without a second round-trip. Treat both values like you would a production database password.

    When a submission would push month-to-date spend past the cap, the API answers HTTP 402 budget_exceeded and the SDK raises BudgetExceededError. The error payload tells the caller exactly how much headroom it has, so an agent can shrink the spec and retry without a separate usage query.

    budget · python ~/your-agent
    >>> try:
    ...     client.run(image="ghcr.io/acme/eval:v2", gpu="h100_80gb", gpu_count=8)
    ... except nodus.BudgetExceededError as e:
    ...     headroom = e.payload["monthly_spend_cap_usd"] - e.payload["month_to_date_usd"]
    ...     # downscale and retry, or escalate to a human
    ...     log.warning("budget blocked", quoted=e.payload["estimated_cost_usd"], headroom=headroom)

    Subscribe to the budget.threshold_reached webhook (default trigger: 80% of cap) to surface near-exhaustion to an operator before the first 402 ever fires. See Webhooks.

    # Signed requests

    Keys minted with --signing-required must accompany every request with an HMAC-SHA256 signature covering the method, path, body, and a timestamp. This pins each request to a single operation and a narrow time window (default tolerance: 5 minutes), so a leaked bearer alone cannot replay writes and a stolen log cannot be mined for live credentials.

    Signed string format (identical to the webhook scheme, so one implementation covers both sides):

    signed-string · format v1
    signed_string = f"{ts}.{METHOD}.{path}.{sha256_hex(body)}"
    mac           = HMAC-SHA256(signing_secret, signed_string).hexdigest()
    
    Nodus-Signature: v1=<mac>,ts=<unix_seconds>

    The SDK ships a helper so you don't have to reimplement this. It returns a dict of headers you can merge into any request — useful if you're calling an endpoint the SDK doesn't yet wrap, or building request signing into a different HTTP stack.

    sign · python ~/your-agent
    >>> import json, os, httpx, nodus
    >>>
    >>> body = json.dumps({"image": "ghcr.io/acme/train:v3", "gpu": "h100_80gb"}).encode()
    >>> headers = {
    ...     "Authorization": "Bearer " + os.environ["NODUS_API_KEY"],
    ...     "Content-Type": "application/json",
    ... }
    >>> headers.update(nodus.sign_request(
    ...     signing_secret=os.environ["NODUS_SIGNING_SECRET"],
    ...     method="POST",
    ...     path="/v1/jobs",
    ...     body=body,
    ... ))
    >>> httpx.post("https://api.nodus.run/v1/jobs", content=body, headers=headers)
    <Response [202 Accepted]>

    Verification failures surface as SignatureError (401 signature_invalid, signature_stale, or signing_secret_unavailable). The correct response for each is different: resign with the same secret; resync the clock; rotate the key. Don't collapse them into a generic retry.

    Body canonicalisation: always sign the exact bytes you put on the wire. Re-serialising the body between signing and sending (e.g. letting httpx build JSON from a dict after you signed a pre-encoded copy) will break verification. Sign once, send once.

    # Webhooks

    Instead of polling, register an HTTPS endpoint and Nodus will POST state-change events to it. Every delivery is signed with the same v1 HMAC scheme as outbound requests, so one verifier covers both directions. All deliveries are audited — delivery attempts, HTTP status, latency, and next retry are written to the Trust Ledger and exposed on the console.

    EventFires whenPayload extras
    job.placedThe API has accepted a submission and enqueued it for dispatch.estimated_cost_usd, principal_kind
    job.completedThe worker reports a successful terminal state.cost_usd, supplier, region, duration_s
    job.failedTerminal failure — all retries and failovers exhausted.error_code, error_message, last_supplier
    budget.threshold_reachedMonth-to-date spend on a key crosses the configured fraction (default 80%) of its cap.month_to_date_usd, monthly_spend_cap_usd, fraction
    budget.exceededA submission was blocked by the monthly cap.month_to_date_usd, monthly_spend_cap_usd, estimated_cost_usd

    Register an endpoint and subscribe it to one or more events. The signing secret is returned once in the response body; store it next to your other secrets.

    register-webhook · shell ~/your-app
    $ curl -sS -X POST https://api.nodus.run/v1/webhooks \
    .     -H "Authorization: Bearer $NODUS_API_KEY" \
    .     -H "Content-Type: application/json" \
    .     -d '{
    .       "url": "https://hooks.acme.internal/nodus",
    .       "events": ["job.completed", "job.failed", "budget.threshold_reached"]
    .     }'
    {
      "id": "wh_b71c...",
      "url": "https://hooks.acme.internal/nodus",
      "events": ["job.completed", "job.failed", "budget.threshold_reached"],
      "secret": "whsec_4a9e...a1"        // shown once
    }

    Verify deliveries on your side with the same SDK helper that signs outbound requests, in reverse. The signed path is the one on your endpoint (strip any reverse-proxy mount prefix first).

    verify · python ~/your-app
    >>> # FastAPI route, but the helper is framework-agnostic.
    >>> @app.post("/nodus")
    ... async def inbound(request: Request):
    ...     body = await request.body()
    ...     ok = nodus.verify_webhook(
    ...         signing_secret=os.environ["NODUS_WEBHOOK_SECRET"],
    ...         method="POST",
    ...         path="/nodus",
    ...         body=body,
    ...         signature_header=request.headers.get("Nodus-Signature"),
    ...     )
    ...     if not ok:
    ...         raise HTTPException(401, "bad signature")
    ...     event = json.loads(body)
    ...     # event["type"] in {"job.completed", "job.failed", ...}

    Deliveries use exponential backoff with capped jitter (roughly: immediate, 30s, 2m, 10m, 1h) and are marked permanently failed after the final attempt. The Trust Ledger keeps the full attempt history so you can replay manually from the console if your endpoint was down.

    # Retries & idempotency

    Both clients retry transient failures automatically: network errors, HTTP 408, 409, 425, 429, and 5xx. Retry-After is honoured on 429; everything else uses decorrelated-jitter backoff. The default budget is two retries; raise or lower with max_retries=.

    POST /v1/jobs is retried safely because the SDK generates an idempotency key for every run() call the caller did not key themselves. If you pass your own idempotency_key, the SDK forwards it unchanged and the server guarantees that submitting the same key twice returns the original job.

    # Versioning

    SemVer. The SDK is versioned independently of the control plane: a major bump on the server never forces a major bump on the SDK. New wire fields are additive — older SDK versions ignore fields they don’t know about (JobView uses extra="ignore"). We will publish a deprecation notice at least one minor release before removing any public symbol.

    The public surface is everything re-exported from nodus/__init__.py. Anything prefixed with an underscore, or in nodus._http, is internal and can change without notice.

    Python SDK · v0.1.0 Report an issue