Skip to main content

Production Deployment

This guide covers running ktsu in production: service topology, state persistence, configuration, scaling, health checks, and observability.

Deployment guides


Service Topology

A ktsu deployment has three services:

  1. Orchestrator — The kernel. Manages the workflow DAG, validates step outputs through the Air-Lock, tracks heartbeats, and persists all run state. Exposes the public HTTP API for workflow invocation.
  2. Agent Runtime — The worker. Executes stateless agent reasoning loops. Receives invocation payloads from the Orchestrator, connects to tool servers, and calls the Gateway for LLM inference.
  3. LLM Gateway — The security boundary. Normalizes LLM providers, holds credentials, enforces cost budgets, and is the only service with outbound internet access to provider APIs.

All inter-service communication is plain HTTP. There are no message queues or shared memory between services.

Communication flow:

CLI / external → Orchestrator → Runtime → Gateway → LLM provider
→ Tool servers (MCP/SSE)

The Runtime is registered with the Orchestrator via KTSU_RUNTIME_URL. The Gateway address is set on both the Orchestrator (KTSU_GATEWAY_URL) and the Runtime (KTSU_GATEWAY_URL).

Minimum Viable Deployment

For single-host or development use, run all three services on one machine:

ktsu start gateway --config gateway.yaml   # :5052 by default
ktsu start orchestrator # :5050
ktsu start runtime # :5051

For a scaled deployment, each service runs on separate hosts or containers. The Orchestrator and Gateway ports are configurable; point services at each other via environment variables.


State Persistence

The Orchestrator is stateless — all run and step state lives in the Store, which is the only component that must be backed up.

Store backends:

BackendKTSU_STORE_TYPENotes
In-memorymemory (default)No persistence; state is lost on restart. For development only.
SQLitesqlitePersists to a file. Suitable for single-host or low-volume production.
PostgrespostgresDefined in the interface but not yet implemented.

SQLite in production:

Set KTSU_STORE_TYPE=sqlite and KTSU_DB_PATH to an absolute path on a persistent volume:

KTSU_STORE_TYPE=sqlite
KTSU_DB_PATH=/data/ktsu.db

The SQLite store does not enable WAL mode automatically. For workloads with concurrent readers, enable WAL mode manually on the database file before starting the Orchestrator:

sqlite3 /data/ktsu.db "PRAGMA journal_mode=WAL;"

Back up ktsu.db regularly. Because the Orchestrator is stateless, recovery is straightforward: restore the database file and restart.


Environment Configuration

Orchestrator

VariableDefaultDescription
KTSU_ORCHESTRATOR_HOST`` (all interfaces)Bind host
KTSU_ORCHESTRATOR_PORT5050Listen port
KTSU_RUNTIME_URLURL of the Agent Runtime
KTSU_GATEWAY_URLURL of the LLM Gateway
KTSU_OWN_URLThe Orchestrator's own URL (used for Runtime callbacks)
KTSU_STORE_TYPEmemoryState backend: memory or sqlite
KTSU_DB_PATHktsu.dbSQLite file path
KTSU_PROJECT_DIR.Project root for resolving agent and server paths

Agent Runtime

VariableDefaultDescription
KTSU_RUNTIME_HOST`` (all interfaces)Bind host
KTSU_RUNTIME_PORT5051Listen port
KTSU_ORCHESTRATOR_URLhttp://localhost:5050Orchestrator to register with and send heartbeats to
KTSU_GATEWAY_URLhttp://localhost:5052Gateway for LLM calls

LLM Gateway

VariableDefaultDescription
KTSU_GATEWAY_HOST`` (all interfaces)Bind host
KTSU_GATEWAY_PORT5052Listen port
ANTHROPIC_API_KEYInjected into the gateway container; referenced via {{ env.ANTHROPIC_API_KEY }} in gateway config

Provider Credentials

Do not hardcode provider API keys in workflow or gateway YAML files. Declare them in the env: section and reference them with {{ env.VAR }} substitution:

# gateway.yaml
env:
- name: ANTHROPIC_API_KEY
secret: true

providers:
- name: anthropic
type: anthropic
config:
api_key: "{{ env.ANTHROPIC_API_KEY }}"

The gateway resolves {{ env.ANTHROPIC_API_KEY }} at startup from the process environment. Inject secrets via your orchestration platform's secret management (Docker secrets, Kubernetes secrets, AWS Secrets Manager, etc.) rather than committing them to files.


Scaling

Orchestrator: Stateless. Can be horizontally scaled behind a load balancer as long as all instances share the same SQLite file (single host only) or a future Postgres backend. For multi-host scale-out, Postgres support is required (not yet implemented).

Agent Runtime: Handles one agent reasoning loop per request. Scale out Runtime instances to increase concurrent workflow throughput. Each Runtime registers with the Orchestrator independently via KTSU_ORCHESTRATOR_URL.

LLM Gateway: Stateless. Can be horizontally scaled behind a load balancer. All instances read credentials from environment variables.

Tool Servers: Independent processes. Scale custom tool servers independently based on observed load. Each tool server is referenced by URL in agent definitions, so multiple instances can sit behind a load balancer without changes to workflow config.


Health Checks

All three core services expose a GET /health endpoint that returns {"status":"ok"} when the service is ready.

ServiceDefault URL
Orchestratorhttp://localhost:5050/health
Agent Runtimehttp://localhost:5051/health
LLM Gatewayhttp://localhost:5052/health

Verify all three after startup:

curl -s http://localhost:5050/health   # orchestrator
curl -s http://localhost:5051/health # agent runtime
curl -s http://localhost:5052/health # LLM gateway

A healthy startup sequence: Gateway starts first (it has no dependencies), then the Orchestrator (depends on Gateway being healthy), then the Runtime (registers with both). The Docker Compose setup enforces this order via depends_on with condition: service_healthy.

The Agent Runtime sends a heartbeat POST to <KTSU_ORCHESTRATOR_URL>/heartbeat every 5 seconds while agent steps are active. The Orchestrator uses these heartbeats to detect stuck agents and fail steps that go silent.


Observability

Reserved output fields:

Agents may emit two reserved fields in step output that the Orchestrator surfaces for logging and monitoring — they have no effect on downstream pipeline data:

  • ktsu_flags — An array of string labels (e.g. ["low_risk", "requires_review"]). Use these to drive alerting rules on step output logs.
  • ktsu_rationale — A free-text string explaining the agent's decision. Useful for audit trails and debugging.

Inspecting run state:

ktsu runs list
ktsu runs get <run-id>

ktsu runs get returns the full run record including step statuses, outputs, and error messages. This is the primary tool for diagnosing failed or stuck runs.

Heartbeat monitoring:

The Runtime posts active step IDs to the Orchestrator every 5 seconds. If a Runtime crashes or becomes unresponsive, the Orchestrator will detect the missing heartbeat and mark the affected step as failed. Monitor for steps that transition to failed with a timeout error as a signal of Runtime instability.


Last updated April 2026