Production Deployment
This guide covers running ktsu in production: service topology, state persistence, configuration, scaling, health checks, and observability.
Deployment guides
Service Topology
A ktsu deployment has three services:
- Orchestrator — The kernel. Manages the workflow DAG, validates step outputs through the Air-Lock, tracks heartbeats, and persists all run state. Exposes the public HTTP API for workflow invocation.
- Agent Runtime — The worker. Executes stateless agent reasoning loops. Receives invocation payloads from the Orchestrator, connects to tool servers, and calls the Gateway for LLM inference.
- LLM Gateway — The security boundary. Normalizes LLM providers, holds credentials, enforces cost budgets, and is the only service with outbound internet access to provider APIs.
All inter-service communication is plain HTTP. There are no message queues or shared memory between services.
Communication flow:
CLI / external → Orchestrator → Runtime → Gateway → LLM provider
→ Tool servers (MCP/SSE)
The Runtime is registered with the Orchestrator via KTSU_RUNTIME_URL. The Gateway address is set on both the Orchestrator (KTSU_GATEWAY_URL) and the Runtime (KTSU_GATEWAY_URL).
Minimum Viable Deployment
For single-host or development use, run all three services on one machine:
ktsu start gateway --config gateway.yaml # :5052 by default
ktsu start orchestrator # :5050
ktsu start runtime # :5051
For a scaled deployment, each service runs on separate hosts or containers. The Orchestrator and Gateway ports are configurable; point services at each other via environment variables.
State Persistence
The Orchestrator is stateless — all run and step state lives in the Store, which is the only component that must be backed up.
Store backends:
| Backend | KTSU_STORE_TYPE | Notes |
|---|---|---|
| In-memory | memory (default) | No persistence; state is lost on restart. For development only. |
| SQLite | sqlite | Persists to a file. Suitable for single-host or low-volume production. |
| Postgres | postgres | Defined in the interface but not yet implemented. |
SQLite in production:
Set KTSU_STORE_TYPE=sqlite and KTSU_DB_PATH to an absolute path on a persistent volume:
KTSU_STORE_TYPE=sqlite
KTSU_DB_PATH=/data/ktsu.db
The SQLite store does not enable WAL mode automatically. For workloads with concurrent readers, enable WAL mode manually on the database file before starting the Orchestrator:
sqlite3 /data/ktsu.db "PRAGMA journal_mode=WAL;"
Back up ktsu.db regularly. Because the Orchestrator is stateless, recovery is straightforward: restore the database file and restart.
Environment Configuration
Orchestrator
| Variable | Default | Description |
|---|---|---|
KTSU_ORCHESTRATOR_HOST | `` (all interfaces) | Bind host |
KTSU_ORCHESTRATOR_PORT | 5050 | Listen port |
KTSU_RUNTIME_URL | — | URL of the Agent Runtime |
KTSU_GATEWAY_URL | — | URL of the LLM Gateway |
KTSU_OWN_URL | — | The Orchestrator's own URL (used for Runtime callbacks) |
KTSU_STORE_TYPE | memory | State backend: memory or sqlite |
KTSU_DB_PATH | ktsu.db | SQLite file path |
KTSU_PROJECT_DIR | . | Project root for resolving agent and server paths |
Agent Runtime
| Variable | Default | Description |
|---|---|---|
KTSU_RUNTIME_HOST | `` (all interfaces) | Bind host |
KTSU_RUNTIME_PORT | 5051 | Listen port |
KTSU_ORCHESTRATOR_URL | http://localhost:5050 | Orchestrator to register with and send heartbeats to |
KTSU_GATEWAY_URL | http://localhost:5052 | Gateway for LLM calls |
LLM Gateway
| Variable | Default | Description |
|---|---|---|
KTSU_GATEWAY_HOST | `` (all interfaces) | Bind host |
KTSU_GATEWAY_PORT | 5052 | Listen port |
ANTHROPIC_API_KEY | — | Injected into the gateway container; referenced via {{ env.ANTHROPIC_API_KEY }} in gateway config |
Provider Credentials
Do not hardcode provider API keys in workflow or gateway YAML files. Declare them in the env: section and reference them with {{ env.VAR }} substitution:
# gateway.yaml
env:
- name: ANTHROPIC_API_KEY
secret: true
providers:
- name: anthropic
type: anthropic
config:
api_key: "{{ env.ANTHROPIC_API_KEY }}"
The gateway resolves {{ env.ANTHROPIC_API_KEY }} at startup from the process environment. Inject secrets via your orchestration platform's secret management (Docker secrets, Kubernetes secrets, AWS Secrets Manager, etc.) rather than committing them to files.
Scaling
Orchestrator: Stateless. Can be horizontally scaled behind a load balancer as long as all instances share the same SQLite file (single host only) or a future Postgres backend. For multi-host scale-out, Postgres support is required (not yet implemented).
Agent Runtime: Handles one agent reasoning loop per request. Scale out Runtime instances to increase concurrent workflow throughput. Each Runtime registers with the Orchestrator independently via KTSU_ORCHESTRATOR_URL.
LLM Gateway: Stateless. Can be horizontally scaled behind a load balancer. All instances read credentials from environment variables.
Tool Servers: Independent processes. Scale custom tool servers independently based on observed load. Each tool server is referenced by URL in agent definitions, so multiple instances can sit behind a load balancer without changes to workflow config.
Health Checks
All three core services expose a GET /health endpoint that returns {"status":"ok"} when the service is ready.
| Service | Default URL |
|---|---|
| Orchestrator | http://localhost:5050/health |
| Agent Runtime | http://localhost:5051/health |
| LLM Gateway | http://localhost:5052/health |
Verify all three after startup:
curl -s http://localhost:5050/health # orchestrator
curl -s http://localhost:5051/health # agent runtime
curl -s http://localhost:5052/health # LLM gateway
A healthy startup sequence: Gateway starts first (it has no dependencies), then the Orchestrator (depends on Gateway being healthy), then the Runtime (registers with both). The Docker Compose setup enforces this order via depends_on with condition: service_healthy.
The Agent Runtime sends a heartbeat POST to <KTSU_ORCHESTRATOR_URL>/heartbeat every 5 seconds while agent steps are active. The Orchestrator uses these heartbeats to detect stuck agents and fail steps that go silent.
Observability
Reserved output fields:
Agents may emit two reserved fields in step output that the Orchestrator surfaces for logging and monitoring — they have no effect on downstream pipeline data:
ktsu_flags— An array of string labels (e.g.["low_risk", "requires_review"]). Use these to drive alerting rules on step output logs.ktsu_rationale— A free-text string explaining the agent's decision. Useful for audit trails and debugging.
Inspecting run state:
ktsu runs list
ktsu runs get <run-id>
ktsu runs get returns the full run record including step statuses, outputs, and error messages. This is the primary tool for diagnosing failed or stuck runs.
Heartbeat monitoring:
The Runtime posts active step IDs to the Orchestrator every 5 seconds. If a Runtime crashes or becomes unresponsive, the Orchestrator will detect the missing heartbeat and mark the affected step as failed. Monitor for steps that transition to failed with a timeout error as a signal of Runtime instability.
Last updated April 2026