# Local AI Swarm (Router + Multi-Node Local Models)

This repo already has a strong local-first AI layer (`app/local_ai.py`) used by Ask AI and Search. The **Local AI Swarm** adds a production-minded **routing + backpressure** layer so the website can use **multiple local/LAN model nodes** safely and observably.

## What this is

When enabled, the swarm layer:
- routes local model calls across **one or more local/LAN model nodes**, including:
  - Ollama HTTP (`/api/chat`, `/api/embeddings`), and
  - OpenAI-compatible HTTP (`/v1/chat/completions`, `/v1/embeddings`) for servers like vLLM, llama.cpp server, LM Studio, or Ollama’s `/v1`,
- enforces **global + per-node concurrency** (backpressure),
- applies **timeouts** and basic **circuit-breaker cooldowns** on repeated failures,
- exposes an operator/status endpoint (`GET /api/swarm/status`) that is **public-safe by default**.

It is designed to be **off by default** and to preserve existing behavior when disabled.

## Traceable routing (operator debugging)

Every HTTP response from the app includes:
- `X-Request-ID` — per-request id (generated server-side).
- `X-PPIA-Trace-ID` — a short, public-safe trace id (`sha256(X-Request-ID)[:12]`).

When the Local AI Swarm is enabled, the router records recent per-request events (node selection, queue waits, overload/circuit-breaker opens). Operators can inspect these events with:
- Web UI (loopback): `GET /admin/swarm?trace=<X-PPIA-Trace-ID>` → **Trace** panel auto-loads.
- CLI (loopback): `python3 scripts/ppia swarm trace <trace_id> [--scope process|global] [--window-s 900] [--limit 200]`
- Raw JSON (strict loopback only): `GET /api/swarm/trace/<trace_id>?scope=process|global&window_s=...&limit=...`

Notes:
- `/api/swarm/trace/*` is **strict loopback only** (DNS-rebinding + forwarded-header hardened).
- The `global` scope filters the router’s persisted JSONL events (best-effort; cross-process).

Implementation notes:
- Swarm hot paths use pooled HTTP keep-alive clients (httpx, `trust_env=0`, `follow_redirects=0`) to reduce LAN tail latency.
- Pool sizing is env-tunable for larger clusters:
  - `LOCAL_SWARM_HTTPX_MAX_CONNECTIONS`, `LOCAL_SWARM_HTTPX_MAX_KEEPALIVE_CONNECTIONS`, `LOCAL_SWARM_HTTPX_KEEPALIVE_EXPIRY_S`
  - `LOCAL_SWARM_HTTPX_ASYNC_MAX_CONNECTIONS`, `LOCAL_SWARM_HTTPX_ASYNC_MAX_KEEPALIVE_CONNECTIONS`, `LOCAL_SWARM_HTTPX_ASYNC_KEEPALIVE_EXPIRY_S`
  - `LOCAL_SWARM_RAY_GATEWAY_HTTPX_MAX_CONNECTIONS`, `LOCAL_SWARM_RAY_GATEWAY_HTTPX_MAX_KEEPALIVE_CONNECTIONS`, `LOCAL_SWARM_RAY_GATEWAY_HTTPX_KEEPALIVE_EXPIRY_S` (website → Ray gateway)
- When `LOCAL_SWARM_USE_FILE_LOCKS=1`, routing order also considers cross-process slot occupancy (file-lock based) to avoid repeatedly selecting already-full nodes under multi-worker deployments.
- When `LOCAL_AI_TRAINING_LOCK_FILE` exists and is fresh, chat/stream routing is softly biased away from localhost nodes (helps avoid inference fighting local fine-tuning over GPU/VRAM).
- Ollama health probes prefer a lightweight `/api/version` probe for health, falling back to cached `/api/tags` only when version is unavailable.
- Streaming TTFT failover: `LOCAL_SWARM_STREAM_TTFT_TIMEOUT_MS` can skip stale/down nodes that don’t produce a first token promptly when multiple candidates exist (default: 8000ms; set to 0 to disable). Enforced by `/api/ai/chat/stream` and the optional `/v1/chat/completions` proxy. TTFT fast-fail is recorded as a **stream node failure** (not `client_canceled`) so the router down-ranks stale nodes quickly across workers.
- Best-effort session affinity (stickiness): the website uses a small, random, httpOnly cookie (`ppia_swarm_aff`) as an affinity key so multi-turn chats prefer the same base when it is still competitive (avoids cache thrash on multi-node clusters). Tunables: `LOCAL_SWARM_AFFINITY_ENABLED` and `LOCAL_SWARM_AFFINITY_PROMOTE_MIN_RATIO`.
  - Observability: `GET /api/swarm/metrics` includes `affinity.promoted_total` and `nodes[].affinity_promoted_n` (base-id only).
- Optional guardrail: cap distinct node attempts per request (0 = unlimited): `LOCAL_SWARM_MAX_NODE_ATTEMPTS_CHAT`, `LOCAL_SWARM_MAX_NODE_ATTEMPTS_EMBED`, `LOCAL_SWARM_MAX_NODE_ATTEMPTS_STREAM`.
- TTFT observability: `/api/swarm/metrics` (and `/api/swarm/metrics.prom`) report stream **TTFT** percentiles per-op and per-node when available.
- Backpressure observability: `/api/swarm/metrics` (and Prometheus text) report **queue wait** percentiles for admission to the swarm concurrency pools (semaphores/locks), plus per-op queue timeout counts.
  - Tuning: `LOCAL_SWARM_QUEUE_WAIT_MS` / `LOCAL_SWARM_QUEUE_WAIT_MS_BG`
  - Persistence threshold (for ok events): `LOCAL_SWARM_METRICS_QUEUE_WAIT_PERSIST_MS` (default: 10ms; lower = more disk IO)
- Probe results are cached on-disk to share state across multiple Uvicorn workers (best-effort, safe to delete):
  - Path: `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/probe_cache/<kind>.<base_id>.json`
  - Shape: `{ts, ok, models[]?}` keyed by `base_id` (hash; no base URLs)
  - Router request-path hard failures (timeouts / connection errors) also write `ok=false` hints (throttled) so other workers down-rank nodes quickly between periodic probes.
  - Active only when `LOCAL_SWARM_USE_FILE_LOCKS=1` and file locks are supported.
  - Opt-out (advanced): set `LOCAL_SWARM_CACHE_ENABLED=0` (or `PPIA_SWARM_CACHE_ENABLED=0`) to disable probe/overload/breaker cache reads+writes while keeping the file-lock backpressure gates.
  - Loopback fallback (recommended): `LOCAL_SWARM_FALLBACK_LOOPBACK_ON_FAILURE=1` allows a final attempt via `LOCAL_AI_URL` **only when it is strict loopback** (127.0.0.1/::1/localhost) after all allowlisted nodes fail. Disable for strict allowlist-only behavior.
  - Note: when `PPIA_DATA_DIR` is not exported in the process environment, the swarm config will still
    read `aipowerprogressia.com/.env` for `PPIA_DATA_DIR` (best-effort) to avoid multi-worker drift.
- Overload cooldowns are also shared on-disk (best-effort, safe to delete):
  - Path: `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/overload_cache/<kind>.<base_id>.json`
  - Shape: `{ts, until_ts, status?, retry_after_s?}` keyed by `base_id` only (no base URLs)
  - Purpose: reduces 429 storms under multi-worker deployments when a node is alive but temporarily overloaded.
  - Written on `HTTP 429/503` overload responses; cleared on subsequent successful calls (throttled).
  - Active only when file locks are enabled and caches are enabled (`LOCAL_SWARM_CACHE_ENABLED=1`).
- Breaker cooldowns (circuit-breaker open periods) are also shared on-disk (best-effort, safe to delete):
  - Path: `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/breaker_cache/<kind>.<base_id>.json`
  - Shape: `{ts, until_ts?, until_ts_chat?, until_ts_embed?, until_ts_stream?}` keyed by `base_id` only (no base URLs)
    - Keys omitted = no change
    - Values `<= 0` clear that breaker cooldown
  - Purpose: reduces multi-worker flapping by sharing open-breaker cooldowns (and clears) across Uvicorn workers.
  - Written when a breaker opens after repeated failures; cleared on subsequent successful calls (throttled).
  - Optional poll knob (advanced): `LOCAL_SWARM_BREAKER_CACHE_POLL_S` (alias: `PPIA_SWARM_BREAKER_CACHE_POLL_S`).
- Node EWMA snapshots are also persisted (best-effort, safe to delete):
  - Path: `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/state/node_state.<base_id>.json`
  - Contents: compact per-node EWMA latency/TTFT state keyed by `base_id` only (no base URLs).
  - Purpose: reduces cold-start routing churn after restarts (keeps “which node is faster for streams” learning).
  - Active only when file locks are enabled (same gating as metrics persistence).
- Node mode overrides are also persisted (operator-only; safe to delete):
  - Path: `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/state/node_override.<base_id>.json`
  - Shape: `{mode:"enabled|drain|disabled", reason?, updated_utc}` keyed by `base_id` only (no base URLs)
  - Purpose: safe maintenance controls (drain/disable nodes) without editing the node list.
  - Operator API (strict loopback): `GET /api/swarm/nodes/overrides`, `POST /api/swarm/nodes/override`
  - Operator UI: `/admin/swarm` (Node override panel + mode column).
  - Operator CLI (strict loopback):
    - List overrides: `python3 scripts/ppia swarm node-overrides`
    - Drain (prefer other nodes): `python3 scripts/ppia swarm node-override <base_id> --mode drain --reason "maintenance"`
    - Disable (never select): `python3 scripts/ppia swarm node-override <base_id> --mode disabled --reason "offline"`
    - Re-enable: `python3 scripts/ppia swarm node-override <base_id> --mode enabled --reason "back online"`
  - Mode semantics:
    - `enabled`: normal routing
    - `drain`: used only when no enabled nodes remain
    - `disabled`: never used (even as a last resort)

## Optional: OpenAI-compatible `/v1` gateway on the website

When enabled, the FastAPI app also exposes a small OpenAI-compatible surface backed by the **same Swarm router**
(so clients get the same routing, backpressure, failover, and metrics):

- `GET /v1/models`
- `POST /v1/chat/completions` (supports `stream: true`)
- `POST /v1/embeddings`

Enable (in `aipowerprogressia.com/.env` or your process env):

```bash
LOCAL_OPENAI_PROXY_ENABLED=1
```

Auth + safety defaults:
- Strict loopback requests are allowed by default.
- If you set `LOCAL_OPENAI_PROXY_TOKEN`, a bearer token is required **even on loopback** (recommended when using OpenAI SDKs).
- Non-loopback clients are always token-gated, and must come from a private/loopback IP by default.
  - Setting `LOCAL_OPENAI_PROXY_ALLOW_PUBLIC_CLIENTS=1` relaxes this (not recommended).
- Observability: when enabled, `/v1/*` requests are logged to `ai_runs` (metadata only; no prompt/response stored). Disable with `LOCAL_OPENAI_PROXY_LOG_RUNS=0`.
- Input caps (defense-in-depth): `LOCAL_OPENAI_PROXY_MAX_MESSAGES`, `LOCAL_OPENAI_PROXY_MAX_MESSAGE_CHARS`, `LOCAL_OPENAI_PROXY_MAX_EMBED_ITEMS`, `LOCAL_OPENAI_PROXY_MAX_EMBED_TEXT_CHARS` (defaults align with `AI_MAX_MESSAGE_CHARS`).
- Streaming resilience: when multiple stream candidates exist, `/v1/chat/completions` uses the same TTFT fast-fail contract as `/api/ai/chat/stream` (`LOCAL_SWARM_STREAM_TTFT_TIMEOUT_MS`, default 8000ms; set to `0` to disable).

Quick sanity checks:

```bash
# Disabled (default): 404
curl -i http://127.0.0.1:8000/v1/models

# Enabled:
LOCAL_OPENAI_PROXY_ENABLED=1 bash scripts/run_app.sh
curl -fsS http://127.0.0.1:8000/v1/models | head
```

OpenAI Python SDK example (loopback):

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="local",  # set to your LOCAL_OPENAI_PROXY_TOKEN if configured
)

resp = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Hello from the local swarm gateway."}],
)
print(resp.choices[0].message.content)
```

## Ask AI protection knobs (recommended when public-ish)

These are optional server-side controls that help keep the website responsive under load:

- **Expensive request rate bucket** (opt-in): set `AI_RATE_LIMIT_CHAT_EXPENSIVE` / `AI_RATE_LIMIT_CHAT_STREAM_EXPENSIVE` (and windows) to separately limit requests that can multiply backend work (e.g. `use_web_search=1` or `orchestrate=1`).
- **Per-IP concurrent stream cap**: set `AI_CHAT_STREAM_MAX_CONCURRENT_PER_IP` to limit concurrent `/api/ai/chat/stream` connections per client (cross-process; file-lock based). Keep `0` for local dev.
- **Per-IP concurrent chat cap**: set `AI_CHAT_MAX_CONCURRENT_PER_IP` to limit concurrent `/api/ai/chat` requests per client (cross-process; file-lock based). Keep `0` for local dev.

## 5-minute quickstart (local dev)

Prereqs (recommended; one-time):

```bash
cd aipowerprogressia.com
python3 -m venv .venv310
. .venv310/bin/activate
python3 -m pip install -r requirements.txt
```

Note: repo-shipped systemd units reference `.venv310/bin/python` by default.

1) Ensure a local model server is running (default: Ollama):

```bash
curl -fsS http://127.0.0.1:11434/api/tags >/dev/null
# (fallback) curl -fsS http://127.0.0.1:11434/api/version >/dev/null
```

2) Enable swarm (in `aipowerprogressia.com/.env` or your process env):

```bash
LOCAL_SWARM_ENABLED=1
LOCAL_SWARM_DEFAULT_MODE=on
LOCAL_SWARM_USE_FILE_LOCKS=1
```

Compatibility notes:
- `LOCAL_SWARM_MODE=router` is accepted as an alias for enabling the swarm when `LOCAL_SWARM_ENABLED` is unset.
- If you are using an OpenAI-compatible local inference server (vLLM, llama.cpp server, LM Studio, LocalAI, or Ollama `/v1`) and you have not configured `LOCAL_AI_NODES{,_PATH}` yet, you can set:
  - `LOCAL_AI_PROVIDER=openai_compat`
  - `LOCAL_AI_BASE_URL=http://127.0.0.1:11434/v1` (or `LOCAL_AI_BASE_URLS=...` for multiple)
  The swarm will synthesize a default node list from these values.

3) (Optional, multi-node) Configure nodes via a file:

```bash
LOCAL_AI_NODES_PATH=config/local_ai.nodes.json
```

Security note:
- The nodes file is treated as **operator-private config** and is ignored if it is a symlink or **group/world writable**.
- If you create it by hand on systems with a group-writable umask (common: `umask 002`), run `chmod 600 config/local_ai.nodes.json`.
- `python3 scripts/refresh_swarm_nodes.py` already writes the file with `0600`.

Example `config/local_ai.nodes.json`:

```json
[
  {"name":"local","kind":"ollama","baseUrl":"http://127.0.0.1:11434","maxConcurrency":2,"capabilities":["chat","embed"],"tags":["cpu"]},
  {"name":"lan-a","kind":"openai_compat","baseUrl":"http://<LAN_IP>:8000","maxConcurrency":2,"capabilities":["chat","embed"],"tags":["gpu"]}
]
```

If you have a repo-root `cluster_config.sh`, you can refresh the nodes file automatically:

```bash
python3 scripts/refresh_swarm_nodes.py
python3 scripts/probe_swarm_models.py --strict
# If your node list lives in a file, pass it explicitly:
python3 scripts/probe_swarm_models.py --strict --nodes-path "$LOCAL_AI_NODES_PATH"
```

Notes:
- `python3 scripts/refresh_swarm_nodes.py` (and the loopback operator endpoint `POST /api/swarm/nodes/refresh`) writes a private latest-run report to `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/health/swarm_nodes_refresh_latest.json`.
- Operator view (counts only; no host inventory): `GET /api/swarm/nodes/refresh/latest` (and the `/admin/swarm` dashboard).
  - The refresh report includes `ray_diag` (summary only; never host inventory) when Ray-backed host discovery is enabled:
    - `enabled`, `ok`, `error`, `timeout_s`, `host_count`
  - The refresh report also includes `hardware_summary` (counts only) to highlight cluster utilization gaps:
    - `ray.gpu_hosts` / `ray.k210_hosts` vs `discovered.gpu_hosts` / `discovered.k210_hosts`
  - For operator debugging (inventory; strict loopback only): `GET /api/swarm/nodes/coverage` shows per-host model-server presence (useful when Ray reports GPU hosts but discovery finds no GPU model nodes).
    - CLI: `python3 scripts/ppia swarm coverage` (optional: `--max-hosts 40`, `--include-models`)
- By default, refresh preserves previously-known nodes on cluster hosts even if temporarily unreachable. To prune unreachable nodes after hardware/IP reshuffles, run refresh with:
  - `LOCAL_SWARM_NODES_CARRY_UNREACHABLE=0 python3 scripts/refresh_swarm_nodes.py`

Optional (Ray-backed host discovery):
- If your LAN cluster membership is managed by Ray and `cluster_config.sh` is stale, you can include **live Ray membership** as an additional host source:
  - Prereqs (safe-by-default; no Ray dependency inside the website venv):
    - Ensure you have a Ray-enabled Python available at one of:
      - `RAY_PYTHON=/path/to/python` (preferred), or
      - `~/ray-env/bin/python` or `~/ray-env-py310/bin/python3.10` (auto-detected).
    - Ensure Ray can connect to your cluster:
      - set `RAY_ADDRESS=<HEAD_IP>:<PORT>`, or
      - set `RAY_HEAD_IP` / `RAY_PORT`, or
      - keep a correct repo-root `cluster_config.sh` (`HEAD_IP` + `RAY_PORT` are read as data).
  - set `LOCAL_SWARM_DISCOVER_FROM_RAY=1` (and optionally `LOCAL_SWARM_DISCOVER_RAY_TIMEOUT_S=2.6`) before running `refresh_swarm_nodes.py`, or
  - run discovery directly with `python3 scripts/discover_swarm_nodes.py --from-ray ...`.
  - Note: the repo-shipped `swarm-nodes-refresh.service` (system + user profiles) enables Ray-backed host discovery by default.
    - To force config-only discovery, override with `LOCAL_SWARM_DISCOVER_FROM_RAY=0` in a systemd drop-in.
  - Optional: `swarm-smoke.timer` (installed by the `swarm`/`core` systemd profiles) runs a lightweight periodic smoke check
    (services + swarm + ray + gating + metrics + automation; includes a tiny `ai_chat` + `route→draft→safety`) and logs results to journald.

Optional (Ray gateway probing for localhost-only model servers):

- If each node runs its model server bound to `127.0.0.1` (recommended), set `LOCAL_SWARM_NODES_REFRESH_MODE=via-ray` when running `refresh_swarm_nodes.py`.
  - This probes `http://127.0.0.1:<port>` **inside** Ray tasks on each host via `ray_swarm_gateway.py`, then writes nodes with `"viaRay": true` and `"localBaseUrl": "http://127.0.0.1:<port>"`.
- Requires:
  - `ray-swarm-gateway.service` running on the Ray head (default `LOCAL_SWARM_RAY_GATEWAY_URL=http://127.0.0.1:9892`), and
  - `LOCAL_SWARM_RAY_ENABLED=1` to route requests to `"viaRay": true` nodes (when off, `viaRay` nodes are ignored).
- Safety note: refresh output and latest reports remain inventory-safe (hosts are not printed; reports keep counts only).

4) Start the app and verify:

```bash
bash scripts/run_app.sh
curl -fsS 'http://127.0.0.1:8000/api/swarm/status?detail=1' | head
python3 scripts/ppia overview
```

## Operator UI (loopback/admin)

- Swarm router dashboard: `http://127.0.0.1:8000/admin/swarm`
  - Nodes, models union, metrics, route previews, smoke/warmup/nodes-refresh status.
- Automation dashboard: `http://127.0.0.1:8000/admin/automation`
  - Queue depth + worker heartbeats.
  - Run or enqueue built-in workflows (including `wf_swarm_multiagent_task_v1` route→draft→safety).
- AI runs explorer: `http://127.0.0.1:8000/admin/ai-runs`
  - Streaming calls include `X-AI-Run-ID` in the response headers.
  - `ai_runs` now also stores first-class swarm routing columns (`swarm_base_id`, `swarm_kind`, `swarm_op`) for easier filtering/debugging (no host/IP exposure).
    - The Swarm dashboard links node rows directly to `admin/ai-runs?swarm=...`.
  - Swarm routing metadata is surfaced under the “Swarm:” line; for streaming runs the stored timings include:
    - `swarm_route` (candidate base-id list), and
    - `swarm_used_base_id` (the base-id that actually produced the first token when fallback/retry occurs).
  - The Ask AI widget on `127.0.0.1`/`localhost` also appends `used=...` to its debug line after generation by fetching the loopback-only run summary endpoint:
    - `GET /api/ai/runs/{run_id}` (detail=0; timings only; strict loopback).

## Operator CLI (recommended after reboot)

These commands are convenient wrappers over the swarm operator endpoints. Some actions are **loopback-only** by design.

```bash
# Public-safe status (works anywhere)
python3 scripts/ppia swarm status

# Detail (loopback/admin only)
python3 scripts/ppia swarm status --detail 1

# Probe model inventories (strict loopback only; writes operator caches)
python3 scripts/ppia swarm models --probe

# List persisted node mode overrides (strict loopback only)
python3 scripts/ppia swarm node-overrides

# Disable a flapping node by base_id (strict loopback only; reversible)
python3 scripts/ppia swarm node-override 0123abcd4567 --mode disabled --reason "probe timed out"

# Refresh LOCAL_AI_NODES_PATH from cluster_config.sh + live Ray membership (strict loopback only)
python3 scripts/ppia swarm refresh-nodes --source both

# Optional (safer): build `viaRay` nodes by probing localhost-only model servers from inside Ray tasks.
# This avoids exposing Ollama/vLLM ports on the LAN, but requires `ray-swarm-gateway.service` on the head.
python3 scripts/ppia swarm refresh-nodes --source ray_via

# Smoke check (strict loopback only): services + swarm + ray + automation
python3 scripts/ppia swarm smoke --full

# Post-reboot self-heal (strict loopback only): refresh nodes + probe models
python3 scripts/ppia doctor --fix-swarm

# Post-reboot self-heal (strict loopback only): prune unreachable nodes (useful after IP/DHCP reshuffles)
python3 scripts/ppia doctor --fix-swarm --fix-swarm-prune

# Optional (slower) post-reboot warmup: also preloads already-present models across nodes
python3 scripts/ppia doctor --fix-swarm --fix-swarm-warmup

# Route preview (strict loopback only): see candidate ordering for an op/model/tags combination
python3 scripts/ppia swarm route-preview --op chat --model llama3.1:8b --tags gpu

# Optional: clear split-brain systemd state when the system app unit is active but the user unit is failed/active
python3 scripts/ppia doctor --fix-systemd-app-conflict

# Optional: clear split-brain timer state when both system + user timers are enabled (avoids duplicate runs)
python3 scripts/ppia doctor --fix-systemd-timer-split-brain
```

## Optional: Codex CLI MCP tools (Swarm + Agent Jobs)

This workspace includes a small local STDIO MCP server under `local-mcp-server/` (repo root) that can:

- call PPIA loopback endpoints (Swarm status/models), and
- dispatch token-gated agent jobs via the website (`/api/ai/agent/*`) without printing per-run tokens in chat logs.

Setup (run from the repo root):

```bash
cd local-mcp-server
npm run build

# Register once (use an absolute path on your machine)
codex mcp add local-mcp-server -- node /ABS/PATH/local-mcp-server/build/index.js
```

Configuration (env):

- `PPIA_BASE_URL` (default: `http://127.0.0.1:8000`)
- `PPIA_MCP_ALLOW_NON_LOOPBACK=1` to allow private/loopback IP bases (LAN)
- `PPIA_MCP_ALLOW_HOSTNAMES=1` (only when non-loopback is enabled) to allow hostname bases

Tools (names):

- `health_check`, `list_models`, `list_agents`
- `swarm_config_diag`, `swarm_metrics`, `swarm_route_preview`
- `swarm_refresh_nodes`, `swarm_warmup` (operator actions; disabled by default in the MCP server)
- `dispatch_task`, `get_task_result`, `cancel_task`

Safety posture:
- Loopback-only by default; rejects hostnames unless explicitly allowed.
- Stores per-run tokens in memory (tokens are not returned by tools; MCP server restart loses them).

Workflow discovery:
- The site exposes `GET /api/ai/agent/workflows` (gated by the same policy as enqueue) so MCP clients can stay in sync with the server’s allowlisted agent workflows and input schemas.
- `dispatch_task` supports an optional `workflow_id` field; invalid/unsupported values are rejected server-side (allowlist enforced).
- Optional: MCP can persist per-run tokens to a private file for post-reboot polling/cancel (opt-in via `PPIA_MCP_PERSIST_AGENT_TOKENS=1`). This is convenient, but treat the token cache as sensitive.
- Operator actions via MCP are disabled by default; set `PPIA_MCP_ENABLE_OPERATOR_ACTIONS=1` to allow `swarm_refresh_nodes` and `swarm_warmup`.

### Post-reboot checklist (single host + LAN cluster)

Safe-by-default sequence (run on the web host / operator box):

1) Verify the app and services health matrix:

```bash
curl -fsS http://127.0.0.1:8000/api/services/status | head
python3 scripts/ppia overview
# Optional: fail fast on slow networks
python3 scripts/ppia overview --timeout-s 8

# Optional: validate the Ray cluster deterministically (run from repo root; LAN-dependent)
cd .. && bash ray_validate_cluster.sh && cd aipowerprogressia.com
```

2) Verify swarm routing is effective and file-lock backpressure is healthy:

```bash
python3 scripts/ppia swarm status
python3 scripts/ppia swarm status --detail 1
```

3) If nodes change after reboot (DHCP/Ray membership drift), self-heal the node list + inventories (strict loopback):

```bash
python3 scripts/ppia doctor --fix-swarm
```

If you recently changed hardware, swapped NICs, or your LAN IPs/DHCP leases moved, prune unreachable/stale nodes during the refresh:

```bash
python3 scripts/ppia doctor --fix-swarm --fix-swarm-prune
```

4) Optional: warm already-present models to reduce post-reboot tail latency (strict loopback; avoids model pulls):

```bash
python3 scripts/ppia doctor --fix-swarm --fix-swarm-warmup
```

If `python3 scripts/ppia doctor` reports a systemd app service conflict (system unit active but user unit failed/active), clear it with:

```bash
python3 scripts/ppia doctor --fix-systemd-app-conflict
```

If `python3 scripts/ppia doctor` reports duplicate swarm/automation timers active in both system + user scopes, clear it with:

```bash
python3 scripts/ppia doctor --fix-systemd-timer-split-brain
```

5) If you installed systemd timers, confirm they are running (choose one scope; do not run both):

```bash
# System scope (typical on a server)
sudo systemctl status swarm-nodes-refresh.timer swarm-models-probe.timer swarm-smoke.timer swarm-warmup.timer automation-worker.timer

# User scope (typical on a dev box)
systemctl --user status swarm-nodes-refresh.timer swarm-models-probe.timer swarm-smoke.timer swarm-warmup.timer automation-worker.timer
```

Warmup scheduling note (systemd):
- System scope defaults:
  - `swarm-nodes-refresh.timer`: ~45 seconds after boot, then every 30 minutes (keeps the nodes file in sync with `cluster_config.sh` and (optionally) Ray membership).
  - `swarm-models-probe.timer`: ~75 seconds after boot, then every 10 minutes (keeps model inventories fresh).
  - `swarm-smoke.timer`: ~4 minutes after boot, then every 10 minutes (light end-to-end smoke checks).
  - `swarm-warmup.timer`: ~2 minutes after boot (+ up to ~2 minutes randomized delay), then once per day.
- User scope defaults are slightly slower (gives the app time to start):
  - `swarm-nodes-refresh.timer`: ~2 minutes after boot, then every 30 minutes.
  - `swarm-models-probe.timer`: ~3 minutes after boot, then every 10 minutes.
  - `swarm-smoke.timer`: ~4 minutes after login/user-manager start, then every 10 minutes.
  - `swarm-warmup.timer`: ~6 minutes after boot (+ up to ~10 minutes randomized delay), then once per day.
- This warmup is best-effort: it only preloads already-present models (no pulls) to reduce post-reboot tail latency.
- The warmup unit orders after `swarm-nodes-refresh.service` + `swarm-models-probe.service` to avoid warming stale/empty node inventories.
- Warmup also publishes a best-effort probe-cache hint for faster routing; it avoids clobbering a larger cached inventory with a suspiciously tiny list from a transient upstream glitch.

Cluster node sanity checks (run from the web host; do not expose these ports to the public internet):
- Ollama node: `curl -fsS http://<LAN_IP>:11434/api/tags >/dev/null`
- OpenAI-compatible node: `curl -fsS http://<LAN_IP>:8000/v1/models >/dev/null` (or `:8001` depending on your server)

### LAN node setup (common gotcha: localhost-only bind)

Most local model servers default to listening on `127.0.0.1` only. For a multi-node swarm, you have two safe patterns:

1) **LAN-reachable model ports (simple)**
- each model node listens on a LAN-reachable interface, and
- you firewall the port to your LAN (Ollama typically has no auth).

2) **Ray Swarm Gateway + `viaRay` nodes (safer; no LAN model ports)**
- each model node can keep the model server bound to localhost (`127.0.0.1`), and
- the web host uses Ray task placement to execute the HTTP call *from inside the target node*.
- this avoids exposing model ports on the LAN, at the cost of Ray overhead and a gateway process on the head.

Quick “add a node” checklist (operator):

1) Decide which pattern you are using:
   - **LAN-reachable**: bind the model server to a LAN interface and firewall to your LAN.
   - **viaRay (recommended)**: keep the model server bound to loopback and let Ray place the call on the node.
2) Ensure the node can run Ray tasks (it must be a Ray worker or the head).
3) Ensure the Ray Swarm Gateway is up on the Ray head:
   - `sudo systemctl status ray-swarm-gateway.service --no-pager || true`
   - `curl -fsS http://127.0.0.1:9892/health`
   - (if `/health` fails) `sudo journalctl -u ray-swarm-gateway.service -n 200 --no-pager || true`
4) Refresh and probe from the web host (strict loopback):
   - LAN-reachable nodes: `python3 scripts/ppia swarm refresh-nodes --source both`
   - viaRay nodes: `python3 scripts/ppia swarm refresh-nodes --source ray_via`
   - Then: `python3 scripts/ppia swarm models --probe`
5) If IPs changed after hardware/DHCP reshuffles, prune stale entries:
   - `python3 scripts/ppia swarm refresh-nodes --source both --prune` (or `--source ray_via --prune`)
6) If a single node is slow/unreliable, drain it during investigation:
   - `python3 scripts/ppia swarm node-override <base_id> --mode drain --reason "investigating latency"`

Ollama node example:
- Ensure it listens on LAN: set `OLLAMA_HOST=0.0.0.0:11434` (or `OLLAMA_HOST=<LAN_IP>:11434`) and restart Ollama.
- Minimal systemd drop-in:

```ini
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment=OLLAMA_HOST=0.0.0.0:11434
```

vLLM node example (OpenAI-compatible):
- Run with a LAN bind: `vllm serve <model> --host 0.0.0.0 --port 8001 …`
- Repo helper: `vllm_serve.sh` defaults to `VLLM_HOST=127.0.0.1`; set `VLLM_HOST=0.0.0.0` on LAN nodes.

Loopback-only examples (for `viaRay` nodes):
- Ollama: keep the default loopback bind (`OLLAMA_HOST=127.0.0.1:11434`) and ensure the port is *not* exposed on the LAN.
- vLLM: `vllm serve <model> --host 127.0.0.1 --port 8001 …` (same `--host` rule for LM Studio / llama.cpp server).

#### Ray Swarm Gateway (localhost-only model servers)

Prereqs:
- Ray control plane running (`ray-head.service` + workers).
- Gateway running on the Ray head (loopback-only).

Install + start the gateway (from repo root on the head):

```bash
sudo install -m 0644 ray-swarm-gateway.service /etc/systemd/system/ray-swarm-gateway.service
sudo systemctl daemon-reload
sudo systemctl enable --now ray-swarm-gateway.service
curl -fsS http://127.0.0.1:9892/health
# Logs (when health fails):
sudo journalctl -u ray-swarm-gateway.service -n 200 --no-pager || true
```

Note: if you update `ray_swarm_gateway.py`, restart the gateway to pick up changes:

```bash
sudo systemctl restart ray-swarm-gateway.service
```

Alternative (recommended): install via the site’s unit installer (includes the gateway in `core`/`swarm` profiles):

```bash
cd aipowerprogressia.com
sudo bash scripts/install_systemd_units.sh --scope system --profile core --apply
sudo systemctl enable --now ray-swarm-gateway.service
```

User-scope alternative (no sudo; useful for local dev):

```bash
cd aipowerprogressia.com
bash scripts/install_systemd_units.sh --scope user --profile swarm --apply
systemctl --user enable --now ray-swarm-gateway.service
curl -fsS http://127.0.0.1:9892/health
```

Gateway operator endpoints (on the head; loopback by default):

- `GET http://127.0.0.1:9892/health`
- `GET http://127.0.0.1:9892/metrics` (JSON; inflight + per-op latency/ok-rate)
- `GET http://127.0.0.1:9892/metrics.prom` (Prometheus text)

Gateway security + bind notes:
- Default bind is loopback (`RAY_SWARM_GATEWAY_BIND=127.0.0.1`) to avoid accidental LAN exposure.
- To bind to a private LAN IP, you must set `RAY_SWARM_GATEWAY_ALLOW_PRIVATE_BIND=1` and a non-empty `RAY_SWARM_GATEWAY_TOKEN` (sent as `x-ppia-ray-token`). When LAN-bound, `/health` and metrics endpoints require the token. Non-private binds are refused.
- Default defense-in-depth: `RAY_SWARM_GATEWAY_LOCAL_BASE_LOOPBACK_ONLY=1` requires `localBaseUrl` to be loopback (e.g., `http://127.0.0.1:11434`). Set to `0` only if you truly need node-internal RFC1918 hops.
- Optional SSRF guardrail: restrict which upstream ports the gateway will call via `RAY_SWARM_GATEWAY_ALLOWED_PORTS` (e.g. `11434,8000,8001`). When unset, any port is allowed (backwards-compatible).
- HTTP connection reuse: the gateway currently uses stdlib `urllib.request` which includes `Connection: close`, so keep-alive pooling is not implemented. Prefer batching endpoints (notably Ollama `/api/embed`) to avoid N× small requests.

Placement notes:
- When available, the gateway uses Ray’s `NodeAffinitySchedulingStrategy` to place tasks by NodeID resolved from `nodeIp` (cached briefly).
- Fallback for older Ray versions uses a custom resource key (`node:<ip>`).

Enable Ray-backed swarm execution on the website:
- set `LOCAL_SWARM_RAY_ENABLED=1`
- back-compat: `AI_RAY_ENABLED=1` also enables Ray exec (no-op unless a node is `viaRay: true`)
- optional: `LOCAL_SWARM_RAY_GATEWAY_URL=http://127.0.0.1:9892`
- `LOCAL_SWARM_RAY_GATEWAY_TOKEN=...` (shared secret; sent as `x-ppia-ray-token`; required when `LOCAL_SWARM_RAY_GATEWAY_URL` is non-loopback)

Configure nodes with `viaRay` (example):

```json
[
  {
    "name": "gpu-a",
    "kind": "ollama",
    "baseUrl": "http://<LAN_IP>:11434",
    "localBaseUrl": "http://127.0.0.1:11434",
    "viaRay": true,
    "rayNodeIp": "<LAN_IP>",
    "tags": ["gpu"],
    "maxConcurrency": 2
  }
]
```

Notes:
- `baseUrl` stays unique per node (used for routing + concurrency + metrics); it does **not** need to be reachable from the web host when `viaRay` is enabled.
- `localBaseUrl` is used *inside* the Ray task on that node (defaults to `http://127.0.0.1:<port>` if omitted).
- Streaming: `viaRay` nodes support streaming via the Ray Swarm Gateway (`POST /v1/exec/chat/stream`). When a node is reachable directly, the router may still stream via direct HTTP, but `viaRay` enables fully-local-on-node model servers (bound to localhost) while still supporting streaming on the website (`/api/ai/chat/stream`, `/v1/chat/completions`).

## Where it lives (code map)

- Config: `app/swarm/config.py`
- Router + node state: `app/swarm/router.py`
- Ray execution client (website → gateway): `app/swarm/ray_exec.py`
- Cross-process lock semaphores (optional): `app/swarm/locks.py`
- Drop-in integration wrappers: `app/swarm/integration.py`
- Status endpoint (public-safe; detail gated): `app/swarm/status.py`
- OpenAI-compatible adapter (for `/v1` nodes): `app/swarm/providers/openai_compat.py`

Integration points:
- The legacy app (`app/_app.py`) shadows selected `app/local_ai.py` functions with swarm wrappers when imported.
- Ask AI (`app/ai_routes.py`) supports per-request swarm hints via request JSON.
- Automation (`app/automation.py`) uses the same `call_local_ai` entrypoint, so queued workflows can benefit from swarm routing too.

### Orchestrated Ask AI (advanced; opt-in beyond loopback)

Ask AI supports an optional multi-step **route → draft → safety** orchestration mode on the normal chat endpoint:

- `POST /api/ai/chat` with `"orchestrate": true` (aliases: `"orch": true`, `"swarm_orchestrate": true`, `"swarm_orch": true`)
- Optional (higher cost): set `"orchestrate": "v2"` (or `"orch": "v2"`) to run **route → draft → critic → revise → safety**.
  - Use this when you want higher-quality, more self-checked answers and you can afford extra local model calls.
  - `critic` returns strict JSON feedback; `revise` produces a revised answer before the final safety pass.
- Streaming note: orchestration is supported on `POST /api/ai/chat/stream`, but it streams **only** the final safety-reviewed response (no unreviewed draft tokens).
  - Expect higher TTFT vs single-pass streaming (because the server runs route/draft/(critic/revise)/safety first).
  - Use `POST /api/ai/chat` when you need a structured JSON response body and/or want to inspect step metadata (loopback/admin only by default).
- CLI support:
  - `python3 scripts/ppia --orch v1 ask "..."` (route → draft → safety; final-only stream)
  - `python3 scripts/ppia --orch v2 ask "..."` (route → draft → critic → revise → safety; final-only stream)
- Default security posture: honored only for strict loopback/admin requests; public/forwarded requests ignore the flag.
- Optional widening (operator-controlled):
  - `LOCAL_SWARM_ORCH_ALLOW_AUTH=1` enables orchestration for authenticated user sessions (non-admin).
  - `LOCAL_SWARM_ORCH_ALLOW_PUBLIC=1` enables orchestration for all callers (not recommended on the public internet).
  - `bash scripts/run_app.sh` will also read these keys from `.env` when they are not already set in the environment.
- When enabled, the handler forces swarm routing on for the request and builds a compact evidence packet from already-fetched
  retrieval context (docs/grid/web/resources) before drafting.
- Optional per-step overrides for v2:
  - Models: `LOCAL_SWARM_ORCH_CRITIC_MODEL`, `LOCAL_SWARM_ORCH_REVISE_MODEL` (also supports the `PPIA_` aliases)
  - Timeouts: `LOCAL_SWARM_ORCH_TIMEOUT_S_CRITIC`, `LOCAL_SWARM_ORCH_TIMEOUT_S_REVISE` (seconds; unset uses the normal local-ai timeout)

## Architecture

```mermaid
flowchart TB
  U[Browser / PWA] -->|/api/ai/chat<br/>/api/ai/chat/stream| API[FastAPI app<br/>app/main.py + app/ai_routes.py]
  API -->|local_ai wrappers| SW[Swarm Router<br/>app/swarm/router.py]
  SW -->|Ollama HTTP| N1[(Ollama node A<br/>127.0.0.1:11434)]
  SW -->|OpenAI-compatible HTTP| N2[(OpenAI-compat node B<br/>LAN 192.168.x.y:8000)]
  API -->|operator visibility| ST[/api/swarm/status<br/>public-safe detail gating/]
  API -->|health matrix| SS[/api/services/status/]
  API -->|admin-only queue| AU[Automation runs<br/>/api/automation/*]
  AU --> SW
```

## Configuration

### Enable / disable

- `LOCAL_SWARM_ENABLED=1` enables swarm routing globally.
- `LOCAL_SWARM_DEFAULT_MODE=on|off` controls the default behavior when a request does not explicitly specify a swarm hint.

Per-request override (Ask AI):
- `POST /api/ai/chat` body may include `"swarm": true|false` (or `"use_swarm": ...`).
- `POST /api/ai/chat/stream` body may include `"swarm": true|false` (or `"use_swarm": ...`).
- Optional routing preference tags (Ask AI):
  - `POST /api/ai/chat` body may include `"swarm_tags": "gpu"` (also accepts `"prefer_tags"`, lists, or comma strings).
  - `POST /api/ai/chat/stream` body may include `"swarm_tags": "gpu"` (also accepts `"prefer_tags"`, lists, or comma strings).

Notes:
- Public/remote callers may opt out (`"swarm": false`) for a single request.
- Force-enabling swarm (`"swarm": true`) is honored only for strict loopback or admin identity (otherwise the hint is ignored and `LOCAL_SWARM_DEFAULT_MODE` applies).
- Tag preferences (`swarm_tags` / `prefer_tags`) are honored only for strict loopback or admin identity (otherwise ignored).

### Concurrency + timeouts

- `LOCAL_SWARM_MAX_PARALLEL_TASKS=4` caps in-flight swarm tasks.
  - With `LOCAL_SWARM_USE_FILE_LOCKS=1` (default), this is enforced **across Uvicorn workers** (cross-process).
  - Without file locks (or on non-POSIX platforms), it falls back to a **per-process** limit.
- `LOCAL_SWARM_MAX_PARALLEL_EMBED_TASKS=4` optionally caps **embedding** concurrency separately (must be `<= LOCAL_SWARM_MAX_PARALLEL_TASKS`).
  - Default is “no separate pool” (embed max == global max).
  - Set this lower (e.g. `1` or `2`) to prevent background embedding bursts from starving interactive chat.
- `LOCAL_SWARM_MAX_PARALLEL_STREAM_TASKS=4` optionally caps **streaming** concurrency separately (must be `<= LOCAL_SWARM_MAX_PARALLEL_TASKS`).
  - Default is “no separate pool” (stream max == global max).
  - Set this lower (e.g. `1`) to prevent long streams from starving chat/embed work.
- `LOCAL_SWARM_MAX_PARALLEL_BG_TASKS=4` optionally caps **background/worker** swarm traffic separately (must be `<= LOCAL_SWARM_MAX_PARALLEL_TASKS`).
  - Default is “no separate pool” (bg max == global max).
  - Set this lower (e.g. `1`) to prevent automation/background jobs from starving interactive Ask AI usage.
  - Compatibility alias: `LOCAL_SWARM_QUEUE_CONCURRENCY` is accepted as an alias for BG concurrency when `LOCAL_SWARM_MAX_PARALLEL_BG_TASKS` is not set.
- `LOCAL_SWARM_EMBED_BATCH_SIZE=4` controls embedding batching when available (payload `"input": [...]`).
  - OpenAI-compatible nodes: `/v1/embeddings` (batch).
  - Ollama nodes: `/api/embed` (batch, best-effort; falls back to single-item `/api/embeddings` when unsupported).
    - Stable client errors from `/api/embed` list inputs are negative-cached per-node for ~10 minutes to avoid repeated failed batch attempts.
  - `1` disables batching and always embeds one string per request.
  - `ollama_embed_many()` respects the provided `timeout` **across the full call** (batch + fallback) to avoid holding embed/global slots indefinitely.
- `LOCAL_SWARM_EMBED_SHARD_WORKERS=1` optionally fans out `ollama_embed_many()` chunks across multiple nodes for higher LAN throughput.
  - When `>1` and there are multiple chunks, the router runs chunk embedding in a small thread pool and preserves output order.
  - Per-node concurrency limits still apply (`maxConcurrency` + cross-process file locks when enabled).
- `LOCAL_SWARM_QUEUE_WAIT_MS=150` controls how long the router waits for a global slot before returning `429`.
  - Set `0` to fail fast under load.
- `LOCAL_SWARM_QUEUE_WAIT_MS_BG=150` optionally sets a different queue-wait budget for **background/worker** traffic.
  - Used only when callers mark the context as background (`swarm_background_context(True)`).
  - Default is `LOCAL_SWARM_QUEUE_WAIT_MS` (same behavior).
- `LOCAL_SWARM_NODE_MAX_CONCURRENCY=2` default per-node concurrency cap.
- `LOCAL_SWARM_REQUEST_TIMEOUT_MS=120000` bounds per-attempt timeouts in the router.
- Streaming behavior (Ask AI): `POST /api/ai/chat/stream` uses an **async upstream stream** (cancellation-aware) so client disconnects stop upstream work and release swarm slots promptly. The router enforces a true **wall-clock** cap using the same timeout budget (prevents “infinite stream” nodes from holding stream/global slots forever).
- `LOCAL_AI_REMOTE_TIMEOUT=30` bounds total request time (seconds) for **LAN/remote** Ollama nodes (clamped to `LOCAL_AI_TIMEOUT`).
  - If you see `502: timed out` on remote nodes during grounded/structured requests, increase this (e.g. `30` → `45`).
- `LOCAL_SWARM_HEALTHCHECK_INTERVAL_MS=30000` is used as a TTL for lightweight health/model probe caching (router probes are still request-driven).
- `LOCAL_SWARM_PROBE_MODELS_ON_ROUTE=0|1` (default `0`) opt-in request-path probing of per-node model inventories to avoid obvious model mismatches.
  - Recommended default is `0` (operators can refresh inventories explicitly via `/api/swarm/models?probe=1`).
- `LOCAL_SWARM_STRICT_MODEL_FILTER=0|1` (default `0`) opt-in to **drop** endpoints that appear not to support the requested model when at least one other candidate appears to support it.
  - Useful on heterogeneous clusters where “missing model” is a hard mismatch (reduces wasted retries and accidental model pulls).
  - Keep disabled if you expect model inventories to be incomplete/stale and want maximum fallback.
- `LOCAL_SWARM_PREFER_TAGS_DEFAULT=gpu` optionally biases routing toward nodes tagged with `gpu` when no per-request tag hint is provided.
  - This is a preference (soft boost), not a hard requirement; the router still falls back to other healthy nodes.
- `LOCAL_SWARM_PREFER_TAGS_CHAT=gpu` optionally biases **chat + streaming** routing when no per-request tag hint is provided.
  - Use this when you want chat/streams to prefer GPU nodes but keep embeddings elsewhere.
- `LOCAL_SWARM_PREFER_TAGS_EMBED=cpu` optionally biases **embedding** routing when no per-request tag hint is provided.
  - Common pattern on mixed clusters: `PREFER_TAGS_CHAT=gpu` + `PREFER_TAGS_EMBED=cpu` to keep GPU capacity for generation/streaming.
- `LOCAL_SWARM_ORCH_PREFER_TAGS_ROUTE|DRAFT|SAFETY` optionally biases the **multi-agent pipeline** (`route → draft → safety`) by step.
  - Used by the operator endpoints and automation workflow `wf_swarm_multiagent_task_v1`.
  - Example pattern to better utilize mixed CPU/GPU clusters:
    - `LOCAL_SWARM_ORCH_PREFER_TAGS_ROUTE=cpu` (fast routing/classification)
    - `LOCAL_SWARM_ORCH_PREFER_TAGS_DRAFT=gpu` (heavier generation)
    - `LOCAL_SWARM_ORCH_PREFER_TAGS_SAFETY=cpu` (lightweight safety check)
- `LOCAL_SWARM_ORCH_DEFAULT_STEP_TAGS=1` (default `1`) enables built-in step tag preferences when step tags are unset:
  - route → `cpu`, draft → `gpu`, safety → `cpu`
  - Set `0` to disable built-in defaults and rely only on explicit `LOCAL_SWARM_ORCH_PREFER_TAGS_*` or request-provided tags.
- `LOCAL_SWARM_ORCH_STRICT_STEP_TAGS=1` (default `1`) makes step tags **try matching-tag nodes first** (ordering, not hard failure).
  - Keeps the default intent (route/safety on `cpu`, draft on `gpu`) on heterogeneous clusters when both node types support the same model.
  - Still falls back to other healthy nodes if no matching nodes exist or all matching nodes fail/saturate.
- `LOCAL_SWARM_ORCH_ROUTE_MODEL|DRAFT_MODEL|SAFETY_MODEL` optionally overrides the **model name** used per step.
  - This is useful when your “fast” model only exists on GPU nodes but you still want routing/safety on CPU nodes.
  - Example:
    - `LOCAL_SWARM_ORCH_ROUTE_MODEL=llama3.2:3b`
    - `LOCAL_SWARM_ORCH_SAFETY_MODEL=llama3.2:3b`
- `LOCAL_SWARM_LOG_LEVEL=info` controls swarm router logs (`debug|info|warn|error`). Logs are base-id only (never node URLs).
- `LOCAL_SWARM_USE_FILE_LOCKS=1` enables cross-process backpressure using `flock` (Linux/macOS). Recommended when `WORKERS>1`.
- `LOCAL_SWARM_LOCK_DIR=` optionally overrides where lock files live (default: `$PPIA_DATA_DIR/locks/swarm` when set via env or `.env`, otherwise `<repo>/data/locks/swarm`).
- `PPIA_AI_API_MAX_BODY_BYTES=1048576` (default 1MiB) hard-caps request body size for `/api/ai/`, `/api/local-ai/`, and `/api/swarm/` (defense-in-depth against accidental mega-prompts / DoS).
- `LOCAL_AI_MAX_RESPONSE_BYTES=2000000` caps **legacy non-swarm** Local AI JSON responses (defense-in-depth). The swarm router uses its own upstream response caps (`LOCAL_SWARM_MAX_UPSTREAM_RESPONSE_BYTES`).

### Nodes (single host or LAN cluster)

Configure nodes via one JSON env var:

- Ollama example:
  - `LOCAL_AI_NODES='[{"name":"ollama-local","kind":"ollama","baseUrl":"http://127.0.0.1:11434","maxConcurrency":2,"capabilities":["chat","embed"]}]'`
- OpenAI-compatible example:
  - `LOCAL_AI_NODES='[{"name":"vllm-a","kind":"openai_compat","baseUrl":"http://127.0.0.1:8000/v1","apiKeyEnv":"VLLM_A_API_KEY","maxConcurrency":2,"capabilities":["chat","embed"],"models":["llama-3.1-8b-instruct"]}]'`
    - Note: `baseUrl` is normalized to `http(s)://host:port` internally; the provider re-adds `/v1`.

Schema (per node):
- `name` (string, optional)
- `kind` / `provider` (string, optional): `ollama` (default) or `openai_compat`
- `baseUrl` (string, required): **must be `http(s)://host:port`**
- `maxConcurrency` (int, optional)
- `weight` (float, optional)
- `capabilities` (list or comma string): `chat`, `embed` (and `vision` reserved)
- `models` (optional list): restrict the node to specific model names
- `modelsPinned` (optional bool): when `true`, node refresh preserves the existing `models` allowlist instead of replacing it from discovery.
  - Use this if you intentionally curate a narrow allowlist (e.g., to keep a node from serving large models).
- `modelAliases` (optional object): map a requested model name to a node-specific model id
  - Use this when different nodes expose different model IDs (common across Ollama vs OpenAI-compatible servers).
  - Shape: `{"requested":"node_model", ...}` (exact match; if the requested model is tagged like `name:latest`, the router also tries a `name` lookup).
- `apiKeyEnv` (optional string): **environment variable name** containing a Bearer token for OpenAI-compatible nodes.
  - This is never stored in the nodes JSON file. Set the actual secret only in your environment or `.env`.
  - If unset, the router falls back to `LOCAL_AI_API_KEY` (global).
  - Optional defense-in-depth: set `LOCAL_SWARM_API_KEY_ENV_ALLOWLIST=...` to restrict which env var names may be used by `apiKeyEnv`.
    - When an allowlist is set and `apiKeyEnv` is not in it, **no auth header is sent** (and there is no fallback to `LOCAL_AI_API_KEY`).
- `tags` (optional list): freeform (hardware hints)
  - Used as an **operator preference hint** for routing (e.g., tag a node with `gpu` and prefer it for heavy workloads).
  - Optional default bias: `LOCAL_SWARM_PREFER_TAGS_DEFAULT=gpu`
  - Optional per-op bias: `LOCAL_SWARM_PREFER_TAGS_CHAT=gpu` and `LOCAL_SWARM_PREFER_TAGS_EMBED=cpu`

Example `modelAliases` usage (OpenAI-compatible servers often use different model IDs than Ollama):

```json
{
  "name": "vllm-a",
  "kind": "openai_compat",
  "baseUrl": "http://127.0.0.1:8000/v1",
  "capabilities": ["chat", "embed"],
  "models": ["meta-llama/Llama-3.1-8B-Instruct"],
  "modelAliases": {
    "llama3.1:8b": "meta-llama/Llama-3.1-8B-Instruct",
    "llama3.1": "meta-llama/Llama-3.1-8B-Instruct"
  }
}
```

If `LOCAL_AI_NODES` is not set, the swarm uses the existing `LOCAL_AI_*_URL` settings from `app/local_ai.py` as its default node set.

Node source precedence (first match wins):
- `LOCAL_AI_NODES` (JSON env)
- `LOCAL_AI_NODES_PATH` / `LOCAL_SWARM_NODES_PATH` / `LOCAL_AI_NODES_FILE` (JSON file)
- models registry (if present)
- `LOCAL_AI_*_URL` fallbacks from `app/local_ai.py`

Reload behavior:
- Changes in `.env` require restarting the app process.
- Changes to the nodes JSON file are picked up automatically (best-effort; short TTL per process).

#### Nodes via file path (optional)

If shell-escaping JSON is annoying, store the node list in a local file (JSON) and point the app at it:

- `LOCAL_AI_NODES_PATH=/path/to/local_ai_nodes.json` (also supports `LOCAL_SWARM_NODES_PATH` and `LOCAL_AI_NODES_FILE`)

The file must contain the same JSON list described above (e.g., `[{...},{...}]`).

#### Node discovery helper (optional)

To generate a `LOCAL_AI_NODES_PATH` file by probing a set of private hosts:

```bash
cd aipowerprogressia.com
python3 scripts/discover_swarm_nodes.py \
  --hosts "<LAN_IP_A> <LAN_IP_B> <LAN_IP_C>" \
  --write "$HOME/.ppia/local_ai_nodes.json"
```

If you maintain repo-root `cluster_config.sh` with `NODES=(...)`, you can use:

```bash
cd aipowerprogressia.com
python3 scripts/discover_swarm_nodes.py --from-cluster-config --write "$HOME/.ppia/local_ai_nodes.json"
```

If your cluster membership is managed by Ray (and you have a Ray Python env installed on this machine), you can also use:

```bash
cd aipowerprogressia.com
python3 scripts/discover_swarm_nodes.py --from-ray --write "$HOME/.ppia/local_ai_nodes.json"
```

Then set `LOCAL_AI_NODES_PATH=$HOME/.ppia/local_ai_nodes.json` and restart the app.

Notes:
- Discovery probes Ollama via `/api/tags` (preferred) and OpenAI-compatible servers via `/v1/models` (fallback).
- Discovery caps per-probe JSON reads via `LOCAL_SWARM_DISCOVER_MAX_JSON_BYTES` (default `2000000` bytes) to avoid accidental memory blowups.
  - Legacy alias: `LOCAL_SWARM_DISCOVERY_MAX_JSON_BYTES`.
- By default, only **private/loopback** IPs are probed. Use `--allow-public-hosts` only if you understand the risk.

#### Node discovery API (operator; loopback-only)

For quick on-box discovery without running the script directly:

```bash
curl -sS -X POST http://127.0.0.1:8000/api/swarm/discover \\
  -H 'content-type: application/json' \\
  -d '{"hosts":["<LAN_IP_A>","<LAN_IP_B>"],"tags":["gpu"],"openai_ports":[8000,8001,1234]}' | jq
```

Security notes:
- This endpoint is **strict loopback only** (forwarded/proxied requests are blocked).
- For browser requests, unsafe methods require **same-origin** headers (Origin/Referer/Sec-Fetch-Site) to mitigate CSRF-to-localhost.
- Public-host probing is disabled unless you explicitly set `LOCAL_SWARM_ALLOW_PUBLIC_NODES=1`.

#### Cluster-config hosts (operator; loopback-only)

If you maintain repo-root `cluster_config.sh`, you can load its host inventory from the website:

```bash
curl -sS http://127.0.0.1:8000/api/swarm/cluster-hosts | jq
```

This endpoint is **strict loopback only** and returns a de-duped list derived from:
- `HEAD_IP`
- `NODES[]`
- `WINDOWS_NODES[]`
- `OPTIONAL_LINUX_NODES[]`

On `http://127.0.0.1:8000/local-ai` you can click **From cluster_config** to populate the host list, then click **Discover**.

### Operator workflow (recommended)

1. Run the app locally and open `http://127.0.0.1:8000/local-ai`.
   - Several swarm operator actions are **strict loopback only** and will 403 when accessed through a reverse proxy (forwarded headers).
   - To operate from another machine: `ssh -L 8000:127.0.0.1:8000 <your-box>` then open `http://127.0.0.1:8000/local-ai`.
2. In **Local AI Swarm**, use:
   - **Discover** to probe a list of private hosts and generate a node list.
   - **Copy LOCAL_AI_NODES** and paste into `aipowerprogressia.com/.env`, then restart.
   - **Probe models** to refresh per-node model inventories and see a union model count.
   - **Routing Preview** to inspect scored node ordering for a given op/model (and optional `prefer tags`).
   - **Swarm metrics** to see rolling ok-rate and latency percentiles (public-safe; detail gated on localhost/admin).
   - **Swarm Multi-Agent Test** to run an operator-only `route → draft → safety` pipeline through swarm routing (strict loopback only).
     - Optional: set **prefer tags** (e.g., `gpu`) to bias routing toward tagged nodes.
3. Optional: use the CLI with per-request swarm controls (best on localhost):
   - `python3 scripts/ppia chat --swarm --swarm-tags gpu`
   - `python3 scripts/ppia ask "Summarize today's top AI news." --swarm --swarm-tags gpu`

### Auto-refresh `LOCAL_AI_NODES_PATH` (optional)

If you prefer a **nodes file** (instead of embedding JSON into `.env`), you can keep it fresh automatically:

1. Pick a nodes file path.
   - Recommended default: `~/.ppia/local_ai_nodes.json`
   - Set it via `LOCAL_AI_NODES_PATH=...` (env or systemd unit) so the app and timers use the same file.

Optional: trigger a refresh from the website UI (strict loopback only):
- Open `http://127.0.0.1:8000/local-ai`
- In **Local AI Swarm**, click **Refresh nodes file**
  - This calls `POST /api/swarm/nodes/refresh` (strict loopback only).
  - Requires `LOCAL_AI_NODES_PATH` to be set in the app process environment.
  - Preserves operator-tuned per-node fields (e.g., `maxConcurrency`, `weight`, `modelAliases`) from the existing nodes file.
  - Refreshes per-node `models` inventories by default; to pin a curated allowlist, set `modelsPinned: true` on that node.
  - Keeps previously-known nodes on cluster hosts even if temporarily unreachable (prevents silent shrink-on-flake).
2. Run once:

```bash
cd aipowerprogressia.com
python3 scripts/refresh_swarm_nodes.py
```

3. Optional: install and enable the systemd timer:

```bash
cd aipowerprogressia.com
sudo bash scripts/install_systemd_units.sh --scope system --profile swarm --apply --enable --now
sudo systemctl status aipowerprogressia-app.service swarm-nodes-refresh.timer swarm-models-probe.timer swarm-smoke.timer swarm-warmup.timer automation-worker.timer
sudo systemctl list-timers 'swarm-*' 'automation-*' --no-pager
sudo journalctl -u swarm-nodes-refresh.service -n 200 --no-pager
sudo journalctl -u swarm-models-probe.service -n 200 --no-pager
sudo journalctl -u swarm-smoke.service -n 200 --no-pager
```

User-level systemd (no sudo) is also supported for on-box dev/ops:

```bash
cd aipowerprogressia.com
bash scripts/install_systemd_units.sh --scope user --profile swarm --apply --enable

# Start timers immediately (avoids restarting your app if it's already running manually).
systemctl --user start swarm-nodes-refresh.timer swarm-models-probe.timer swarm-smoke.timer automation-worker.timer
systemctl --user start swarm-warmup.timer
```

User-level timer note:
- User timers run only while your per-user systemd manager is running.
- If you want user timers to run at boot on a headless box (no login session), enable lingering:
  - `sudo loginctl enable-linger "$USER"`

Note: if you already run the **system-level** timers (`sudo systemctl status swarm-nodes-refresh.timer`), do **not**
also enable the user-level timers — pick one scope to avoid duplicate refresh/probe runs.

Systemd profile cheat sheet:
- `--profile swarm`: app + swarm refresh + swarm model probes + swarm smoke + automation worker
- `--profile core`: swarm + healthcheck + backups/prunes + feeds/digest
- `--profile all`: everything (including frontier-cycle style automation)

Notes:
- The refresh script reads repo-root `cluster_config.sh` as **data** (parsed; never sourced).
- Logs are count-only; it avoids printing host IPs to journald by default.
- The repo-shipped systemd unit templates include baseline hardening + background QoS:
  - Hardening: `NoNewPrivileges=yes`, `PrivateTmp=yes`, and kernel/cgroup protections.
  - QoS (oneshot/timers): `Nice=10` + best-effort IO scheduling, so probes/refresh/warmup don't starve interactive Ask AI.
  - To relax/tune these, use a systemd drop-in (example: `sudo systemctl edit swarm-smoke.service`).
- System scope units are rendered to a specific `User=` + absolute paths at install time.
  - `sudo bash scripts/install_systemd_units.sh ...` uses `SUDO_USER` by default (override: `PPIA_SYSTEMD_USER=...`).
  - If you move the repo or change users, re-run the installer or override via drop-ins (example: `sudo systemctl edit swarm-nodes-refresh.service`).
  - Optional tuning (env vars):
    - `LOCAL_SWARM_AUTOMATION_FORCE=1` (default) forces swarm routing for background automation/scripts even when `LOCAL_SWARM_DEFAULT_MODE=off`.
    - `LOCAL_SWARM_DISCOVER_OLLAMA_PORT=11434`
    - `LOCAL_SWARM_DISCOVER_OPENAI_PORTS=8000 8001 1234` (1234 = LM Studio default)
    - `LOCAL_SWARM_DISCOVER_TIMEOUT_S=0.8`
    - `LOCAL_SWARM_DISCOVER_MAX_MODELS=60`
    - `LOCAL_SWARM_DISCOVER_TAGS_PATH=$HOME/.ppia/swarm_host_tags.json` (optional)
    - `LOCAL_SWARM_PROBE_CACHE_POLL_S=6` (optional) poll interval for reading shared probe cache files (multi-worker IO tuning; higher = less disk churn).
    - `LOCAL_SWARM_OVERLOAD_CACHE_POLL_S=4` (optional) poll interval for reading shared overload cooldown cache files (multi-worker IO tuning).
    - `LOCAL_SWARM_XPROC_HINT_ENABLED=1` (optional) set to 0 to skip cross-process slot-occupancy *estimation* even when file locks are enabled (reduces request-path filesystem work).
    - `LOCAL_SWARM_XPROC_HINT_TTL_S=0.25` (optional) cache TTL for cross-process inflight estimates (lower = fresher, higher = less filesystem work).

### Health checks (node → swarm → end-to-end)

```bash
# Node health
curl -fsS http://127.0.0.1:11434/api/tags >/dev/null

# Swarm view (public-safe)
python3 scripts/ppia swarm status
python3 scripts/probe_swarm_models.py --strict

# End-to-end (requires app running on localhost)
python3 scripts/swarm_smoke.py --base-url http://127.0.0.1:8000 --metrics --chat --stream --orch --check-gating

# Optional: warm models after a reboot (loopback-only; avoids model pulls)
python3 scripts/swarm_warmup.py --base-url http://127.0.0.1:8000
# (CLI wrapper)
python3 scripts/ppia swarm warmup
```

Optional: periodic model probes
- `swarm-models-probe.timer` runs `python3 scripts/probe_swarm_models.py` to keep per-node model inventories warm.
- This improves routing accuracy without enabling request-path probing (`LOCAL_SWARM_PROBE_MODELS_ON_ROUTE=0`).

Optional: periodic swarm smoke checks
- `swarm-smoke.timer` runs a small end-to-end check (`services` + `swarm` + `ray` + a tiny `ai_chat` + `ai_stream` + `route→draft→safety`).
  - It is safe-by-default (HTTP only). It makes small local AI calls but does not mutate state.
  - If the web app is down, the unit will fail (intentionally visible), but it will not attempt restarts.
  - It writes a private JSON report to `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/health/swarm_smoke_latest.json` (mode `0600`).

Optional: warmup report (operator)
- `POST /api/swarm/warmup` runs a small warmup pass across configured swarm nodes.
  - It warms only models that appear in per-node inventories (avoids accidental model pulls).
  - Use `LOCAL_SWARM_WARMUP_CHAT_MODELS_PER_NODE=2` to cap per-node chat warm calls (0 disables chat warmup).
  - Optional: `LOCAL_SWARM_WARMUP_KEEP_ALIVE=30m` to ask Ollama nodes to keep warmed chat models resident longer (best-effort; upstream decides parsing).
  - Default: `LOCAL_SWARM_WARMUP_FALLBACK_ANY_MODEL=1` falls back to warming a small-ish already-present chat model when the requested model(s) time out on a cold boot. Set to `0` to require only the explicitly requested models.
  - It writes `${LOCAL_SWARM_LOCK_DIR:-$PPIA_DATA_DIR/locks/swarm}/health/swarm_warmup_latest.json` (mode `0600`).
- `GET /api/swarm/warmup/latest` returns a public-safe summary (counts only); `?detail=1` requires loopback/admin.

Operator surfaces:
- `GET /api/swarm/smoke/latest` returns a public-safe summary of the latest smoke run.
  - `?detail=1` is loopback/admin only and includes automation/queue details.
- `/admin/swarm` renders the latest smoke summary + steps.
- `GET /api/services/status` includes a `Swarm Smoke (Latest)` row (optional) for quick visibility.

### Host tags (CPU/GPU-aware routing)

Swarm supports a **soft tag preference** to bias routing toward certain nodes (without removing fallbacks):
- per request: `--swarm-tags gpu`
- default: `LOCAL_SWARM_PREFER_TAGS_DEFAULT=gpu`
- per op (optional): `LOCAL_SWARM_PREFER_TAGS_CHAT=gpu` and `LOCAL_SWARM_PREFER_TAGS_EMBED=cpu`

To make this meaningful on a multi-node LAN cluster, you can label discovered hosts with tags via `LOCAL_SWARM_DISCOVER_TAGS_PATH`.

Example `swarm_host_tags.json` (private; do not commit):

```json
{
  "default_tags": ["cpu"],
  "hosts": {
    "node-a": ["gpu"],
    "node-b": ["cpu", "fast"]
  }
}
```

Notes:
- Keys under `hosts` must match the host strings returned by `cluster_config.sh` parsing (exact match).
- This file is only used by `scripts/refresh_swarm_nodes.py` / `swarm-nodes-refresh.timer` when generating `LOCAL_AI_NODES_PATH`.
- Tags never include URLs; routing/metrics remain base-id only on public-safe surfaces.

### OpenAI-compatible auth (optional)

Some OpenAI-compatible local servers require a bearer token. If needed:
- `LOCAL_AI_API_KEY=...`

This key is **never** exposed on public status endpoints.

Safety note:
- If you opt into routing to public nodes (`LOCAL_SWARM_ALLOW_PUBLIC_NODES=1`), the router will **not** send `LOCAL_AI_API_KEY` to non-private targets unless you also set `LOCAL_SWARM_ALLOW_AUTH_TO_PUBLIC_NODES=1`.

### Safety: allowed node hosts

Swarm node URLs are validated at config load time:
- only `http`/`https`,
- **explicit port required**,
- no userinfo (`user:pass@`), query strings, or fragments,
- by default, only **loopback + private RFC1918/ULA** IPs are allowed.

Opt-ins (use with care):
- `LOCAL_SWARM_ALLOW_HOSTNAMES=1` allows non-IP hostnames (simple hostnames only unless you also allow public nodes).
- `LOCAL_SWARM_ALLOW_PUBLIC_NODES=1` allows public IPs and dotted hostnames.
- `LOCAL_SWARM_ALLOWED_PORTS=11434,8000,8001` optionally restricts node target ports (empty = allow all).
- `LOCAL_SWARM_ALLOWED_CIDRS=<LAN_CIDR>` optionally restricts node target IPs to a CIDR allowlist (loopback targets are always allowed).
- `LOCAL_SWARM_MAX_UPSTREAM_JSON_BYTES=...` caps the JSON payload size sent to local model servers (default: `2000000`; set `0` to disable). When enabled, the router filters out node kinds that would exceed the cap and returns HTTP 413 if no kind can accept the request.
- `LOCAL_SWARM_MAX_UPSTREAM_JSON_BYTES_VISION=...` caps JSON payloads for vision calls that include base64 image data (default: `12000000`; set `0` to disable for vision). This avoids breaking `/api/ai/vision` when the general chat cap is small.
- `AI_VISION_MAX_IMAGE_MB=...` caps uploaded image size for `/api/ai/vision` (default: `6`, clamped to `MAX_UPLOAD_MB`). This keeps in-memory vision requests bounded and prevents accidental mega-base64 payloads.
- `LOCAL_SWARM_MAX_UPSTREAM_RESPONSE_BYTES=...` caps upstream response bodies (JSON + stream) from model servers (bytes; `0` disables). This protects the router from misbehaving nodes returning huge payloads.

Important: link-local ranges (e.g., `169.254/16`, `fe80::/10`) are blocked to reduce metadata/SSRF risk.

### Threat model (quick)

This project treats local AI/swarm as a **networked compute subsystem** and the website as a **public-facing untrusted input** surface.

**Attacker models**
- Remote unauthenticated caller hitting public endpoints (e.g., `POST /api/ai/chat`, `POST /api/ai/chat/stream`).
- Remote authenticated caller (future or optional) with a session cookie.
- Local browser attacker attempting CSRF-to-localhost (e.g., DNS rebinding / local HTML file / cross-site fetch).
- Compromised/buggy LAN model node returning malicious payloads or very large responses.
- Misconfiguration (operator accidentally enables public nodes / hostnames / debug logging).

**Primary risks and the controls already in place**
- SSRF / unexpected egress from the swarm router
  - Node admission allowlists in `app/swarm/config.py` (`LOCAL_SWARM_ALLOWED_PORTS`, `LOCAL_SWARM_ALLOWED_CIDRS`, `LOCAL_SWARM_ALLOW_PUBLIC_NODES`, `LOCAL_SWARM_ALLOW_HOSTNAMES`).
  - Link-local / multicast / unspecified / reserved ranges rejected (metadata-adjacent SSRF mitigation).
  - Upstream clients do **not** trust env proxies and do **not** follow redirects (Ollama: `app/local_ai.py`; OpenAI-compat: `app/swarm/providers/openai_compat.py`; discovery tooling: `app/swarm/discovery.py`).
- Tooling exposure (filesystem/network/admin helpers) via agent flows
  - Swarm orchestrator is explicitly **no-tools** (`app/swarm/orchestrator.py`).
  - MCP tool proxy is operator-only + allowlisted + no-proxy/no-redirect (`app/mcp_proxy.py`).
  - The server only advertises MCP actions in the system prompt for strict loopback/admin callers (`app/ai_routes.py`).
  - “Agent mode” actions are a **client-side allowlist** (no arbitrary server-side execution); server stores only bounded metadata (`app/agent_actions.py`, `static/site.js`, `static/site_widget.js`).
- Prompt-injection from retrieved/web/page content
  - Untrusted snippets are normalized + wrapped (`<BEGIN_UNTRUSTED_…>`) and a centralized guard is appended only when needed (`app/prompt_safety.py`).
  - Swarm orchestration steps wrap any untrusted blocks and apply the same guard (`app/swarm/orchestrator.py`).
- DoS / resource exhaustion (mega-prompts, too many concurrent calls, stream abuse)
  - ASGI receive-layer request body caps for `/api/ai/*`, `/api/local-ai/*`, `/api/swarm/*` (`app/http_middleware.py`, wired in `app/_app.py` via `PPIA_AI_API_MAX_BODY_BYTES`).
  - SQLite-backed IP rate limiting (multi-worker safe) and per-IP concurrency locks for chat/stream (`app/rate_limit.py`, `app/_app.py`, `app/ai_routes.py`, `app/swarm/locks.py`).
  - Swarm router global/per-node concurrency + overload cooldown + bounded timeouts (`app/swarm/router.py`).
- Sensitive data exposure via logs / status surfaces
  - Public-safe swarm endpoints return **base IDs**, not base URLs (base URLs require strict loopback and are never exposed to remote admins) (`GET /api/swarm/status`, `GET /api/swarm/models`).
  - Error strings are sanitized/redacted (`app/safety.py`, `app/ai_routes.py`, swarm providers).
  - Debug switches that can reveal upstream error bodies are off by default (`LOCAL_SWARM_DEBUG_UPSTREAM_ERROR_BODY=0`).
- Auth/token exfiltration to public nodes
  - OpenAI-compatible auth headers are not sent to non-private nodes unless explicitly allowed (`LOCAL_SWARM_ALLOW_AUTH_TO_PUBLIC_NODES=1`) (`app/swarm/router.py`).

**Operator hardening checklist (recommended)**
- Keep `LOCAL_SWARM_ALLOW_PUBLIC_NODES=0` and `LOCAL_SWARM_ALLOW_HOSTNAMES=0` unless you *intentionally* route outside your LAN.
- Set `LOCAL_SWARM_ALLOWED_PORTS=11434,8000,8001` and `LOCAL_SWARM_ALLOWED_CIDRS=<your LAN CIDR>` on any machine that may be exposed to untrusted traffic.
- Set `PPIA_LOOPBACK_TOKEN=...` on laptops/desktops where you browse the open web (defense-in-depth for strict loopback endpoints).
- Keep `LOCAL_SWARM_DEBUG_UPSTREAM_ERROR_BODY=0` in production.

## Observability

### Swarm status

- `GET /api/swarm/status` returns a **public-safe** snapshot (no internal base URLs).
- `GET /api/swarm/status?detail=1` is allowed only from **strict loopback** requests or an **admin** session with `admin:read` (when wired by the app).
- `GET /api/swarm/status?detail=2` (strict loopback only) also includes per-node `base` URLs for on-box debugging.
  - `detail=1` also includes `swarm.httpx_pools` (effective keep-alive pool sizing for upstream + Ray gateway clients).

Optional hardening (defense-in-depth):
- Set `PPIA_LOOPBACK_TOKEN=...` to require `x-ppia-loopback-token` on all strict loopback endpoints (prevents accidental exposure via misconfigured local proxies).

Routing traceability:
- Upstream calls include `X-Request-ID` (when available) and `User-Agent: ppia-swarm-router/1` to help correlate node logs.

### Refresh nodes file (operator; loopback-only)

If `LOCAL_AI_NODES_PATH` is set, you can refresh the nodes file in-process:

- `POST /api/swarm/nodes/refresh` (strict loopback only)

This probes hosts derived from repo-root `cluster_config.sh` and rewrites the nodes JSON file only when it changed.

### Config diagnostics (operator; loopback-only)

When swarm routing looks “enabled but ineffective”, this endpoint helps explain why a nodes file might be ignored (symlink/perms/invalid JSON):

- `GET /api/swarm/config/diag` (strict loopback only)
- CLI: `python3 scripts/ppia swarm config-diag`

### Route preview (operator; loopback-only)

Preview scored node ordering (useful when tuning `weight`, `maxConcurrency`, and tags):

```bash
curl -sS -X POST http://127.0.0.1:8000/api/swarm/route/preview \\
  -H 'content-type: application/json' \\
  -d '{"op":"chat","model":"llama3","tags":["gpu"],"limit":10}' | jq
```

Notes:
- This endpoint is **strict loopback only** (blocked behind proxies).
- Add `"include_bases": true` for on-box debugging (still loopback-only).

Status fields (selected):
- `swarm.inflight_total` is the total in-flight calls tracked by the router in this process (useful for debugging saturation).
- `swarm.locks_ok` indicates whether cross-process file locks are working (when `LOCAL_SWARM_USE_FILE_LOCKS=1`).
- `swarm.cache_enabled` indicates whether on-disk probe/overload/breaker caches are enabled (`LOCAL_SWARM_CACHE_ENABLED=1`).
- When `detail=1` and `swarm.locks_ok=true`, `swarm.inflight_locks` includes **cross-process** estimates derived from file-lock slot occupancy:
  - `global|embed|stream|bg`: `{inflight, slots}`
  - `nodes[]`: `{base_id, inflight, slots}` (per-node; **strict loopback only**)
    - Non-loopback admin sessions (detail=1) can see the global pools, but not per-node slot occupancy.

File-lock safety (defense-in-depth):
- The lock directory is required to be **private** (owned by the app user; `0700`). If it is not private, the router disables file locks (and cross-process probe cache / inflight estimates) rather than risk interference from other local users.

Overload handling:
- If a node responds with `HTTP 429` or `HTTP 503`, the router treats it as **overloaded but alive**:
  - it does **not** increment `fail_count` or open the breaker,
  - it applies a short per-node cooldown (`nodes[].overload_until_utc`) and down-ranks the node temporarily,
    - if the upstream sends `Retry-After`, the cooldown respects it (bounded),
  - if *all* nodes are overloaded, chat requests return `429 local swarm overloaded (all nodes overloaded)`.

Interpreting node fields:
- `nodes[].open_until_utc` ⇒ circuit breaker open (consecutive failures; node treated as down temporarily)
- `nodes[].overload_until_utc` ⇒ overload cooldown (node is alive but busy; temporary down-rank)

Debugging note:
- By default, upstream HTTP error bodies are **not** included in exception/log text (to reduce accidental leakage of prompt/context).
- To opt into verbose upstream error bodies for on-box debugging, set: `LOCAL_SWARM_DEBUG_UPSTREAM_ERROR_BODY=1` (truncated; disable after triage).

### Swarm models

- `GET /api/swarm/models` returns **counts only** (public-safe).
- `GET /api/swarm/models?detail=1&probe=1` (strict loopback only) returns per-node model IDs and refreshes cached model inventories.
- `GET /api/swarm/models?detail=2&probe=1` (strict loopback only) includes per-node `base` URLs for on-box debugging.

Notes:
- Probing calls `GET /v1/models` (OpenAI-compatible) or `GET /api/tags` (Ollama) with short timeouts.
- Public requests cannot trigger probes (avoids turning public surfaces into internal scanners).

### Swarm metrics

- `GET /api/swarm/metrics` returns a **public-safe** rolling snapshot aggregated from recent router events:
  - overall counts + ok_rate
  - router-level backpressure/validation counts that **never reached a node**:
    - `metrics.router_total` / `metrics.router_ok_rate`
    - `router_errors[]` (top `op` + `error` codes, e.g. `busy_global_pool`, `busy_stream_lock`, `payload_too_large`)
  - p50/p95 latency
  - per-operation aggregates (`chat`, `stream`, `embed`, `embed_batch`, …)
  - per-node aggregates keyed by `base_id` (no URLs)
- `GET /api/swarm/metrics?detail=1` is loopback/admin:
  - strict loopback: includes node `name`, `tags`, and bounded `last_error` (no base URLs unless a separate loopback-only detail level is added in the future)
  - non-loopback admin: returns an **admin-safe** node view (no `name`, no `tags`, no `last_error`)
- `GET /api/swarm/metrics?scope=global` (loopback/admin) aggregates **persisted** router events across **all Uvicorn workers** (best-effort).
  - Useful when running multiple workers and you want a single windowed ok_rate/latency view.
- `GET /api/swarm/metrics.prom` returns the same snapshot in **Prometheus text format** (public-safe; default `scope=process`).
  - `scope=global` is loopback/admin only (same as JSON).
  - Notes: labels intentionally avoid base URLs and model names (low-cardinality; safer for public surfaces).
  - Router backpressure metrics:
    - `ppia_swarm_router_events_total_window{scope="process|global"}`
    - `ppia_swarm_router_event_errors_total_window{scope,op,error}`
  - Queue wait breakdown (helps separate semaphore pressure vs lock pressure):
    - `ppia_swarm_queue_wait_ms_p95_by_pool_window{scope,pool}`
    - `ppia_swarm_queue_timeouts_by_pool_total_window{scope,pool}`
    - `ppia_swarm_queue_error_by_pool_total_window{scope,pool,error}`
  - Cross-process pool utilization (file-lock based; best-effort):
    - `ppia_swarm_pool_inflight{scope,pool}`
    - `ppia_swarm_pool_slots{scope,pool}`
    - `ppia_swarm_pool_utilization{scope,pool}`
  - Overload metrics (node reached but responded overloaded / busy):
    - `ppia_swarm_op_overload_total_window{scope,op}`

JSON fields added (public-safe):
- `queue.pools[]` / `queue.pool_errors[]` include queue wait + timeout/error aggregates keyed by `pool`.
- `concurrency` includes cross-process pool occupancy (`global|embed|stream|bg`) when file locks are enabled.

Metrics persistence (operator):
- `LOCAL_SWARM_METRICS_PERSIST=1` writes compact router events to a private JSONL file under `LOCAL_SWARM_LOCK_DIR` / `PPIA_DATA_DIR`.
  - Default: on when `LOCAL_SWARM_USE_FILE_LOCKS=1`.
- Optional bounds:
  - `LOCAL_SWARM_METRICS_MAX_BYTES=8000000` (per-process file; rotated at this size)
  - `LOCAL_SWARM_METRICS_RETENTION_DAYS=7` (prunes old per-process event files)

### Services status

`GET /api/services/status` includes a “Local AI Swarm” row (optional) for quick operator visibility.

### Swarm benchmark (operator)

If you want a reproducible request-level benchmark that also prints the router’s own metrics snapshot:

```bash
python3 scripts/bench_swarm_router.py --mode stream --requests 40 --concurrency 8 --model llama3 --message "bench"
```

## Queueing / multi-agent workflows (admin-only)

The app already has a durable automation queue with audit trails. Swarm routing is available to these runs too.

Built-in workflows:
- `wf_swarm_research_answer_v1` (kind: `swarm_research_answer_v1`)
- `wf_swarm_multiagent_task_v1` (kind: `swarm_multiagent_task_v1`)

### `wf_swarm_research_answer_v1` (grounded, multi-source)

This workflow produces a grounded answer by building a “grounding packet” from:
- local docs (stdlib local RAG fallback),
- Daily Intelligence digests (SQLite mirror if present; JSON fallback), and
- Brave Search results (web/news auto).

Then it runs a guarded multi-agent pipeline:
- route → grounded draft → safety

Citation contract:
- Local docs: cite as `[D1]`, `[D2]`, …
- Daily intelligence items: cite as `[I1]`, `[I2]`, …
- Brave web/news results: cite as `[1]`, `[2]`, …

Inputs (selected):
- `question` (required)
- `goal` (default: `answer`; can be `plan|summarize|content_optimize|code_help`)
- `docs_limit` (default: 4; 0 disables local docs)
- `intel_days` (default: 3; 0 disables intelligence)
- `intel_items_limit` (default: 6; 0 disables intelligence)
- `source|count|country|search_lang` (Brave Search controls)
- `tags` / `swarm_tags` / `prefer_tags` (routing hints; optional)
  - string/list: applied to every step
  - object: keys `route|draft|safety|default` (per-step tag preferences)

Example (enqueue):

```bash
curl -sS -X POST http://127.0.0.1:8000/api/automation/runs/enqueue \\
  -H 'content-type: application/json' \\
  -H 'cookie: ppia_admin_session=...' \\
  -d '{"workflow_id":"wf_swarm_research_answer_v1","input":{"question":"What are the most important local-first AI developments this week?","goal":"answer","docs_limit":4,"intel_days":3,"intel_items_limit":6,"source":"auto","count":5,"country":"us","search_lang":"en","tags":null}}'
```

Then inspect via:
- `GET /api/automation/runs/{run_id}` (admin)

Operator UI:
- `/admin/automation` shows the queue + run detail (including per-step swarm `base_id`, model, and durations).
- `/admin/swarm` includes a “Queue a Research Answer” panel for the same workflow.

CLI example:

```bash
# On localhost/operator sessions, this can work without tokens.
# If your server requires explicit write auth, pass --ai-write-token or --admin-token.
python3 scripts/ppia research "What are the most important local-first AI developments this week?" --wait

# Print full JSON run detail (steps + outputs) instead of the plain answer:
python3 scripts/ppia research "Summarize today's Daily Intelligence into 5 bullets." --raw-json
```

Enqueue via CLI (avoids cookie copy/paste):

```bash
cd aipowerprogressia.com
set -a; source .env; set +a
python3 scripts/ppia --admin-token "$APP_ADMIN_TOKEN" automation enqueue \
  --workflow-id wf_swarm_research_answer_v1 \
  --input '{"question":"What changed in local-first AI this week?","goal":"answer","docs_limit":4,"intel_days":3,"intel_items_limit":6,"source":"auto","count":5,"tags":{"route":"cpu","draft":"gpu","safety":"cpu"}}' \
  --wait --max-wait 300
```

Note: queued runs require `automation-worker.timer` (or run `python3 scripts/automation_worker.py --once` to drain once).
Tip: the oneshot worker (`--once`, used by `automation-worker.timer`) uses a stable `worker_id` by hostname so `/api/automation/status` stays readable and the heartbeat table doesn’t grow unbounded.

Guardrails:
- Enqueue caps: `AUTOMATION_MAX_QUEUED_RUNS` (default `400`) and optional `AUTOMATION_MAX_ACTIVE_RUNS` (default `0` disabled).
- Retention (operator-only apply): `AUTOMATION_RETENTION_KEEP_DAYS` / `AUTOMATION_RETENTION_KEEP_LAST` prune finalized automation history via `/api/storage/cleanup_apply` (disabled by default).

### Token-gated agent jobs (website integration)

For narrow website integration (without exposing admin-only automation detail), the site includes a token-gated enqueue/poll API:

- `POST /api/ai/agent/enqueue` → returns `{run_id, run_token}`
- `GET /api/ai/agent/runs/{run_id}` with header `x-ppia-run-token: ...` → returns a public-safe snapshot + `final`

Defaults:
- loopback/admin-only by default (see `AI_AGENT_JOBS_ALLOW_AUTH` / `AI_AGENT_JOBS_ALLOW_PUBLIC` in `.env.example`).
- the `run_token` is stored only as a sha256 hash in SQLite.
- By default the enqueue path uses the grounded workflow `wf_swarm_research_answer_v1` (catalog + docs + intelligence + optional Brave), and it supports optional toggle fields:
  - `use_resources` (bool) — canonical catalog resources (default: true)
  - `resources_limit` (int) — max catalog items to include (default: 6)
  - `use_docs` (bool)
  - `use_web_search` (`on|off|auto`) — `auto` only hits Brave for news/time-sensitive queries (best-effort)
  - `context` (string) — untrusted conversation/page context (not evidence; do not cite)

Example:

```bash
curl -sS -X POST http://127.0.0.1:8000/api/ai/agent/enqueue \\
  -H 'content-type: application/json' \\
  -d '{"prompt":"Summarize the PowerSearch Grid value prop in 5 bullets.","goal":"summarize","tags":{"route":"cpu","draft":"gpu","critic":"cpu","revise":"gpu","safety":"cpu"}}' | jq

# Then poll (replace RUN_ID + RUN_TOKEN):
curl -sS http://127.0.0.1:8000/api/ai/agent/runs/RUN_ID \\
  -H "x-ppia-run-token: RUN_TOKEN" | jq
```

Important: queued runs still require a worker (`automation-worker.timer` or `python3 scripts/automation_worker.py --once`).

### `wf_swarm_multiagent_task_v2` (route → draft → critic → revise → safety)

It runs a simple multi-step pipeline (route → draft → critic → revise → safety) and stores step outputs in `automation_run_steps`.

Optional: step-level routing tags (operator hint)
- The workflow accepts `tags` (or `swarm_tags` / `prefer_tags`) as:
  - a string / list (applies to every step), or
  - an object with keys `route|draft|critic|revise|safety|default` (per-step tag preferences).
- This is useful when you want lightweight routing/safety on CPU nodes but drafts on GPU nodes.

Optional: step-level routing tags (operator defaults via env)
- If you run the multi-agent pipeline without passing `tags`, you can set defaults via env:
  - `LOCAL_SWARM_ORCH_PREFER_TAGS_ROUTE=cpu`
  - `LOCAL_SWARM_ORCH_PREFER_TAGS_DRAFT=gpu`
  - `LOCAL_SWARM_ORCH_PREFER_TAGS_SAFETY=cpu`
  - Optional global fallback: `LOCAL_SWARM_ORCH_PREFER_TAGS_ALL=cpu` (or `LOCAL_SWARM_ORCH_PREFER_TAGS=cpu`)
- These are soft preferences (boosts) — the router still falls back to other healthy nodes.

Example (enqueue):

```bash
curl -sS -X POST http://127.0.0.1:8000/api/automation/runs/enqueue \\
  -H 'content-type: application/json' \\
  -H 'cookie: ppia_admin_session=...' \\
  -d '{"workflow_id":"wf_swarm_multiagent_task_v2","input":{"prompt":"Summarize the PowerSearch Grid value prop in 5 bullets.","goal":"summarize","tags":{"route":"cpu","draft":"gpu","critic":"cpu","revise":"gpu","safety":"cpu"}}}'
```

Then inspect via:
- `GET /api/automation/runs/{run_id}` (admin)

## RAG / docs index: embedding via swarm

The repo’s SQLite **docs index** (`/docs` RAG fallback) embeds content in batches using `app/rag/embed_ollama.py`.

When:
- `LOCAL_SWARM_ENABLED=1`, and
- the docs index is configured with the default `ollama_host` (loopback; `http://localhost:11434`),

then `embed_texts()` routes embeddings through the **swarm router** so large embedding jobs can use multiple LAN nodes and OpenAI-compatible servers (vLLM/LM Studio) when configured.

To force the docs index to talk to a specific remote Ollama host (bypass swarm), set:
- `DOCS_INDEX_OLLAMA_HOST=http://<host>:11434` (or the equivalent config entry).

## Troubleshooting

- Swarm disabled? Check `LOCAL_SWARM_ENABLED` and `LOCAL_SWARM_DEFAULT_MODE`.
- No nodes? Verify `LOCAL_AI_NODES` JSON and that URLs are private/loopback (or set the allow flags intentionally).
- Ray shows GPU hosts but swarm discovers no GPU model nodes?
  - Run coverage (loopback-only inventory): `python3 scripts/ppia swarm coverage` (or `/admin/swarm` → “Load coverage”).
  - If GPU hosts are marked `missing`, start a node-local model server on those hosts (Ollama `127.0.0.1:11434` or OpenAI-compatible `/v1`) and re-run `python3 scripts/ppia doctor --fix-swarm`.
- OpenAI-compatible nodes failing? Verify your server supports:
  - `GET /v1/models`
  - `POST /v1/chat/completions`
  - `POST /v1/embeddings`
  - If `GET /v1/models` returns `404`, the node is **not** OpenAI-compatible at that URL. Either:
    - point `baseUrl` at the server’s real OpenAI-compatible `/v1` base, or
    - configure the node as `kind: "ollama"` and use the Ollama HTTP API (`:11434`).
- Verify health quickly:
  - `GET /api/services/status`
  - `GET /api/swarm/status`
  - `python3 scripts/swarm_smoke.py --metrics --chat --orch --check-gating` (HTTP wiring + loopback-only orchestration)
  - `python3 scripts/swarm_smoke.py --json-out auto --metrics --automation --chat --orch --check-gating` (also writes `swarm_smoke_latest.json`)
  - `python3 scripts/swarm_smoke.py --metrics --chat --orch --orch-strict` (fails if tag preferences aren't satisfied)
  - `python3 scripts/swarm_smoke.py --probe-models` (loopback-only: refresh model inventories)
- After reboot, nodes look stale or model inventories are empty?
  - Run: `python3 scripts/ppia doctor --fix-swarm` (strict loopback; refresh nodes + probe models).
  - If you rely on timers, check they’re running (pick one scope; don’t enable both system + user timers):
    - `sudo systemctl status swarm-nodes-refresh.timer swarm-models-probe.timer swarm-smoke.timer`
    - `systemctl --user status swarm-nodes-refresh.timer swarm-models-probe.timer swarm-smoke.timer`
  - If `systemctl --user ...` fails (no user bus / no session), either:
    - use the system-level timers (`sudo systemctl ...`), or
    - enable lingering: `sudo loginctl enable-linger "$USER"`
- Common `429` meanings:
  - `local swarm busy (global concurrency)` ⇒ raise `LOCAL_SWARM_MAX_PARALLEL_TASKS` or reduce load (use file locks for multi-worker)
  - `local swarm busy (all nodes saturated)` ⇒ per-node `maxConcurrency` too low / too few nodes
  - `local swarm overloaded (all nodes overloaded)` ⇒ upstream nodes are returning `429/503` (add capacity, reduce concurrency, or add nodes)
- If `swarm.locks_ok=false` on `/api/swarm/status`:
  - single-worker dev is fine (locks are optional), but for multi-worker setups ensure `LOCAL_SWARM_LOCK_DIR` is writable and on local disk, or set `LOCAL_SWARM_USE_FILE_LOCKS=0` intentionally.
- If `scope=global` metrics look wrong or empty:
  - `scope=global` requires metrics persistence (default on when `LOCAL_SWARM_USE_FILE_LOCKS=1`).
  - Confirm `LOCAL_SWARM_LOCK_DIR` is consistent across workers and contains `metrics/` files.
  - Avoid polling `scope=global` at high frequency; it may read/parse multiple JSONL files (best-effort, bounded).
- If LAN nodes are timing out:
  - Verify the node is reachable from the web host: `curl -fsS http://<LAN_IP>:11434/api/version` (Ollama) or `curl -fsS http://<LAN_IP>:8000/v1/models` / `curl -fsS http://<LAN_IP>:8001/v1/models` (OpenAI-compat).
  - Check firewall/routing (LAN-only recommended) and consider increasing `LOCAL_AI_REMOTE_TIMEOUT`.
- If `viaRay` nodes are timing out / always skipped:
  - Check the gateway on the Ray head:
    - `curl -fsS http://127.0.0.1:9892/health`
    - `curl -fsS http://127.0.0.1:9892/metrics | head`
  - Confirm the website has Ray exec enabled: `LOCAL_SWARM_RAY_ENABLED=1` (compat: `AI_RAY_ENABLED=1`).
  - Confirm the nodes file actually contains `viaRay: true` entries (loopback): `python3 scripts/ppia swarm status --detail 1`
  - On the affected node, verify the localhost model server is healthy (run on that node): `curl -fsS http://127.0.0.1:11434/api/version >/dev/null` (Ollama) or `curl -fsS http://127.0.0.1:8000/v1/models >/dev/null` (OpenAI-compat).
- If one node is slow/flaky and you need the site to stay responsive during investigation:
  - Identify the node (loopback): `python3 scripts/ppia swarm metrics --scope global --detail 1 --window-s 900`
  - Drain it temporarily: `python3 scripts/ppia swarm node-override <base_id> --mode drain --reason "investigating latency"`
  - Disable it if needed: `python3 scripts/ppia swarm node-override <base_id> --mode disabled --reason "offline"`
  - Re-enable after the fix: `python3 scripts/ppia swarm node-override <base_id> --mode enabled --reason "back online"`
- If Ray-backed discovery isn’t finding hosts:
  - Check `ray_diag` in `GET /api/swarm/nodes/refresh/latest` (strict loopback) or in `python3 scripts/ppia swarm refresh-nodes --source both`.
  - Common fixes:
    - set `RAY_PYTHON=...` to a Python that has Ray installed (keeps the website venv Ray-free),
    - set `RAY_ADDRESS=...` (or `RAY_HEAD_IP`/`RAY_PORT`) so membership queries can connect.
- If you changed any `static/site*.js`, remember: `npm run build:js-compat` (per `AGENTS.md`).