Power outage recovery (operator runbook)

Back to Search Download Markdown Docs are source-of-truth for grounding + trust.
Provenance

path: docs/power_outage_recovery.md · modified_utc: 2026-05-06T16:55:38.534456Z · size: 4308 bytes
Docs (Markdown)

Tip: this is plain markdown in a <pre> block for maximum inspectability.
# Power outage recovery (operator runbook)

If electricity flickers or the machine loses power unexpectedly, the system can come back up in a **degraded-but-working**
state: one worker offline, a few systemd units failed, or a device/PWA stuck on stale cached assets.

This runbook is **safe by default** (checks + restarts; no deletions).

## 0) Quick operator checks (2–5 minutes)

From repo root:

```bash
# Failed system services (GPU, robotics stack, Ray, etc.)
systemctl --failed --no-pager || true

# Failed user services (periodic automation loops)
systemctl --user --failed --no-pager || true

# Snapshot cluster + website health (WARN-only in degraded states)
bash cluster_excellence_run.sh --quick
bash cluster_2026_excellence_audit.sh --quick --ssh-warn

# Capture an audit bundle under logs/codex_head_audit_YYYYMMDD_HHMMSS/
bash complete_audit_run.sh --quick
```

If the website is running, open:
- `/status` (health dashboard)
- `/reset-app` (if a phone/PWA is stuck on “Loading…”)

## 1) Website correctness (non-regression gates)

```bash
cd aipowerprogressia.com
python3 scripts/run_core_release_gates.py
```

## 2) PWA “Loading…” loops (Android)

On the affected device:
- open `/reset-app`
- choose **Reset app cache**
- reload `/status`

If you recently edited `static/site*.js`:

```bash
cd aipowerprogressia.com
npm run build:js-compat
bash scripts/run_unit_tests.sh
```

## 3) Storage safety (avoid split-state after restarts)

If you use an external `PPIA_DATA_DIR`, ensure you don’t have a stale repo-local `./data/app.db` causing split-state:

```bash
cd aipowerprogressia.com
bash scripts/check_storage_layout.sh

# Safe-by-default archive (rename) of repo-local ./data/app.db{,-wal,-shm}
bash scripts/archive_repo_local_app_db.sh
```

## 4) Cluster degraded states (one worker offline)

If a worker is intentionally powered off, mark it optional (so strict audits still mean something):
- edit `cluster_config.sh` → add the node to `OPTIONAL_LINUX_NODES=(...)`

If you expect the worker to be online, bring it back and re-run:

```bash
bash ray_validate_cluster.sh
bash cluster_excellence_run.sh --repair --quick
```

## 5) Ray queue / background routing (optional)

The `/status` Services grid includes a **Ray Queue** card (best-effort). If it is down after a reboot, verify the exporter
and queue bootstrap are running:

```bash
systemctl status ray-queue-metrics.service ray-queue-router.service --no-pager || true
curl -fsS http://127.0.0.1:9891/metrics | rg '^ray_queue_up' || true
```

If `ray_queue_up` is `0`, restart the queue bootstrap:

```bash
sudo systemctl restart ray-queue-router.service
```

## 6) Ray distributed execution smoke (optional)

The `/status` Services grid includes a **Ray Smoke (Latest)** card. This verifies *distributed execution* (spread tasks +
node-affinity probes), not just membership.

Check the latest report:

```bash
systemctl status ray-smoke.timer ray-smoke.service --no-pager || true
curl -fsS http://127.0.0.1:8000/api/cluster/ray/smoke/latest | head
```

The on-host artifact is written to:
- `${PPIA_DATA_DIR:-data}/locks/cluster/health/ray_smoke_latest.json`

## 6) Local AI Swarm recovery (optional)

If Ask AI or background workflows feel slow/flaky after a reboot, refresh the swarm node registry and probe inventories
from loopback:

```bash
cd aipowerprogressia.com
bash scripts/continuous_orchestrator.sh heal --profile swarm
bash scripts/continuous_orchestrator.sh heal --profile swarm --warmup
python3 scripts/ppia doctor --fix-swarm
python3 scripts/ppia doctor --fix-swarm --fix-swarm-prune
python3 scripts/ppia doctor --fix-swarm --fix-swarm-warmup
PPIA_HEAL_SWARM_PRUNE=1 bash scripts/continuous_orchestrator.sh heal --profile swarm
python3 scripts/ppia swarm warmup
python3 scripts/ppia swarm refresh-nodes --prune
python3 scripts/swarm_smoke.py --base-url http://127.0.0.1:8000 --timeout-s 10 --json-out auto --metrics --automation --check-gating --chat --stream --orch
```

If you use `viaRay` nodes (localhost-only model servers on workers), also confirm the Ray Swarm Gateway is running on the Ray head:

```bash
sudo systemctl status ray-swarm-gateway.service --no-pager || true
curl -fsS http://127.0.0.1:9892/health
sudo journalctl -u ray-swarm-gateway.service -n 200 --no-pager || true
```