Docs (Markdown)
Tip: this is plain markdown in a <pre> block for maximum inspectability.
# Power outage recovery (operator runbook)
If electricity flickers or the machine loses power unexpectedly, the system can come back up in a **degraded-but-working**
state: one worker offline, a few systemd units failed, or a device/PWA stuck on stale cached assets.
This runbook is **safe by default** (checks + restarts; no deletions).
## 0) Quick operator checks (2–5 minutes)
From repo root:
```bash
# Failed system services (GPU, robotics stack, Ray, etc.)
systemctl --failed --no-pager || true
# Failed user services (periodic automation loops)
systemctl --user --failed --no-pager || true
# Snapshot cluster + website health (WARN-only in degraded states)
bash cluster_excellence_run.sh --quick
bash cluster_2026_excellence_audit.sh --quick --ssh-warn
# Capture an audit bundle under logs/codex_head_audit_YYYYMMDD_HHMMSS/
bash complete_audit_run.sh --quick
```
If the website is running, open:
- `/status` (health dashboard)
- `/reset-app` (if a phone/PWA is stuck on “Loading…”)
## 1) Website correctness (non-regression gates)
```bash
cd aipowerprogressia.com
python3 scripts/run_core_release_gates.py
```
## 2) PWA “Loading…” loops (Android)
On the affected device:
- open `/reset-app`
- choose **Reset app cache**
- reload `/status`
If you recently edited `static/site*.js`:
```bash
cd aipowerprogressia.com
npm run build:js-compat
bash scripts/run_unit_tests.sh
```
## 3) Storage safety (avoid split-state after restarts)
If you use an external `PPIA_DATA_DIR`, ensure you don’t have a stale repo-local `./data/app.db` causing split-state:
```bash
cd aipowerprogressia.com
bash scripts/check_storage_layout.sh
# Safe-by-default archive (rename) of repo-local ./data/app.db{,-wal,-shm}
bash scripts/archive_repo_local_app_db.sh
```
## 4) Cluster degraded states (one worker offline)
If a worker is intentionally powered off, mark it optional (so strict audits still mean something):
- edit `cluster_config.sh` → add the node to `OPTIONAL_LINUX_NODES=(...)`
If you expect the worker to be online, bring it back and re-run:
```bash
bash ray_validate_cluster.sh
bash cluster_excellence_run.sh --repair --quick
```
## 5) Ray queue / background routing (optional)
The `/status` Services grid includes a **Ray Queue** card (best-effort). If it is down after a reboot, verify the exporter
and queue bootstrap are running:
```bash
systemctl status ray-queue-metrics.service ray-queue-router.service --no-pager || true
curl -fsS http://127.0.0.1:9891/metrics | rg '^ray_queue_up' || true
```
If `ray_queue_up` is `0`, restart the queue bootstrap:
```bash
sudo systemctl restart ray-queue-router.service
```
## 6) Ray distributed execution smoke (optional)
The `/status` Services grid includes a **Ray Smoke (Latest)** card. This verifies *distributed execution* (spread tasks +
node-affinity probes), not just membership.
Check the latest report:
```bash
systemctl status ray-smoke.timer ray-smoke.service --no-pager || true
curl -fsS http://127.0.0.1:8000/api/cluster/ray/smoke/latest | head
```
The on-host artifact is written to:
- `${PPIA_DATA_DIR:-data}/locks/cluster/health/ray_smoke_latest.json`
## 6) Local AI Swarm recovery (optional)
If Ask AI or background workflows feel slow/flaky after a reboot, refresh the swarm node registry and probe inventories
from loopback:
```bash
cd aipowerprogressia.com
bash scripts/continuous_orchestrator.sh heal --profile swarm
bash scripts/continuous_orchestrator.sh heal --profile swarm --warmup
python3 scripts/ppia doctor --fix-swarm
python3 scripts/ppia doctor --fix-swarm --fix-swarm-prune
python3 scripts/ppia doctor --fix-swarm --fix-swarm-warmup
PPIA_HEAL_SWARM_PRUNE=1 bash scripts/continuous_orchestrator.sh heal --profile swarm
python3 scripts/ppia swarm warmup
python3 scripts/ppia swarm refresh-nodes --prune
python3 scripts/swarm_smoke.py --base-url http://127.0.0.1:8000 --timeout-s 10 --json-out auto --metrics --automation --check-gating --chat --stream --orch
```
If you use `viaRay` nodes (localhost-only model servers on workers), also confirm the Ray Swarm Gateway is running on the Ray head:
```bash
sudo systemctl status ray-swarm-gateway.service --no-pager || true
curl -fsS http://127.0.0.1:9892/health
sudo journalctl -u ray-swarm-gateway.service -n 200 --no-pager || true
```