Docs (Markdown)
Tip: this is plain markdown in a <pre> block for maximum inspectability.
# PowerSearch Grid (MVP) — Architecture + Trust Model
PowerSearch Grid is the opt-in, privacy-first distributed search subsystem of **AI Power Progress iA**.
It provides:
- a control-plane (this FastAPI app) that **signs typed jobs** and stores state in SQLite
- edge agents that **poll** for work, **verify signatures**, and run only **whitelisted handlers**
- a local-first searchable index (`grid_docs` + FTS + optional embeddings) and a `/grid` product surface
- Ask AI grounding over Grid docs via bounded, permission-safe excerpts (no arbitrary URL fetch)
## Architecture map
```mermaid
flowchart LR
browser[Browser /grid] -->|GET /grid| web[FastAPI static page]
browser -->|Grid Search| api_search[/api/grid/search]
browser -->|Ask AI over results| api_ai[/api/ai/chat/stream]
browser -->|Trust metadata| api_assets[/api/grid/assets]
browser -->|Cluster status| api_status[/api/grid/status]
browser -->|Operator console| api_admin[/api/grid/admin/overview]
api_search --> db[(SQLite app.db)]
api_status --> db
api_assets --> fs[(static/grid assets)]
api_admin --> db
api_ai --> db
api_ai --> ollama[Ollama (local AI)\noptional]
agent[Edge agent (Python)] -->|register| api_reg[/api/grid/nodes/register]
agent -->|heartbeat| api_hb[/api/grid/nodes/{node_id}/heartbeat]
agent -->|poll| api_poll[/api/grid/jobs/poll]
agent -->|result| api_res[/api/grid/jobs/{job_id}/result]
api_reg --> db
api_hb --> db
api_poll --> db
api_res --> db
api_submit[/api/grid/jobs/submit\n(admin)] --> db
api_submit -->|sign manifest| sig[Ed25519 signing key\n(local file)]
api_poll -->|serves signed manifest| sig
```
## Install / distribution flow (trust-first)
Primary conversion surface: `GET /grid` (`static/grid.html`).
The UX intentionally separates:
1. **Download Edge Agent** (inspectable source)
2. **Quick Install** (platform-specific commands)
3. **Verify / Checksums** (SHA-256 + optional signed manifest)
Key endpoints/assets:
- Agent: `static/grid/edge_agent.py`
- Verify helper: `static/grid/verify_grid_release.py` (verifies signed manifest + optional pinned key + local file checksums)
- Docs:
- `GET /grid/docs` (inspectable markdown viewer)
- `GET /grid/docs.md` (raw markdown download)
- Post-install node check:
- `GET /api/grid/nodes/{node_id}/status_public` (minimal node status; uses `node_id` only, never the token)
- Linux installer: `static/grid/install_edge_agent.sh` (`--print-plan`, `--uninstall`)
- Safe to re-run: preserves existing `agent.json` (node_id + policy) by default.
- Upgrade-only mode: `--upgrade` (skips registration; restarts service).
- Rotate node identity (policy preserved): `--re-register`.
- Managed caps (Linux `systemd --user`): the generated unit applies `cpu_max_percent` → `CPUQuota` and `ram_max_gb` → `MemoryMax` on install/upgrade. Custom units are preserved.
- Optional `GRID_DISPLAY_NAME=...` sets a human-readable node name at registration time (defaults to empty for privacy).
- Supports pinned key verification: set `GRID_PUBKEY_FPR_SHA256=<sha256 fingerprint>` to enforce the signed release-manifest public key.
- If `GRID_PUBKEY_FPR_SHA256` is set, the installer also writes a pinned key into `agent.json`:
- `signing_public_key_fingerprint_sha256` (observed)
- `signing_public_key_fingerprint_sha256_expected` (pinned; agent blocks work if it does not match)
- Uninstaller: `static/grid/uninstall_edge_agent.sh`
- Registration helper: `static/grid/register_node.py`
- Windows installer: `static/grid/install_edge_agent_windows.ps1`
- Downloads + verifies `SHA256SUMS` for `edge_agent.py`, `register_node.py`, and `verify_grid_release.py`.
- Verifies the signed release manifest (Ed25519) via `verify_grid_release.py` (optional pinned key via `GRID_PUBKEY_FPR_SHA256` / `-PubKeyFingerprintSha256`).
- Safe to re-run: reuses existing `agent.json` by default; use `-ReRegister` to rotate node identity (policy preserved).
- Windows uninstaller: `static/grid/uninstall_edge_agent_windows.ps1`
- Removes agent + config directories from the user profile (does not remove Python or pip packages).
- Optional background operation:
- macOS: LaunchAgent template: `static/grid/powersearch-grid-agent.plist` (replace `__HOME__`, then load with `launchctl`).
- Windows: Task Scheduler (recommended) — create a user logon task to run `python edge_agent.py --config agent.json` (copyable commands are provided on `/grid`).
- Trust metadata: `GET /api/grid/assets` (hashes + signed release manifest when crypto is available)
- Includes a stable `release_id` for the current asset set (derived from the manifest content).
- Includes `sha256sums_ok` + `sha256sums_note` so operators can detect a stale `static/grid/SHA256SUMS` file.
- The API serves computed checksums for the currently-served assets; if the on-disk `SHA256SUMS` is stale, `sha256sums_ok=false`.
- Release helper: run `python3 scripts/update_grid_sha256sums.py` whenever files in `static/grid/` change.
- Policy helpers:
- `GET /api/grid/policy/default` (default conservative policy)
- `POST /api/grid/policy/validate` (merge + validate a policy; used by the `/grid` policy editor)
## Agent doctor (read-only diagnostics)
`edge_agent.py` supports a local doctor mode for trust-first debugging:
```bash
~/.local/share/powersearch-grid-agent/venv/bin/python ~/.local/share/powersearch-grid-agent/edge_agent.py \
--config ~/.config/powersearch-grid/agent.json \
--doctor
```
Doctor checks include:
- config validity + policy parse
- pinned signing key match (if configured)
- control-plane reachability (`/health`, `/api/grid/status`, `status_public`)
- Ollama reachability (best-effort)
## Trust + consent model (MVP)
Core invariants:
- **Off by default:** `policy.enabled=false` until the user explicitly opts in.
- **Emergency stop (local):** if `policy.emergency_stop_enabled=true`, creating an `EMERGENCY_STOP` file next to `agent.json` pauses work immediately (delete it to resume).
- **Pull-only:** agents poll the control-plane; no peer-to-peer pushes.
- **Typed jobs only:** `GRID_JOB_TYPES = {health_check, crawl_url, ollama_chat}`.
- **Signed manifests:** edge agents verify Ed25519 signatures on job manifests.
- **Whitelisted handlers:** the agent executes only hard-coded handlers for allowed job types.
- **Minimal telemetry (honest):** agents send heartbeats (coarse CPU/RAM/disk + agent version; no hostname by default) and job results. For `crawl_url`, the agent fetches **text content only** (robots-respecting, bounded size/time, allowlisted targets) and uploads page text for indexing. The control-plane stores the page text in `grid_docs` for search; `grid_job_results` stores metadata only (`content_omitted=true`) to keep the DB compact.
- **SSRF defense-in-depth:** for `crawl_url`, both control-plane and agents enforce domain allowlists and block private IPs by default.
- **Rate limiting:** per-node throttles on poll/heartbeat/result (token-authenticated) and per-IP throttles on public endpoints like `/api/grid/search`.
Policy keys (agent-enforced):
- `crawl_allowlist_domains`: list of allowed domains (empty → default to `base_url` host)
- `allow_private_crawl_ips`: default `false`
- `crawl_max_redirects`: default `3`
Policy keys (best-effort / heuristic):
- `idle_only`, `quiet_hours`, `plugged_in_only` (best-effort), `thermal_throttle_enabled` (Linux-only), `reserve_cores_for_user`
- `reserve_ram_gb_for_user` (Linux-only RAM reserve; other platforms treat RAM availability as unknown)
Control-plane knobs (operator env vars):
- `GRID_CRAWL_ALLOWLIST` (defaults to `PUBLIC_BASE_URL` host)
- `GRID_ALLOW_PRIVATE_CRAWL_IPS=1` to permit private IPs (LAN-only deployments)
- `GRID_REGISTRATION_TOKEN` to require a join token on public networks
## Job lifecycle map
1. **Submit (admin/operator)**: `POST /api/grid/jobs/submit` (token-gated)
2. **Validate**: `grid_validate_crawl_url` blocks off-allowlist and non-global targets
3. **Queue**: job persisted in `grid_jobs` (status `queued`)
4. **Poll**: agent calls `POST /api/grid/jobs/poll` with its current policy/capabilities
5. **Assign**: control-plane chooses a node and returns `{manifest, signature_b64}`
6. **Verify (agent)**: Ed25519 signature verification + policy allowlist checks
7. **Execute**: whitelisted handler runs:
- `health_check` (capabilities + health)
- `crawl_url` (bounded crawl + robots + allowlist + private-IP blocking)
- `ollama_chat` (local Ollama; model bounded by node config)
8. **Submit result**: `POST /api/grid/jobs/{job_id}/result`
9. **Index (control-plane)**: crawl results can be indexed into `grid_docs` (+ optional FTS)
Reliability note: during polls, the control-plane best-effort **reclaims stale `assigned` jobs** (requeues them, or marks them failed when `attempts>=max_attempts`) so the queue can make progress if a node disappears mid-job. Tunables: `GRID_JOB_STALE_MULTIPLIER` (2.0), `GRID_JOB_STALE_GRACE_S` (60), `GRID_JOB_STALE_MIN_S` (180), `GRID_JOB_STALE_MAX_S` (7200), `GRID_JOB_STALE_SCAN_LIMIT` (32), `GRID_JOB_STALE_RECLAIM_LIMIT` (6), `GRID_JOB_STALE_RECLAIM_INTERVAL_S` (20).
## Search + semantic + Ask AI grounding
Grid search endpoint:
- `GET /api/grid/search` supports:
- lexical search via SQLite FTS (when available)
- uses BM25 ranking (title weighted higher than body)
- optional semantic rerank with cached embeddings (`grid_doc_embeddings`)
- trust/ops metadata in the response (best-effort):
- `mode` (`lexical|semantic|recent`) and optional `note` / `semantic_error`
- `dedupe` summary (`by=content_hash`, `pruned`)
- `timings_ms` (for UI latency surfacing and operator debugging)
Ask AI over Grid results:
- The browser sends `page.sources` (URLs + `doc_id` where available) and sets `use_grid_context=true`.
- Server looks up those docs in SQLite and injects bounded excerpts as an **untrusted** context block:
- helper: `app/_app.py:_ai_grid_docs_context_from_payload`
- Ask AI never fetches arbitrary URLs on its own for Grid context.
- The `/grid` UI renders grounding separately from the main answer:
- “What this is based on” (high-level provenance summary)
- “Sources” (clickable, labeled links)
- Follow-up thread memory stores the answer body only (grounding is stripped).
## Storage / index / cache
SQLite (`app.db`) tables (MVP):
- `grid_nodes` (registration, policy snapshot, capabilities, consent, token hash)
- `grid_jobs` (queue/assignment state + signed manifest)
- `grid_docs` (indexed documents)
- `grid_docs_fts` (optional; FTS5 for lexical search)
- `grid_doc_embeddings` (optional; semantic cache)
## Operator visibility
- `GET /api/grid/status` shows:
- node counts (online/total)
- contributing node counts (contributing online/total)
- work-availability (newer agents only):
- `nodes_work_allowed_online` / `nodes_work_reported_online` (how many online nodes are currently able to pull jobs)
- `nodes_work_blocked_reported_online` (opted-in nodes that are connected but currently blocked by local policy heuristics)
- the heartbeat window used for “online” (`online_window_s`)
- job counts (queued/assigned/done/failed/canceled)
- includes best-effort queue age fields when available: `jobs.oldest_queued_utc`, `jobs.oldest_assigned_utc` (and `*_age_s`)
- doc + embedding counts
- a lightweight freshness block: `freshness.docs_last_indexed_utc`, `freshness.embeddings_last_updated_utc` (and `*_age_s`)
- capacity aggregates from recent heartbeats (`capacity.online`, `capacity.contributing_online`)
- `cpu_total`, `mem_total_bytes`, `mem_available_bytes`, `home_free_bytes`
- `ollama_nodes_online` (best-effort)
- recent nodes (last seen, status, contributing/paused/disconnected state)
- may include `work_allowed` and `work_blockers` for “why isn't this opted-in node pulling jobs right now?”
- privacy: `display_name` is omitted by default; set `GRID_PUBLIC_STATUS_INCLUDE_DISPLAY_NAME=1` to include it
Operator console (token-gated or direct-local only):
- `GET /api/grid/admin/overview` returns:
- nodes (policy + heartbeat/capabilities summary)
- jobs (status + attempts + result summary)
- audit events
- `GET /api/grid/admin/jobs/{job_id}` returns:
- per-job trace (slim job fields + assigned node summary + related audit events; best-effort)
- `POST /api/grid/jobs/submit` queues an admin job (used by the `/grid` operator console “Submit job” form)
- `POST /api/grid/admin/jobs/{job_id}/retry` re-queues a copy of a job
- `POST /api/grid/admin/jobs/{job_id}/cancel` cancels a queued **or assigned** job (status becomes `canceled`)
- `POST /api/grid/admin/jobs/reclaim-stale` forces a best-effort reclaim of stale assigned jobs (returns `{requeued, failed, scanned}`)
Auth model:
- If `GRID_ADMIN_TOKEN` (or `APP_ADMIN_TOKEN`) is set, requests must provide the token header.
- If no tokens are configured, operator endpoints allow **direct loopback only** (no forwarded headers).
## Top issues (severity × leverage)
1. Release/versioning: automate `SHA256SUMS` + signed-manifest distribution + changelog.
2. Key pinning UX: better “verify signature” flow + rotation/runbook story.
3. Policy editor UX: safe defaults + inline explanations + validation errors.
4. Node health details: show CPU/RAM/disk + last heartbeat payload on `/grid` + `/status`.
5. Operator job queue UI: submit/inspect/retry/cancel jobs (token-gated).
6. Agent upgrade path: idempotent updates + “what changed” diff view.
7. Constraint enforcement: normalize timeouts/max_bytes/redirects consistently across plane + agent.
8. Crawl pipeline: canonicalization + dedupe + content-type enforcement + snippet quality.
9. Robots handling: caching + clearer failure modes + operator overrides (still conservative).
10. Stronger consent UX: explicit resource caps + “pause/stop” UX + local logs.
11. Public endpoint hardening: rate limits + safer errors + audit trail coverage.
12. Registration safety: token requirements + replay protection + clearer “public vs LAN” modes.
13. Multi-tenant scoping: org/workspace isolation for nodes/jobs/docs (permission-aware search).
14. Search relevance: better blending of FTS + semantic; near-duplicate clustering.
15. Embedding lifecycle: TTL/invalidation + model versioning + cache health endpoints.
16. Ask AI grounding: UI to select which Grid docs are used; better citations for excerpts.
17. Degraded mode: clearer UI when grid disabled/unavailable (local-only fallback).
18. Observability: per-stage latency metrics + job success rates + node SLOs.
19. Packaging: Docker image for agent + signed releases for Linux/macOS/Windows.
20. Sandbox posture: stricter network egress for crawls (optional), safer redirect policies, headers.
## Implementation order (recommended)
1. Automate `SHA256SUMS` + signed-manifest release generation (single source of truth).
2. Ship a token-gated operator UI for nodes + jobs (status, queue, retries, audit trail).
3. Ship a user-facing policy editor + consent explainer (opt-in clarity; safe defaults).
4. Improve indexing/crawl quality (canonicalization, dedupe, content-type rules, snippets).
5. Improve Ask AI “grounded over Grid” UX (doc selection, better citations, fallbacks).
6. Add observability + SLOs (latency + job success + node health trend lines).
## Baseline measurement plan
Local checks:
- Unit tests: `bash scripts/run_unit_tests.sh`
- Regression smoke: `python3 scripts/regression_smoke.py`
Grid-specific smoke:
- `curl -fsS http://127.0.0.1:8000/api/grid/assets | head`
- `curl -fsS http://127.0.0.1:8000/api/grid/status`
- Open `http://127.0.0.1:8000/grid` and verify:
- Download + Verify section loads checksums
- platform tabs switch
- Grid Search works (empty state + query)
- Ask AI over results streams and cites Grid URLs