# Sol Chat Web

This document covers the production web chat exposed at `https://sol.system42.one/chat`, the same-origin backend route at `/api/chat`, and the replacement of the legacy `www/sol-chat.html` page.

## Recent Changes

Last 72 hours, condensed:

- the live reasoning backend was switched to the local `DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf` lane on `127.0.0.1:18080`
- the website now exposes a direct browser-openable and GPT Action-importable chat route through:
  - `GET /api/chat/query`
  - `GET /chat-openapi.json`
- both assistant surfaces now share the same backend service:
  - the dedicated `/chat` client
  - the floating desktop assistant in `www/index.html`
- the floating desktop assistant was hardened after page-summary failures:
  - streamed turns now fall back to non-stream JSON if the stream faults or yields no usable text
  - the backend now applies the non-empty-answer retry path to streamed generations too
- the floating desktop assistant now caches page-grounded replies by prompt plus page fingerprint, and caches generated read-aloud narration text by page fingerprint
- desktop page-grounded turns now default to one-shot JSON instead of SSE to avoid stream startup overhead on summary-style prompts
- the desktop assistant now exposes a real play/pause/resume transport and stops queued playback when the popup is closed
- the desktop assistant prompt strip now keeps the first suggestion anchored while rerolling the second and third suggestions from a broader context-aware pool on each popup open
- `/chat` now sends a real `page_context` payload built from the visible interface, including stack cards, debug metrics, diagnostics, and recent transcript context
- debug-mode retrieval metadata now persists across reloads via session history instead of disappearing after the original streamed turn
- direct query answers are model-generated on cache miss and reused from a file-backed cache on identical repeats
- direct query grounding is richer than before:
  - top retrieval snippets
  - current source-file contents for the top hit files when available
  - live site-state and sensor context from `site-metrics.json`
- this was added specifically to improve low-signal turns such as greetings and diagnostics, so prompts like `hello?` can answer from local context instead of generic assistant chatter
- creative prompts still keep the same retrieval/source-file bundle in play; that context is meant to act as seed material rather than being bypassed

## Architecture

- UI:
  - canonical client: `/home/david/random/www/chat/index.html`
  - supporting assets: `/home/david/random/www/chat/chat.css`, `/home/david/random/www/chat/chat.js`
  - desktop shell assistant: `/home/david/random/www/index.html`
  - deprecated entry: `/home/david/random/www/sol-chat.html` now redirects to `/chat`
  - public face now mirrors the desktop assistant language from `www/index.html`: retro dialog framing, Sol orb presence, voice controls, and on-page debug/metrics blocks
- origin routing:
  - Caddy serves `/chat` by rewriting to `/chat/index.html`
  - Caddy now sends `Cache-Control: no-store` for `/chat` and `/chat/*` so mobile clients do not sit on stale CSS/JS after a page refresh
  - Caddy reverse-proxies `/api/chat*` to `127.0.0.1:8895`
- backend:
  - daemon: `/home/david/random/bin/sol_chat_api.py`
  - runtime style: stdlib `ThreadingHTTPServer`, matching the small local daemons already used for knowledge/logbook/gui metadata
- persistence:
  - default root: `/home/david/.local/share/sol_chat_web`
  - per-session JSON files under `sessions/`
  - each session stores `created_at`, `updated_at`, `metadata`, and `messages[]`
- adjacent local pipeline docs:
  - `/home/david/random/docs/sol-multimodal-pipeline.md` covers the direct webcam/bootstrap scripts, local model lanes, and current bring-up status outside the web-only surface

## Request Surface

- `POST /api/chat`
  - request body:
    - `message`
    - `session`
    - `stream`
    - optional `images[]` with browser-supplied data URLs for multimodal turns
    - optional `profile` to prefer `vision` or `vision_fast` when images are present
  - response:
    - JSON when streaming is disabled
    - `text/event-stream` when streaming is enabled
- `GET /api/chat/query?query=...`
  - browser-openable direct JSON chat route
  - stateless by default:
    - if `session` is omitted and `persist` is not set, the turn is processed without being written into history
  - message generation behavior:
    - this route forces model generation for `message` on normal cache misses instead of using the extractive fallback path
    - probe-style connectivity and diagnostic requests are allowed to use the deterministic live-site-state fallback so they stay concise and avoid overflowing the local `4096` token context window
    - repeated identical direct-query requests reuse a file-backed cache
    - response metadata includes `cache_hit`
  - useful query params:
    - `query` or `message`
    - `session`
    - `persist=1`
    - `profile`
    - `page_title`
    - `page_target`
    - `page_content_type`
    - repeated `page_heading`
    - repeated `page_question`
    - `page_content`
    - or `page_context` as a JSON-encoded object
- `GET /api/chat/history?session=...`
  - returns the stored session history
- `POST /api/chat/reset`
  - clears the session back to the system prompt
- `GET /api/chat/health`
  - lightweight readiness/config surface without exposing backend topology details
- `GET /chat-openapi.json`
  - public OpenAPI schema for importing the combined Sol Action surface as a GPT Action
  - import URL:
    - `https://sol.system42.one/chat-openapi.json`
  - privacy policy URL:
    - `https://sol.system42.one/privacy.html`

## Frontend Behavior

Both website assistant surfaces now use the same same-origin backend service, `sol_chat_api.py`, through `/api/chat*`.

The dedicated `/chat` page is not just a thin wrapper over the API anymore. It now carries the same Sol presence language as the desktop shell while keeping a public-facing layout:

- persisted `sol_session` in `localStorage`
- same-origin history load, send, and reset flows
- image attachment flow with local preview cards before send
- `/chat` now builds a local page-context payload for text turns from the visible UI itself:
  - stack summary
  - stack cards
  - debug metrics
  - client diagnostics
  - recent transcript tail
- Sol orb presence using `www/assets/hue-visualizer.js`
- voice controls:
  - voice arm/disarm
  - speak latest assistant message
  - stop playback
- automatic speech pre-cache for the latest visible reply
- site-wide debug metrics block sourced from `site-metrics.json`
- per-message retrieval diagnostics when debug mode is enabled

The page is intentionally same-origin and local-first. The browser never talks directly to the model backend or any LAN-only llama endpoint. Image attachments are serialized in-browser and posted only to the same-origin Sol API.

The floating desktop assistant in `www/index.html` uses the same route family:

- `POST /api/chat`
- `GET /api/chat/history`
- `GET /api/chat/speak`

This keeps the public `/chat` surface and the shell assistant on the same local reasoning backend, retrieval path, speech cache, and session persistence format.

The clients do not use the transport identically anymore:

- `/chat` remains SSE-first for visible incremental transcript updates
- the floating desktop assistant now prefers non-stream JSON when a grounded `page_context` is present, because page-summary turns benefit more from lower startup overhead than from token-by-token UI updates
- the floating desktop assistant still uses SSE for non-page/freeform turns and retries without stream if the streamed path fails or yields an empty final message

## Context Assembly

The site chat now treats context as a ranked bundle, not a single blob:

- system prompt:
  - `/home/david/random/prompts/sol_chat_system_prompt.txt`
  - tells Sol to treat page context, retrieval, and history as evidence to synthesize from rather than text to parrot
- page context:
  - for `/chat`, built directly from the visible interface state
  - includes the page title, stack summary, debug metrics, diagnostics, and a bounded recent transcript
  - for the floating desktop assistant, page context is fingerprinted from target, title, content type, headings, and bounded content
  - that fingerprint now drives browser-side reuse of:
    - grounded reply text for repeated page questions
    - generated narration text for repeated read-aloud requests
- retrieval:
  - still queried from the knowledge API for text turns
  - used as supporting evidence when the question is broader than the current page
  - the chat backend can now also synthesize one supplemental knowledge query for the same turn and call `/api/knowledge/query` again with that generated query
  - the supplemental lookup is additive; it does not replace the default retrieval query
  - the response surface exposes this as `retrieval.supplemental`
  - the backend now also loads the current source file for top retrieval hits when possible, so the model sees both:
    - the quoted embedding hit
    - the current file contents behind that hit
  - for HTML sources, the backend now renders readable visible text from the current file before matching snippets or building excerpts
  - retrieval metadata records this with `current_file_representation=rendered_html_text`
  - if an indexed snippet reflects an older snapshot and the current file differs, the model is told that explicitly
  - if an indexed snippet is still present in the current readable file text, `snippet_found_in_current_file` is set to `true`
- live site state:
  - for text turns, the backend now injects a live block from `site-metrics.json` as additional context rather than reserving it for diagnostics only
  - this includes current traffic summary, top paths, recent requests, sensor readouts, and runtime service status when available
  - greetings and diagnostic pings use this block instead of defaulting to generic assistant chatter
  - concise diagnostic pings now prioritize this live site-state path and skip archive retrieval so they stay under the local `4096` token context ceiling
  - direct `queryChat` connectivity probes also use this path now, so GPT Action test invocations can return a short grounded confirmation instead of a long model answer or a 502 from context overflow
- history:
  - still persisted, but the live request budget now favors current page context and grounding evidence over older transcript bulk

Important runtime constraint:

- the live reasoning lane is currently running on a `4096` context window because that is the hardware-safe configuration for the local `DeepSeek-R1-Distill-Qwen-7B` service on this GPU
- because of that, the backend now spends less of the request budget on stale history and more on active page context plus retrieval hits

## Backend Routing

The API no longer assumes a single always-on model lane:

- text-only turns default to the reasoning backend
- image turns route to the vision backend, with `vision_fast` available as the smaller fallback lane
- health now exposes stack inventory plus backend runtime state (`running`, `model_id`, `base_url`)
- vision backends are launched on demand by `sol_chat_api.py` via local `llama-server` and are reaped after an idle timeout

Incomplete downloads are not treated as installed models. The shared stack registry checks minimum file sizes so partially-downloaded GGUF files do not get advertised as ready.

## Grounding And Retrieval

Each user turn triggers a retrieval request against the public knowledge API:

- default source: `https://sol.system42.one/api/knowledge/query`
- default `top_k`: `3`

The backend formats retrieval into transient system context, then adds a grounding contract for the current turn. The model is told to:

- prioritize retrieved evidence for factual claims
- state uncertainty when the evidence is weak
- avoid inventing names, biographies, citations, and titles

Strict grounding mode is enabled by default. In strict mode, the backend falls back to an extractive answer when:

- retrieval quality is weak
- the query is clearly profile/explanation style and retrieval should dominate
- the model output introduces named entities not present in the user query or retrieved evidence

If retrieval fails outright, the turn continues without retrieval and the failure is logged instead of crashing the request path.

When page context is present and the question is clearly about the current page, stack, session, or debug metrics, retrieval is skipped and the answer is synthesized directly from page context instead.

Direct-query grounding details:

- `/api/chat/query` still forces a model-generated `message` on cache miss for normal queries
- for probe-style requests about diagnostics, connectivity, or minimal status confirmation, the route now permits deterministic live-site-state fallback instead
- the retrieval/debug payload for each turn can now include:
  - `source_documents`
    - current file contents or bounded excerpts for the top retrieved files
    - for HTML files, these contents are visible rendered text rather than raw markup
    - each item may include `current_file_representation` and `snippet_found_in_current_file`
  - `live_site_state`
    - traffic, sensor, recent-request, and runtime-service telemetry from `site-metrics.json` and the local dashboard stack
- this was added specifically so low-signal prompts like `hello?` and diagnostic pings stop collapsing into a single embedding chunk and instead answer from:
  - archive hit text
  - current source file state
  - current site metrics
- metrics are now additive context for ordinary text turns too; they are not stripped out just because retrieval or page context is present
- creative/story prompts are not excluded from this path; they can still pick up archive material and current file text as seed context

## Config

Environment variables:

- `SOL_CHAT_HOST`
- `SOL_CHAT_PORT`
- `SOL_CHAT_BACKEND_BASE_URL`
- `SOL_CHAT_MODEL`
- `SOL_CHAT_VISION_BACKEND_BASE_URL`
- `SOL_CHAT_VISION_FAST_BACKEND_BASE_URL`
- `SOL_CHAT_TIMEOUT`
- `SOL_CHAT_STREAM`
- `SOL_CHAT_HISTORY_DIR`
- `SOL_CHAT_SYSTEM_PROMPT_FILE`
- `SOL_CHAT_ENABLE_STRICT_GROUNDING`
- `SOL_CHAT_HISTORY_WINDOW_CHARS`
- `SOL_CHAT_MAX_HISTORY_MESSAGES`
- `SOL_CHAT_KNOWLEDGE_URL`
- `SOL_CHAT_KNOWLEDGE_TOP_K`
- `SOL_CHAT_KNOWLEDGE_TIMEOUT`
- `SOL_CHAT_TEMPERATURE`
- `SOL_CHAT_TOP_P`
- `SOL_CHAT_MAX_TOKENS`
- `SOL_CHAT_VISION_IDLE_TIMEOUT`

Editable prompt file:

- `/home/david/random/prompts/sol_chat_system_prompt.txt`

Default backend assumption:

- the reasoning lane is commonly kept warm on `SOL_CHAT_BACKEND_BASE_URL`
- the web API can also start local vision backends itself when image turns arrive

Voice/TTS behavior:

- `/api/chat/speak` is served by the same backend daemon
- synthesized audio is cached on disk under `/home/david/.local/share/sol_chat_web/tts_cache`
- repeated identical speech requests reuse cached MP3 output across sessions and refreshes
- because the floating desktop assistant now reuses cached narration text for unchanged pages, repeated page read-aloud runs also tend to hit the same server-side MP3 cache entry

Desktop assistant playback behavior:

- the floating assistant now has an explicit transport button whose label tracks state:
  - `Pause` while audio is playing
  - `Resume` when paused with buffered playback available
  - `Play` when no buffered playback is active
- closing the popup performs a hard playback stop and clears pending continuation so hidden playback does not resume later
- quick prompt suggestions are now partially dynamic:
  - first suggestion remains anchored
  - second and third are regenerated from the current page or fallback prompt pool each time the popup is reopened

Asset freshness behavior:

- `www/chat/index.html` now references versioned asset URLs for `chat.css`, `chat.js`, and `hue-visualizer.js`
- Caddy also marks `/chat` and `/chat/*` as `no-store`
- this combination was added after mobile clients kept reusing stale public JS/CSS while `/chat` HTML itself had already updated

## Deployment

1. Start the backend service:

```bash
python3 /home/david/random/bin/sol_chat_api.py
```

2. Ensure Caddy is using `/home/david/random/bin/Caddyfile.pkd_share`, which now includes:

- `/chat` rewrite to `/chat/index.html`
- `Cache-Control: no-store` on `/chat` and `/chat/*`
- `/api/chat*` reverse proxy to `127.0.0.1:8895`

3. For persistent boot behavior, add a user service similar to the existing site daemons:

```ini
[Unit]
Description=Sol chat web API
After=network-online.target

[Service]
ExecStart=/usr/bin/python3 /home/david/random/bin/sol_chat_api.py
Restart=always
RestartSec=2

[Install]
WantedBy=default.target
```

Suggested unit path:

- `/home/david/.config/systemd/user/sol-chat-api.service`
- the model backend is commonly paired with `/home/david/.config/systemd/user/sol-chat-model.service`

After installing or changing the unit:

```bash
systemctl --user daemon-reload
systemctl --user enable --now sol-chat-api.service
systemctl --user status --no-pager sol-chat-api.service
systemctl --user status --no-pager sol-chat-model.service
```

## Logging And Verification

The backend logs structured JSON events to stdout/journal with:

- request start/completion
- session id
- retrieval success or failure
- model latency
- fallback-grounding usage

Debug persistence:

- per-assistant-turn retrieval metadata is now stored in session metadata and returned by `GET /api/chat/history`
- the `/chat` frontend restores those stored traces when debug mode is enabled after a reload
- this fixes the earlier behavior where debug detail existed only on the live DOM node during the original response stream

Prompt/response checks used during tuning:

```text
Prompt: "What page is open?"
Result: page-context answer from /chat UI state, retrieval skipped, stack summary + metrics included.

Prompt: "Which local models are active right now?"
Result: page-context answer naming Qwen3-VL, Gemma small, and DeepSeek, with the active reasoning backend file.

Prompt: "What do the debug metrics show?"
Result: page-context answer summarizing visible metrics, transport state, grounding state, backend profile, and readouts.
```

Direct browser-query examples:

```text
/api/chat/query?query=what%20is%20Sol%3F
```

```text
/api/chat/query?query=What%20page%20is%20open%3F&page_title=Sol%20Chat&page_target=%2Fchat&page_content_type=chat_ui&page_heading=Sol%20%2F%20Chat&page_content=status%3A%20idle...
```

GPT Action import details:

```text
Schema URL: https://sol.system42.one/chat-openapi.json
Privacy URL: https://sol.system42.one/privacy.html
Available actions: queryChat, chatHealth, queryKnowledge, knowledgeHealth
```

Cache behavior check used during tuning:

```text
1. GET /api/chat/query?query=what%20is%20Sol%20really%3F
   -> model-generated answer, cache_hit: false

2. repeat same URL
   -> same answer returned from cache, cache_hit: true
```

Regression/contract check:

```bash
python3 /home/david/random/bin/check_sol_chat_api_contract.py
python3 /home/david/random/bin/check_sol_chat_asset_versioning.py
python3 /home/david/random/bin/check_sol_chat_tts_cache.py
```

That script starts fake knowledge and model backends, boots `sol_chat_api.py` against them, and verifies:

- `POST /api/chat`
- streaming SSE output
- history persistence
- reset behavior
- same-origin route shape compatibility

The asset versioning check verifies that `/chat` references cache-busted JS/CSS URLs. The TTS check verifies that speech caching is still active and writable.

## Legacy Replacement

The old `www/sol-chat.html` page was retired for three reasons:

- it depended on jQuery for a trivial interaction
- it hardcoded a LAN target instead of using same-origin routing
- it framed Sol as a placeholder journaling companion rather than a production chat surface

The file now exists only as a redirect to `/chat`, so old links still land on the current interface without preserving the old copy or behavior.