Sol Chat Web

This document covers the production web chat exposed at https://sol.system42.one/chat, the same-origin backend route at /api/chat, and the…

Executive Summary

This document covers the production web chat exposed at https://sol.system42.one/chat, the same-origin backend route at /api/chat, and the replacement of the legacy www/sol-chat.html page. Recent Changes Sol Chat Web This document covers the production web chat exposed at https://sol.system42.one/chat, the same-origin backend route at /api/chat, and the replacement of the legacy www/sol-chat.html page.

Sol Chat Web

This document covers the production web chat exposed at https://sol.system42.one/chat, the same-origin backend route at /api/chat, and the replacement of the legacy www/sol-chat.html page.

Recent Changes

Last 72 hours, condensed:

Architecture

Request Surface

Frontend Behavior

Both website assistant surfaces now use the same same-origin backend service, sol_chat_api.py, through /api/chat.

The dedicated /chat page is not just a thin wrapper over the API anymore. It now carries the same Sol presence language as the desktop shell while keeping a public-facing layout:

  • persisted sol_session in localStorage
  • same-origin history load, send, and reset flows
  • image attachment flow with local preview cards before send
  • /chat now builds a local page-context payload for text turns from the visible UI itself:
  • stack summary
  • stack cards
  • debug metrics
  • client diagnostics
  • recent transcript tail
  • Sol orb presence using www/assets/hue-visualizer.js
  • voice controls:
  • voice arm/disarm
  • speak latest assistant message
  • stop playback
  • automatic speech pre-cache for the latest visible reply
  • site-wide debug metrics block sourced from site-metrics.json
  • per-message retrieval diagnostics when debug mode is enabled

The page is intentionally same-origin and local-first. The browser never talks directly to the model backend or any LAN-only llama endpoint. Image attachments are serialized in-browser and posted only to the same-origin Sol API.

The floating desktop assistant in www/index.html uses the same route family:

  • POST /api/chat
  • GET /api/chat/history
  • GET /api/chat/speak

This keeps the public /chat surface and the shell assistant on the same local reasoning backend, retrieval path, speech cache, and session persistence format.

The clients do not use the transport identically anymore:

  • /chat remains SSE-first for visible incremental transcript updates
  • the floating desktop assistant now prefers non-stream JSON when a grounded page_context is present, because page-summary turns benefit more from lower startup overhead than from token-by-token UI updates
  • the floating desktop assistant still uses SSE for non-page/freeform turns and retries without stream if the streamed path fails or yields an empty final message

Context Assembly

The site chat now treats context as a ranked bundle, not a single blob:

  • system prompt:
  • /home/david/random/prompts/sol_chat_system_prompt.txt
  • tells Sol to treat page context, retrieval, and history as evidence to synthesize from rather than text to parrot
  • page context:
  • for /chat, built directly from the visible interface state
  • includes the page title, stack summary, debug metrics, diagnostics, and a bounded recent transcript
  • for the floating desktop assistant, page context is fingerprinted from target, title, content type, headings, and bounded content
  • that fingerprint now drives browser-side reuse of:
  • grounded reply text for repeated page questions
  • generated narration text for repeated read-aloud requests
  • retrieval:
  • still queried from the knowledge API for text turns
  • used as supporting evidence when the question is broader than the current page
  • the chat backend can now also synthesize one supplemental knowledge query for the same turn and call /api/knowledge/query again with that generated query
  • the supplemental lookup is additive; it does not replace the default retrieval query
  • the response surface exposes this as retrieval.supplemental
  • the backend now also loads the current source file for top retrieval hits when possible, so the model sees both:
  • the quoted embedding hit
  • the current file contents behind that hit
  • for HTML sources, the backend now renders readable visible text from the current file before matching snippets or building excerpts
  • retrieval metadata records this with current_file_representation=rendered_html_text
  • if an indexed snippet reflects an older snapshot and the current file differs, the model is told that explicitly
  • if an indexed snippet is still present in the current readable file text, snippet_found_in_current_file is set to true
  • live site state:
  • for text turns, the backend now injects a live block from site-metrics.json as additional context rather than reserving it for diagnostics only
  • this includes current traffic summary, top paths, recent requests, sensor readouts, and runtime service status when available
  • greetings and diagnostic pings use this block instead of defaulting to generic assistant chatter
  • concise diagnostic pings now prioritize this live site-state path and skip archive retrieval so they stay under the local 4096 token context ceiling
  • direct queryChat connectivity probes also use this path now, so GPT Action test invocations can return a short grounded confirmation instead of a long model answer or a 502 from context overflow
  • history:
  • still persisted, but the live request budget now favors current page context and grounding evidence over older transcript bulk

Important runtime constraint:

  • the live reasoning lane is currently running on a 4096 context window because that is the hardware-safe configuration for the local DeepSeek-R1-Distill-Qwen-7B service on this GPU
  • because of that, the backend now spends less of the request budget on stale history and more on active page context plus retrieval hits

Backend Routing

The API no longer assumes a single always-on model lane:

  • text-only turns default to the reasoning backend
  • image turns route to the vision backend, with vision_fast available as the smaller fallback lane
  • health now exposes stack inventory plus backend runtime state (running, model_id, base_url)
  • vision backends are launched on demand by sol_chat_api.py via local llama-server and are reaped after an idle timeout

Incomplete downloads are not treated as installed models. The shared stack registry checks minimum file sizes so partially-downloaded GGUF files do not get advertised as ready.

Grounding And Retrieval

Each user turn triggers a retrieval request against the public knowledge API:

  • default source: https://sol.system42.one/api/knowledge/query
  • default top_k: 3

The backend formats retrieval into transient system context, then adds a grounding contract for the current turn. The model is told to:

  • prioritize retrieved evidence for factual claims
  • state uncertainty when the evidence is weak
  • avoid inventing names, biographies, citations, and titles

Strict grounding mode is enabled by default. In strict mode, the backend falls back to an extractive answer when:

  • retrieval quality is weak
  • the query is clearly profile/explanation style and retrieval should dominate
  • the model output introduces named entities not present in the user query or retrieved evidence

If retrieval fails outright, the turn continues without retrieval and the failure is logged instead of crashing the request path.

When page context is present and the question is clearly about the current page, stack, session, or debug metrics, retrieval is skipped and the answer is synthesized directly from page context instead.

Direct-query grounding details:

  • /api/chat/query still forces a model-generated message on cache miss for normal queries
  • for probe-style requests about diagnostics, connectivity, or minimal status confirmation, the route now permits deterministic live-site-state fallback instead
  • the retrieval/debug payload for each turn can now include:
  • source_documents
  • current file contents or bounded excerpts for the top retrieved files
  • for HTML files, these contents are visible rendered text rather than raw markup
  • each item may include current_file_representation and snippet_found_in_current_file
  • live_site_state
  • traffic, sensor, recent-request, and runtime-service telemetry from site-metrics.json and the local dashboard stack
  • this was added specifically so low-signal prompts like hello? and diagnostic pings stop collapsing into a single embedding chunk and instead answer from:
  • archive hit text
  • current source file state
  • current site metrics
  • metrics are now additive context for ordinary text turns too; they are not stripped out just because retrieval or page context is present
  • creative/story prompts are not excluded from this path; they can still pick up archive material and current file text as seed context

Config

Environment variables:

  • SOL_CHAT_HOST
  • SOL_CHAT_PORT
  • SOL_CHAT_BACKEND_BASE_URL
  • SOL_CHAT_MODEL
  • SOL_CHAT_VISION_BACKEND_BASE_URL
  • SOL_CHAT_VISION_FAST_BACKEND_BASE_URL
  • SOL_CHAT_TIMEOUT
  • SOL_CHAT_STREAM
  • SOL_CHAT_HISTORY_DIR
  • SOL_CHAT_SYSTEM_PROMPT_FILE
  • SOL_CHAT_ENABLE_STRICT_GROUNDING
  • SOL_CHAT_HISTORY_WINDOW_CHARS
  • SOL_CHAT_MAX_HISTORY_MESSAGES
  • SOL_CHAT_KNOWLEDGE_URL
  • SOL_CHAT_KNOWLEDGE_TOP_K
  • SOL_CHAT_KNOWLEDGE_TIMEOUT
  • SOL_CHAT_TEMPERATURE
  • SOL_CHAT_TOP_P
  • SOL_CHAT_MAX_TOKENS
  • SOL_CHAT_VISION_IDLE_TIMEOUT

Editable prompt file:

  • /home/david/random/prompts/sol_chat_system_prompt.txt

Default backend assumption:

  • the reasoning lane is commonly kept warm on SOL_CHAT_BACKEND_BASE_URL
  • the web API can also start local vision backends itself when image turns arrive

Voice/TTS behavior:

  • /api/chat/speak is served by the same backend daemon
  • synthesized audio is cached on disk under /home/david/.local/share/sol_chat_web/tts_cache
  • repeated identical speech requests reuse cached MP3 output across sessions and refreshes
  • because the floating desktop assistant now reuses cached narration text for unchanged pages, repeated page read-aloud runs also tend to hit the same server-side MP3 cache entry

Desktop assistant playback behavior:

  • the floating assistant now has an explicit transport button whose label tracks state:
  • Pause while audio is playing
  • Resume when paused with buffered playback available
  • Play when no buffered playback is active
  • closing the popup performs a hard playback stop and clears pending continuation so hidden playback does not resume later
  • quick prompt suggestions are now partially dynamic:
  • first suggestion remains anchored
  • second and third are regenerated from the current page or fallback prompt pool each time the popup is reopened

Asset freshness behavior:

Deployment

  1. Start the backend service:
bash
python3 /home/david/random/bin/sol_chat_api.py
  1. Ensure Caddy is using /home/david/random/bin/Caddyfile.pkd_share, which now includes:
  1. For persistent boot behavior, add a user service similar to the existing site daemons:
ini
[Unit]
Description=Sol chat web API
After=network-online.target

[Service]
ExecStart=/usr/bin/python3 /home/david/random/bin/sol_chat_api.py
Restart=always
RestartSec=2

[Install]
WantedBy=default.target

Suggested unit path:

After installing or changing the unit:

bash
systemctl --user daemon-reload
systemctl --user enable --now sol-chat-api.service
systemctl --user status --no-pager sol-chat-api.service
systemctl --user status --no-pager sol-chat-model.service

Logging And Verification

The backend logs structured JSON events to stdout/journal with:

Debug persistence:

Prompt/response checks used during tuning:

text
Prompt: "What page is open?"
Result: page-context answer from /chat UI state, retrieval skipped, stack summary + metrics included.

Prompt: "Which local models are active right now?"
Result: page-context answer naming Qwen3-VL, Gemma small, and DeepSeek, with the active reasoning backend file.

Prompt: "What do the debug metrics show?"
Result: page-context answer summarizing visible metrics, transport state, grounding state, backend profile, and readouts.

Direct browser-query examples:

text
/api/chat/query?query=what%20is%20Sol%3F
text
/api/chat/query?query=What%20page%20is%20open%3F&page_title=Sol%20Chat&page_target=%2Fchat&page_content_type=chat_ui&page_heading=Sol%20%2F%20Chat&page_content=status%3A%20idle...

GPT Action import details:

text
Schema URL: https://sol.system42.one/chat-openapi.json
Privacy URL: https://sol.system42.one/privacy.html
Available actions: queryChat, chatHealth, queryKnowledge, knowledgeHealth

Cache behavior check used during tuning:

text
1. GET /api/chat/query?query=what%20is%20Sol%20really%3F
   -> model-generated answer, cache_hit: false

2. repeat same URL
-> same answer returned from cache, cache_hit: true

Regression/contract check:

bash
python3 /home/david/random/bin/check_sol_chat_api_contract.py
python3 /home/david/random/bin/check_sol_chat_asset_versioning.py
python3 /home/david/random/bin/check_sol_chat_tts_cache.py

That script starts fake knowledge and model backends, boots sol_chat_api.py against them, and verifies:

The asset versioning check verifies that /chat references cache-busted JS/CSS URLs. The TTS check verifies that speech caching is still active and writable.

Legacy Replacement

The old www/sol-chat.html page was retired for three reasons:

The file now exists only as a redirect to /chat, so old links still land on the current interface without preserving the old copy or behavior.