Sol Chat Web

This document covers the production web chat exposed at https://sol.system42.one/chat, the same-origin backend route at /api/chat, and the replacement of the legacy www/sol-chat.html page.

Recent Changes

Last 72 hours, condensed:

the live reasoning backend was switched to the local DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf lane on 127.0.0.1:18080
the website now exposes a direct browser-openable and GPT Action-importable chat route through:
GET /api/chat/query
GET /chat-openapi.json
both assistant surfaces now share the same backend service:
the dedicated /chat client
the floating desktop assistant in www/index.html
the floating desktop assistant was hardened after page-summary failures:
streamed turns now fall back to non-stream JSON if the stream faults or yields no usable text
the backend now applies the non-empty-answer retry path to streamed generations too
the floating desktop assistant now caches page-grounded replies by prompt plus page fingerprint, and caches generated read-aloud narration text by page fingerprint
desktop page-grounded turns now default to one-shot JSON instead of SSE to avoid stream startup overhead on summary-style prompts
the desktop assistant now exposes a real play/pause/resume transport and stops queued playback when the popup is closed
the desktop assistant prompt strip now keeps the first suggestion anchored while rerolling the second and third suggestions from a broader context-aware pool on each popup open
/chat now sends a real page_context payload built from the visible interface, including stack cards, debug metrics, diagnostics, and recent transcript context
debug-mode retrieval metadata now persists across reloads via session history instead of disappearing after the original streamed turn
direct query answers are model-generated on cache miss and reused from a file-backed cache on identical repeats
direct query grounding is richer than before:
top retrieval snippets
current source-file contents for the top hit files when available
live site-state and sensor context from site-metrics.json
this was added specifically to improve low-signal turns such as greetings and diagnostics, so prompts like hello? can answer from local context instead of generic assistant chatter
creative prompts still keep the same retrieval/source-file bundle in play; that context is meant to act as seed material rather than being bypassed

Architecture

UI:
canonical client: /home/david/random/www/chat/index.html
supporting assets: /home/david/random/www/chat/chat.css, /home/david/random/www/chat/chat.js
desktop shell assistant: /home/david/random/www/index.html
deprecated entry: /home/david/random/www/sol-chat.html now redirects to /chat
public face now mirrors the desktop assistant language from www/index.html: retro dialog framing, Sol orb presence, voice controls, and on-page debug/metrics blocks
origin routing:
Caddy serves /chat by rewriting to /chat/index.html
Caddy now sends Cache-Control: no-store for /chat and /chat/ so mobile clients do not sit on stale CSS/JS after a page refresh
Caddy reverse-proxies /api/chat to 127.0.0.1:8895
backend:
daemon: /home/david/random/bin/sol_chat_api.py
runtime style: stdlib ThreadingHTTPServer, matching the small local daemons already used for knowledge/logbook/gui metadata
persistence:
default root: /home/david/.local/share/sol_chat_web
per-session JSON files under sessions/
each session stores created_at, updated_at, metadata, and messages[]
adjacent local pipeline docs:
/home/david/random/docs/sol-multimodal-pipeline.md covers the direct webcam/bootstrap scripts, local model lanes, and current bring-up status outside the web-only surface

Request Surface

POST /api/chat
request body:
message
session
stream
optional images[] with browser-supplied data URLs for multimodal turns
optional profile to prefer vision or vision_fast when images are present
response:
JSON when streaming is disabled
text/event-stream when streaming is enabled
GET /api/chat/query?query=...
browser-openable direct JSON chat route
stateless by default:
if session is omitted and persist is not set, the turn is processed without being written into history
message generation behavior:
this route forces model generation for message on normal cache misses instead of using the extractive fallback path
probe-style connectivity and diagnostic requests are allowed to use the deterministic live-site-state fallback so they stay concise and avoid overflowing the local 4096 token context window
repeated identical direct-query requests reuse a file-backed cache
response metadata includes cache_hit
useful query params:
query or message
session
persist=1
profile
page_title
page_target
page_content_type
repeated page_heading
repeated page_question
page_content
or page_context as a JSON-encoded object
GET /api/chat/history?session=...
returns the stored session history
POST /api/chat/reset
clears the session back to the system prompt
GET /api/chat/health
lightweight readiness/config surface without exposing backend topology details
GET /chat-openapi.json
public OpenAPI schema for importing the combined Sol Action surface as a GPT Action
import URL:
https://sol.system42.one/chat-openapi.json
privacy policy URL:
https://sol.system42.one/privacy.html

Frontend Behavior

Both website assistant surfaces now use the same same-origin backend service, sol_chat_api.py, through /api/chat.

The dedicated /chat page is not just a thin wrapper over the API anymore. It now carries the same Sol presence language as the desktop shell while keeping a public-facing layout:

persisted sol_session in localStorage
same-origin history load, send, and reset flows
image attachment flow with local preview cards before send
/chat now builds a local page-context payload for text turns from the visible UI itself:
stack summary
stack cards
debug metrics
client diagnostics
recent transcript tail
Sol orb presence using www/assets/hue-visualizer.js
voice controls:
voice arm/disarm
speak latest assistant message
stop playback
automatic speech pre-cache for the latest visible reply
site-wide debug metrics block sourced from site-metrics.json
per-message retrieval diagnostics when debug mode is enabled

The page is intentionally same-origin and local-first. The browser never talks directly to the model backend or any LAN-only llama endpoint. Image attachments are serialized in-browser and posted only to the same-origin Sol API.

The floating desktop assistant in www/index.html uses the same route family:

POST /api/chat
GET /api/chat/history
GET /api/chat/speak

This keeps the public /chat surface and the shell assistant on the same local reasoning backend, retrieval path, speech cache, and session persistence format.

The clients do not use the transport identically anymore:

/chat remains SSE-first for visible incremental transcript updates
the floating desktop assistant now prefers non-stream JSON when a grounded page_context is present, because page-summary turns benefit more from lower startup overhead than from token-by-token UI updates
the floating desktop assistant still uses SSE for non-page/freeform turns and retries without stream if the streamed path fails or yields an empty final message

Context Assembly

The site chat now treats context as a ranked bundle, not a single blob:

system prompt:
/home/david/random/prompts/sol_chat_system_prompt.txt
tells Sol to treat page context, retrieval, and history as evidence to synthesize from rather than text to parrot
page context:
for /chat, built directly from the visible interface state
includes the page title, stack summary, debug metrics, diagnostics, and a bounded recent transcript
for the floating desktop assistant, page context is fingerprinted from target, title, content type, headings, and bounded content
that fingerprint now drives browser-side reuse of:
grounded reply text for repeated page questions
generated narration text for repeated read-aloud requests
retrieval:
still queried from the knowledge API for text turns
used as supporting evidence when the question is broader than the current page
the chat backend can now also synthesize one supplemental knowledge query for the same turn and call /api/knowledge/query again with that generated query
the supplemental lookup is additive; it does not replace the default retrieval query
the response surface exposes this as retrieval.supplemental
the backend now also loads the current source file for top retrieval hits when possible, so the model sees both:
the quoted embedding hit
the current file contents behind that hit
for HTML sources, the backend now renders readable visible text from the current file before matching snippets or building excerpts
retrieval metadata records this with current_file_representation=rendered_html_text
if an indexed snippet reflects an older snapshot and the current file differs, the model is told that explicitly
if an indexed snippet is still present in the current readable file text, snippet_found_in_current_file is set to true
live site state:
for text turns, the backend now injects a live block from site-metrics.json as additional context rather than reserving it for diagnostics only
this includes current traffic summary, top paths, recent requests, sensor readouts, and runtime service status when available
greetings and diagnostic pings use this block instead of defaulting to generic assistant chatter
concise diagnostic pings now prioritize this live site-state path and skip archive retrieval so they stay under the local 4096 token context ceiling
direct queryChat connectivity probes also use this path now, so GPT Action test invocations can return a short grounded confirmation instead of a long model answer or a 502 from context overflow
history:
still persisted, but the live request budget now favors current page context and grounding evidence over older transcript bulk

Important runtime constraint:

the live reasoning lane is currently running on a 4096 context window because that is the hardware-safe configuration for the local DeepSeek-R1-Distill-Qwen-7B service on this GPU
because of that, the backend now spends less of the request budget on stale history and more on active page context plus retrieval hits

Backend Routing

The API no longer assumes a single always-on model lane:

text-only turns default to the reasoning backend
image turns route to the vision backend, with vision_fast available as the smaller fallback lane
health now exposes stack inventory plus backend runtime state (running, model_id, base_url)
vision backends are launched on demand by sol_chat_api.py via local llama-server and are reaped after an idle timeout

Incomplete downloads are not treated as installed models. The shared stack registry checks minimum file sizes so partially-downloaded GGUF files do not get advertised as ready.

Grounding And Retrieval

Each user turn triggers a retrieval request against the public knowledge API:

default source: https://sol.system42.one/api/knowledge/query
default top_k: 3

The backend formats retrieval into transient system context, then adds a grounding contract for the current turn. The model is told to:

prioritize retrieved evidence for factual claims
state uncertainty when the evidence is weak
avoid inventing names, biographies, citations, and titles

Strict grounding mode is enabled by default. In strict mode, the backend falls back to an extractive answer when:

retrieval quality is weak
the query is clearly profile/explanation style and retrieval should dominate
the model output introduces named entities not present in the user query or retrieved evidence

If retrieval fails outright, the turn continues without retrieval and the failure is logged instead of crashing the request path.

When page context is present and the question is clearly about the current page, stack, session, or debug metrics, retrieval is skipped and the answer is synthesized directly from page context instead.

Direct-query grounding details:

/api/chat/query still forces a model-generated message on cache miss for normal queries
for probe-style requests about diagnostics, connectivity, or minimal status confirmation, the route now permits deterministic live-site-state fallback instead
the retrieval/debug payload for each turn can now include:
source_documents
current file contents or bounded excerpts for the top retrieved files
for HTML files, these contents are visible rendered text rather than raw markup
each item may include current_file_representation and snippet_found_in_current_file
live_site_state
traffic, sensor, recent-request, and runtime-service telemetry from site-metrics.json and the local dashboard stack
this was added specifically so low-signal prompts like hello? and diagnostic pings stop collapsing into a single embedding chunk and instead answer from:
archive hit text
current source file state
current site metrics
metrics are now additive context for ordinary text turns too; they are not stripped out just because retrieval or page context is present
creative/story prompts are not excluded from this path; they can still pick up archive material and current file text as seed context

Config

Environment variables:

SOL_CHAT_HOST
SOL_CHAT_PORT
SOL_CHAT_BACKEND_BASE_URL
SOL_CHAT_MODEL
SOL_CHAT_VISION_BACKEND_BASE_URL
SOL_CHAT_VISION_FAST_BACKEND_BASE_URL
SOL_CHAT_TIMEOUT
SOL_CHAT_STREAM
SOL_CHAT_HISTORY_DIR
SOL_CHAT_SYSTEM_PROMPT_FILE
SOL_CHAT_ENABLE_STRICT_GROUNDING
SOL_CHAT_HISTORY_WINDOW_CHARS
SOL_CHAT_MAX_HISTORY_MESSAGES
SOL_CHAT_KNOWLEDGE_URL
SOL_CHAT_KNOWLEDGE_TOP_K
SOL_CHAT_KNOWLEDGE_TIMEOUT
SOL_CHAT_TEMPERATURE
SOL_CHAT_TOP_P
SOL_CHAT_MAX_TOKENS
SOL_CHAT_VISION_IDLE_TIMEOUT

Editable prompt file:

/home/david/random/prompts/sol_chat_system_prompt.txt

Default backend assumption:

the reasoning lane is commonly kept warm on SOL_CHAT_BACKEND_BASE_URL
the web API can also start local vision backends itself when image turns arrive

Voice/TTS behavior:

/api/chat/speak is served by the same backend daemon
synthesized audio is cached on disk under /home/david/.local/share/sol_chat_web/tts_cache
repeated identical speech requests reuse cached MP3 output across sessions and refreshes
because the floating desktop assistant now reuses cached narration text for unchanged pages, repeated page read-aloud runs also tend to hit the same server-side MP3 cache entry

Desktop assistant playback behavior:

the floating assistant now has an explicit transport button whose label tracks state:
Pause while audio is playing
Resume when paused with buffered playback available
Play when no buffered playback is active
closing the popup performs a hard playback stop and clears pending continuation so hidden playback does not resume later
quick prompt suggestions are now partially dynamic:
first suggestion remains anchored
second and third are regenerated from the current page or fallback prompt pool each time the popup is reopened

Asset freshness behavior:

www/chat/index.html now references versioned asset URLs for chat.css, chat.js, and hue-visualizer.js

Caddy also marks /chat and /chat/ as no-store
this combination was added after mobile clients kept reusing stale public JS/CSS while /chat HTML itself had already updated

Deployment

Start the backend service:

bash
python3 /home/david/random/bin/sol_chat_api.py

Ensure Caddy is using /home/david/random/bin/Caddyfile.pkd_share, which now includes:

/chat rewrite to /chat/index.html
Cache-Control: no-store on /chat and /chat/
/api/chat reverse proxy to 127.0.0.1:8895

For persistent boot behavior, add a user service similar to the existing site daemons:

ini
[Unit]
Description=Sol chat web API
After=network-online.target
[Service]
ExecStart=/usr/bin/python3 /home/david/random/bin/sol_chat_api.py
Restart=always
RestartSec=2
[Install]
WantedBy=default.target

Suggested unit path:

/home/david/.config/systemd/user/sol-chat-api.service
the model backend is commonly paired with /home/david/.config/systemd/user/sol-chat-model.service

After installing or changing the unit:

bash
systemctl --user daemon-reload
systemctl --user enable --now sol-chat-api.service
systemctl --user status --no-pager sol-chat-api.service
systemctl --user status --no-pager sol-chat-model.service

Logging And Verification

The backend logs structured JSON events to stdout/journal with:

request start/completion
session id
retrieval success or failure
model latency
fallback-grounding usage

Debug persistence:

per-assistant-turn retrieval metadata is now stored in session metadata and returned by GET /api/chat/history
the /chat frontend restores those stored traces when debug mode is enabled after a reload
this fixes the earlier behavior where debug detail existed only on the live DOM node during the original response stream

Prompt/response checks used during tuning:

text Prompt: "What page is open?" Result: page-context answer from /chat UI state, retrieval skipped, stack summary + metrics included. Prompt: "Which local models are active right now?" Result: page-context answer naming Qwen3-VL, Gemma small, and DeepSeek, with the active reasoning backend file.

Prompt: "What do the debug metrics show?" Result: page-context answer summarizing visible metrics, transport state, grounding state, backend profile, and readouts.

Direct browser-query examples:

text
/api/chat/query?query=what%20is%20Sol%3F

text
/api/chat/query?query=What%20page%20is%20open%3F&page_title=Sol%20Chat&page_target=%2Fchat&page_content_type=chat_ui&page_heading=Sol%20%2F%20Chat&page_content=status%3A%20idle...

GPT Action import details:

text
Schema URL: https://sol.system42.one/chat-openapi.json
Privacy URL: https://sol.system42.one/privacy.html
Available actions: queryChat, chatHealth, queryKnowledge, knowledgeHealth

Cache behavior check used during tuning:

text 1. GET /api/chat/query?query=what%20is%20Sol%20really%3F -> model-generated answer, cache_hit: false

2. repeat same URL -> same answer returned from cache, cache_hit: true

Regression/contract check:

bash
python3 /home/david/random/bin/check_sol_chat_api_contract.py
python3 /home/david/random/bin/check_sol_chat_asset_versioning.py
python3 /home/david/random/bin/check_sol_chat_tts_cache.py

That script starts fake knowledge and model backends, boots sol_chat_api.py against them, and verifies:

POST /api/chat
streaming SSE output
history persistence
reset behavior
same-origin route shape compatibility

The asset versioning check verifies that /chat references cache-busted JS/CSS URLs. The TTS check verifies that speech caching is still active and writable.

Legacy Replacement

The old www/sol-chat.html page was retired for three reasons:

it depended on jQuery for a trivial interaction
it hardcoded a LAN target instead of using same-origin routing
it framed Sol as a placeholder journaling companion rather than a production chat surface

The file now exists only as a redirect to /chat, so old links still land on the current interface without preserving the old copy or behavior.

Executive Summary

Sol Chat Web

Recent Changes

Architecture

Request Surface

Frontend Behavior

Context Assembly

Backend Routing

Grounding And Retrieval

Config

Deployment

Logging And Verification

Legacy Replacement