Sol Multimodal Pipeline
This document covers the local multimodal stack assembled on the workstation: webcam capture, local GGUF models on the Vulkan llama.cpp runtime, the Sol web chat integration, and the bootstrap shell scripts used for direct local bring-up.
Purpose
The system is designed as a routed stack rather than a single model:
- sensor input:
webcam-feed --snapshot /tmp/frame.png- fast vision lane:
gemma-3-4b-it-Q4_K_M.gguf- primary vision lane:
Qwen3VL-8B-Thinking-Q4_K_M.gguf- reasoning lane:
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf- web surface:
/home/david/random/www/chat/index.html- API layer:
/home/david/random/bin/sol_chat_api.py- future memory hook:
/api/knowledge/query
Target flow:
text
webcam -> snapshot -> vision -> reasoning -> stdout
Longer-term flow:
text
webcam -> vision -> reasoning -> knowledge query -> synthesis
Runtime
The local inference runtime is the Vulkan llama.cpp build under:
/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/
Relevant binaries on this host:
/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-cli/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-mtmd-cli/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-server
Notes:
- this machine is using the Vulkan backend, not a local CUDA
./mainbuild - the old
main-style examples are useful conceptually, but the executable names here are different - direct multimodal CLI runs should use
llama-mtmd-cli - text-only local inference should use
llama-cliorllama-server
GPU
The working GPU on this box is:
NVIDIA GeForce GTX 1660 SUPER6144 MiB VRAM
Useful verification:
bash
nvidia-smi
The Vulkan backend is visible during runtime startup logs and is currently the active local path for Sol.
Models
Expected model directory:
/home/david/.cache/models/
Core files used by this stack:
/home/david/.cache/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf/home/david/.cache/models/gemma-3-4b-it-Q4_K_M.gguf/home/david/.cache/models/Qwen3VL-8B-Thinking-Q4_K_M.gguf/home/david/.cache/models/mmproj-model-f16.gguf/home/david/.cache/models/mmproj-Qwen3VL-8B-Thinking-F16.gguf
Important rule:
- vision inference requires both a model file and a matching projector file
- Gemma fast vision uses
mmproj-model-f16.ggufin the current local stack - Qwen3-VL uses
mmproj-Qwen3VL-8B-Thinking-F16.gguf
The shared registry used by Sol lives in:
/home/david/random/bin/local_model_stack.py
That registry also enforces minimum-size thresholds so partially-downloaded files are not advertised as ready.
Webcam Input
The local capture utility is:
/home/david/random/bin/webcam_feed.py- launcher:
/home/david/bin/webcam-feed
Supporting low-level capture helper:
/home/david/random/bin/webcam_feed_v4l2_gray_stream.c
Why it exists:
- the attached webcam negotiates an awkward padded Bayer/
GRBGformat - browser preview and generic video tools were unreliable for this camera
webcam-feeddirectly reads the V4L2 stream, strips the padding, and emits corrected grayscale frames
Useful commands:
bash
webcam-feed
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png
Bootstrap Scripts
The direct local bootstrap scripts live in:
/home/david/random/bin/run_vision.sh/home/david/random/bin/run_reasoning.sh/home/david/random/bin/run_pipeline.sh/home/david/random/bin/loop_vision.sh
They are executable shell entry points for direct local testing outside the web API.
run_vision.sh
Purpose:
- route an image to the fast vision lane or the full vision lane
Current behavior:
fastmode uses:gemma-3-4b-it-Q4_K_M.ggufmmproj-model-f16.ggufllama-mtmd-clifullmode uses:Qwen3VL-8B-Thinking-Q4_K_M.ggufmmproj-Qwen3VL-8B-Thinking-F16.ggufllama-mtmd-cli
Usage:
bash
/home/david/random/bin/run_vision.sh /tmp/frame.png fast
/home/david/random/bin/run_vision.sh /tmp/frame.png full
run_reasoning.sh
Purpose:
- feed a textual visual description into DeepSeek for interpretation
Runtime:
llama-cliDeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
Usage:
bash
/home/david/random/bin/run_reasoning.sh "A webcam frame shows a desk and monitor."
printf '%s' "visual description text" | /home/david/random/bin/run_reasoning.sh -
run_pipeline.sh
Purpose:
- glue the vision and reasoning stages together
Usage:
bash
/home/david/random/bin/run_pipeline.sh /tmp/frame.png fast
/home/david/random/bin/run_pipeline.sh /tmp/frame.png full
Output shape:
text
=== VISION ===
...
=== REASONING ===
...
loop_vision.sh
Purpose:
- take repeated webcam snapshots and run the pipeline in a loop
Usage:
bash
/home/david/random/bin/loop_vision.sh fast
/home/david/random/bin/loop_vision.sh full
The loop currently captures /tmp/frame.png, clears the terminal, runs the pipeline, then sleeps for two seconds.
Current Bring-Up Status
What is working:
- the webcam snapshot path works
- the Vulkan llama.cpp runtime is installed and detects the local GPU
- the Sol web API is already structured as a routed stack with reasoning, fast vision, and primary vision profiles
- the website at
/chatnow has image attachment UI and multimodal request plumbing - the bootstrap shell scripts exist and validate syntactically
What is partially working:
- the reasoning lane is established and can be targeted through the local Sol stack
- direct shell orchestration for
vision -> reasoningis in place
What is still unresolved:
- the current direct
llama-mtmd-clismoke test exits successfully but returns empty stdout for the Gemma vision pass - because of that,
run_pipeline.shcurrently stalls at an empty vision result in direct CLI mode - the production Sol web path uses
llama-serverand OpenAI-style chat requests for multimodal turns, which is the more reliable path to keep using while the one-shot CLI invocation is narrowed down
This distinction matters:
- the website/API multimodal architecture is already in the right shape
- the direct bootstrap shell path still needs one more runtime-specific correction for
llama-mtmd-cli
Sol Website Integration
The website and API files most relevant to this stack are:
/home/david/random/www/chat/index.html/home/david/random/www/chat/chat.css/home/david/random/www/chat/chat.js/home/david/random/bin/sol_chat_api.py/home/david/random/bin/local_model_stack.py
Current behavior:
- text-only turns route to the reasoning profile
- image turns can route to
visionorvision_fast - health reporting exposes profile availability and runtime state
- vision backends are started on demand and can be reaped after idle timeouts
- the dedicated
/chatpage now sends its own page-context payload on text turns so the model can answer questions about the visible chat UI, stack cards, and debug metrics without relying on archive retrieval alone - a browser-openable
GET /api/chat/query?query=...route now exists for direct JSON chat queries in the same general style as the knowledge query endpoint - that direct query route now forces model-generated
messageoutput and reuses a file-backed cache for identical requests
The web chat implementation should be treated as the canonical integration surface. The shell scripts are for local bring-up, testing, and direct operator workflows.
Recent tuning outcome:
/chatquestions about the current page, current stack, and visible debug metrics now prefer page context over retrieval- retrieval metadata is persisted with session history for debug-mode replay after refresh
- direct prompt checks against the live local API were used to confirm:
What page is open?Which local models are active right now?What do the debug metrics show?
Verification Commands
Basic environment checks:
bash
nvidia-smi
stat -c '%n %s' /home/david/.cache/models/*.gguf
Webcam:
bash
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png
Direct pipeline scripts:
bash
bash -n /home/david/random/bin/run_vision.sh
bash -n /home/david/random/bin/run_reasoning.sh
bash -n /home/david/random/bin/run_pipeline.sh
bash -n /home/david/random/bin/loop_vision.sh
Sol web stack:
bash
curl -s http://127.0.0.1:8895/api/chat/health
python3 /home/david/random/bin/check_sol_chat_api_contract.py
Operating Guidance
Use the stack in this order:
- verify webcam capture
- verify model files and projector files are present
- verify
nvidia-smishows the expected GPU - verify
/api/chat/health - use the website or API as the main multimodal surface
- use the shell scripts for local pipeline bring-up and debugging
Do not assume the shell scripts are the production path. The production path is the Sol API plus the routed web client.
Next Steps
The next meaningful extensions are:
- fix the direct
llama-mtmd-clione-shot invocation sorun_vision.shemits actual text - optionally change
run_vision.shto use a transientllama-serverplus/v1/chat/completionsif that proves more stable than the CLI - add a
knowledge -> synthesisstage after reasoning - add automatic routing policies such as:
fastfor low-complexity framesfullfor complex scenes or OCR-heavy images- replace snapshot polling with a streaming or near-real-time frame loop
- connect the perception loop to logging, screenplay, or event capture systems
The intended end state is not “one assistant model.” It is a modular local cognition stack with separate perception, reasoning, memory, and routing layers.