Sol Multimodal Pipeline

This document covers the local multimodal stack assembled on the workstation: webcam capture, local GGUF models on the Vulkan llama.cpp runtime, the Sol web chat integration, and the bootstrap shell scripts used for direct local bring-up.

Purpose

The system is designed as a routed stack rather than a single model:

sensor input:
webcam-feed --snapshot /tmp/frame.png
fast vision lane:
gemma-3-4b-it-Q4_K_M.gguf
primary vision lane:
Qwen3VL-8B-Thinking-Q4_K_M.gguf
reasoning lane:
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
web surface:
/home/david/random/www/chat/index.html
API layer:
/home/david/random/bin/sol_chat_api.py
future memory hook:
/api/knowledge/query

Target flow:

text
webcam -> snapshot -> vision -> reasoning -> stdout

Longer-term flow:

text
webcam -> vision -> reasoning -> knowledge query -> synthesis

Runtime

The local inference runtime is the Vulkan llama.cpp build under:

/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/

Relevant binaries on this host:

/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-cli
/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-mtmd-cli
/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-server

Notes:

this machine is using the Vulkan backend, not a local CUDA ./main build
the old main-style examples are useful conceptually, but the executable names here are different
direct multimodal CLI runs should use llama-mtmd-cli
text-only local inference should use llama-cli or llama-server

GPU

The working GPU on this box is:

NVIDIA GeForce GTX 1660 SUPER
6144 MiB VRAM

Useful verification:

bash
nvidia-smi

The Vulkan backend is visible during runtime startup logs and is currently the active local path for Sol.

Models

Expected model directory:

/home/david/.cache/models/

Core files used by this stack:

/home/david/.cache/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf
/home/david/.cache/models/gemma-3-4b-it-Q4_K_M.gguf
/home/david/.cache/models/Qwen3VL-8B-Thinking-Q4_K_M.gguf
/home/david/.cache/models/mmproj-model-f16.gguf
/home/david/.cache/models/mmproj-Qwen3VL-8B-Thinking-F16.gguf

Important rule:

vision inference requires both a model file and a matching projector file
Gemma fast vision uses mmproj-model-f16.gguf in the current local stack
Qwen3-VL uses mmproj-Qwen3VL-8B-Thinking-F16.gguf

The shared registry used by Sol lives in:

/home/david/random/bin/local_model_stack.py

That registry also enforces minimum-size thresholds so partially-downloaded files are not advertised as ready.

Webcam Input

The local capture utility is:

/home/david/random/bin/webcam_feed.py
launcher: /home/david/bin/webcam-feed

Supporting low-level capture helper:

/home/david/random/bin/webcam_feed_v4l2_gray_stream.c

Why it exists:

the attached webcam negotiates an awkward padded Bayer/GRBG format
browser preview and generic video tools were unreliable for this camera
webcam-feed directly reads the V4L2 stream, strips the padding, and emits corrected grayscale frames

Useful commands:

bash
webcam-feed
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png

Bootstrap Scripts

The direct local bootstrap scripts live in:

/home/david/random/bin/run_vision.sh
/home/david/random/bin/run_reasoning.sh
/home/david/random/bin/run_pipeline.sh
/home/david/random/bin/loop_vision.sh

They are executable shell entry points for direct local testing outside the web API.

`run_vision.sh`

Purpose:

route an image to the fast vision lane or the full vision lane

Current behavior:

fast mode uses:
gemma-3-4b-it-Q4_K_M.gguf
mmproj-model-f16.gguf
llama-mtmd-cli
full mode uses:
Qwen3VL-8B-Thinking-Q4_K_M.gguf
mmproj-Qwen3VL-8B-Thinking-F16.gguf
llama-mtmd-cli

Usage:

bash
/home/david/random/bin/run_vision.sh /tmp/frame.png fast
/home/david/random/bin/run_vision.sh /tmp/frame.png full

`run_reasoning.sh`

Purpose:

feed a textual visual description into DeepSeek for interpretation

Runtime:

llama-cli
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf

Usage:

bash
/home/david/random/bin/run_reasoning.sh "A webcam frame shows a desk and monitor."
printf '%s' "visual description text" | /home/david/random/bin/run_reasoning.sh -

`run_pipeline.sh`

Purpose:

glue the vision and reasoning stages together

Usage:

bash
/home/david/random/bin/run_pipeline.sh /tmp/frame.png fast
/home/david/random/bin/run_pipeline.sh /tmp/frame.png full

Output shape:

text
=== VISION ===
...
=== REASONING ===
...

`loop_vision.sh`

Purpose:

take repeated webcam snapshots and run the pipeline in a loop

Usage:

bash
/home/david/random/bin/loop_vision.sh fast
/home/david/random/bin/loop_vision.sh full

The loop currently captures /tmp/frame.png, clears the terminal, runs the pipeline, then sleeps for two seconds.

Current Bring-Up Status

What is working:

the webcam snapshot path works
the Vulkan llama.cpp runtime is installed and detects the local GPU
the Sol web API is already structured as a routed stack with reasoning, fast vision, and primary vision profiles
the website at /chat now has image attachment UI and multimodal request plumbing
the bootstrap shell scripts exist and validate syntactically

What is partially working:

the reasoning lane is established and can be targeted through the local Sol stack
direct shell orchestration for vision -> reasoning is in place

What is still unresolved:

the current direct llama-mtmd-cli smoke test exits successfully but returns empty stdout for the Gemma vision pass
because of that, run_pipeline.sh currently stalls at an empty vision result in direct CLI mode
the production Sol web path uses llama-server and OpenAI-style chat requests for multimodal turns, which is the more reliable path to keep using while the one-shot CLI invocation is narrowed down

This distinction matters:

the website/API multimodal architecture is already in the right shape
the direct bootstrap shell path still needs one more runtime-specific correction for llama-mtmd-cli

Sol Website Integration

The website and API files most relevant to this stack are:

/home/david/random/www/chat/index.html
/home/david/random/www/chat/chat.css
/home/david/random/www/chat/chat.js
/home/david/random/bin/sol_chat_api.py
/home/david/random/bin/local_model_stack.py

Current behavior:

text-only turns route to the reasoning profile
image turns can route to vision or vision_fast
health reporting exposes profile availability and runtime state
vision backends are started on demand and can be reaped after idle timeouts
the dedicated /chat page now sends its own page-context payload on text turns so the model can answer questions about the visible chat UI, stack cards, and debug metrics without relying on archive retrieval alone
a browser-openable GET /api/chat/query?query=... route now exists for direct JSON chat queries in the same general style as the knowledge query endpoint
that direct query route now forces model-generated message output and reuses a file-backed cache for identical requests

The web chat implementation should be treated as the canonical integration surface. The shell scripts are for local bring-up, testing, and direct operator workflows.

Recent tuning outcome:

/chat questions about the current page, current stack, and visible debug metrics now prefer page context over retrieval
retrieval metadata is persisted with session history for debug-mode replay after refresh
direct prompt checks against the live local API were used to confirm:
What page is open?
Which local models are active right now?
What do the debug metrics show?

Verification Commands

Basic environment checks:

bash
nvidia-smi
stat -c '%n %s' /home/david/.cache/models/*.gguf

Webcam:

bash
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png

Direct pipeline scripts:

bash
bash -n /home/david/random/bin/run_vision.sh
bash -n /home/david/random/bin/run_reasoning.sh
bash -n /home/david/random/bin/run_pipeline.sh
bash -n /home/david/random/bin/loop_vision.sh

Sol web stack:

bash
curl -s http://127.0.0.1:8895/api/chat/health
python3 /home/david/random/bin/check_sol_chat_api_contract.py

Operating Guidance

Use the stack in this order:

verify webcam capture
verify model files and projector files are present
verify nvidia-smi shows the expected GPU
verify /api/chat/health
use the website or API as the main multimodal surface
use the shell scripts for local pipeline bring-up and debugging

Do not assume the shell scripts are the production path. The production path is the Sol API plus the routed web client.

Next Steps

The next meaningful extensions are:

fix the direct llama-mtmd-cli one-shot invocation so run_vision.sh emits actual text
optionally change run_vision.sh to use a transient llama-server plus /v1/chat/completions if that proves more stable than the CLI
add a knowledge -> synthesis stage after reasoning
add automatic routing policies such as:
fast for low-complexity frames
full for complex scenes or OCR-heavy images
replace snapshot polling with a streaming or near-real-time frame loop
connect the perception loop to logging, screenplay, or event capture systems

The intended end state is not “one assistant model.” It is a modular local cognition stack with separate perception, reasoning, memory, and routing layers.

Executive Summary

Sol Multimodal Pipeline

Purpose

Runtime

GPU

Models

Webcam Input

Bootstrap Scripts

run_vision.sh

run_reasoning.sh

run_pipeline.sh

loop_vision.sh

Current Bring-Up Status

Sol Website Integration

Verification Commands

Operating Guidance

Next Steps

`run_vision.sh`

`run_reasoning.sh`

`run_pipeline.sh`

`loop_vision.sh`