Sol Multimodal Pipeline

This document covers the local multimodal stack assembled on the workstation: webcam capture, local GGUF models on the Vulkan llama.cpp runtime, the…

Executive Summary

This document covers the local multimodal stack assembled on the workstation: webcam capture, local GGUF models on the Vulkan llama.cpp runtime, the Sol web chat integration, and the bootstrap shell scripts used for direct local bring-up. Purpose Sol Multimodal Pipeline This document covers the local multimodal stack assembled on the workstation: webcam capture, local GGUF models on the Vulkan llama.cpp runtime, the Sol web chat integration, and the bootstrap shell scripts used for direct local bring-up.

Sol Multimodal Pipeline

This document covers the local multimodal stack assembled on the workstation: webcam capture, local GGUF models on the Vulkan llama.cpp runtime, the Sol web chat integration, and the bootstrap shell scripts used for direct local bring-up.

Purpose

The system is designed as a routed stack rather than a single model:

Target flow:

text
webcam -> snapshot -> vision -> reasoning -> stdout

Longer-term flow:

text
webcam -> vision -> reasoning -> knowledge query -> synthesis

Runtime

The local inference runtime is the Vulkan llama.cpp build under:

Relevant binaries on this host:

Notes:

GPU

The working GPU on this box is:

Useful verification:

bash
nvidia-smi

The Vulkan backend is visible during runtime startup logs and is currently the active local path for Sol.

Models

Expected model directory:

Core files used by this stack:

Important rule:

The shared registry used by Sol lives in:

That registry also enforces minimum-size thresholds so partially-downloaded files are not advertised as ready.

Webcam Input

The local capture utility is:

Supporting low-level capture helper:

Why it exists:

Useful commands:

bash
webcam-feed
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png

Bootstrap Scripts

The direct local bootstrap scripts live in:

They are executable shell entry points for direct local testing outside the web API.

run_vision.sh

Purpose:

Current behavior:

Usage:

bash
/home/david/random/bin/run_vision.sh /tmp/frame.png fast
/home/david/random/bin/run_vision.sh /tmp/frame.png full

run_reasoning.sh

Purpose:

Runtime:

Usage:

bash
/home/david/random/bin/run_reasoning.sh "A webcam frame shows a desk and monitor."
printf '%s' "visual description text" | /home/david/random/bin/run_reasoning.sh -

run_pipeline.sh

Purpose:

Usage:

bash
/home/david/random/bin/run_pipeline.sh /tmp/frame.png fast
/home/david/random/bin/run_pipeline.sh /tmp/frame.png full

Output shape:

text
=== VISION ===
...

=== REASONING ===
...

loop_vision.sh

Purpose:

Usage:

bash
/home/david/random/bin/loop_vision.sh fast
/home/david/random/bin/loop_vision.sh full

The loop currently captures /tmp/frame.png, clears the terminal, runs the pipeline, then sleeps for two seconds.

Current Bring-Up Status

What is working:

What is partially working:

What is still unresolved:

This distinction matters:

Sol Website Integration

The website and API files most relevant to this stack are:

Current behavior:

The web chat implementation should be treated as the canonical integration surface. The shell scripts are for local bring-up, testing, and direct operator workflows.

Recent tuning outcome:

Verification Commands

Basic environment checks:

bash
nvidia-smi
stat -c '%n %s' /home/david/.cache/models/*.gguf

Webcam:

bash
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png

Direct pipeline scripts:

bash
bash -n /home/david/random/bin/run_vision.sh
bash -n /home/david/random/bin/run_reasoning.sh
bash -n /home/david/random/bin/run_pipeline.sh
bash -n /home/david/random/bin/loop_vision.sh

Sol web stack:

bash
curl -s http://127.0.0.1:8895/api/chat/health
python3 /home/david/random/bin/check_sol_chat_api_contract.py

Operating Guidance

Use the stack in this order:

  1. verify webcam capture
  2. verify model files and projector files are present
  3. verify nvidia-smi shows the expected GPU
  4. verify /api/chat/health
  5. use the website or API as the main multimodal surface
  6. use the shell scripts for local pipeline bring-up and debugging

Do not assume the shell scripts are the production path. The production path is the Sol API plus the routed web client.

Next Steps

The next meaningful extensions are:

The intended end state is not “one assistant model.” It is a modular local cognition stack with separate perception, reasoning, memory, and routing layers.