# Sol Multimodal Pipeline

This document covers the local multimodal stack assembled on the workstation: webcam capture, local GGUF models on the Vulkan llama.cpp runtime, the Sol web chat integration, and the bootstrap shell scripts used for direct local bring-up.

## Purpose

The system is designed as a routed stack rather than a single model:

- sensor input:
  - `webcam-feed --snapshot /tmp/frame.png`
- fast vision lane:
  - `gemma-3-4b-it-Q4_K_M.gguf`
- primary vision lane:
  - `Qwen3VL-8B-Thinking-Q4_K_M.gguf`
- reasoning lane:
  - `DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf`
- web surface:
  - `/home/david/random/www/chat/index.html`
- API layer:
  - `/home/david/random/bin/sol_chat_api.py`
- future memory hook:
  - `/api/knowledge/query`

Target flow:

```text
webcam -> snapshot -> vision -> reasoning -> stdout
```

Longer-term flow:

```text
webcam -> vision -> reasoning -> knowledge query -> synthesis
```

## Runtime

The local inference runtime is the Vulkan llama.cpp build under:

- `/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/`

Relevant binaries on this host:

- `/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-cli`
- `/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-mtmd-cli`
- `/home/david/.local/lib/llama.cpp-vulkan/llama-b8195/llama-server`

Notes:

- this machine is using the Vulkan backend, not a local CUDA `./main` build
- the old `main`-style examples are useful conceptually, but the executable names here are different
- direct multimodal CLI runs should use `llama-mtmd-cli`
- text-only local inference should use `llama-cli` or `llama-server`

## GPU

The working GPU on this box is:

- `NVIDIA GeForce GTX 1660 SUPER`
- `6144 MiB VRAM`

Useful verification:

```bash
nvidia-smi
```

The Vulkan backend is visible during runtime startup logs and is currently the active local path for Sol.

## Models

Expected model directory:

- `/home/david/.cache/models/`

Core files used by this stack:

- `/home/david/.cache/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf`
- `/home/david/.cache/models/gemma-3-4b-it-Q4_K_M.gguf`
- `/home/david/.cache/models/Qwen3VL-8B-Thinking-Q4_K_M.gguf`
- `/home/david/.cache/models/mmproj-model-f16.gguf`
- `/home/david/.cache/models/mmproj-Qwen3VL-8B-Thinking-F16.gguf`

Important rule:

- vision inference requires both a model file and a matching projector file
- Gemma fast vision uses `mmproj-model-f16.gguf` in the current local stack
- Qwen3-VL uses `mmproj-Qwen3VL-8B-Thinking-F16.gguf`

The shared registry used by Sol lives in:

- `/home/david/random/bin/local_model_stack.py`

That registry also enforces minimum-size thresholds so partially-downloaded files are not advertised as ready.

## Webcam Input

The local capture utility is:

- `/home/david/random/bin/webcam_feed.py`
- launcher: `/home/david/bin/webcam-feed`

Supporting low-level capture helper:

- `/home/david/random/bin/webcam_feed_v4l2_gray_stream.c`

Why it exists:

- the attached webcam negotiates an awkward padded Bayer/`GRBG` format
- browser preview and generic video tools were unreliable for this camera
- `webcam-feed` directly reads the V4L2 stream, strips the padding, and emits corrected grayscale frames

Useful commands:

```bash
webcam-feed
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png
```

## Bootstrap Scripts

The direct local bootstrap scripts live in:

- `/home/david/random/bin/run_vision.sh`
- `/home/david/random/bin/run_reasoning.sh`
- `/home/david/random/bin/run_pipeline.sh`
- `/home/david/random/bin/loop_vision.sh`

They are executable shell entry points for direct local testing outside the web API.

### `run_vision.sh`

Purpose:

- route an image to the fast vision lane or the full vision lane

Current behavior:

- `fast` mode uses:
  - `gemma-3-4b-it-Q4_K_M.gguf`
  - `mmproj-model-f16.gguf`
  - `llama-mtmd-cli`
- `full` mode uses:
  - `Qwen3VL-8B-Thinking-Q4_K_M.gguf`
  - `mmproj-Qwen3VL-8B-Thinking-F16.gguf`
  - `llama-mtmd-cli`

Usage:

```bash
/home/david/random/bin/run_vision.sh /tmp/frame.png fast
/home/david/random/bin/run_vision.sh /tmp/frame.png full
```

### `run_reasoning.sh`

Purpose:

- feed a textual visual description into DeepSeek for interpretation

Runtime:

- `llama-cli`
- `DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf`

Usage:

```bash
/home/david/random/bin/run_reasoning.sh "A webcam frame shows a desk and monitor."
printf '%s' "visual description text" | /home/david/random/bin/run_reasoning.sh -
```

### `run_pipeline.sh`

Purpose:

- glue the vision and reasoning stages together

Usage:

```bash
/home/david/random/bin/run_pipeline.sh /tmp/frame.png fast
/home/david/random/bin/run_pipeline.sh /tmp/frame.png full
```

Output shape:

```text
=== VISION ===
...

=== REASONING ===
...
```

### `loop_vision.sh`

Purpose:

- take repeated webcam snapshots and run the pipeline in a loop

Usage:

```bash
/home/david/random/bin/loop_vision.sh fast
/home/david/random/bin/loop_vision.sh full
```

The loop currently captures `/tmp/frame.png`, clears the terminal, runs the pipeline, then sleeps for two seconds.

## Current Bring-Up Status

What is working:

- the webcam snapshot path works
- the Vulkan llama.cpp runtime is installed and detects the local GPU
- the Sol web API is already structured as a routed stack with reasoning, fast vision, and primary vision profiles
- the website at `/chat` now has image attachment UI and multimodal request plumbing
- the bootstrap shell scripts exist and validate syntactically

What is partially working:

- the reasoning lane is established and can be targeted through the local Sol stack
- direct shell orchestration for `vision -> reasoning` is in place

What is still unresolved:

- the current direct `llama-mtmd-cli` smoke test exits successfully but returns empty stdout for the Gemma vision pass
- because of that, `run_pipeline.sh` currently stalls at an empty vision result in direct CLI mode
- the production Sol web path uses `llama-server` and OpenAI-style chat requests for multimodal turns, which is the more reliable path to keep using while the one-shot CLI invocation is narrowed down

This distinction matters:

- the website/API multimodal architecture is already in the right shape
- the direct bootstrap shell path still needs one more runtime-specific correction for `llama-mtmd-cli`

## Sol Website Integration

The website and API files most relevant to this stack are:

- `/home/david/random/www/chat/index.html`
- `/home/david/random/www/chat/chat.css`
- `/home/david/random/www/chat/chat.js`
- `/home/david/random/bin/sol_chat_api.py`
- `/home/david/random/bin/local_model_stack.py`

Current behavior:

- text-only turns route to the reasoning profile
- image turns can route to `vision` or `vision_fast`
- health reporting exposes profile availability and runtime state
- vision backends are started on demand and can be reaped after idle timeouts
- the dedicated `/chat` page now sends its own page-context payload on text turns so the model can answer questions about the visible chat UI, stack cards, and debug metrics without relying on archive retrieval alone
- a browser-openable `GET /api/chat/query?query=...` route now exists for direct JSON chat queries in the same general style as the knowledge query endpoint
- that direct query route now forces model-generated `message` output and reuses a file-backed cache for identical requests

The web chat implementation should be treated as the canonical integration surface. The shell scripts are for local bring-up, testing, and direct operator workflows.

Recent tuning outcome:

- `/chat` questions about the current page, current stack, and visible debug metrics now prefer page context over retrieval
- retrieval metadata is persisted with session history for debug-mode replay after refresh
- direct prompt checks against the live local API were used to confirm:
  - `What page is open?`
  - `Which local models are active right now?`
  - `What do the debug metrics show?`

## Verification Commands

Basic environment checks:

```bash
nvidia-smi
stat -c '%n %s' /home/david/.cache/models/*.gguf
```

Webcam:

```bash
webcam-feed --info
webcam-feed --snapshot /tmp/frame.png
```

Direct pipeline scripts:

```bash
bash -n /home/david/random/bin/run_vision.sh
bash -n /home/david/random/bin/run_reasoning.sh
bash -n /home/david/random/bin/run_pipeline.sh
bash -n /home/david/random/bin/loop_vision.sh
```

Sol web stack:

```bash
curl -s http://127.0.0.1:8895/api/chat/health
python3 /home/david/random/bin/check_sol_chat_api_contract.py
```

## Operating Guidance

Use the stack in this order:

1. verify webcam capture
2. verify model files and projector files are present
3. verify `nvidia-smi` shows the expected GPU
4. verify `/api/chat/health`
5. use the website or API as the main multimodal surface
6. use the shell scripts for local pipeline bring-up and debugging

Do not assume the shell scripts are the production path. The production path is the Sol API plus the routed web client.

## Next Steps

The next meaningful extensions are:

- fix the direct `llama-mtmd-cli` one-shot invocation so `run_vision.sh` emits actual text
- optionally change `run_vision.sh` to use a transient `llama-server` plus `/v1/chat/completions` if that proves more stable than the CLI
- add a `knowledge -> synthesis` stage after reasoning
- add automatic routing policies such as:
  - `fast` for low-complexity frames
  - `full` for complex scenes or OCR-heavy images
- replace snapshot polling with a streaming or near-real-time frame loop
- connect the perception loop to logging, screenplay, or event capture systems

The intended end state is not “one assistant model.” It is a modular local cognition stack with separate perception, reasoning, memory, and routing layers.