Local AI Works When the Stack Is Built for It
A tuned Crown Citadel local stack finished a messy invoice sandbox in 5:49, while an Ollama + OpenClaw run stalled at 27:18 and ran out of context.
Crown Citadel Performance Lab
Strix Halo local inference benchmarks: throughput, memory, serving, tuning, and quality.
Articles about running AI locally: hardware, settings, model behavior, tooling, and lessons from the lab.
A tuned Crown Citadel local stack finished a messy invoice sandbox in 5:49, while an Ollama + OpenClaw run stalled at 27:18 and ran out of context.
How tuned llama.cpp settings changed prompt processing and generation speed on the same Strix Halo machine, with paired before/after bars and percent lifts.
Peak effective UMA memory on x, standalone decode on y, and selected context speed as color.
Context-side prompt processing on x and standalone decode on y, grouped by model family.
Only models with at least four measured context points and one 64k+ result are drawn, so the lines represent real scaling behavior rather than one-off tests.
Full benchmark matrix. Context columns show prompt-processing or combined prompt+generation throughput, while standalone decode is kept separate in the TG column.
| Profile | Backend | KV / Shape | 4k | 8k | 16k | 32k | 64k | 80k | 100k | 128k | TG | Memory | Measurements |
|---|
Speculative serving sweeps: draft settings, acceptance rate, request time, decode speed, and memory deltas.
| Run | Config | Total | Prompt | Decode | Acceptance | Memory |
|---|
End-to-end local serving measurements: first-token delay, total request time, prompt size, generated output, and sampled memory.
| Run | First token | Total | Prompt | Output | Memory |
|---|
Short conclusions from focused real-prompt tuning runs. This is not a launch-command reference; it only shows which tested direction is worth using.
| Scenario | Use | Evidence | Caveat |
|---|
Completed coding and repair rows from the quality benchmark store. Suites require real task coverage before they can appear here; failed or incomplete rows stay out of the chart.
| Profile | Test | Score | Evidence | Run |
|---|
Official 20-scenario Hermes agent behavior runs. Score is verifier average across memory, workspace, skills, scheduling, delegation, recovery, and approval-boundary scenarios.
Each card explains what the race tests before you open it. These are focused head-to-head pages for speculative decoding, long-output behavior, task correctness, run health, and memory pressure.
How to read the dashboard without knowing the benchmark harness internals.