SiliconBench

Speed, memory, and fidelity for LLM serving on unified-memory desktops. All requests hit an OpenAI-compatible endpoint at concurrency 1 / 8 / 16, BF16 weights, n=100 per level.

SiliconBench audits the LLM serving engines that run on unified-memory desktop hardware. Nine stacks are benchmarked on Apple Silicon against the same weights and prompts, with a CUDA-native reference track on an NVIDIA DGX Spark for the three engines the two ecosystems share. A maintainer agent re-runs the benchmark, commits the raw results, and this page rebuilds from them automatically.

Speed alone is a misleading ranking on shared machines: the engine pool is the same memory your browser and IDE use, and a stack can be fastest while claiming most of it or while returning wrong output. The tables therefore keep three lenses side by side. Speed is measured on two workloads (a short-prompt chat split and an agent split whose multi-turn prompts reach several thousand input tokens), memory as the peak footprint during serving, and fidelity as weighted F1 on a classification task against an NVIDIA reference on identical weights. Try sorting by tok/s at c=1 and then at c=16: the point of the concurrency sweep is that single-stream rankings do not survive load.

Single-node serving only. Stacks are listed alphabetically by default; click a column header to sort (failed runs always sink to the bottom); no default ranking is implied; the paper's central finding is that speed-only orderings mislead. ✕ = crashed (<5/100 requests), n/100 = partial run, – = not measured. Trend sparklines show each stack's own shape across c=1/8/16 (per-row normalized; TTFT on a log scale); magnitudes are in the numbers. Hover a stack name for its run provenance; the full record is under per-framework provenance at the bottom.

machine model