Rethinking vllm-metal's Memory Budget for Apple Silicon
Introduction
LLM inference engines like vLLM were designed for discrete NVIDIA GPUs, where GPU memory is a dedicated, isolated resource. The memory allocator can safely assume it owns nearly all of VRAM — upstream vLLM defaults to claiming 90% of total GPU memory (gpu_memory_utilization=0.9), and if another process has already taken enough that this target can’t be met, vLLM simply refuses to start.
This assumption breaks on Apple Silicon. The M-series chips use a Unified Memory Architecture (UMA): CPU, GPU, and Neural Engine all share a single physical memory pool. There is no dedicated “VRAM” to claim. A Mac running vllm-metal is likely also running a browser, an IDE, and other applications — all competing for the same memory. Reserving 90% of total memory is not realistic in this scenario.
But the problem isn’t just about lowering a threshold. We also want to support the case where a Mac is used as a dedicated inference server with no other significant memory consumers.
The discussion tracks vllm-metal#97.
Why vllm-metal
An NVIDIA RTX 5090 ships with 32 GB of GDDR7 at a $1,999 MSRP — roughly $62.47 per GB. A Mac Studio with the M3 Ultra and 512 GB of unified memory starts at $9,499 — roughly $18.55 per GB, more than 3× cheaper. This is not an apple-to-Apple comparison (😉): CUDA VRAM is dedicated and faster, while unified memory is shared with the rest of the system and comes with lower bandwidth — which is precisely why the memory allocator matters so much on this platform.
Related Work
To compare the three engines below, we use a common vocabulary:
- memory hint: the value reported by Metal’s
recommendedMaxWorkingSetSize— the OS’s suggestion for how much GPU memory a process should use. This is a static, per-process hint; it does not reflect memory consumed by other apps. - RAM cap: a software-imposed ceiling, typically a fraction of total system RAM (e.g., 2/3 or 3/4), intended to leave room for the OS and other apps.
- utilization target: the fraction of the memory budget an engine attempts to claim (e.g., 0.9 means “use up to 90%”).
vLLM
vLLM follows a profile-and-claim strategy. On CUDA, it queries the GPU for total and free VRAM at startup, then claims total_vram × utilization_target (default 0.9). If the requested amount exceeds what is actually free (because another process is already using the GPU), vLLM fails immediately rather than proceeding with a smaller budget. The claimed memory is managed as a paged block pool (PagedAttention), where KV cache blocks are allocated and freed at page granularity as requests arrive and leave.
GPU memory is filled in two stages. First, model weights are loaded onto the GPU. vLLM then runs a profiling forward pass with dummy inputs to measure peak non-KV memory usage (activations, temporaries). The remaining memory — total_vram × utilization_target - weight_memory - profile_peak — becomes the KV block pool:
mistral.rs
mistral.rs adapts the same two-phase approach for Apple Silicon but introduces a RAM cap to leave room for the OS and other applications sharing the same memory pool:
\[\text{ram_cap} = \begin{cases} \text{system_ram} \times 2/3 & \text{if } \text{system_ram} \leq 36\text{ GB} \\\\ \text{system_ram} \times 3/4 & \text{otherwise} \end{cases}\]In Phase 1 (model weights), mistral.rs uses the memory hint directly, computing memory_hint - current_process_allocation and reserving max(available × 0.02, 512 MB) as headroom, then greedily places layers on GPU until the budget is exhausted.
In Phase 2 (KV cache), the effective ceiling becomes min(memory_hint, ram_cap). Since Apple typically reports a memory hint in the 66–75% range of system RAM — close to what the RAM cap already computes — the min() often selects the same value. The KV budget is then:
The two safety margins stack: 25–33% is reserved by the RAM cap, and a further 10% by the utilization target (default 0.9), leaving roughly 60–68% of system RAM as the effective ceiling before model weights are subtracted.
llama.cpp
llama.cpp uses the memory hint as its ceiling, memory_hint - current_process_allocation, with no RAM cap and no utilization target. Memory consumed by other apps is invisible; the engine has no system-wide pressure signal. On macOS 15+, a background thread requests buffer residency every 500 ms to prevent OS eviction, an acknowledgment that the OS may reclaim memory under pressure even after allocation succeeds.
GPU memory is filled in three stages: model weights (last N layers offloaded to GPU, back-to-front), a KV cache (preallocated at a user-specified token capacity n_ctx), and a compute scratch buffer (worst-case activation buffer reused every forward pass). Unlike the paged designs above, llama.cpp’s KV cache is contiguous — sequences share a ring buffer with 1-token granularity and an explicit defrag pass. Total KV memory is committed at startup:
vLLM assumes exclusive ownership of a discrete memory pool. mistral.rs introduces Apple Silicon-specific caps for desktop coexistence. llama.cpp pre-commits a user-specified amount with no system-wide awareness. vllm-metal inherits from all three but needs to handle both the mixed-use desktop case and the dedicated server case, ideally without requiring the user to manually tune a single magic number.
Proposal
vllm-metal has two KV cache paths, selected by VLLM_METAL_USE_PAGED_ATTENTION. Both are tracked in vllm-metal#97.
Path 1: Contiguous Allocation (MLX)
Today, the scheduler reasons in 16-token blocks and reports phantom block counts, but MLX allocates contiguous caches in 256-token steps. None of those blocks exist at runtime. The fix is to strip the paged bookkeeping and use mlx_lm’s auto behavior: each request allocates only the KV cache memory it needs via make_prompt_cache() and releases it when done. No upfront budget, no utilization target.
Path 2: Paged Allocation (vLLM)
Path 2 is under active development on the paged-attention-v3 branch (vllm-metal#70). It maintains a global block pool backed by Metal buffers, aligned with upstream vLLM’s PagedAttention. VLLM_METAL_MEMORY_FRACTION controls how much system RAM the pool claims. At startup, the engine measures weight and activation memory, then fills the remaining budget with KV blocks:
On a dedicated inference server, set the fraction high. On a desktop Mac sharing memory with a browser, IDE, and other applications, the fraction that is actually free will be lower. If the requested allocation exceeds free memory, the engine refuses to start and reports the available memory and the fraction the user would need to set. An environment variable (not a CLI flag) matches vllm-metal’s existing configuration pattern (VLLM_METAL_PREFIX_CACHE, etc.).
The Wired Collector
macOS distinguishes between pageable memory (swappable to disk) and wired memory (pinned in physical RAM). Metal GPU buffers are wired. Under memory pressure, macOS invokes the wired collector, a kernel mechanism that reclaims GPU wired memory by evicting Metal buffers, causing silent performance degradation or crashes.
The mistral.rs community documented this behavior in mistral.rs#1348. The workaround: sudo sysctl iogpu.disable_wired_collector=1. The setting does not persist across reboots unless added to a startup script. For dedicated servers running Path 2 with an aggressive memory fraction, disabling the wired collector is recommended.