<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://ranranhaoranzhang.com/feed.xml" rel="self" type="application/atom+xml"/><link href="https://ranranhaoranzhang.com/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-10T06:58:11+00:00</updated><id>https://ranranhaoranzhang.com/feed.xml</id><title type="html">blank</title><subtitle>PhD Candidate at Penn State University. Research in NLP, data-centric AI, and LLMs. </subtitle><entry><title type="html">I turned my Timeular Tracker into a macOS macropad</title><link href="https://ranranhaoranzhang.com/blog/2026/timeular-macropad/" rel="alternate" type="text/html" title="I turned my Timeular Tracker into a macOS macropad"/><published>2026-04-10T00:00:00+00:00</published><updated>2026-04-10T00:00:00+00:00</updated><id>https://ranranhaoranzhang.com/blog/2026/timeular-macropad</id><content type="html" xml:base="https://ranranhaoranzhang.com/blog/2026/timeular-macropad/"><![CDATA[<p>Show and tell time.</p> <div style="display: flex; justify-content: center; gap: 1rem; flex-wrap: wrap;"> <img src="/assets/img/timeular-tracker.png" alt="Timeular Tracker on desk" style="max-width: 250px; height: auto; border-radius: 6px;"/> <img src="/assets/img/timeular-menubar.png" alt="Menu bar UI" style="max-width: 250px; height: auto; border-radius: 6px;"/> </div> <h2 id="what-it-does">What it does</h2> <p>I turned my Timeular Tracker (the 8-sided time-tracking dice) into a physical macropad for macOS. Flip the dice, switch macOS desktops. Side 1 is my main workspace, side 2 is a side project, side 3 is another. No keyboard shortcut to remember, just a satisfying physical flip. Every side can also be bound to launching an app, running a shell command, or opening a URL, all configured from a tiny menu bar popover.</p> <h2 id="how-it-got-built">How it got built</h2> <p>I had the Timeular charged but unused on my desk (I stopped paying the monthly subscription). I also had Claude Opus 4.6, and even though I don’t know Swift, Opus does.</p> <ol> <li><strong>Scan.</strong> Asked Claude to find the Timeular over Bluetooth. It picked Python to explore the device.</li> <li><strong>Prototype.</strong> Read the orientation signal on flip and wired it to shell commands. A working macropad in one script.</li> <li><strong>Go native.</strong> Rewrote the whole thing in Swift/SwiftUI as a proper macOS menu bar app with CoreBluetooth.</li> <li><strong>Ship it.</strong> Works.</li> </ol> <p>Source on <a href="https://github.com/WindChimeRan/ohmytimular">GitHub</a>.</p>]]></content><author><name></name></author><category term="side-projects"/><category term="hardware"/><category term="macOS"/><category term="Swift"/><category term="Claude"/><category term="hack"/><summary type="html"><![CDATA[Flipping an 8-sided dice to switch macOS desktops, built end to end with Claude Opus.]]></summary></entry><entry><title type="html">Rethinking vllm-metal’s Memory Budget for Apple Silicon</title><link href="https://ranranhaoranzhang.com/blog/2026/llm-inference-memory-allocation-apple-silicon/" rel="alternate" type="text/html" title="Rethinking vllm-metal’s Memory Budget for Apple Silicon"/><published>2026-02-14T10:00:00+00:00</published><updated>2026-02-14T10:00:00+00:00</updated><id>https://ranranhaoranzhang.com/blog/2026/llm-inference-memory-allocation-apple-silicon</id><content type="html" xml:base="https://ranranhaoranzhang.com/blog/2026/llm-inference-memory-allocation-apple-silicon/"><![CDATA[<h2 id="introduction">Introduction</h2> <p>LLM inference engines like <a href="https://github.com/vllm-project/vllm">vLLM</a> were designed for discrete NVIDIA GPUs, where GPU memory is a dedicated, isolated resource. The memory allocator can safely assume it owns nearly all of VRAM — upstream vLLM defaults to claiming 90% of total GPU memory (<code class="language-plaintext highlighter-rouge">gpu_memory_utilization=0.9</code>), and if another process has already taken enough that this target can’t be met, vLLM simply refuses to start.</p> <p>This assumption breaks on Apple Silicon. The M-series chips use a Unified Memory Architecture (UMA): CPU, GPU, and Neural Engine all share a single physical memory pool. There is no dedicated “VRAM” to claim. A Mac running <a href="https://github.com/vllm-project/vllm-metal">vllm-metal</a> is likely also running a browser, an IDE, and other applications — all competing for the same memory. Reserving 90% of total memory is not realistic in this scenario.</p> <p>But the problem isn’t just about lowering a threshold. We also want to support the case where a Mac <em>is</em> used as a dedicated inference server with no other significant memory consumers. </p> <p>The discussion tracks <a href="https://github.com/vllm-project/vllm-metal/issues/97">vllm-metal#97</a>.</p> <hr/> <h2 id="why-vllm-metal">Why vllm-metal</h2> <p>An NVIDIA RTX 5090 ships with 32 GB of GDDR7 at a $1,999 MSRP — roughly <strong>$62.47 per GB</strong>. A Mac Studio with the M3 Ultra and 512 GB of unified memory starts at $9,499 — roughly <strong>$18.55 per GB</strong>, more than 3× cheaper. This is not an apple-to-Apple comparison (😉): CUDA VRAM is dedicated and faster, while unified memory is shared with the rest of the system and comes with lower bandwidth — which is precisely why the memory <em>allocator</em> matters so much on this platform.</p> <hr/> <h2 id="related-work">Related Work</h2> <p>To compare the three engines below, we use a common vocabulary:</p> <ul> <li><strong>memory hint</strong>: the value reported by Metal’s <code class="language-plaintext highlighter-rouge">recommendedMaxWorkingSetSize</code> — the OS’s suggestion for how much GPU memory a process should use. This is a static, per-process hint; it does not reflect memory consumed by other apps.</li> <li><strong>RAM cap</strong>: a software-imposed ceiling, typically a fraction of total system RAM (e.g., 2/3 or 3/4), intended to leave room for the OS and other apps.</li> <li><strong>utilization target</strong>: the fraction of the memory budget an engine attempts to claim (e.g., 0.9 means “use up to 90%”).</li> </ul> <h3 id="vllm">vLLM</h3> <p><strong><a href="https://github.com/vllm-project/vllm">vLLM</a></strong> follows a profile-and-claim strategy. On CUDA, it queries the GPU for total and free VRAM at startup, then claims <code class="language-plaintext highlighter-rouge">total_vram × utilization_target</code> (default 0.9). If the requested amount exceeds what is actually free (because another process is already using the GPU), vLLM fails immediately rather than proceeding with a smaller budget. The claimed memory is managed as a paged block pool (PagedAttention), where KV cache blocks are allocated and freed at page granularity as requests arrive and leave.</p> <p>GPU memory is filled in two stages. First, <strong>model weights</strong> are loaded onto the GPU. vLLM then runs a profiling forward pass with dummy inputs to measure peak non-KV memory usage (activations, temporaries). The remaining memory — <code class="language-plaintext highlighter-rouge">total_vram × utilization_target - weight_memory - profile_peak</code> — becomes the KV block pool:</p> \[\text{kv_budget} = \text{total_vram} \times \text{utilization} - \text{weight_memory} - \text{profile_peak}\] <h3 id="mistralrs">mistral.rs</h3> <p><strong><a href="https://github.com/EricLBuehler/mistral.rs">mistral.rs</a></strong> adapts the same two-phase approach for Apple Silicon but introduces a RAM cap to leave room for the OS and other applications sharing the same memory pool:</p> \[\text{ram_cap} = \begin{cases} \text{system_ram} \times 2/3 &amp; \text{if } \text{system_ram} \leq 36\text{ GB} \\\\ \text{system_ram} \times 3/4 &amp; \text{otherwise} \end{cases}\] <p>In <strong>Phase 1 (model weights)</strong>, mistral.rs uses the memory hint directly, computing <code class="language-plaintext highlighter-rouge">memory_hint - current_process_allocation</code> and reserving <code class="language-plaintext highlighter-rouge">max(available × 0.02, 512 MB)</code> as headroom, then greedily places layers on GPU until the budget is exhausted.</p> <p>In <strong>Phase 2 (KV cache)</strong>, the effective ceiling becomes <code class="language-plaintext highlighter-rouge">min(memory_hint, ram_cap)</code>. Since Apple typically reports a memory hint in the 66–75% range of system RAM — close to what the RAM cap already computes — the <code class="language-plaintext highlighter-rouge">min()</code> often selects the same value. The KV budget is then:</p> \[\text{kv_budget} = \min(\text{memory_hint},\; \text{ram_cap}) \times \text{utilization} - \text{used}\] <p>The two safety margins stack: 25–33% is reserved by the RAM cap, and a further 10% by the utilization target (default 0.9), leaving roughly 60–68% of system RAM as the effective ceiling before model weights are subtracted.</p> <h3 id="llamacpp">llama.cpp</h3> <p><strong><a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a></strong> uses the memory hint as its ceiling, <code class="language-plaintext highlighter-rouge">memory_hint - current_process_allocation</code>, with no RAM cap and no utilization target. Memory consumed by other apps is invisible; the engine has no system-wide pressure signal. On macOS 15+, a background thread requests buffer residency every 500 ms to prevent OS eviction, an acknowledgment that the OS may reclaim memory under pressure even after allocation succeeds.</p> <p>GPU memory is filled in three stages: <strong>model weights</strong> (last N layers offloaded to GPU, back-to-front), a <strong>KV cache</strong> (preallocated at a user-specified token capacity <code class="language-plaintext highlighter-rouge">n_ctx</code>), and a <strong>compute scratch buffer</strong> (worst-case activation buffer reused every forward pass). Unlike the paged designs above, llama.cpp’s KV cache is contiguous — sequences share a ring buffer with 1-token granularity and an explicit defrag pass. Total KV memory is committed at startup:</p> \[\text{KV}_{\text{total}} = n_{\text{kv_layers}} \times 2 \times n_{\text{kv_heads}} \times d_{\text{head}} \times n_{\text{ctx}} \times \text{dtype_size}\] <p>vLLM assumes exclusive ownership of a discrete memory pool. mistral.rs introduces Apple Silicon-specific caps for desktop coexistence. llama.cpp pre-commits a user-specified amount with no system-wide awareness. vllm-metal inherits from all three but needs to handle both the mixed-use desktop case and the dedicated server case, ideally without requiring the user to manually tune a single magic number.</p> <hr/> <h2 id="proposal">Proposal</h2> <p>vllm-metal has two KV cache paths, selected by <code class="language-plaintext highlighter-rouge">VLLM_METAL_USE_PAGED_ATTENTION</code>. Both are tracked in <a href="https://github.com/vllm-project/vllm-metal/issues/97">vllm-metal#97</a>.</p> <h3 id="path-1-contiguous-allocation-mlx">Path 1: Contiguous Allocation (MLX)</h3> <p>Today, the scheduler reasons in 16-token blocks and reports phantom block counts, but MLX allocates contiguous caches in 256-token steps. None of those blocks exist at runtime. The fix is to strip the paged bookkeeping and use mlx_lm’s <code class="language-plaintext highlighter-rouge">auto</code> behavior: each request allocates only the KV cache memory it needs via <code class="language-plaintext highlighter-rouge">make_prompt_cache()</code> and releases it when done. No upfront budget, no utilization target.</p> <h3 id="path-2-paged-allocation-vllm">Path 2: Paged Allocation (vLLM)</h3> <p>Path 2 is under active development on the <code class="language-plaintext highlighter-rouge">paged-attention-v3</code> branch (<a href="https://github.com/vllm-project/vllm-metal/issues/70">vllm-metal#70</a>). It maintains a global block pool backed by Metal buffers, aligned with upstream vLLM’s PagedAttention. <code class="language-plaintext highlighter-rouge">VLLM_METAL_MEMORY_FRACTION</code> controls how much system RAM the pool claims. At startup, the engine measures weight and activation memory, then fills the remaining budget with KV blocks:</p> \[\text{kv_budget} = \text{system_ram} \times \text{VLLM_METAL_MEMORY_FRACTION} - \text{weight_memory} - \text{profile_peak}\] <p>On a dedicated inference server, set the fraction high. On a desktop Mac sharing memory with a browser, IDE, and other applications, the fraction that is actually free will be lower. If the requested allocation exceeds free memory, the engine refuses to start and reports the available memory and the fraction the user would need to set. An environment variable (not a CLI flag) matches vllm-metal’s existing configuration pattern (<code class="language-plaintext highlighter-rouge">VLLM_METAL_PREFIX_CACHE</code>, etc.).</p> <h4 id="the-wired-collector">The Wired Collector</h4> <p>macOS distinguishes between pageable memory (swappable to disk) and wired memory (pinned in physical RAM). Metal GPU buffers are wired. Under memory pressure, macOS invokes the <strong>wired collector</strong>, a kernel mechanism that reclaims GPU wired memory by evicting Metal buffers, causing silent performance degradation or crashes.</p> <p>The mistral.rs community documented this behavior in <a href="https://github.com/EricLBuehler/mistral.rs/issues/1348">mistral.rs#1348</a>. The workaround: <code class="language-plaintext highlighter-rouge">sudo sysctl iogpu.disable_wired_collector=1</code>. The setting does not persist across reboots unless added to a startup script. For dedicated servers running Path 2 with an aggressive memory fraction, disabling the wired collector is recommended.</p>]]></content><author><name></name></author><category term="research"/><category term="LLM"/><category term="inference"/><category term="Apple Silicon"/><category term="systems"/><category term="proposal"/><summary type="html"><![CDATA[Exploring memory allocation strategies for LLM inference engines on Apple Silicon's unified memory architecture.]]></summary></entry></feed>