<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://ranranhaoranzhang.com/feed.xml" rel="self" type="application/atom+xml"/><link href="https://ranranhaoranzhang.com/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-06-13T16:08:40+00:00</updated><id>https://ranranhaoranzhang.com/feed.xml</id><title type="html">blank</title><subtitle>PhD Candidate at Penn State University. Research in NLP, data-centric AI, and LLMs. </subtitle><entry><title type="html">One Launch for Any Batch: The Binary Search Inside vllm-metal’s Varlen Attention</title><link href="https://ranranhaoranzhang.com/blog/2026/varlen-kernel-binary-search/" rel="alternate" type="text/html" title="One Launch for Any Batch: The Binary Search Inside vllm-metal’s Varlen Attention"/><published>2026-06-12T00:00:00+00:00</published><updated>2026-06-12T00:00:00+00:00</updated><id>https://ranranhaoranzhang.com/blog/2026/varlen-kernel-binary-search</id><content type="html" xml:base="https://ranranhaoranzhang.com/blog/2026/varlen-kernel-binary-search/"><![CDATA[<p>The <a href="/blog/2026/vllm-metal-vs-mlx-lm-kv-cache/">previous post</a> ended at the kernel boundary: vLLM’s scheduler emits a varlen schedule, and the schedule pays off only if the attention kernel can consume it. When one GPU launch covers a batch that mixes prefill chunks, decode steps, and draft verification, how does each parallel worker discover which request it is working for?</p> <h2 id="the-ladder-from-chip-to-lane">The ladder from chip to lane</h2> <p>This machine is an M1 Pro with 14 GPU cores. A Metal compute dispatch is a <strong>grid</strong> of <strong>threadgroups</strong>; a threadgroup runs on one GPU core, which keeps several resident and switches between them to hide memory latency. Within a threadgroup, threads execute in <strong>simdgroups</strong> that step through instructions in lockstep. Apple’s own documentation maps the vocabulary to other dialects: a simdgroup is what CUDA calls a warp, a lane is a thread within it.</p> <style>.hier{--hr-ink:var(--global-text-color,#1a1a2e);--hr-muted:var(--global-text-color-light,#5a5a72);--hr-line:var(--global-divider-color,#e4e4ec);background:var(--global-card-bg-color,#fff);border:1.5px solid var(--hr-line);border-radius:12px;padding:14px 16px;margin:1.4em 0;color:var(--hr-ink);font-size:.85rem}.hier .hbox{border:1.5px solid;border-radius:8px;padding:8px 12px;margin:8px 0 2px}.hier .hrow{display:flex;align-items:baseline;gap:8px;flex-wrap:wrap}.hier .hl{font-weight:700;font-size:.82rem}.hier .hm{color:var(--hr-muted);font-size:.78rem}.hier .hbadge{margin-left:auto;font:600 .72rem ui-monospace,SFMono-Regular,Menlo,monospace;color:var(--hr-muted);border:1px solid var(--hr-line);border-radius:10px;padding:1px 8px;white-space:nowrap}.hier .chip{border-color:#47a}.hier .core{border-color:#098}.hier .tg{border-color:#e73;margin:20px 18px 2px 0;background:var(--global-card-bg-color,#fff);box-shadow:6px -6px 0 0 var(--global-card-bg-color,#fff),6px -6px 0 1.5px #e73,12px -12px 0 0 var(--global-card-bg-color,#fff),12px -12px 0 1.5px #e73}.hier .sg{border-color:#c31}.hier .lanes{display:flex;gap:2px;margin-top:6px;flex-wrap:wrap}.hier .lane{width:9px;height:14px;border-radius:2px;background:#c31;opacity:.55}</style> <div class="hier"> <div class="hbox chip"> <div class="hrow"><span class="hl">M1 Pro</span><span class="hm">shared unified memory</span><span class="hbadge">14 GPU cores</span></div> <div class="hbox core"> <div class="hrow"><span class="hl">GPU core</span><span class="hm">runs whole threadgroups; threadgroup memory lives in the core</span><span class="hbadge">many threadgroups, switched to hide latency</span></div> <div class="hbox tg"> <div class="hrow"><span class="hl">threadgroup</span><span class="hm">up to 1,024 threads, 32 KB threadgroup memory</span><span class="hbadge">256 threads in the decode kernel</span></div> <div class="hbox sg"> <div class="hrow"><span class="hl">simdgroup</span><span class="hm">32 lanes in lockstep</span><span class="hbadge">8 per threadgroup here</span></div> <div class="lanes"><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div><div class="lane"></div></div> </div> </div> </div> </div> </div> <style>.bench-table{overflow-x:auto;margin:0 0 1.4em}.bench-table table{width:100%;border-collapse:collapse;font-size:.92rem;line-height:1.4;font-variant-numeric:tabular-nums;border-top:2px solid var(--global-text-color,#1a1a2e);border-bottom:2px solid var(--global-text-color,#1a1a2e)}.bench-table th{font-weight:600;padding:.45em .75em;white-space:nowrap;border-bottom:1.5px solid var(--global-text-color,#1a1a2e)}.bench-table td{padding:.4em .75em}.bench-table th:first-child,.bench-table td:first-child{padding-left:.25em}.bench-table th:last-child,.bench-table td:last-child{padding-right:.25em}.bench-table td:first-child{font-weight:600}</style> <div class="bench-table"> <table> <thead> <tr> <th>Metal</th> <th>CUDA</th> </tr> </thead> <tbody> <tr> <td>grid</td> <td>grid</td> </tr> <tr> <td>threadgroup</td> <td>thread block</td> </tr> <tr> <td>simdgroup</td> <td>warp</td> </tr> <tr> <td>thread / lane</td> <td>thread / lane</td> </tr> </tbody> </table> </div> <p>The numbers in the figure are read from this machine (<code class="language-plaintext highlighter-rouge">threadExecutionWidth</code>, <code class="language-plaintext highlighter-rouge">maxTotalThreadsPerThreadgroup</code>, <code class="language-plaintext highlighter-rouge">maxThreadgroupMemoryLength</code>) and match the constants hard-coded in the kernels (<code class="language-plaintext highlighter-rouge">NUM_SIMD_LANES=32</code>, the 32 KB shared-memory budget documented in <code class="language-plaintext highlighter-rouge">paged_ops.cpp</code>).</p> <h2 id="one-threadgroup-per-unit-of-query-work">One threadgroup per unit of query work</h2> <p>vllm-metal ships two paged attention kernels in <code class="language-plaintext highlighter-rouge">kernels_v2/</code>, and the host picks per step:</p> <p><strong>Pure-decode batches</strong> run <code class="language-plaintext highlighter-rouge">paged_attention</code> in <code class="language-plaintext highlighter-rouge">pagedattention.metal</code>. The grid is <code class="language-plaintext highlighter-rouge">(num_heads, total_q_tokens)</code>: one threadgroup per query token per head, which for a real batch is thousands of threadgroups that the 14 cores work through a few at a time. Its 256 threads form 8 simdgroups that stride across the sequence’s KV blocks, each maintaining online-softmax running state, merged across simdgroups at the end.</p> <p><strong>Batches with more query tokens than sequences</strong>, which means any batch containing prefill, run <code class="language-plaintext highlighter-rouge">paged_attention_tiled</code> in <code class="language-plaintext highlighter-rouge">pagedattention_tiled.metal</code> when the dtype and head size qualify; the per-token kernel is the fallback for everything else. The tiled kernel is FlashAttention-2 style: one threadgroup computes a 32-token block of query rows for one head (<code class="language-plaintext highlighter-rouge">BQ=32</code>), with four simdgroups each owning 8 rows, and Q·Kᵀ and P·V done as 8×8 <code class="language-plaintext highlighter-rouge">simdgroup_multiply_accumulate</code> tiles.</p> <p>Both grids are flat over the packed token axis from the previous post. The padded alternative would be a grid over <code class="language-plaintext highlighter-rouge">(batch, max_seq_len)</code> with most threadgroups assigned to padding. Here every threadgroup corresponds to real work, with one exception: the tiled grid is sized <code class="language-plaintext highlighter-rouge">total_q_tokens/BQ + num_seqs</code>, a deliberate over-estimate of how many 32-token blocks the ragged sequences actually fill. Surplus threadgroups discover they are surplus and exit. That choice is inherited.</p> <h2 id="which-sequence-owns-token-4097">Which sequence owns token 4,097?</h2> <p>A threadgroup wakes up knowing only its grid coordinates: head 3, query token 4,097. The KV cache is paged, so before it can do anything it needs <code class="language-plaintext highlighter-rouge">seq_idx</code>: which request’s block table to walk, which context length bounds the causal mask, where its sequence starts in the packed query tensor. The batch is ragged, so token 4,097 could belong to any request.</p> <p>The mapping from token to request lives in one array: <code class="language-plaintext highlighter-rouge">cu_seqlens_q</code>, the exclusive prefix sum of query lengths, <code class="language-plaintext highlighter-rouge">[0, len_0, len_0+len_1, ...]</code>, length <code class="language-plaintext highlighter-rouge">num_seqs + 1</code>. Finding the owner of token <code class="language-plaintext highlighter-rouge">t</code> means finding the largest <code class="language-plaintext highlighter-rouge">i</code> with <code class="language-plaintext highlighter-rouge">cu_seqlens_q[i] &lt;= t</code>. The array is sorted. That is a binary search:</p> <div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// pagedattention.metal</span>
<span class="kr">inline</span> <span class="kt">int</span> <span class="nf">find_seq_idx</span><span class="p">(</span><span class="k">const</span> <span class="n">device</span> <span class="kt">int32_t</span> <span class="o">*</span><span class="n">cu_seqlens_q</span><span class="p">,</span>
                        <span class="kt">int</span> <span class="n">q_token_idx</span><span class="p">,</span> <span class="kt">int</span> <span class="n">num_seqs</span><span class="p">)</span> <span class="p">{</span>
  <span class="kt">int</span> <span class="n">lo</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">hi</span> <span class="o">=</span> <span class="n">num_seqs</span><span class="p">;</span>
  <span class="k">while</span> <span class="p">(</span><span class="n">lo</span> <span class="o">&lt;</span> <span class="n">hi</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">mid</span> <span class="o">=</span> <span class="p">(</span><span class="n">lo</span> <span class="o">+</span> <span class="n">hi</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">/</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">cu_seqlens_q</span><span class="p">[</span><span class="n">mid</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="n">q_token_idx</span><span class="p">)</span> <span class="p">{</span>
      <span class="n">lo</span> <span class="o">=</span> <span class="n">mid</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
      <span class="n">hi</span> <span class="o">=</span> <span class="n">mid</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
  <span class="p">}</span>
  <span class="k">return</span> <span class="n">lo</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div> <p>Step through it below. Click a token; the steps are the ones the kernel executes.</p> <style>.bsw{--bsw-ink:var(--global-text-color,#1a1a2e);--bsw-muted:var(--global-text-color-light,#5a5a72);--bsw-line:var(--global-divider-color,#e4e4ec);--bsw-out:#eef0f5;--bsw-hit:#098;--bsw-miss:#c31;background:var(--global-card-bg-color,#fff);border:1.5px solid var(--bsw-line);border-radius:12px;padding:16px 18px;margin:1.5em 0;color:var(--bsw-ink)}html[data-theme="dark"] .bsw{--bsw-out:#2b2b33;--bsw-miss:#e85c41}.bsw label{font-size:.88rem}.bsw input[type=range]{vertical-align:middle;width:120px}.bsw .bsw-h{font-weight:700;margin:1.05em 0 .3em}.bsw .hint{font-size:.85rem;color:var(--bsw-muted);margin:.15em 0 .5em}.bsw .strip{display:flex;gap:2px;flex-wrap:wrap;margin:.4em 0}.bsw .tok{width:15px;height:26px;border-radius:3px;cursor:pointer;opacity:.85}.bsw .tok.sel{outline:2.5px solid var(--bsw-ink);outline-offset:-2px;opacity:1}.bsw .cuarr{font:.82rem ui-monospace,SFMono-Regular,Menlo,monospace;background:var(--bsw-out);border-radius:6px;padding:6px 10px;margin:.4em 0;overflow-x:auto;white-space:nowrap}.bsw .steps{font:.82rem ui-monospace,SFMono-Regular,Menlo,monospace;background:var(--bsw-out);border-radius:6px;padding:6px 10px;margin:.4em 0;overflow-x:auto}.bsw .steps .yes{color:var(--bsw-hit);font-weight:700}.bsw .steps .no{color:var(--bsw-miss);font-weight:700}.bsw .res{font-size:.92rem;margin:.5em 0 0}</style> <div class="bsw"> <label>requests in the batch: <input type="range" id="bswn" min="2" max="16" value="6"/> <b id="bswnv">6</b></label> <div class="hint">query lengths mix decode (1 token) and prefill chunks, like a real continuous-batching step</div> <div class="bsw-h">packed query tokens, colored by request</div> <div class="strip" id="bswstrip"></div> <div class="cuarr" id="bswcu"></div> <div class="bsw-h">binary search for the selected token</div> <div class="steps" id="bswsteps"></div> <p class="res" id="bswres"></p> <div class="hint">every thread of the threadgroup runs these same steps in lockstep on the same inputs</div> </div> <script>
(function(){
const LENS=[1,1,5,1,12,2,1,7,1,3,4,1,9,1,2,6];
const PAL=['#4477AA','#EE7733','#009988','#CC3311','#AA3377','#66CCEE','#228833','#CCBB44',
           '#4477AA','#EE7733','#009988','#CC3311','#AA3377','#66CCEE','#228833','#CCBB44'];
const $=id=>document.getElementById(id);
let nseq=6, sel=null;
function cu(){const a=[0];for(let i=0;i<nseq;i++)a.push(a[i]+LENS[i]);return a;}
function search(c,t){
  const steps=[];let lo=0,hi=nseq;
  while(lo<hi){
    const mid=Math.floor((lo+hi+1)/2), ok=c[mid]<=t;
    steps.push({lo,hi,mid,v:c[mid],ok});
    if(ok)lo=mid;else hi=mid-1;
  }
  return {steps,seq:lo};
}
function render(){
  $('bswnv').textContent=nseq;
  const c=cu(), total=c[nseq];
  if(sel===null||sel>=total) sel=Math.min(Math.floor(total*0.7),total-1);
  const strip=$('bswstrip');strip.innerHTML='';
  for(let s=0;s<nseq;s++)for(let k=0;k<LENS[s];k++){
    const t=c[s]+k,d=document.createElement('div');
    d.className='tok'+(t===sel?' sel':'');d.style.background=PAL[s];d.title='token '+t;
    d.addEventListener('click',()=>{sel=t;render();});
    strip.appendChild(d);
  }
  $('bswcu').textContent='cu_seqlens_q = ['+c.join(', ')+']   (num_seqs = '+nseq+')';
  const {steps,seq}=search(c,sel);
  $('bswsteps').innerHTML=steps.map((s,i)=>
    'step '+(i+1)+':  lo='+s.lo+' hi='+s.hi+' → mid='+s.mid+',  cu['+s.mid+']='+s.v+' ≤ '+sel+' ?  '+
    (s.ok?'<span class="yes">yes → lo=mid</span>':'<span class="no">no → hi=mid−1</span>')
  ).join('<br>');
  $('bswres').innerHTML='token <b>'+sel+'</b> belongs to <b style="color:'+PAL[seq]+'">request '+seq+
    '</b> after <b>'+steps.length+'</b> steps; the bound is ⌈log₂(num_seqs+1)⌉ = '+Math.ceil(Math.log2(nseq+1));
}
$('bswn').addEventListener('input',e=>{nseq=+e.target.value;sel=null;render();});
render();
})();
</script> <p>The search runs before any attention math, in every threadgroup, and nothing about it is parallel. All threads of the threadgroup, 256 in the decode kernel and 128 in the tiled one, execute it redundantly on identical inputs and take identical branches. Redundant execution is the right call: the alternative is one lane searching while the rest wait at a barrier, then a broadcast through threadgroup memory. Uniform redundant execution costs the same wall time as one lane and skips the synchronization.</p> <p>The cost is at most $\lceil \log_2(N{+}1) \rceil$ iterations, each one dependent int32 load. At 16 concurrent requests that is 5 loads of a 68-byte array that every threadgroup touches, against a KV loop that streams megabytes. For a pure decode batch the search is computing the identity function: <code class="language-plaintext highlighter-rouge">cu_seqlens_q</code> is <code class="language-plaintext highlighter-rouge">[0, 1, 2, ...]</code>, so token <code class="language-plaintext highlighter-rouge">t</code> belongs to request <code class="language-plaintext highlighter-rouge">t</code>. The kernel pays those loads to learn that.</p> <h2 id="borrowed-from-triton">Borrowed from Triton</h2> <p>The kernel’s own comment names its ancestor: the approach is “the same approach used by the upstream vLLM unified Triton kernel,” <code class="language-plaintext highlighter-rouge">triton_unified_attention.py:find_seq_idx</code>. vllm-metal adapted the upstream kernel’s tests first, then translated the kernel to Metal against them. Upstream’s <code class="language-plaintext highlighter-rouge">find_seq_idx</code> is also a binary search over the same prefix array, executed once per Triton program as scalar control flow (it has since moved to <code class="language-plaintext highlighter-rouge">triton_attention_helpers.py</code>).</p> <p>The tiled kernel inherits a subtler trick. Its grid is in units of 32-token Q-blocks, but blocks are aligned per sequence, so block boundaries depend on where sequences start. Rather than materialize a second prefix array in block units, the tiled kernel and its Triton ancestor search the token-unit array through a transform: compare against <code class="language-plaintext highlighter-rouge">cu_seqlens_q[mid] / BQ + mid</code> instead of <code class="language-plaintext highlighter-rouge">cu_seqlens_q[mid]</code>. Each sequence contributes <code class="language-plaintext highlighter-rouge">len/BQ</code> full blocks plus one potential ragged tail, and the <code class="language-plaintext highlighter-rouge">+ mid</code> term counts the tails. One array serves both coordinate systems.</p> <p>The over-provisioned grid has the same upstream origin, and upstream documents the reason: computing the exact number of Q-blocks would require reading the query lengths on the CPU, and the kernel’s authors judged a few empty threadgroups cheaper than that synchronization. The pattern repeats: spend trivial GPU work to keep the CPU out of the launch path.</p> <h2 id="what-continuous-batching-asks-of-the-kernel">What continuous batching asks of the kernel</h2> <p>Trace one step end to end. The vLLM scheduler hands vllm-metal a step that mixes, say, 32 decoding requests and one mid-prefill request. The model runner packs them into one flat token tensor, decodes first, and builds <code class="language-plaintext highlighter-rouge">cu_seqlens_q</code> as plain Python lists in <code class="language-plaintext highlighter-rouge">prepare_unified()</code>: something like <code class="language-plaintext highlighter-rouge">[0, 1, 2, ..., 32, 544]</code>. Every attention layer makes one paged-attention dispatch; the C++ side routes this batch to the tiled kernel, sizes the grid from <code class="language-plaintext highlighter-rouge">total_q_tokens</code>, and each threadgroup binary-searches its way to its request. No layer of this stack ever sees a <code class="language-plaintext highlighter-rouge">[batch, seq_len]</code> rectangle.</p> <p>Each forward pass is one attention dispatch per layer regardless of batch composition (a small pure-decode batch adds a reduce pass via the split-KV path), which is what continuous batching asks for: requests join and leave the batch every step, query lengths change every step, and none of that changes the kernel, only the contents of a 136-byte prefix array. The speculative-decoding case from the previous post lands in the same array, though not as one piece: <code class="language-plaintext highlighter-rouge">prepare_unified()</code> expands a k+1-token verification segment into k+1 unit deltas, each with its own context length carrying the causal staircase.</p> <h2 id="measuring-the-obvious-optimization">Measuring the obvious optimization</h2> <p>The search is $O(\log N)$ work per threadgroup to compute something the host already knows. The host builds <code class="language-plaintext highlighter-rouge">cu_seqlens_q</code>; it could just as easily build the inverse, a <code class="language-plaintext highlighter-rouge">token → seq_idx</code> array of length <code class="language-plaintext highlighter-rouge">total_q_tokens</code>, and the kernel would do one load instead of a loop. We implemented it behind an env flag and measured.</p> <p><strong>Method.</strong> A function constant adds a second path to both kernels: <code class="language-plaintext highlighter-rouge">seq_idx = seq_map[q_token_idx]</code> for the decode kernel, a Q-block map with sentinel entries for the tiled kernel’s surplus threadgroups. The host builds the token map in <code class="language-plaintext highlighter-rouge">prepare_unified()</code> next to <code class="language-plaintext highlighter-rouge">cu_seqlens_q</code> and derives the Q-block map from it at the step’s first tiled dispatch; each converts to an MLX array once per step. Correctness is checked by asserting bit-identical attention output against the binary-search path across decode, mixed, ragged-tail, and spec-verify shaped batches, and a dispatch-side counter asserts that each timing arm actually ran the path it claims. Timing is ABAB-interleaved across processes, 2 warmup batches then pooled timed repeats, medians and IQR reported. Shapes: 16 query heads, 4 KV heads, head size 128, fp16 (Qwen3-0.6B’s query-head count and head size; the model itself has 8 KV heads).</p> <p><strong>The search is not free everywhere.</strong> Per-dispatch medians on the M1 Pro:</p> <div class="bench-table"> <table> <thead> <tr> <th>batch</th> <th>kernel</th> <th style="text-align: right">binary search (µs)</th> <th style="text-align: right">O(1) map (µs)</th> <th style="text-align: right">delta</th> </tr> </thead> <tbody> <tr> <td>decode, 64 reqs, ctx 128</td> <td>per-token</td> <td style="text-align: right">322.6</td> <td style="text-align: right">296.5</td> <td style="text-align: right">−8.1%</td> </tr> <tr> <td>decode, 256 reqs, ctx 128</td> <td>per-token</td> <td style="text-align: right">1,277.1</td> <td style="text-align: right">1,171.0</td> <td style="text-align: right">−8.3%</td> </tr> <tr> <td>decode, 1,024 reqs, ctx 128</td> <td>per-token</td> <td style="text-align: right">5,159.2</td> <td style="text-align: right">4,580.9</td> <td style="text-align: right">−11.2%</td> </tr> <tr> <td>decode, 64 reqs, ctx 2,048</td> <td>per-token</td> <td style="text-align: right">4,074.3</td> <td style="text-align: right">4,043.2</td> <td style="text-align: right">−0.8%</td> </tr> <tr> <td>decode, 1,024 reqs, ctx 2,048</td> <td>per-token</td> <td style="text-align: right">65,141.9</td> <td style="text-align: right">64,259.9</td> <td style="text-align: right">−1.4%</td> </tr> <tr> <td>prefill, 8 × 512</td> <td>tiled</td> <td style="text-align: right">7,297.2</td> <td style="text-align: right">7,157.0</td> <td style="text-align: right">−1.9%</td> </tr> <tr> <td>mixed, 32 decode + 4 × 512</td> <td>tiled</td> <td style="text-align: right">7,022.6</td> <td style="text-align: right">6,811.6</td> <td style="text-align: right">−3.0%</td> </tr> </tbody> </table> </div> <p>The pattern matches the arithmetic. At context 128 the KV loop visits 8 blocks, so the search’s dependent loads are a visible fraction of each threadgroup’s runtime, and the fraction grows with the batch: 64 requests take up to 7 search iterations, 1,024 up to 11. At context 2,048 the KV loop visits 128 blocks and the search disappears into it. No configuration regressed beyond noise. The host pays for the map: at 1,024 decoding requests, 86 µs to build the Python list and 10 µs to convert it, per step. That is under 0.1% of the attention time of the step it serves.</p> <p><strong>End to end, at ordinary scale, nothing moves.</strong> Serving Qwen3-0.6B at concurrency 16 with 1,024-token inputs, 3 ABAB rounds per arm: output throughput +0.9%, median TTFT −1.2%, median TPOT +0.5%, p99 TPOT +2.6%, all inside the round-to-round IQR. Three rounds per arm can only detect complete separation, and the TPOT point estimates lean slightly against the map, so the supported read is flat. At 1,024 tokens of context, decode sits in the regime where the table above says the search costs nothing, so this is the microbench’s prediction, confirmed.</p> <p><strong>The favorable regime, end to end.</strong> Serving the same model at concurrency 64 with 128-token inputs and outputs keeps every decode step at context 128 to 256 with a 64-deep batch: the microbench’s best case. The first three ABAB rounds per arm showed output throughput +3.1% and median TPOT −3.0%. Nine rounds per arm dissolved it: +0.9% mean throughput, −0.7% mean TPOT, 6 of 9 pairwise wins, exact permutation p = 0.26 and 0.45. The early +3.1% traced to three fast server processes that landed on both arms (Fisher p = 1.0) and never recurred. An 8-11% kernel-side win dilutes to under 1% of serving time, and separating a 1% effect from 1.3% per-round noise takes about thirty rounds per arm; we stopped at nine.</p> <p><strong>Where this sits in the design space.</strong> Kernels that need token-to-request resolution have settled into four patterns. Per-program search over the prefix array, as here and in upstream’s unified Triton kernel. Grid axes that encode the request directly, as in the classic padded PagedAttention launch, paying threadgroups for padding instead. Host-precomputed work descriptors, as in FlashInfer’s <code class="language-plaintext highlighter-rouge">plan()</code> phase and FlashMLA’s tile scheduler metadata, which amortize scheduling onto the CPU once per step. And explicit precomputed index maps, which is the variant we tested; vLLM already ships them in FlexAttention’s <code class="language-plaintext highlighter-rouge">doc_ids</code> and ROCm AITER’s <code class="language-plaintext highlighter-rouge">token_to_batch</code> (non-default attention backends) and in Mamba2’s per-chunk <code class="language-plaintext highlighter-rouge">seq_idx</code>, but not in its default attention path.</p> <p><strong>Verdict.</strong> The binary search is not literally free: replace it with a host-built map and the decode kernel returns up to 11% of its time at short context and large batch, bit-identically. It is free where it matters: those are the configurations where the kernel is cheapest, so serving throughput moves by less than the run-to-run noise. The upstream trade, a few empty threadgroups and a $\log_2 N$ loop in exchange for keeping the CPU out of the launch path, survives measurement, and the seq-map branch stays parked in its experiment worktree. The patch worth upstreaming from this exercise is the bug the bit-identical assertion flushed out of the unmodified baseline instead: an exception thrown from inside <code class="language-plaintext highlighter-rouge">mx.eval</code> leaves MLX’s Metal eval state wedged, and a later, valid eval can return an unwritten output buffer.</p> <h2 id="links">Links</h2> <ul> <li>vllm-metal: <a href="https://github.com/vllm-project/vllm-metal">github.com/vllm-project/vllm-metal</a>, kernels in <code class="language-plaintext highlighter-rouge">vllm_metal/metal/kernels_v2/</code></li> <li>Upstream ancestor: <a href="https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/ops/triton_unified_attention.py"><code class="language-plaintext highlighter-rouge">triton_unified_attention.py</code></a>, introduced in <a href="https://github.com/vllm-project/vllm/pull/16828">vllm#16828</a>; <code class="language-plaintext highlighter-rouge">find_seq_idx</code> now lives in <a href="https://github.com/vllm-project/vllm/blob/main/vllm/v1/attention/ops/triton_attention_helpers.py"><code class="language-plaintext highlighter-rouge">triton_attention_helpers.py</code></a></li> <li>Apple GPU execution model: <a href="https://developer.apple.com/videos/play/wwdc2022/10159/">Scale compute workloads across Apple GPUs, WWDC22 session 10159</a>, <a href="https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf">Metal feature set tables</a></li> <li>Previous post: <a href="/blog/2026/vllm-metal-vs-mlx-lm-kv-cache/">Contiguous vs Paged Varlen KV Cache</a></li> </ul>]]></content><author><name></name></author><category term="research"/><category term="Metal"/><category term="kernel"/><category term="Apple Silicon"/><category term="vllm-metal"/><category term="systems"/><summary type="html"><![CDATA[How vllm-metal's paged attention kernels resolve which sequence owns each token with a per-threadgroup binary search, what that buys continuous batching on Apple Silicon, and what happened when we replaced it with an O(1) lookup map.]]></summary></entry><entry><title type="html">Keyframe-Only Video Loading in vLLM: The Accuracy-Throughput Trade</title><link href="https://ranranhaoranzhang.com/blog/2026/vllm-pyav-keyframes-video-loader/" rel="alternate" type="text/html" title="Keyframe-Only Video Loading in vLLM: The Accuracy-Throughput Trade"/><published>2026-06-10T10:00:00+00:00</published><updated>2026-06-10T10:00:00+00:00</updated><id>https://ranranhaoranzhang.com/blog/2026/vllm-pyav-keyframes-video-loader</id><content type="html" xml:base="https://ranranhaoranzhang.com/blog/2026/vllm-pyav-keyframes-video-loader/"><![CDATA[<p>eBay has billions of videos and a steady stream of classification jobs over them: a video plus a classification prompt goes to a vision-language model, and the answer is one multiple-choice token. A typical job is 1,000–2,000 short low-resolution clips run offline through vLLM’s <code class="language-plaintext highlighter-rouge">LLM.chat</code> with <code class="language-plaintext highlighter-rouge">max_tokens=1</code>, so there is no generation phase: the run is prefill, and every input-side cost sits on the critical path. Profiling these jobs (Qwen2.5-VL-7B-Instruct, one A100, TP=1) put video decode at 28–44% of wall time depending on the dataset. The GPU waits while the CPU turns compressed video into pixel arrays.</p> <p><a href="https://github.com/vllm-project/vllm/pull/45203">PR #45203</a> is the workaround I proposed upstream: <code class="language-plaintext highlighter-rouge">pyav_keyframes</code>, an opt-in lossy video loader that decodes only keyframes, so decode work is at most <code class="language-plaintext highlighter-rouge">num_frames</code> single-frame decodes per clip regardless of clip length. What it buys and what it costs are both measurable. All numbers below are from public datasets, with full settings and tuning logs in <a href="https://github.com/WindChimeRan/offline_video_vllm">offline_video_vllm</a>.</p> <h2 id="where-decode-time-goes">Where decode time goes</h2> <p>Video codecs store a complete image only at periodic keyframes (I-frames). The frames between them are motion-compensated deltas: P-frames reference earlier frames, B-frames reference earlier and later ones. A keyframe plus the frames that depend on it forms a GOP (group of pictures), typically 2–10 s in web encodes.</p> <style>.gop-strip{margin:1.2em 0}.gop-strip .gop{display:flex;gap:3px;flex-wrap:wrap}.gop-strip .fr{width:30px;height:38px;border-radius:5px;display:flex;align-items:center;justify-content:center;color:#fff;font:700 .72rem ui-monospace,SFMono-Regular,Menlo,monospace}.gop-strip .I{background:#e73}.gop-strip .P{background:#47a}.gop-strip .B{background:#9dbbd8}.gop-strip .legend{display:flex;gap:16px;flex-wrap:wrap;font-size:.85rem;color:var(--global-text-color-light,#5a5a72);margin:.4em 0 0}.gop-strip .legend span::before{content:"";display:inline-block;width:11px;height:11px;border-radius:3px;margin-right:5px;vertical-align:-1px}.gop-strip .lg-I::before{background:#e73}.gop-strip .lg-P::before{background:#47a}.gop-strip .lg-B::before{background:#9dbbd8}</style> <div class="gop-strip"> <div class="gop"> <div class="fr I">I</div><div class="fr B">B</div><div class="fr B">B</div><div class="fr P">P</div> <div class="fr B">B</div><div class="fr B">B</div><div class="fr P">P</div><div class="fr B">B</div> <div class="fr P">P</div> <div class="fr I">I</div><div class="fr B">B</div><div class="fr P">P</div><div class="fr B">B</div> <div class="fr B">B</div><div class="fr P">P</div><div class="fr B">B</div><div class="fr P">P</div> <div class="fr B">B</div> </div> <div class="legend"><span class="lg-I">I = keyframe (self-contained)</span> <span class="lg-P">P = delta vs past</span> <span class="lg-B">B = delta vs past + future</span></div> </div> <p>Decoding an arbitrary frame means starting at its GOP’s keyframe and decoding forward, because P and B frames are meaningless alone. A lossless sparse sampler pays that GOP-prefix decode for every target it touches, however few targets it keeps. Keyframes have neither problem: decoding one costs one frame decode, and finding them costs no decode at all, since the demuxer reads packet headers that already carry a keyframe flag and a timestamp.</p> <h2 id="a-loader-that-never-decodes-a-delta-frame">A loader that never decodes a delta frame</h2> <p><code class="language-plaintext highlighter-rouge">pyav_keyframes</code> makes two passes over the container. The first demuxes the stream and records every keyframe timestamp; no pixels are decoded. The second spreads <code class="language-plaintext highlighter-rouge">num_frames</code> picks evenly over that keyframe list, then seeks to each pick and decodes exactly one frame. A 30-second clip and a 10-minute clip cost the same: one header sweep plus at most <code class="language-plaintext highlighter-rouge">num_frames</code> keyframe decodes.</p> <p>When the budget exceeds the keyframe count, picks repeat instead of falling back to delta frames, and the repeats stay balanced: a 2-keyframe clip asked for 16 frames returns 8 copies of each. Repeated frames are decoded once and reported at their true source positions in the frame metadata, so a model like Qwen2.5-VL, which embeds each frame’s time position, sees the same moment twice rather than motion that never happened.</p> <p>The sliders below replay the pick logic: set how many keyframes the clip has and how many frames the caller asks for, then compare the decode work of lossless uniform sampling against keyframe-only sampling for the same request.</p> <style>.pick-widget{--pw-orange:#e73;--pw-red:#c31;--pw-ink:var(--global-text-color,#1a1a2e);--pw-muted:var(--global-text-color-light,#5a5a72);--pw-line:var(--global-divider-color,#e4e4ec);--pw-chip:#d9dbe4;--pw-kf-bg:#f0e4da;--pw-kf-ink:#7a3c12;--pw-out-bg:#eef0f5;background:var(--global-card-bg-color,#fff);border:1.5px solid var(--pw-line);border-radius:12px;padding:16px 18px;margin:1.5em 0;color:var(--pw-ink)}html[data-theme="dark"] .pick-widget{--pw-red:#e85c41;--pw-chip:#44444f;--pw-kf-bg:#3d2d1e;--pw-kf-ink:#f3c89e;--pw-out-bg:#2b2b33}.pick-widget label{font-size:.9rem;margin-right:18px}.pick-widget input[type=range]{vertical-align:middle;width:160px}.pick-widget .pw-h{font-weight:700;margin:1.2em 0 .3em}.pick-widget .hint{font-size:.85rem;color:var(--pw-muted)}.pick-widget .chips{display:flex;gap:2px;flex-wrap:wrap;margin:.5em 0}.pick-widget .chip{width:14px;height:26px;border-radius:3px;background:var(--pw-chip)}.pick-widget .chip.kf{background:var(--pw-orange)}.pick-widget .chip.dec{outline:2.5px solid var(--pw-red);outline-offset:-2px}.pick-widget .kfrow{display:flex;gap:5px;flex-wrap:wrap;margin:.5em 0}.pick-widget .kfchip{min-width:36px;padding:4px 5px;border-radius:6px;background:var(--pw-kf-bg);border:1.5px solid var(--pw-orange);text-align:center;font:600 .75rem ui-monospace,SFMono-Regular,Menlo,monospace;color:var(--pw-kf-ink)}.pick-widget .kfchip .cnt{display:block;font-size:.95rem;color:var(--pw-red)}.pick-widget .kfchip.zero{opacity:.32;border-style:dashed}.pick-widget .stat{font-size:.92rem;margin:.4em 0;color:var(--pw-muted)}.pick-widget .stat b{color:var(--pw-ink)}.pick-widget .out{font:.82rem ui-monospace,SFMono-Regular,Menlo,monospace;background:var(--pw-out-bg);border-radius:6px;padding:6px 10px;margin:.4em 0;overflow-x:auto}.pick-widget .legend{display:flex;gap:16px;flex-wrap:wrap;font-size:.85rem;color:var(--pw-muted);margin:.6em 0 0}.pick-widget .legend span::before{content:"";display:inline-block;width:11px;height:11px;border-radius:3px;margin-right:5px;vertical-align:-1px}.pick-widget .lg-I::before{background:var(--pw-orange)}.pick-widget .lg-skip::before{background:var(--pw-chip)}</style> <div class="pick-widget"> <label>keyframes in clip (<code>n_kf</code>): <input type="range" id="nkf" min="2" max="24" value="12"/> <b id="nkfv">12</b></label> <label>budget (<code>num_frames</code>): <input type="range" id="nf" min="1" max="32" value="16"/> <b id="nfv">16</b></label> <div class="pw-h">Which keyframes get picked (badge = times duplicated)</div> <div class="kfrow" id="kfrow"></div> <div class="out" id="picksout"></div> <div class="pw-h">Decode work on the timeline (GOP = 10 frames)</div> <div class="hint">lossless uniform sampling — must decode every outlined frame to reach its targets:</div> <div class="chips" id="lossless"></div> <div class="hint"><code>pyav_keyframes</code> — decodes only outlined keyframes:</div> <div class="chips" id="lossy"></div> <div class="stat" id="stats"></div> <div class="legend"><span class="lg-I">keyframe</span><span class="lg-skip">delta frame</span> <span style="color:var(--pw-red)">▢ outlined = actually decoded</span></div> </div> <script>
function npRound(x){ // numpy round-half-even
  const f=Math.floor(x), d=x-f;
  if(d>0.5) return f+1;
  if(d<0.5) return f;
  return (f%2===0)?f:f+1;
}
function linspace(a,b,k){
  if(k===1) return [a];
  const out=[]; for(let i=0;i<k;i++) out.push(a+(b-a)*i/(k-1)); return out;
}
const GOP=10;
function render(){
  const nkf=+document.getElementById('nkf').value;
  const nf=+document.getElementById('nf').value;
  document.getElementById('nkfv').textContent=nkf;
  document.getElementById('nfv').textContent=nf;
  const picks=linspace(0,nkf-1,nf).map(npRound);
  const counts=Array(nkf).fill(0); picks.forEach(p=>counts[p]++);
  // keyframe row
  const kfrow=document.getElementById('kfrow'); kfrow.innerHTML='';
  for(let i=0;i<nkf;i++){
    const d=document.createElement('div');
    d.className='kfchip'+(counts[i]?'':' zero');
    d.innerHTML='kf'+i+'<span class="cnt">'+(counts[i]?'×'+counts[i]:'—')+'</span>';
    kfrow.appendChild(d);
  }
  document.getElementById('picksout').textContent=
    'picks = ['+picks.join(', ')+']   →   frames_indices = ['+picks.map(p=>p*GOP).join(', ')+']';
  // timeline
  const total=nkf*GOP;
  const targets=linspace(0,total-1,Math.min(nf,total)).map(Math.floor); // lossless uniform targets
  const decodedLossless=new Set();
  targets.forEach(t=>{ const kf=Math.floor(t/GOP)*GOP; for(let f=kf;f<=t;f++) decodedLossless.add(f); });
  const decodedLossy=new Set(picks.map(p=>p*GOP));
  const mk=(host,decoded)=>{
    const el=document.getElementById(host); el.innerHTML='';
    for(let f=0;f<total;f++){
      const c=document.createElement('div');
      c.className='chip'+(f%GOP===0?' kf':'')+(decoded.has(f)?' dec':'');
      el.appendChild(c);
    }
  };
  mk('lossless',decodedLossless); mk('lossy',decodedLossy);
  document.getElementById('stats').innerHTML=
    'decoded frames — lossless: <b>'+decodedLossless.size+'</b> · pyav_keyframes: <b>'+
    decodedLossy.size+'</b> ('+(decodedLossless.size/decodedLossy.size).toFixed(1)+'× fewer)';
}
document.getElementById('nkf').addEventListener('input',render);
document.getElementById('nf').addEventListener('input',render);
render();
</script> <h2 id="the-trade-measured">The trade, measured</h2> <p>Setup: Qwen2.5-VL-7B-Instruct on one A100 (TP=1), offline <code class="language-plaintext highlighter-rouge">LLM.chat</code>, <code class="language-plaintext highlighter-rouge">max_tokens=1</code>, 16 frames per clip, 1,990 multiple-choice questions from NExTQA and MVBench. The two runs are identical except the video loader: lossless OpenCV sampling vs <code class="language-plaintext highlighter-rouge">pyav_keyframes</code>. End-to-end wall time drops from 674 s to 380 s and throughput rises from 2.95 to 5.23 req/s, a 1.77× speedup from the loader swap alone. At the decode stage, lossless sampling costs 193 ms on a 30 s clip and 3,124 ms on a 600 s clip; <code class="language-plaintext highlighter-rouge">pyav_keyframes</code> stays between 43 and 77 ms across all four test clips.</p> <div style="display: flex; justify-content: center;"> <img src="/assets/img/pyav_keyframes_speed.png" alt="Decode time per clip and end-to-end wall time, lossless vs pyav_keyframes" style="max-width: 100%; height: auto;"/> </div> <p align="center"><em>Left: decode time to extract 16 frames (log scale) across clip lengths and GOP settings. Right: end-to-end wall time on the N=1990 benchmark; the only change between bars is the video loader.</em></p> <p>Accuracy is the other side of the trade. NExTQA, scene-content QA, is unchanged at 79.6 vs 79.5. MVBench drops 11.3 points overall, and the drop concentrates in motion-sensitive subtasks: <code class="language-plaintext highlighter-rouge">action_antonym</code> loses 52.7 points, <code class="language-plaintext highlighter-rouge">moving_attribute</code> and <code class="language-plaintext highlighter-rouge">object_existence</code> lose 36.4 each, while 10 of 18 subtasks stay within ±2 points. Keyframes land on scene boundaries, so prompts about what a scene contains keep their signal; prompts about what changes between keyframes lose it.</p> <div style="display: flex; justify-content: center;"> <img src="/assets/img/pyav_keyframes_accuracy.png" alt="Overall accuracy on NExTQA and MVBench, and MVBench per-subtask deltas" style="max-width: 100%; height: auto;"/> </div> <p align="center"><em>Left: overall accuracy, num_frames=16. Right: MVBench per-subtask deltas, the worst and best of 18 subtasks; the remaining 10 move less than ±2 pt.</em></p> <p>The decode savings apply to every clip by construction; the accuracy cost lands only where the prompt needs inter-keyframe information. A classification job in the eBay mold, one-token answers about what the clip shows, matches the NExTQA column. Motion reasoning matches the <code class="language-plaintext highlighter-rouge">action_antonym</code> column and should stay on a lossless loader.</p> <h2 id="a-case-study-in-quoting-out-of-context">A case study in quoting out of context</h2> <p>Chinese has an idiom for this failure mode: 断章取义, judging a passage by a fragment taken out of context. Keyframe sampling does it to video. The two CLEVRER-based MVBench subtasks that drop 36.4 points each, <code class="language-plaintext highlighter-rouge">moving_attribute</code> and <code class="language-plaintext highlighter-rouge">object_existence</code>, are the extreme case: all 26 CLEVRER clips we probed carry exactly one keyframe, at frame 0, so <code class="language-plaintext highlighter-rouge">pyav_keyframes</code> returns the opening frame duplicated to fill the budget. The model is asked about a five-second video and shown its first instant, <code class="language-plaintext highlighter-rouge">num_frames</code> times.</p> <p>The item below is a real <code class="language-plaintext highlighter-rouge">object_existence</code> row. The question asks about a purple sphere; the sphere rolls in at t≈2.0 s, so the answer is yes, and frame 0 cannot know that. Drag the budget down: uniform sampling loses its glimpses one by one and collapses to frame 0 at <code class="language-plaintext highlighter-rouge">num_frames = 1</code>; the keyframe row sits at frame 0 from the start. On encodes like this, <code class="language-plaintext highlighter-rouge">num_frames</code> stops being the knob that matters. The keyframe count is, and it is a property of the file, not of the request.</p> <style>.case-widget{--cw-ink:var(--global-text-color,#1a1a2e);--cw-muted:var(--global-text-color-light,#5a5a72);--cw-line:var(--global-divider-color,#e4e4ec);--cw-yes:#098;--cw-no:#c31;background:var(--global-card-bg-color,#fff);border:1.5px solid var(--cw-line);border-radius:12px;padding:16px 18px;margin:1.5em 0;color:var(--cw-ink)}html[data-theme="dark"] .case-widget{--cw-no:#e85c41}.case-widget .cw-q{font-size:1rem;margin:0 0 .3em}.case-widget .cw-meta{font-size:.85rem;color:var(--cw-muted);margin:0 0 .9em}.case-widget .cw-opt{display:inline-block;border:1.5px solid var(--cw-line);border-radius:6px;padding:1px 8px;margin-left:6px;font-size:.88rem}.case-widget .cw-opt.truth{border-color:var(--cw-yes);color:var(--cw-yes);font-weight:700}.case-widget label{font-size:.9rem}.case-widget input[type=range]{vertical-align:middle;width:160px}.case-widget .cw-h{font-weight:700;margin:1.1em 0 .35em}.case-widget .cw-h code{font-weight:400}.case-widget .cw-row{display:flex;flex-wrap:wrap;gap:3px}.case-widget .cw-row img{width:var(--tw,12%);min-width:42px;max-width:240px;border-radius:4px;display:block}.case-widget .cw-row img.vis{outline:2.5px solid var(--cw-yes);outline-offset:-2px}.case-widget .cw-verdict{font-size:.92rem;color:var(--cw-muted);margin:.45em 0 0}.case-widget .cw-verdict b.yes{color:var(--cw-yes)}.case-widget .cw-verdict b.no{color:var(--cw-no)}</style> <div class="case-widget"> <p class="cw-q">“Are there any purple spheres that enter the scene?” <span class="cw-opt">not sure</span><span class="cw-opt truth">yes ✓</span><span class="cw-opt">no</span></p> <p class="cw-meta">MVBench <code>object_existence</code>, CLEVRER clip <code>video_12845.mp4</code>: 128 frames, 5.12 s, one keyframe (frame 0). The purple sphere enters at t≈2.0 s.</p> <label>budget (<code>num_frames</code>): <input type="range" id="csb" min="0" max="4" value="4"/> <b id="csbv">16</b></label> <div class="cw-h">lossless uniform sampling</div> <div class="cw-row" id="csuni"></div> <div class="cw-verdict" id="csuniv"></div> <div class="cw-h"><code>pyav_keyframes</code> (n_kf = 1 → frame 0, duplicated)</div> <div class="cw-row" id="cskf"></div> <div class="cw-verdict" id="cskfv"></div> </div> <script>
(function(){
const BUDGETS=[1,2,4,8,16];
const PICKS={1:[0],2:[0,127],4:[0,42,85,127],8:[0,18,36,54,73,91,109,127],16:[0,8,17,25,34,42,51,59,68,76,85,93,102,110,119,127]};
const VIS=48, BASE='/assets/img/case_video_12845/';
const widget=document.querySelector('.case-widget');
function row(host,idxs){
  const el=document.getElementById(host); el.innerHTML='';
  idxs.forEach(i=>{
    const im=document.createElement('img');
    im.src=BASE+'f'+i+'.webp';
    im.alt='frame at t='+(i/25).toFixed(1)+'s';
    if(i>=VIS) im.className='vis';
    el.appendChild(im);
  });
  return idxs.filter(i=>i>=VIS).length;
}
function verdict(host,n,k){
  document.getElementById(host).innerHTML=
    'purple sphere visible in <b>'+n+'</b> of '+k+' frames → evidence supports '+
    (n>0?'<b class="yes">“yes”</b>':'<b class="no">“no”</b>');
}
function render(){
  const k=BUDGETS[+document.getElementById('csb').value];
  document.getElementById('csbv').textContent=k;
  widget.style.setProperty('--tw', k===16 ? '12%' : '24%');
  verdict('csuniv',row('csuni',PICKS[k]),k);
  verdict('cskfv',row('cskf',Array(k).fill(0)),k);
}
document.getElementById('csb').addEventListener('input',render);
render();
})();
</script> <p align="center"><em>Frames from the CLEVRER validation set, via MVBench. The verdict lines describe the visible evidence in each sampled frame set; no model inference was run for this figure.</em></p> <p><code class="language-plaintext highlighter-rouge">object_existence</code> is 200 rows like this one; this mechanism is what its −36.4 pt aggregates.</p> <h2 id="using-it">Using it</h2> <p>The loader is opt-in; nothing changes unless a run selects the <code class="language-plaintext highlighter-rouge">pyav_keyframes</code> backend. Until the PR lands, the same loader ships as a single-file drop-in (<code class="language-plaintext highlighter-rouge">pyav_keyframes_v2</code>) in the experiment repo: importing the module registers it with vLLM’s loader registry, and the README has the exact <code class="language-plaintext highlighter-rouge">LLM(...)</code> configuration.</p> <h2 id="links">Links</h2> <ul> <li>vLLM PR: <a href="https://github.com/vllm-project/vllm/pull/45203">vllm-project/vllm#45203</a>, “[Multimodal] Add lossy keyframe-only video loader (pyav_keyframes)”</li> <li>Experiments and benchmarks (public datasets): <a href="https://github.com/WindChimeRan/offline_video_vllm">WindChimeRan/offline_video_vllm</a></li> </ul>]]></content><author><name></name></author><category term="research"/><category term="LLM"/><category term="inference"/><category term="video"/><category term="multimodal"/><category term="vllm"/><summary type="html"><![CDATA[An opt-in lossy video loader for vLLM that decodes only I-frames: 1.77× end-to-end on offline video classification, −0.1 pt on NExTQA, −11.3 pt on MVBench.]]></summary></entry><entry><title type="html">vllm-metal vs mlx_lm: Contiguous vs Paged Varlen KV Cache</title><link href="https://ranranhaoranzhang.com/blog/2026/vllm-metal-vs-mlx-lm-kv-cache/" rel="alternate" type="text/html" title="vllm-metal vs mlx_lm: Contiguous vs Paged Varlen KV Cache"/><published>2026-05-20T00:00:00+00:00</published><updated>2026-05-20T00:00:00+00:00</updated><id>https://ranranhaoranzhang.com/blog/2026/vllm-metal-vs-mlx-lm-kv-cache</id><content type="html" xml:base="https://ranranhaoranzhang.com/blog/2026/vllm-metal-vs-mlx-lm-kv-cache/"><![CDATA[<h2 id="the-mlx_lm-case-study-25-on-a-32gb-mac-free-on-a-64gb-mac">The mlx_lm case study: 2.5× on a 32GB Mac, ~free on a 64GB Mac</h2> <p>Same mlx_lm code, two Macs, no other changes. Three prompts of 30K + 5K + 10 input tokens (Qwen3-0.6B, <code class="language-plaintext highlighter-rouge">max_tokens=256</code>), run first sequentially then concurrently. On the M1 Max 64GB the concurrent run finishes in 45.0s, against 40.3s sequential. On the M1 Pro 32GB the concurrent run takes 189.6s, against 74.8s sequential. No OOM, no crash, no warning.</p> <div style="display: flex; justify-content: center; gap: 1rem; flex-wrap: wrap;"> <img src="/assets/img/f4_mlx_fragmentation_m1pro.png" alt="mlx_lm padding cost on M1 Pro 32GB" style="max-width: 48%; height: auto;"/> <img src="/assets/img/f4_mlx_fragmentation_m1max.png" alt="mlx_lm padding cost on M1 Max 64GB" style="max-width: 48%; height: auto;"/> </div> <p align="center"><em>Left: M1 Pro 32GB. Right: M1 Max 64GB. Sequential vs concurrent wall time, with resident memory (solid) and macOS-compressed memory (dashed) over time.</em></p> <p>In concurrent mode mlx_lm pads its KV cache to <code class="language-plaintext highlighter-rouge">[3, H, 30000, D]</code> because the longest prompt is 30K. The actual tokens total 35,010 (30,000 + 5,000 + 10), but the padded rectangle is 3 × 30,000 = 90,000. 61% of every attention step is wasted on padding. On the 64GB box that padding still fits; resident memory peaks at 32GB and the workload finishes. On the 32GB box the same padded rectangle pushes the system into memory pressure, macOS starts compressing pages (orange line rises from ~2GB to ~6GB), and every decode step pays decompression before attention and recompression after.</p> <p>This is a deployment cliff: identical code, no OOM, no warning, just silently slower on a smaller machine.</p> <h2 id="the-one-layer-we-replaced">The one layer we replaced</h2> <p>The cliff lives in attention. vllm-metal is a vLLM plugin for Apple Silicon: at the model level it reuses mlx_lm’s weight loader, RMSNorm, Linear, MoE, and MLP layers. All of these are token-wise. They operate on each token independently and don’t care whether tokens are batched as a <code class="language-plaintext highlighter-rouge">[B, T]</code> padded rectangle or flattened to a <code class="language-plaintext highlighter-rouge">[total_tokens]</code> varlen strip. Attention is the only layer that needs to know about sequence boundaries; that is the one we replaced. Above the model, the scheduler is vLLM’s.</p> <p>mlx_lm uses flash-style attention over a contiguous, left-padded KV cache shaped <code class="language-plaintext highlighter-rouge">[B, H, T, D]</code> (B = batch size, H = number of KV heads, T = sequence length uniform across the batch, D = head dim), consumed by MLX’s stock <code class="language-plaintext highlighter-rouge">scaled_dot_product_attention</code>. vllm-metal uses flash-style attention over a paged KV cache laid out as <code class="language-plaintext highlighter-rouge">[total_tokens, H, D]</code>: tokens from every sequence are packed onto a single flat token dimension, with <code class="language-plaintext highlighter-rouge">cu_seqlens</code> marking sequence boundaries. mlx_lm’s cache is 4D; vllm-metal’s view is 3D, with the <code class="language-plaintext highlighter-rouge">[B, T]</code> axes collapsed into a single <code class="language-plaintext highlighter-rouge">[total_tokens]</code> axis. The vLLM scheduler drives the varlen layout. Same MLP, different attention.</p> <h2 id="why-the-cliff-exists-and-what-vllm-metal-does-instead">Why the cliff exists, and what vllm-metal does instead</h2> <p>The cliff is a property of the cache shape.</p> <p><strong>mlx_lm.</strong> The KV cache is a 4D tensor <code class="language-plaintext highlighter-rouge">[B, H, T, D]</code> with uniform <code class="language-plaintext highlighter-rouge">T</code>. MLX’s stock <code class="language-plaintext highlighter-rouge">scaled_dot_product_attention</code> requires this shape and has no <code class="language-plaintext highlighter-rouge">cu_seqlens</code>-style parameter; every sequence in the batch occupies <code class="language-plaintext highlighter-rouge">T_max</code> tokens whether it needs them or not. Prefill and decode run as mutually exclusive phases. A forward pass is either all prefill or all decode, never mixed. Per-step compute scales as $B \cdot L_{\max}$, so a batch with one long prompt and two short ones pays the long prompt’s price three times.</p> <p><strong>vllm-metal.</strong> The KV cache is paged: KV memory is sliced into fixed-size blocks indexed by a per-sequence block table, the way upstream vLLM does it. Inside the attention step, tokens from all sequences are laid out on a single flat token dimension, with <code class="language-plaintext highlighter-rouge">cu_seqlens</code> marking sequence boundaries. The scheduler is free to pack prefill chunks and decode tokens into the same forward pass. This is real continuous batching, not phase-separated pseudo-batching. Per-step compute scales as the total of real tokens, $\sum_b L_b$, so the 90,000 vs 35,010 imbalance from the case study collapses to 35,010. MLX’s stock SDPA accepts neither <code class="language-plaintext highlighter-rouge">cu_seqlens</code> nor a block table, so the attention path is a hand-written Metal kernel.</p> <p>The widget below shows the case-study batch as each layout stores it.</p> <style>.kvw{--kvw-s1:#47a;--kvw-s2:#098;--kvw-s3:#e73;--kvw-pad:#dcdee6;--kvw-ink:var(--global-text-color,#1a1a2e);--kvw-muted:var(--global-text-color-light,#5a5a72);--kvw-line:var(--global-divider-color,#e4e4ec);--kvw-out-bg:#eef0f5;background:var(--global-card-bg-color,#fff);border:1.5px solid var(--kvw-line);border-radius:12px;padding:16px 18px;margin:1.5em 0 .6em;color:var(--kvw-ink)}html[data-theme="dark"] .kvw{--kvw-pad:#44444f;--kvw-out-bg:#2b2b33}.kvw .ctl{display:flex;gap:4px 20px;flex-wrap:wrap}.kvw .ctl label{font-size:.88rem;white-space:nowrap}.kvw input[type=range]{vertical-align:middle;width:100px}.kvw .kvw-h{font-weight:700;margin:1.15em 0 .15em}.kvw .kvw-h code{font-weight:400}.kvw .hint{font-size:.85rem;color:var(--kvw-muted);margin:.1em 0 .55em}.kvw .lane{display:grid;grid-template-columns:72px minmax(0,1fr);gap:7px 10px;align-items:center}.kvw .lab{font-size:.78rem;font-weight:600;text-align:right;color:var(--kvw-muted);white-space:nowrap}.kvw .kvbar{display:flex;height:24px;border-radius:4px;overflow:hidden}.kvw .seg{flex-basis:0}.kvw .seg+.seg{border-left:1.5px solid var(--global-card-bg-color,#fff)}.kvw .formula{font:.82rem ui-monospace,SFMono-Regular,Menlo,monospace;background:var(--kvw-out-bg);border-radius:6px;padding:6px 10px;margin:.55em 0 0;overflow-x:auto}.kvw .formula b{font-size:.9rem}.kvw .stat{font-size:.92rem;margin:1.05em 0 0;color:var(--kvw-muted)}.kvw .stat b{color:var(--kvw-ink)}</style> <div class="kvw"> <div class="ctl"> <label>Seq 1: <input type="range" id="kvw1" min="0" max="8" value="8"/> <b id="kvw1v">30,000</b></label> <label>Seq 2: <input type="range" id="kvw2" min="0" max="8" value="5"/> <b id="kvw2v">5,000</b></label> <label>Seq 3: <input type="range" id="kvw3" min="0" max="8" value="0"/> <b id="kvw3v">10</b></label> </div> <div class="hint">defaults = the case-study batch: 30,000 + 5,000 + 10 input tokens</div> <div class="kvw-h">mlx_lm — contiguous padded cache, shape <code>[B, H, T, D]</code></div> <div class="hint">every sequence left-padded to the longest (gray = padding); attention runs over the full rectangle</div> <div class="lane" id="kvwpad"></div> <div class="formula" id="kvwalloc"></div> <div class="kvw-h">vllm-metal — flat varlen cache, shape <code>[total_tokens, H, D]</code></div> <div class="hint">same batch packed end to end on one token axis; <code>cu_seqlens</code> marks the boundaries</div> <div class="lane" id="kvwflat"></div> <div class="formula" id="kvwflatinfo"></div> <div class="stat" id="kvwstat"></div> </div> <script>
(function(){
const S=[10,100,500,1000,2000,5000,10000,20000,30000];
const C=['var(--kvw-s1)','var(--kvw-s2)','var(--kvw-s3)'];
const $=id=>document.getElementById(id);
const fmt=n=>n.toLocaleString('en-US');
function seg(len,color,isPad){
  const d=document.createElement('div');
  d.className='seg'; d.style.background=color; d.style.flexGrow=len;
  if(!isPad) d.style.minWidth='5px';
  return d;
}
function render(){
  const l=[1,2,3].map(i=>S[+$('kvw'+i).value]);
  [1,2,3].forEach(i=>$('kvw'+i+'v').textContent=fmt(l[i-1]));
  const tmax=Math.max(...l), alloc=3*tmax, total=l[0]+l[1]+l[2];
  const pad=$('kvwpad'); pad.innerHTML='';
  l.forEach((len,i)=>{
    const lab=document.createElement('div');
    lab.className='lab'; lab.textContent='Seq '+(i+1);
    const tr=document.createElement('div');
    const row=document.createElement('div');
    row.className='kvbar'; row.style.width=(100*tmax/total)+'%';
    if(tmax-len>0) row.appendChild(seg(tmax-len,'var(--kvw-pad)',true));
    row.appendChild(seg(len,C[i],false));
    tr.appendChild(row); pad.appendChild(lab); pad.appendChild(tr);
  });
  $('kvwalloc').innerHTML='allocated: 3 × '+fmt(tmax)+' = <b>'+fmt(alloc)+'</b> KV slots';
  const flat=$('kvwflat'); flat.innerHTML='';
  const flab=document.createElement('div'); flab.className='lab'; flab.textContent='packed';
  const ftr=document.createElement('div');
  const frow=document.createElement('div'); frow.className='kvbar'; frow.style.width='100%';
  l.forEach((len,i)=>frow.appendChild(seg(len,C[i],false)));
  ftr.appendChild(frow); flat.appendChild(flab); flat.appendChild(ftr);
  const cu=[0,l[0],l[0]+l[1],total];
  $('kvwflatinfo').innerHTML='cu_seqlens = ['+cu.join(', ')+']<br>stored: '+
    l.map(fmt).join(' + ')+' = <b>'+fmt(total)+'</b> tokens';
  const waste=Math.round(100*(1-total/alloc)), ratio=alloc/total;
  $('kvwstat').innerHTML = waste<=5
    ? 'balanced lengths: the rectangle wastes only <b>'+waste+'%</b> — the gap opens when one sequence is much longer than the rest'
    : 'padding is <b>'+waste+'%</b> of the rectangle; each attention step touches <b>'+ratio.toFixed(1)+'×</b> the KV rows the packed strip does';
}
[1,2,3].forEach(i=>$('kvw'+i).addEventListener('input',render));
render();
})();
</script> <p align="center"><em>The case-study batch as each cache layout stores it. Drag a length: the padded rectangle re-pads everything to the new longest sequence; the packed strip grows only by the tokens actually added.</em></p> <p><strong>Speculative decoding falls out for free.</strong> Verification scores each sequence’s draft tokens in one forward pass: k + 1 query tokens per sequence, with k varying across the batch, so the batch is ragged on the query axis, not just in KV lengths. To a <code class="language-plaintext highlighter-rouge">cu_seqlens</code> kernel this is just another batch. Upstream vLLM’s FlashAttention backend runs prefill, decode, and verification through the same varlen attention call; drafts merely lengthen each request’s query span, and the spec-specific code only picks which logits to verify. After rejection, the diverging accepted prefixes are just new sequence lengths over the same paged store. A 4D padded cache has to re-pad both axes every step, and making that work is involved enough to be its own paper (<em><a href="https://arxiv.org/pdf/2510.22876">Batch Speculative Decoding Done Right</a></em>). vllm-metal has not shipped speculative decoding yet; the layout means it will arrive with no new kernel, just a different <code class="language-plaintext highlighter-rouge">cu_seqlens</code>.</p> <h2 id="on-the-benchmark">On the benchmark</h2> <div style="display: flex; justify-content: center;"> <img src="/assets/img/throughput_vllm_metal_vs_mlx_lm.png" alt="vllm-metal vs mlx_lm output throughput on SiliconBench chat and agent splits" style="max-width: 650px; width: 100%; height: auto;"/> </div> <p align="center"><em>Output throughput (Qwen3-0.6B BF16) on SiliconBench's chat and agent splits at concurrency 1, 8, 16. Hatched mlx_lm bars on the agent split mark partial-success runs (X/100 prompts returned a non-empty completion).</em></p> <style>.bench-table{overflow-x:auto;margin:0 0 1.4em}.bench-table table{width:100%;border-collapse:collapse;font-size:.92rem;line-height:1.4;font-variant-numeric:tabular-nums;border-top:2px solid var(--global-text-color,#1a1a2e);border-bottom:2px solid var(--global-text-color,#1a1a2e)}.bench-table th{font-weight:600;padding:.45em .75em;white-space:nowrap;border-bottom:1.5px solid var(--global-text-color,#1a1a2e)}.bench-table td{padding:.4em .75em}.bench-table th:first-child,.bench-table td:first-child{padding-left:.25em}.bench-table th:last-child,.bench-table td:last-child{padding-right:.25em}.bench-table td:first-child{font-weight:600}.bench-table tbody tr:nth-child(3) td{border-bottom:1px solid var(--global-divider-color,#e4e4ec)}</style> <p><strong>chat split</strong> (~1K input, max output 256)</p> <div class="bench-table"> <table> <thead> <tr> <th>Engine</th> <th style="text-align: right">c</th> <th style="text-align: right">Success</th> <th style="text-align: right">TTFT p50 (ms)</th> <th style="text-align: right">Throughput (tok/s)</th> <th style="text-align: right">Wall (s)</th> </tr> </thead> <tbody> <tr> <td>mlx_lm</td> <td style="text-align: right">1</td> <td style="text-align: right">100/100</td> <td style="text-align: right">201</td> <td style="text-align: right">74.7</td> <td style="text-align: right">64.5</td> </tr> <tr> <td>mlx_lm</td> <td style="text-align: right">8</td> <td style="text-align: right">100/100</td> <td style="text-align: right">1382</td> <td style="text-align: right">71.1</td> <td style="text-align: right">66.8</td> </tr> <tr> <td>mlx_lm</td> <td style="text-align: right">16</td> <td style="text-align: right">100/100</td> <td style="text-align: right">2014</td> <td style="text-align: right">59.5</td> <td style="text-align: right">77.1</td> </tr> <tr> <td>vllm-metal</td> <td style="text-align: right">1</td> <td style="text-align: right">100/100</td> <td style="text-align: right">115</td> <td style="text-align: right">50.2</td> <td style="text-align: right">92.7</td> </tr> <tr> <td>vllm-metal</td> <td style="text-align: right">8</td> <td style="text-align: right">100/100</td> <td style="text-align: right">133</td> <td style="text-align: right">145.5</td> <td style="text-align: right">31.5</td> </tr> <tr> <td>vllm-metal</td> <td style="text-align: right">16</td> <td style="text-align: right">100/100</td> <td style="text-align: right">183</td> <td style="text-align: right">190.8</td> <td style="text-align: right">24.1</td> </tr> </tbody> </table> </div> <p><strong>agent split</strong> (~4K input, max output 256)</p> <div class="bench-table"> <table> <thead> <tr> <th>Engine</th> <th style="text-align: right">c</th> <th style="text-align: right">Success</th> <th style="text-align: right">TTFT p50 (ms)</th> <th style="text-align: right">Throughput (tok/s)</th> <th style="text-align: right">Wall (s)</th> </tr> </thead> <tbody> <tr> <td>mlx_lm</td> <td style="text-align: right">1</td> <td style="text-align: right">70/100</td> <td style="text-align: right">628</td> <td style="text-align: right">30.1</td> <td style="text-align: right">-</td> </tr> <tr> <td>mlx_lm</td> <td style="text-align: right">8</td> <td style="text-align: right">70/100</td> <td style="text-align: right">2586</td> <td style="text-align: right">27.1</td> <td style="text-align: right">-</td> </tr> <tr> <td>mlx_lm</td> <td style="text-align: right">16</td> <td style="text-align: right">10/100</td> <td style="text-align: right">33825</td> <td style="text-align: right">4.2</td> <td style="text-align: right">-</td> </tr> <tr> <td>vllm-metal</td> <td style="text-align: right">1</td> <td style="text-align: right">100/100</td> <td style="text-align: right">560</td> <td style="text-align: right">22.4</td> <td style="text-align: right">269.9</td> </tr> <tr> <td>vllm-metal</td> <td style="text-align: right">8</td> <td style="text-align: right">100/100</td> <td style="text-align: right">612</td> <td style="text-align: right">54.8</td> <td style="text-align: right">107.2</td> </tr> <tr> <td>vllm-metal</td> <td style="text-align: right">16</td> <td style="text-align: right">100/100</td> <td style="text-align: right">736</td> <td style="text-align: right">73.8</td> <td style="text-align: right">80.4</td> </tr> </tbody> </table> </div> <p>SiliconBench is our benchmark harness for local LLM inference engines on Apple Silicon. It sends 100 prompts to each engine’s OpenAI-compatible API at three concurrency levels: c=1, 8, and 16, where c is the number of in-flight requests. The chat split is single-turn; the agent split is multi-turn material.</p> <p>At c=1 mlx_lm is faster on both splits. By c=8 the data structure pays off: vllm-metal scales while mlx_lm flattens on chat and collapses on agent.</p> <p>The agent split also exposes a reliability cliff. mlx_lm returns zero tokens for a large fraction of the long-input prompts at every concurrency level, the same padded-cache problem the M1 Pro experiment isolated, scaled to a benchmark. vllm-metal serves all 100 prompts at all three concurrency levels.</p> <h2 id="where-this-lands-in-the-ecosystem">Where this lands in the ecosystem</h2> <p>vLLM’s scheduler is the easy half: it emits a varlen schedule (<code class="language-plaintext highlighter-rouge">cu_seqlens</code> and block tables), but the schedule pays off only if that structure survives all the way down to the attention kernel. One repack anywhere in between and you are back to padded compute.</p> <p>All three MLX-based stacks (mlx_lm, omlx, vllm-mlx) converge at the same MLX call: <code class="language-plaintext highlighter-rouge">mx.fast.scaled_dot_product_attention</code>, which requires uniform <code class="language-plaintext highlighter-rouge">T</code> and has no <code class="language-plaintext highlighter-rouge">cu_seqlens</code> argument. vllm-mlx is worth pointing out: vLLM’s varlen scheduler runs upstream, but a <code class="language-plaintext highlighter-rouge">_left_pad_prompts()</code> step at the kernel boundary repacks into 4D padded form. The scheduler is doing varlen bookkeeping the kernel can’t use. llama.cpp takes a third path: its Metal flash-attention kernel supports varlen via an explicit attention mask over per-stream KV ring buffers, with <code class="language-plaintext highlighter-rouge">seq_id</code> deciding which tokens attend to which cells.</p> <div class="bench-table"> <table> <thead> <tr> <th>Engine</th> <th>Varlen</th> <th>KV layout</th> </tr> </thead> <tbody> <tr> <td>mlx_lm</td> <td>no</td> <td>4D padded <code class="language-plaintext highlighter-rouge">[B, H, T, D]</code></td> </tr> <tr> <td>omlx</td> <td>no</td> <td>4D padded <code class="language-plaintext highlighter-rouge">[B, H, T, D]</code></td> </tr> <tr> <td>vllm-mlx</td> <td>no</td> <td>4D padded <code class="language-plaintext highlighter-rouge">[B, H, T, D]</code></td> </tr> <tr> <td>llama.cpp</td> <td>yes (mask-based)</td> <td>3D per-stream ring buffer</td> </tr> <tr> <td>vllm-metal</td> <td>yes (<code class="language-plaintext highlighter-rouge">cu_seqlens</code>)</td> <td>3D flat <code class="language-plaintext highlighter-rouge">[total_tokens, H, D]</code></td> </tr> </tbody> </table> </div> <p>Of the five stacks audited, vllm-metal is the only one that pairs <code class="language-plaintext highlighter-rouge">cu_seqlens</code>-based varlen with a flat 3D KV layout. On NVIDIA this pairing is the de facto serving pattern. Both vLLM and SGLang use it in production. Apple Silicon hasn’t shipped the same pattern until recently: vllm-metal 0.2.0 (April 2026) is the first end-to-end serving framework on Apple Silicon to ship paged varlen attention. The data-structure choice each stack makes is what shows up at concurrency in the benchmark above.</p> <h2 id="citation">Citation</h2> <p>This blog post is part of the paper <em>SiliconBench: Speed, Memory, and Fidelity for LLM Inference on Apple Silicon</em>, coming soon to arXiv.</p>]]></content><author><name></name></author><category term="research"/><category term="LLM"/><category term="inference"/><category term="Apple Silicon"/><category term="systems"/><category term="vllm-metal"/><summary type="html"><![CDATA[Why we replaced mlx_lm's attention layer with paged varlen attention driven by the vLLM scheduler, and the deployment cliff that motivates it.]]></summary></entry><entry><title type="html">I turned my Timeular Tracker into a macOS macropad</title><link href="https://ranranhaoranzhang.com/blog/2026/timeular-macropad/" rel="alternate" type="text/html" title="I turned my Timeular Tracker into a macOS macropad"/><published>2026-04-10T00:00:00+00:00</published><updated>2026-04-10T00:00:00+00:00</updated><id>https://ranranhaoranzhang.com/blog/2026/timeular-macropad</id><content type="html" xml:base="https://ranranhaoranzhang.com/blog/2026/timeular-macropad/"><![CDATA[<p>Show and tell time.</p> <div style="display: flex; justify-content: center; gap: 1rem; flex-wrap: wrap;"> <img src="/assets/img/timeular-tracker.png" alt="Timeular Tracker on desk" style="max-width: 250px; height: auto; border-radius: 6px;"/> <img src="/assets/img/timeular-menubar.png" alt="Menu bar UI" style="max-width: 250px; height: auto; border-radius: 6px;"/> </div> <h2 id="what-it-does">What it does</h2> <p>I turned my Timeular Tracker (the 8-sided time-tracking dice) into a physical macropad for macOS. Flip the dice, switch macOS desktops. Side 1 is my main workspace, side 2 is a side project, side 3 is another. No keyboard shortcut to remember, just a satisfying physical flip. Every side can also be bound to launching an app, running a shell command, or opening a URL, all configured from a tiny menu bar popover.</p> <h2 id="how-it-got-built">How it got built</h2> <p>I had the Timeular charged but unused on my desk (I stopped paying the monthly subscription). I also had Claude Opus 4.6, and even though I don’t know Swift, Opus does.</p> <ol> <li><strong>Scan.</strong> Asked Claude to find the Timeular over Bluetooth. It picked Python to explore the device.</li> <li><strong>Prototype.</strong> Read the orientation signal on flip and wired it to shell commands. A working macropad in one script.</li> <li><strong>Go native.</strong> Rewrote the whole thing in Swift/SwiftUI as a proper macOS menu bar app with CoreBluetooth.</li> <li><strong>Ship it.</strong> Works.</li> </ol> <p>Source on <a href="https://github.com/WindChimeRan/ohmytimular">GitHub</a>.</p>]]></content><author><name></name></author><category term="side-projects"/><category term="hardware"/><category term="macOS"/><category term="Swift"/><category term="Claude"/><category term="hack"/><summary type="html"><![CDATA[Flipping an 8-sided dice to switch macOS desktops, built end to end with Claude Opus.]]></summary></entry><entry><title type="html">Rethinking vllm-metal’s Memory Budget for Apple Silicon</title><link href="https://ranranhaoranzhang.com/blog/2026/llm-inference-memory-allocation-apple-silicon/" rel="alternate" type="text/html" title="Rethinking vllm-metal’s Memory Budget for Apple Silicon"/><published>2026-02-14T10:00:00+00:00</published><updated>2026-02-14T10:00:00+00:00</updated><id>https://ranranhaoranzhang.com/blog/2026/llm-inference-memory-allocation-apple-silicon</id><content type="html" xml:base="https://ranranhaoranzhang.com/blog/2026/llm-inference-memory-allocation-apple-silicon/"><![CDATA[<h2 id="introduction">Introduction</h2> <p>LLM inference engines like <a href="https://github.com/vllm-project/vllm">vLLM</a> were designed for discrete NVIDIA GPUs, where GPU memory is a dedicated, isolated resource. The memory allocator can safely assume it owns nearly all of VRAM — upstream vLLM defaults to claiming 90% of total GPU memory (<code class="language-plaintext highlighter-rouge">gpu_memory_utilization=0.9</code>), and if another process has already taken enough that this target can’t be met, vLLM simply refuses to start.</p> <p>This assumption breaks on Apple Silicon. The M-series chips use a Unified Memory Architecture (UMA): CPU, GPU, and Neural Engine all share a single physical memory pool. There is no dedicated “VRAM” to claim. A Mac running <a href="https://github.com/vllm-project/vllm-metal">vllm-metal</a> is likely also running a browser, an IDE, and other applications — all competing for the same memory. Reserving 90% of total memory is not realistic in this scenario.</p> <p>But the problem isn’t just about lowering a threshold. We also want to support the case where a Mac <em>is</em> used as a dedicated inference server with no other significant memory consumers. </p> <p>The discussion tracks <a href="https://github.com/vllm-project/vllm-metal/issues/97">vllm-metal#97</a>.</p> <hr/> <h2 id="why-vllm-metal">Why vllm-metal</h2> <p>An NVIDIA RTX 5090 ships with 32 GB of GDDR7 at a $1,999 MSRP — roughly <strong>$62.47 per GB</strong>. A Mac Studio with the M3 Ultra and 512 GB of unified memory starts at $9,499 — roughly <strong>$18.55 per GB</strong>, more than 3× cheaper. This is not an apple-to-Apple comparison (😉): CUDA VRAM is dedicated and faster, while unified memory is shared with the rest of the system and comes with lower bandwidth — which is precisely why the memory <em>allocator</em> matters so much on this platform.</p> <hr/> <h2 id="related-work">Related Work</h2> <p>To compare the three engines below, we use a common vocabulary:</p> <ul> <li><strong>memory hint</strong>: the value reported by Metal’s <code class="language-plaintext highlighter-rouge">recommendedMaxWorkingSetSize</code> — the OS’s suggestion for how much GPU memory a process should use. This is a static, per-process hint; it does not reflect memory consumed by other apps.</li> <li><strong>RAM cap</strong>: a software-imposed ceiling, typically a fraction of total system RAM (e.g., 2/3 or 3/4), intended to leave room for the OS and other apps.</li> <li><strong>utilization target</strong>: the fraction of the memory budget an engine attempts to claim (e.g., 0.9 means “use up to 90%”).</li> </ul> <h3 id="vllm">vLLM</h3> <p><strong><a href="https://github.com/vllm-project/vllm">vLLM</a></strong> follows a profile-and-claim strategy. On CUDA, it queries the GPU for total and free VRAM at startup, then claims <code class="language-plaintext highlighter-rouge">total_vram × utilization_target</code> (default 0.9). If the requested amount exceeds what is actually free (because another process is already using the GPU), vLLM fails immediately rather than proceeding with a smaller budget. The claimed memory is managed as a paged block pool (PagedAttention), where KV cache blocks are allocated and freed at page granularity as requests arrive and leave.</p> <p>GPU memory is filled in two stages. First, <strong>model weights</strong> are loaded onto the GPU. vLLM then runs a profiling forward pass with dummy inputs to measure peak non-KV memory usage (activations, temporaries). The remaining memory — <code class="language-plaintext highlighter-rouge">total_vram × utilization_target - weight_memory - profile_peak</code> — becomes the KV block pool:</p> \[\text{kv_budget} = \text{total_vram} \times \text{utilization} - \text{weight_memory} - \text{profile_peak}\] <h3 id="mistralrs">mistral.rs</h3> <p><strong><a href="https://github.com/EricLBuehler/mistral.rs">mistral.rs</a></strong> adapts the same two-phase approach for Apple Silicon but introduces a RAM cap to leave room for the OS and other applications sharing the same memory pool:</p> \[\text{ram_cap} = \begin{cases} \text{system_ram} \times 2/3 &amp; \text{if } \text{system_ram} \leq 36\text{ GB} \\\\ \text{system_ram} \times 3/4 &amp; \text{otherwise} \end{cases}\] <p>In <strong>Phase 1 (model weights)</strong>, mistral.rs uses the memory hint directly, computing <code class="language-plaintext highlighter-rouge">memory_hint - current_process_allocation</code> and reserving <code class="language-plaintext highlighter-rouge">max(available × 0.02, 512 MB)</code> as headroom, then greedily places layers on GPU until the budget is exhausted.</p> <p>In <strong>Phase 2 (KV cache)</strong>, the effective ceiling becomes <code class="language-plaintext highlighter-rouge">min(memory_hint, ram_cap)</code>. Since Apple typically reports a memory hint in the 66–75% range of system RAM — close to what the RAM cap already computes — the <code class="language-plaintext highlighter-rouge">min()</code> often selects the same value. The KV budget is then:</p> \[\text{kv_budget} = \min(\text{memory_hint},\; \text{ram_cap}) \times \text{utilization} - \text{used}\] <p>The two safety margins stack: 25–33% is reserved by the RAM cap, and a further 10% by the utilization target (default 0.9), leaving roughly 60–68% of system RAM as the effective ceiling before model weights are subtracted.</p> <h3 id="llamacpp">llama.cpp</h3> <p><strong><a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a></strong> uses the memory hint as its ceiling, <code class="language-plaintext highlighter-rouge">memory_hint - current_process_allocation</code>, with no RAM cap and no utilization target. Memory consumed by other apps is invisible; the engine has no system-wide pressure signal. On macOS 15+, a background thread requests buffer residency every 500 ms to prevent OS eviction, an acknowledgment that the OS may reclaim memory under pressure even after allocation succeeds.</p> <p>GPU memory is filled in three stages: <strong>model weights</strong> (last N layers offloaded to GPU, back-to-front), a <strong>KV cache</strong> (preallocated at a user-specified token capacity <code class="language-plaintext highlighter-rouge">n_ctx</code>), and a <strong>compute scratch buffer</strong> (worst-case activation buffer reused every forward pass). Unlike the paged designs above, llama.cpp’s KV cache is contiguous — sequences share a ring buffer with 1-token granularity and an explicit defrag pass. Total KV memory is committed at startup:</p> \[\text{KV}_{\text{total}} = n_{\text{kv_layers}} \times 2 \times n_{\text{kv_heads}} \times d_{\text{head}} \times n_{\text{ctx}} \times \text{dtype_size}\] <p>vLLM assumes exclusive ownership of a discrete memory pool. mistral.rs introduces Apple Silicon-specific caps for desktop coexistence. llama.cpp pre-commits a user-specified amount with no system-wide awareness. vllm-metal inherits from all three but needs to handle both the mixed-use desktop case and the dedicated server case, ideally without requiring the user to manually tune a single magic number.</p> <hr/> <h2 id="proposal">Proposal</h2> <p>vllm-metal has two KV cache paths, selected by <code class="language-plaintext highlighter-rouge">VLLM_METAL_USE_PAGED_ATTENTION</code>. Both are tracked in <a href="https://github.com/vllm-project/vllm-metal/issues/97">vllm-metal#97</a>.</p> <h3 id="path-1-contiguous-allocation-mlx">Path 1: Contiguous Allocation (MLX)</h3> <p>Today, the scheduler reasons in 16-token blocks and reports phantom block counts, but MLX allocates contiguous caches in 256-token steps. None of those blocks exist at runtime. The fix is to strip the paged bookkeeping and use mlx_lm’s <code class="language-plaintext highlighter-rouge">auto</code> behavior: each request allocates only the KV cache memory it needs via <code class="language-plaintext highlighter-rouge">make_prompt_cache()</code> and releases it when done. No upfront budget, no utilization target.</p> <h3 id="path-2-paged-allocation-vllm">Path 2: Paged Allocation (vLLM)</h3> <p>Path 2 is under active development on the <code class="language-plaintext highlighter-rouge">paged-attention-v3</code> branch (<a href="https://github.com/vllm-project/vllm-metal/issues/70">vllm-metal#70</a>). It maintains a global block pool backed by Metal buffers, aligned with upstream vLLM’s PagedAttention. <code class="language-plaintext highlighter-rouge">VLLM_METAL_MEMORY_FRACTION</code> controls how much system RAM the pool claims. At startup, the engine measures weight and activation memory, then fills the remaining budget with KV blocks:</p> \[\text{kv_budget} = \text{system_ram} \times \text{VLLM_METAL_MEMORY_FRACTION} - \text{weight_memory} - \text{profile_peak}\] <p>On a dedicated inference server, set the fraction high. On a desktop Mac sharing memory with a browser, IDE, and other applications, the fraction that is actually free will be lower. If the requested allocation exceeds free memory, the engine refuses to start and reports the available memory and the fraction the user would need to set. An environment variable (not a CLI flag) matches vllm-metal’s existing configuration pattern (<code class="language-plaintext highlighter-rouge">VLLM_METAL_PREFIX_CACHE</code>, etc.).</p> <h4 id="the-wired-collector">The Wired Collector</h4> <p>macOS distinguishes between pageable memory (swappable to disk) and wired memory (pinned in physical RAM). Metal GPU buffers are wired. Under memory pressure, macOS invokes the <strong>wired collector</strong>, a kernel mechanism that reclaims GPU wired memory by evicting Metal buffers, causing silent performance degradation or crashes.</p> <p>The mistral.rs community documented this behavior in <a href="https://github.com/EricLBuehler/mistral.rs/issues/1348">mistral.rs#1348</a>. The workaround: <code class="language-plaintext highlighter-rouge">sudo sysctl iogpu.disable_wired_collector=1</code>. The setting does not persist across reboots unless added to a startup script. For dedicated servers running Path 2 with an aggressive memory fraction, disabling the wired collector is recommended.</p>]]></content><author><name></name></author><category term="research"/><category term="LLM"/><category term="inference"/><category term="Apple Silicon"/><category term="systems"/><category term="proposal"/><summary type="html"><![CDATA[Exploring memory allocation strategies for LLM inference engines on Apple Silicon's unified memory architecture.]]></summary></entry></feed>