<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享]]></title><description><![CDATA[<p dir="auto">最近刚入手了 7900xtx，本地跑llm, 为opencode, pi.dev 提供本地llm api 解决客户的代码隐私焦虑。</p>
<p dir="auto">花了亿点点时间跑了下性能，结果如下，供各位参考。流水账，先不贴llama-bench 结果了，太多。</p>
<p dir="auto">先发 老特 这里了，回头有空了再发个reddit<br />
回头等DFlash + HIP(ROCM) 成熟了再跑下看看。</p>
<h2>1. Rocm + turboquant,</h2>
<p dir="auto">repo: <a href="https://github.com/domvox/llama.cpp-turboquant-hip" rel="nofollow ugc">https://github.com/domvox/llama.cpp-turboquant-hip</a><br />
性能: 256k上下文,  pp: 970t/s tg: 29t/s<br />
Comment：目前测试，除了反应没在线api 快，生成代码的质量不比在线api 差。</p>
<pre><code>~/llama.cpp-turboquant-hip/rocm/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --mmproj ~/model/llm/qwen3.6-27b/mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-f16.gguf   --alias qwen3.6-27b   --host 0.0.0.0   --port 8080   --n-gpu-layers 999   --ctx-size 262144   --batch-size 2048   --ubatch-size 768   --threads 8   --temp 1.0      --top-p 0.95     --top-k 20     --min-p 0.00   --presence_penalty 1.5   --cache-type-k turbo3   --cache-type-v turbo3
</code></pre>
<h2>2. Vulkan</h2>
<p dir="auto">repo: <a href="https://github.com/ggml-org/llama.cpp" rel="nofollow ugc">https://github.com/ggml-org/llama.cpp</a><br />
性能: 256k上下文, kv-cache-type: Q4_0,  pp: 730t/s tg: 47t/s, （Q8_0会慢一丢丢）</p>
<pre><code>~/Downloads/llama.cpp/vulkan/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b  --cache-type-k q4_0 --cache-type-v q4_0 -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256
</code></pre>
<h3>2.1 Vulkan + turboquant,</h3>
<p dir="auto">repo: <a href="https://github.com/TheTom/llama-cpp-turboquant" rel="nofollow ugc">https://github.com/TheTom/llama-cpp-turboquant</a><br />
性能: 256k上下文, kv-cache-type: Q4_0,  tg: 10t/s, decoding 时 GPU 使用率不到 30%，速度拉跨。开MTP 也 差不多。</p>
<pre><code>~/llama.cpp/build/bin/llama-server   -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --cache-type-k turbo3 --cache-type-v turbo3   -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256
</code></pre>
<h2>3. Vulkan + MTP</h2>
<p dir="auto">repo/pr:<br />
<a href="https://github.com/ggml-org/llama.cpp/pull/22673" rel="nofollow ugc">https://github.com/ggml-org/llama.cpp/pull/22673</a><br />
性能: 256k上下文, kv-cache-type: Q4_0,  pp: 730t/s tg: 67t/s， VRAM 占用跟不开MTP 差不多，</p>
<pre><code>~/Downloads/llama.cpp/vulkan/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k q4_0 --cache-type-v q4_0 -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256
</code></pre>
<h2>3. Rocm + MTP</h2>
<p dir="auto">repo/pr: <a href="https://github.com/ggml-org/llama.cpp/pull/22673" rel="nofollow ugc">https://github.com/ggml-org/llama.cpp/pull/22673</a><br />
性能: 4k上下文, kv-cache-type: Q4_0,  pp: 730t/s tg: 67t/s<br />
Comment: Rocm的backend + MTP 有问题，VRAM 在开始 对话时 暴增 5G，具体原因不明，所以 最多8k上下文， Rocm目前的好处 是由 turbo quant 集成。</p>
<pre><code>~/llama.cpp/build/bin/llama-server   -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k q4_0 --cache-type-v q4_0   -np 1 -c 4096 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256
</code></pre>
<h2>4.Hipfire (DFlash) v0.1.20</h2>
<p dir="auto">repo: <a href="https://github.com/Kaden-Schutt/hipfire" rel="nofollow ugc">https://github.com/Kaden-Schutt/hipfire</a><br />
性能: 4k上下文， pp: 930t/s tg: 46t/s，<br />
Comment: 只能chat聊天，速度很快，默认开启 DFlash, 但是 上下文8k 以上就会卡死，或者崩溃, 没法给 opencode 或者pi 使用，等三个月半年再看看。</p>
<h2>5. 老卡 P40 24G，</h2>
<p dir="auto">repo: <a href="https://github.com/TheTom/llama-cpp-turboquant" rel="nofollow ugc">https://github.com/TheTom/llama-cpp-turboquant</a><br />
pr:     <a href="https://github.com/ggml-org/llama.cpp/pull/22673" rel="nofollow ugc">https://github.com/ggml-org/llama.cpp/pull/22673</a></p>
<h5>不开MTP</h5>
<p dir="auto">性能: 196k 上下文，tg: 10t/s，</p>
<pre><code>~/llama.cpp-mtp/build/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b  --cache-type-k turbo3 --cache-type-v turbo3 -c 196608 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256
</code></pre>
<h5>开MTP</h5>
<p dir="auto">性能: 196k上下文，tg: 17t/s，</p>
<pre><code>~/llama-cpp-turboquant/build/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k turbo3 --cache-type-v turbo3   -np 1 -c 196608 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256
</code></pre>
<hr />
<hr />
<h1>opencode + deepseek v4 帮我跑了一把，结果如下</h1>
<ul>
<li>如果追求性能 Vulkan + MTP 效果最好，</li>
<li>MTP的性能不是恒定的，不同的上下文或者任务，可能存在很大的差别，你让他写小说，规划日常，写代码，性能提升可能会不一样，跑分仅供参考。</li>
<li>MTP 目前只能单个对话session，没法并行。</li>
<li>Vuklan 后端对 Turbo quant的支持还有存在问题， GPU利用率不够，还得优化。</li>
<li>Rocm + MTP 存在 VRAM问题，会无端暴涨5G占用，导致跑起来最多8k多一点。</li>
</ul>
<h1>llama-bench 测试结果</h1>
<h2>环境</h2>
<ul>
<li><strong>MTP 模型</strong>: Qwen3.6-27B-Q4_K_M-mtp.gguf (15.82 GiB) <a href="https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF/" rel="nofollow ugc">https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF/</a></li>
<li><strong>非MTP 模型</strong>: Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf (17 GiB) <a href="https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive" rel="nofollow ugc">https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive</a></li>
<li><strong>GPU</strong>: AMD Radeon RX 7900 XTX (24,560 MiB 显存)</li>
<li><strong>CPU</strong>: Genuine Intel(R) 13900hk ES</li>
<li><strong>线程数</strong>: 8</li>
<li><strong>n-gpu-layers</strong>: 999 (完全卸载到 GPU)</li>
<li><strong>温度</strong>: 0.7, <strong>top-k</strong>: 20</li>
</ul>
<hr />
<h2>ROCm (HIP) - KV缓存类型对比 (非MTP)</h2>
<p dir="auto"><strong>二进制</strong>: <code>~/llama.cpp/rocm/bin/llama-bench</code> (build 9046)</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">KV缓存类型</th>
<th style="text-align:right">pp1024 (token/s)</th>
<th style="text-align:right">tg128 (token/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">f16 (默认)</td>
<td style="text-align:right"><strong>904.50</strong></td>
<td style="text-align:right">28.99</td>
</tr>
<tr>
<td style="text-align:left">q4_0</td>
<td style="text-align:right">898.01</td>
<td style="text-align:right">28.81</td>
</tr>
</tbody>
</table>
<hr />
<h2>Vulkan - KV缓存类型对比 (非MTP)</h2>
<h3>标准构建 (<code>~/Downloads/llama.cpp/build-vulkan/bin/llama-bench</code>)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">KV缓存类型</th>
<th style="text-align:right">pp512 (token/s)</th>
<th style="text-align:right">tg128 (token/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">f16</td>
<td style="text-align:right">765.94</td>
<td style="text-align:right">37.06</td>
</tr>
<tr>
<td style="text-align:left">Q4_0</td>
<td style="text-align:right">769.82</td>
<td style="text-align:right">37.17</td>
</tr>
<tr>
<td style="text-align:left">Q8_0</td>
<td style="text-align:right">273.25</td>
<td style="text-align:right">37.13</td>
</tr>
</tbody>
</table>
<h3>Turboquant 构建 (<code>~/Downloads/llama-cpp-turboquant/build-vulkan/bin/llama-bench</code>)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">KV缓存类型</th>
<th style="text-align:right">pp512 (token/s)</th>
<th style="text-align:right">tg128 (token/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">turbo2</td>
<td style="text-align:right"><strong>193.43 ± 1.49</strong></td>
<td style="text-align:right">23.79 ± 0.17</td>
</tr>
<tr>
<td style="text-align:left">turbo3</td>
<td style="text-align:right">128.44 ± 1.31</td>
<td style="text-align:right">21.88 ± 0.14</td>
</tr>
<tr>
<td style="text-align:left">turbo4</td>
<td style="text-align:right">178.94 ± 2.03</td>
<td style="text-align:right">23.00 ± 0.25</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">注意：turboquant 测试期间 GPU 使用率仅约 30%，未能充分利用 GPU。瓶颈可能在 CPU 端的量化/反量化操作。</p>
</blockquote>
<p dir="auto">q4_0/q8_0 在 turboquant 构建的 llama-bench 中仍然失败。</p>
<hr />
<h2>Vulkan + MTP</h2>
<p dir="auto"><strong>二进制</strong>: <code>~/llama.cpp/vulkan/bin/llama-cli</code><br />
<strong>命令</strong>: <code>--spec-type mtp --spec-draft-n-max 3 --parallel 1 -p "tell me a jok" -n 128 -ngl 999</code></p>
<blockquote>
<p dir="auto">注意：MTP 使用 <code>-np 1</code>（单并行序列），因此无法并行处理。草稿模型顺序执行，限制了吞吐量。</p>
</blockquote>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">配置</th>
<th style="text-align:right">生成速度 (token/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">非MTP (f16)</td>
<td style="text-align:right">39.5</td>
</tr>
<tr>
<td style="text-align:left">MTP (q4_0)</td>
<td style="text-align:right"><strong>81.2</strong></td>
</tr>
<tr>
<td style="text-align:left">MTP (q8_0)</td>
<td style="text-align:right"><strong>77.5</strong></td>
</tr>
</tbody>
</table>
<hr />
<h2>ROCm + MTP</h2>
<p dir="auto"><strong>二进制</strong>: <code>~/llama.cpp/rocm/bin/llama-cli</code> 配合 <code>LD_LIBRARY_PATH</code></p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">配置</th>
<th style="text-align:right">生成速度 (token/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">非MTP (f16)</td>
<td style="text-align:right">29.4</td>
</tr>
<tr>
<td style="text-align:left">MTP (q4_0)</td>
<td style="text-align:right">53.6</td>
</tr>
<tr>
<td style="text-align:left">MTP (turbo3)</td>
<td style="text-align:right">47.4</td>
</tr>
<tr>
<td style="text-align:left">MTP (turbo4)</td>
<td style="text-align:right"><strong>57.2</strong></td>
</tr>
</tbody>
</table>
<hr />
<h2>总结</h2>
<h3>非MTP (llama-bench)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">KV缓存类型</th>
<th style="text-align:right">pp (token/s)</th>
<th style="text-align:right">tg128 (token/s)</th>
<th style="text-align:left">后端</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">f16</td>
<td style="text-align:right">904.50</td>
<td style="text-align:right">28.99</td>
<td style="text-align:left">ROCm (pp1024)</td>
</tr>
<tr>
<td style="text-align:left">q4_0</td>
<td style="text-align:right">898.01</td>
<td style="text-align:right">28.81</td>
<td style="text-align:left">ROCm (pp1024)</td>
</tr>
<tr>
<td style="text-align:left">f16</td>
<td style="text-align:right">765.94</td>
<td style="text-align:right">37.06</td>
<td style="text-align:left">Vulkan 标准 (pp512)</td>
</tr>
<tr>
<td style="text-align:left">Q4_0</td>
<td style="text-align:right">769.82</td>
<td style="text-align:right">37.17</td>
<td style="text-align:left">Vulkan 标准 (pp512)</td>
</tr>
<tr>
<td style="text-align:left">Q8_0</td>
<td style="text-align:right">273.25</td>
<td style="text-align:right">37.13</td>
<td style="text-align:left">Vulkan 标准 (pp512)</td>
</tr>
<tr>
<td style="text-align:left">turbo2</td>
<td style="text-align:right">193.43</td>
<td style="text-align:right">23.79</td>
<td style="text-align:left">Vulkan turboquant (pp512)</td>
</tr>
<tr>
<td style="text-align:left">turbo4</td>
<td style="text-align:right">178.94</td>
<td style="text-align:right">23.00</td>
<td style="text-align:left">Vulkan turboquant (pp512)</td>
</tr>
<tr>
<td style="text-align:left">turbo3</td>
<td style="text-align:right">128.44</td>
<td style="text-align:right">21.88</td>
<td style="text-align:left">Vulkan turboquant (pp512)</td>
</tr>
</tbody>
</table>
<h3>MTP (llama-cli)</h3>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th style="text-align:left">配置</th>
<th style="text-align:right">生成速度 (token/s)</th>
<th style="text-align:left">后端</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">MTP (q4_0)</td>
<td style="text-align:right"><strong>81.2</strong></td>
<td style="text-align:left">Vulkan</td>
</tr>
<tr>
<td style="text-align:left">MTP (q8_0)</td>
<td style="text-align:right"><strong>77.5</strong></td>
<td style="text-align:left">Vulkan</td>
</tr>
<tr>
<td style="text-align:left">MTP (turbo4)</td>
<td style="text-align:right"><strong>57.2</strong></td>
<td style="text-align:left">ROCm</td>
</tr>
<tr>
<td style="text-align:left">MTP (q4_0)</td>
<td style="text-align:right">53.6</td>
<td style="text-align:left">ROCm</td>
</tr>
<tr>
<td style="text-align:left">MTP (turbo3)</td>
<td style="text-align:right">47.4</td>
<td style="text-align:left">ROCm</td>
</tr>
<tr>
<td style="text-align:left">非MTP (f16)</td>
<td style="text-align:right">39.5</td>
<td style="text-align:left">Vulkan</td>
</tr>
<tr>
<td style="text-align:left">非MTP (f16)</td>
<td style="text-align:right">29.4</td>
<td style="text-align:left">ROCm</td>
</tr>
</tbody>
</table>
<h3>关键观察</h3>
<ol>
<li><strong>ROCm 上的 q4_0</strong> 性能与 f16 几乎相同 (898 vs 905 token/s) — 差异可忽略。</li>
<li><strong>Turboquant 类型</strong> 仅适用于 turboquant Vulkan 构建。turbo2 的提示处理最快 (193 token/s @ pp512)。各 turbo 变体的生成速度相近 (~22-24 token/s)。</li>
<li><strong>标准 Vulkan 构建</strong> 支持 Q4_0/Q8_0 — Q4_0 与 f16 速度相当 (~770 token/s pp512)，Q8_0 提示处理慢约 2.8 倍 (273 token/s) 但生成速度相同 (~37 token/s)。Turbo 类型仅适用于 turboquant 构建。</li>
<li><strong>MTP</strong> 显著提升生成速度：Vulkan+q4_0 达到 <strong>81.2 token/s</strong>（比非MTP 提升 +106%），Vulkan+q8_0 达到 <strong>77.5 token/s</strong> (+96%)，ROCm+turbo4 达到 <strong>57.2 token/s</strong> (+95%)。</li>
</ol>
<p dir="auto"><a href="https://www.reddit.com/r/LocalLLM/comments/1ta42t1/llamacpp_turboquant_mtp_on_7900_xtx/" rel="nofollow ugc">reddit</a></p>
]]></description><link>https://lcz.me/topic/100/7900xtx-llama.cpp-qwen3.6-27b-turboquant-mtp-测试结果分享</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 06:05:02 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/100.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 11 May 2026 07:40:39 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Mon, 18 May 2026 03:00:11 GMT]]></title><description><![CDATA[<p dir="auto"><img src="https://upload.lcz.me/uploads/9d6e2612-488f-45c9-8bf5-40b4d4e6592c.jpeg" alt="5cfbd3e5-4dfc-4456-9395-5faf08254a33-image.jpeg" class=" img-fluid img-markdown" /><br />
有，但是huggingface会更多</p>
]]></description><link>https://lcz.me/post/2241</link><guid isPermaLink="true">https://lcz.me/post/2241</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Mon, 18 May 2026 03:00:11 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Mon, 18 May 2026 02:43:51 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/david-zhang" aria-label="Profile: david-zhang">@<bdi>david-zhang</bdi></a> Qwen3.6-27B-Q4_K_M-mtp.gguf这个是不是只有huggingface上有modelscope上找不到</p>
]]></description><link>https://lcz.me/post/2236</link><guid isPermaLink="true">https://lcz.me/post/2236</guid><dc:creator><![CDATA[张鑫磊]]></dc:creator><pubDate>Mon, 18 May 2026 02:43:51 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Sun, 17 May 2026 15:39:41 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/devin-hi" aria-label="Profile: Devin-Hi">@<bdi>Devin-Hi</bdi></a> 改了之后呢？改进如何？我也想抄作业了。</p>
]]></description><link>https://lcz.me/post/2189</link><guid isPermaLink="true">https://lcz.me/post/2189</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Sun, 17 May 2026 15:39:41 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Sun, 17 May 2026 02:58:55 GMT]]></title><description><![CDATA[<p dir="auto">感謝大神分享！好人一生平安</p>
]]></description><link>https://lcz.me/post/2054</link><guid isPermaLink="true">https://lcz.me/post/2054</guid><dc:creator><![CDATA[Chang Ching-Chun]]></dc:creator><pubDate>Sun, 17 May 2026 02:58:55 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Fri, 15 May 2026 11:15:33 GMT]]></title><description><![CDATA[<p dir="auto">牛啊，大佬，学习了</p>
]]></description><link>https://lcz.me/post/1826</link><guid isPermaLink="true">https://lcz.me/post/1826</guid><dc:creator><![CDATA[xiaopbro]]></dc:creator><pubDate>Fri, 15 May 2026 11:15:33 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Fri, 15 May 2026 09:18:37 GMT]]></title><description><![CDATA[<p dir="auto">22673测试下来windows下概率崩溃,找不到原因</p>
]]></description><link>https://lcz.me/post/1810</link><guid isPermaLink="true">https://lcz.me/post/1810</guid><dc:creator><![CDATA[asdqwe876]]></dc:creator><pubDate>Fri, 15 May 2026 09:18:37 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Fri, 15 May 2026 00:48:11 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/david-zhang" aria-label="Profile: David-Zhang">@<bdi>David-Zhang</bdi></a> <a href="/post/1609">说</a>:</p>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/leon-y" aria-label="Profile: Leon-Y">@<bdi>Leon-Y</bdi></a> ollama是个玩具不是工具，换llama.cpp或者 vllm</p>
</blockquote>
<p dir="auto">果然上了llama.cpp，速度起飞，显卡风扇狂吼。</p>
]]></description><link>https://lcz.me/post/1720</link><guid isPermaLink="true">https://lcz.me/post/1720</guid><dc:creator><![CDATA[Leon Y]]></dc:creator><pubDate>Fri, 15 May 2026 00:48:11 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Thu, 14 May 2026 13:23:18 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/leon-y" aria-label="Profile: Leon-Y">@<bdi>Leon-Y</bdi></a> ollama是个玩具不是工具，换llama.cpp或者 vllm</p>
]]></description><link>https://lcz.me/post/1609</link><guid isPermaLink="true">https://lcz.me/post/1609</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Thu, 14 May 2026 13:23:18 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Thu, 14 May 2026 12:33:00 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> 没有溢出，但极其接近100。目前96.4% 使用率，空闲不到 750 MB。系统日志也没有 GPU OOM 报错。</p>
]]></description><link>https://lcz.me/post/1582</link><guid isPermaLink="true">https://lcz.me/post/1582</guid><dc:creator><![CDATA[Leon Y]]></dc:creator><pubDate>Thu, 14 May 2026 12:33:00 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Thu, 14 May 2026 12:16:33 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/leon-y" aria-label="Profile: Leon-Y">@<bdi>Leon-Y</bdi></a> 显存是不是溢出了？</p>
]]></description><link>https://lcz.me/post/1578</link><guid isPermaLink="true">https://lcz.me/post/1578</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Thu, 14 May 2026 12:16:33 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Thu, 14 May 2026 12:11:27 GMT]]></title><description><![CDATA[<p dir="auto">我搞了个7900 XT 20GB, 用ollama 在跑qwen3.6:27b-q8_0，感觉很慢</p>
]]></description><link>https://lcz.me/post/1577</link><guid isPermaLink="true">https://lcz.me/post/1577</guid><dc:creator><![CDATA[Leon Y]]></dc:creator><pubDate>Thu, 14 May 2026 12:11:27 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Thu, 14 May 2026 03:24:05 GMT]]></title><description><![CDATA[<blockquote>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/miraco" aria-label="Profile: Miraco">@<bdi>Miraco</bdi></a> <a href="/post/1467">说</a>:</p>
<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/fred" aria-label="Profile: Fred">@<bdi>Fred</bdi></a> 感谢您的指点。</p>
</blockquote>
<p dir="auto">不客气哈，论坛嘛就是自己知道点啥有空就贡献贡献。<br />
其实目前不建议除了prefill变慢，不支持多并发之外，还有个原因就是目前llama.cpp这个MTP分支还不支持--mmproj参数，不能支持图片识别。相当于没有多模态的能力了。如果对图片识别有需求的场景就根本无法用。<br />
当前社区大神还在做一个抽象层框架，把这些spec-decoding的技术都抽象出来，以便后续陆续在同一个框架内合入MTP/DFLASH这一类的功能。这些事情做完之前还不会合并。PR只是给爱折腾，有技术能力的兄弟尝尝鲜的。</p>
]]></description><link>https://lcz.me/post/1516</link><guid isPermaLink="true">https://lcz.me/post/1516</guid><dc:creator><![CDATA[Fred]]></dc:creator><pubDate>Thu, 14 May 2026 03:24:05 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Thu, 14 May 2026 01:04:36 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/ken-huang" aria-label="Profile: ken-huang">@<bdi>ken-huang</bdi></a><br />
AMD用不不要用显卡坞，别问我怎么知道的<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f613.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--sweat" style="height:23px;width:auto;vertical-align:middle" title="😓" alt="😓" />，特么的真是折腾，英伟达只是小毛病，这个是一堆暗病。<br />
<img src="https://upload.lcz.me/uploads/4ef97c04-c198-4f72-8c20-b896839b9300.jpeg" alt="7900xtx戴尔笔记本显卡坞.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/1504</link><guid isPermaLink="true">https://lcz.me/post/1504</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Thu, 14 May 2026 01:04:36 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Thu, 14 May 2026 00:44:10 GMT]]></title><description><![CDATA[<p dir="auto">llama-benchy result:</p>
<pre><code>cd /var/home/deck/tmp/llama-benchy
uv run llama-benchy \
  --base-url http://127.0.0.1:8081/v1 \
  --model froggeric/Qwen3.6-27B-MTP-GGUF \
  --served-model-name Qwen3.6-27B-Q4_K_M-mtp.gguf \
  --tokenizer Qwen/Qwen3.6-27B \
  --pp 2048 --tg 32 \
  --depth 0 8192 32768 \
  --runs 1 --no-cache --latency-mode generation --skip-coherence \
  --save-result results/qwen36-27b-mtp-8081-sample-20260513.json --format json

Results:

| context depth | pp t/s | tg t/s | peak tg t/s | TTFR | est PPT |
|---|---:|---:|---:|---:|---:|
| 0 | 457.92 | 29.75 | 30.0 | 4693 ms | 4477 ms |
| 8192 | 432.96 | 28.24 | 29.0 | 23870 ms | 23654 ms |
| 32768 | 329.57 | 25.24 | 27.0 | 105856 ms | 105640 ms |

.venv/bin/llama-benchy \
  --base-url http://127.0.0.1:8081/v1 \
  --model Qwen/Qwen3.6-27B \
  --served-model-name Qwen_Qwen3.6-27B-Q4_K_M.gguf \
  --tokenizer Qwen/Qwen3.6-27B \
  --pp 2048 \
  --tg 32 \
  --depth 0 8192 32768 \
  --runs 1 \
  --latency-mode generation \
  --save-result results/qwen36-27b-original-8081-20260513T235739Z.json \
  --format json


Results:
| depth | pp t/s | tg t/s | TTFR ms |
|---|---:|---:|---:|
| 0 | 685.49 | 30.63 | 3190.39 |
| 8192 | 640.61 | 30.00 | 16184.55 |
| 32768 | 486.52 | 28.16 | 71766.55 |
</code></pre>
<p dir="auto">llama.cpp server config:</p>
<pre><code>    #MODEL="/run/media/deck/ExternalSSD/.llama.cpp/models/froggeric_Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M-mtp.gguf"
    MODEL="/var/run/media/deck/ExternalSSD/.llama.cpp/models/Qwen_Qwen3.6-27B-GGUF/Qwen_Qwen3.6-27B-Q4_K_M.gguf"

      # cd "/var/home/deck/tmp/llama-pr-22673-mtp-clean/build-vulkan-pr22673/bin"
      cd "/var/home/deck/code/llama.cpp/build-vulkan/bin"

      export VK_LOADER_LAYERS_DISABLE=VK_LAYER_LS_frame_generation

      exec ./llama-server \
        -m "$MODEL" \
        -ngl 99 \
        -dev Vulkan0 \
        -fa on \
        -c 200000 \
        -ctk q4_0 \
        -ctv q4_0 \
        -ub 256 \
	--temp 0.2 \
	--top-k 20 \
	--parallel 1 \
        -rea off \
        --reasoning-budget 0 \
        --host "$HOST" \
        --port "$PORT"

       # MTP flags:
       #       --spec-type mtp 
       #       --spec-draft-n-max 2
</code></pre>
<p dir="auto">昨天测了一天感觉MTP打开没有变化（～30tok/s），用了几轮就会爆VRAM, 希望指正哪里出问题了。<br />
我是用beelink ser7 + eGPU 7900xtx + bazzite + hermes agent + discrod<br />
现在基本可以游戏/LLM随时切换, eGPU坑还是很多， 在等x99主板到装机</p>
<p dir="auto">eGPU坑：<br />
用all-ways-egpu可以点亮显卡+游戏<br />
Kfd/ROCm没发使用,试了setup时不去设置iGPU kfd就能用了，但是bazzite不能进game mode了，还在找最后解决方案<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f602.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--joy" style="height:23px;width:auto;vertical-align:middle" title="😂" alt="😂" /></p>
]]></description><link>https://lcz.me/post/1496</link><guid isPermaLink="true">https://lcz.me/post/1496</guid><dc:creator><![CDATA[ken huang]]></dc:creator><pubDate>Thu, 14 May 2026 00:44:10 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 23:18:45 GMT]]></title><description><![CDATA[<p dir="auto">全是干货, 感谢分享!</p>
]]></description><link>https://lcz.me/post/1484</link><guid isPermaLink="true">https://lcz.me/post/1484</guid><dc:creator><![CDATA[michael gong]]></dc:creator><pubDate>Wed, 13 May 2026 23:18:45 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 17:36:24 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/fred" aria-label="Profile: Fred">@<bdi>Fred</bdi></a> 感谢您的指点。</p>
]]></description><link>https://lcz.me/post/1467</link><guid isPermaLink="true">https://lcz.me/post/1467</guid><dc:creator><![CDATA[Miraco]]></dc:creator><pubDate>Wed, 13 May 2026 17:36:24 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 14:37:34 GMT]]></title><description><![CDATA[<p dir="auto">贴子真是全全的干货。学习中</p>
]]></description><link>https://lcz.me/post/1418</link><guid isPermaLink="true">https://lcz.me/post/1418</guid><dc:creator><![CDATA[拐子001]]></dc:creator><pubDate>Wed, 13 May 2026 14:37:34 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 09:18:10 GMT]]></title><description><![CDATA[<p dir="auto">williamlouis 我也是啊，感觉现在developer 快要回家种地去了。<br />
api吧，deepseek v4真香</p>
]]></description><link>https://lcz.me/post/1382</link><guid isPermaLink="true">https://lcz.me/post/1382</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Wed, 13 May 2026 09:18:10 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 07:31:29 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/miraco" aria-label="Profile: Miraco">@<bdi>Miraco</bdi></a> 小白先装opencode，让它给你搞。有问题先问它。</p>
]]></description><link>https://lcz.me/post/1361</link><guid isPermaLink="true">https://lcz.me/post/1361</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Wed, 13 May 2026 07:31:29 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 07:30:19 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/t68823878" aria-label="Profile: t68823878">@<bdi>t68823878</bdi></a></p>
<p dir="auto">llama.cpp 好像最近才支持把，<br />
<a href="https://www.reddit.com/r/LocalLLaMA/comments/1svfjyv/fp4_inference_in_llamacpp_nvfp4_and_ik_llamacpp/" rel="nofollow ugc">https://www.reddit.com/r/LocalLLaMA/comments/1svfjyv/fp4_inference_in_llamacpp_nvfp4_and_ik_llamacpp/</a></p>
<p dir="auto">这个pr 刚merge，你用opencode让它给你弄，应该不难，让它给你调试，它会看模型是否合适还是哪里问题。<br />
<a href="https://github.com/ggml-org/llama.cpp/pull/22196" rel="nofollow ugc">https://github.com/ggml-org/llama.cpp/pull/22196</a></p>
]]></description><link>https://lcz.me/post/1360</link><guid isPermaLink="true">https://lcz.me/post/1360</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Wed, 13 May 2026 07:30:19 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 07:28:41 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/david-zhang" aria-label="Profile: David-Zhang">@<bdi>David-Zhang</bdi></a> 我直接重度 api 。业务方向不一样。我是搞编程的。</p>
]]></description><link>https://lcz.me/post/1358</link><guid isPermaLink="true">https://lcz.me/post/1358</guid><dc:creator><![CDATA[williamlouis]]></dc:creator><pubDate>Wed, 13 May 2026 07:28:41 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 07:24:41 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/t68823878" aria-label="Profile: t68823878">@<bdi>t68823878</bdi></a> 你先把opencode(或者claude code, codex)装上，其他的应该都会简单很多。</p>
]]></description><link>https://lcz.me/post/1356</link><guid isPermaLink="true">https://lcz.me/post/1356</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Wed, 13 May 2026 07:24:41 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 07:23:26 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/williamlouis" aria-label="Profile: williamlouis">@<bdi>williamlouis</bdi></a> 确实如此，A卡在两年前确实只能玩玩游戏，vulkan后端那会儿也不不大行，rocm更拉跨。但现在慢慢赶上了，生产力我觉得是可以的，环境搭好了，每天只管跑，也还稳定，算不上很快，但是确实能解决问题。 这性价比 我觉得不差，总之比intel家的新卡强很多，全靠直面参数了撑场面了。</p>
]]></description><link>https://lcz.me/post/1355</link><guid isPermaLink="true">https://lcz.me/post/1355</guid><dc:creator><![CDATA[David Zhang]]></dc:creator><pubDate>Wed, 13 May 2026 07:23:26 GMT</pubDate></item><item><title><![CDATA[Reply to 7900XTX + llama.cpp Qwen3.6 27B TurboQuant + MTP 测试结果分享 on Wed, 13 May 2026 07:17:58 GMT]]></title><description><![CDATA[<p dir="auto">VLLM_ATTENTION_BACKEND=FlashInfer VLLM_PROFILER_ESTIMATE_CUDAGRAPHS=1 python3 -m vllm.entrypoints.openai.api_server <br />
--model /models/qwen/Qwen3.6-27B-FP8 <br />
--trust-remote-code <br />
--max-model-len 131072 <br />
--kv-cache-dtype fp8_e4m3 <br />
--gpu-memory-utilization 0.58 <br />
--enable-chunked-prefill <br />
--enable-prefix-caching <br />
--max-num-batched-tokens 8192 <br />
--max-num-seqs 2 <br />
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}' <br />
--served-model-name "Qwen-27B-FP8" <br />
--enable-auto-tool-choice <br />
--tool-call-parser qwen3_coder <br />
--reasoning-parser qwen3 <br />
--host 0.0.0.0 <br />
--port 8000</p>
<p dir="auto">半晚上研究，转向使用官方FP8模型，开启MTP，预测3字，基本上能够在90tk的速度。保证基础运行的情况下能够余下40GB左右的空间来搞comfyui，接下来就是继续研究怎么弄comfyui了。<br />
或者说是先研究hermes，然后让他帮我搞定comfyui，有没有大神给点建议？</p>
]]></description><link>https://lcz.me/post/1354</link><guid isPermaLink="true">https://lcz.me/post/1354</guid><dc:creator><![CDATA[t68823878]]></dc:creator><pubDate>Wed, 13 May 2026 07:17:58 GMT</pubDate></item></channel></rss>