<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[llama.cpp 双 RTX 3080 推理加速实测：Qwen3.6-27B 从 35 到 50 tok&#x2F;s]]></title><description><![CDATA[<p dir="auto">测试方法：DeepSeek驱动hermes执行本地测试<br />
测试日期：2026.05.26<br />
备注：以下结果是AI生成，最终测试结果3080双卡的上限也就是53t/s了，供大家参考</p>
<hr />
<h2>1. 硬件环境</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>组件</th>
<th>型号</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPU</td>
<td>Intel Ultra 7 265K（20核，5.4GHz）</td>
</tr>
<tr>
<td>主板</td>
<td>ASUS PRIME Z890-P WIFI</td>
</tr>
<tr>
<td>内存</td>
<td>96GB DDR5 5600MHz（Corsair 2×48GB）</td>
</tr>
<tr>
<td><strong>GPU</strong></td>
<td><strong>2× NVIDIA RTX 3080 20GB（独立，非 SLI，共 40GB VRAM）</strong></td>
</tr>
<tr>
<td>系统</td>
<td>WSL2 Ubuntu 24.04（Windows 11 宿主）</td>
</tr>
</tbody>
</table>
<h2>2. 软件环境</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>项目</th>
<th>版本/配置</th>
</tr>
</thead>
<tbody>
<tr>
<td>llama.cpp</td>
<td>commit <code>95405ac</code>（CUDA 后端，MTP 支持）</td>
</tr>
<tr>
<td>编译参数</td>
<td><code>GGML_CUDA=ON GGML_CUDA_FA=ON</code></td>
</tr>
<tr>
<td>模型</td>
<td><a href="https://huggingface.co/bullerwins/Qwen3.6-27B-GGUF" rel="nofollow ugc">Qwen3.6-27B-Q4_K_M.gguf</a>（16GB，Q4_K_M 量化）</td>
</tr>
<tr>
<td>CUDA</td>
<td>默认 (WSL2 驱动直通)</td>
</tr>
</tbody>
</table>
<h2>3. 测试方法</h2>
<ul>
<li>固定 prompt：64 token 中文问题（Transformer 架构介绍）</li>
<li>生成：200 token 输出</li>
<li>每次测速前冷加载模型（重启服务器），避免 KV cache 热缓存影响结果</li>
<li>温度 <code>--temp 0</code>，保证确定性输出</li>
<li>通过 llama-server 返回的 <code>timings.predicted_per_second</code> 和 <code>timings.prompt_per_second</code> 记录速度</li>
</ul>
<h2>4. 测试结果</h2>
<h3>4.1 基线配置</h3>
<pre><code class="language-bash">CUDA_SCALE_LAUNCH_QUEUES=4 ./llama-server \
  -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 \
  --host 127.0.0.1 --port 8082 -c 131072 --temp 0 \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --ubatch-size 1024 --batch-size 2048 \
  -fa on -ctk q4_0 -ctv q4_0
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">数值</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center">34.66 tok/s</td>
</tr>
<tr>
<td>生成速度</td>
<td style="text-align:center"><strong>52.79 tok/s</strong></td>
</tr>
<tr>
<td>VRAM（GPU0 / GPU1）</td>
<td style="text-align:center">13.6 GB / 17.6 GB</td>
</tr>
</tbody>
</table>
<h3>4.2 测试 A：调整 tensor-split</h3>
<pre><code class="language-bash">... --tensor-split 4,5
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">数值</th>
<th style="text-align:center">变化</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center">33.89 tok/s</td>
<td style="text-align:center">-2.2%</td>
</tr>
<tr>
<td>生成速度</td>
<td style="text-align:center">52.27 tok/s</td>
<td style="text-align:center">-1.0%</td>
</tr>
<tr>
<td>VRAM（GPU0 / GPU1）</td>
<td style="text-align:center">12.5 GB / 17.2 GB</td>
<td style="text-align:center">更均衡</td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">结论：<code>--tensor-split</code> 平衡了显存，但对速度无明显帮助。llama.cpp 的默认 layer 分载已经较优。</p>
</blockquote>
<h3>4.3 测试 B：增大 MPT 推测数量</h3>
<pre><code class="language-bash">... --spec-draft-n-max 5
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">数值</th>
<th style="text-align:center">变化</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center">34.47 tok/s</td>
<td style="text-align:center">-0.5%</td>
</tr>
<tr>
<td>生成速度</td>
<td style="text-align:center">50.96 tok/s</td>
<td style="text-align:center"><strong>-3.5%</strong></td>
</tr>
<tr>
<td>VRAM（GPU0 / GPU1）</td>
<td style="text-align:center">14.1 GB / 16.9 GB</td>
<td style="text-align:center"></td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">结论：<code>--spec-draft-n-max</code> 从 3 提高到 5 反而变慢。原因推测：Qwen3.6-27B 的 MTP 接受率在第 3 个 token 后饱和，多余草稿 token 浪费了计算量。</p>
</blockquote>
<h3>4.4 测试 C：MTP + ngram 组合推测</h3>
<pre><code class="language-bash">... --spec-type draft-mtp,ngram-mod \
    --spec-ngram-mod-n-max 5 --spec-ngram-mod-n-min 3
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">数值</th>
<th style="text-align:center">变化</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center">36.42 tok/s</td>
<td style="text-align:center">+5.1%</td>
</tr>
<tr>
<td>生成速度</td>
<td style="text-align:center"><strong>53.02 tok/s</strong></td>
<td style="text-align:center">+0.4%</td>
</tr>
<tr>
<td>VRAM（GPU0 / GPU1）</td>
<td style="text-align:center">13.4 GB / 16.3 GB</td>
<td style="text-align:center"></td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">结论：MTP + ngram 组合对生成速度有小幅提升，ngram 在重复性高的文本（代码、列表）中效果更好，普通文本中与 MTP 独立效果接近。</p>
</blockquote>
<h3>4.5 测试 D：增大 ubatch-size</h3>
<pre><code class="language-bash">... --ubatch-size 2048
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">数值</th>
<th style="text-align:center">变化</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center"><strong>45.51 tok/s</strong></td>
<td style="text-align:center"><strong>+31.3%</strong> <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f680.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--rocket" style="height:23px;width:auto;vertical-align:middle" title="🚀" alt="🚀" /></td>
</tr>
<tr>
<td>生成速度</td>
<td style="text-align:center">52.49 tok/s</td>
<td style="text-align:center">-0.6%</td>
</tr>
<tr>
<td>VRAM（GPU0 / GPU1）</td>
<td style="text-align:center">16.1 GB / 17.2 GB</td>
<td style="text-align:center"></td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">结论：&lt;u&gt;这是本次测试最重要的发现&lt;/u&gt;。<code>--ubatch-size 2048</code> 让 prompt 处理速度暴涨 31%，但生成速度几乎不变。原因分析：ubatch 增大后，GPU 在 prefill 阶段一次处理更多 token，kernel 启动开销摊分到更多 token 上，计算吞吐大幅提升。而自回归生成阶段 batch 天然为 1，不受 ubatch 影响。显存从 13.6G 升至 16.1G（GPU0），仍在 20GB 安全范围内，未 OOM。</p>
</blockquote>
<h3>4.6 测试 E：增大 CUDA 队列</h3>
<pre><code class="language-bash">CUDA_SCALE_LAUNCH_QUEUES=8 ...
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">数值</th>
<th style="text-align:center">变化</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center">34.09 tok/s</td>
<td style="text-align:center">-1.6%</td>
</tr>
<tr>
<td>生成速度</td>
<td style="text-align:center">52.56 tok/s</td>
<td style="text-align:center">≈持平</td>
</tr>
<tr>
<td>VRAM（GPU0 / GPU1）</td>
<td style="text-align:center">13.6 GB / 17.6 GB</td>
<td style="text-align:center"></td>
</tr>
</tbody>
</table>
<blockquote>
<p dir="auto">结论：单用户场景下 CUDA 队列翻倍无显著收益。<code>CUDA_SCALE_LAUNCH_QUEUES=4</code> 已足够。</p>
</blockquote>
<h3>4.7 测试 F（最优组合）：ubatch-2048 + MTP+ngram</h3>
<pre><code class="language-bash">CUDA_SCALE_LAUNCH_QUEUES=4 ./llama-server \
  -m Qwen3.6-27B-Q4_K_M.gguf -ngl 99 \
  --host 127.0.0.1 --port 8082 -c 131072 --temp 0 \
  --spec-type draft-mtp,ngram-mod \
  --spec-draft-n-max 3 \
  --spec-ngram-mod-n-max 5 --spec-ngram-mod-n-min 3 \
  --ubatch-size 2048 --batch-size 2048 \
  -fa on -ctk q4_0 -ctv q4_0
</code></pre>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th style="text-align:center">数值</th>
<th style="text-align:center">变化</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center"><strong>50.17 tok/s</strong></td>
<td style="text-align:center"><strong>+44.7%</strong> <img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/2b50.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--star" style="height:23px;width:auto;vertical-align:middle" title="⭐" alt="⭐" /></td>
</tr>
<tr>
<td>生成速度</td>
<td style="text-align:center"><strong>53.18 tok/s</strong></td>
<td style="text-align:center">+0.7%</td>
</tr>
<tr>
<td>VRAM（GPU0 / GPU1）</td>
<td style="text-align:center">16.1 GB / 17.2 GB</td>
<td style="text-align:center"></td>
</tr>
</tbody>
</table>
<h2>5. 结果汇总</h2>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>#</th>
<th>测试项</th>
<th style="text-align:center">Prompt (tok/s)</th>
<th style="text-align:center">生成 (tok/s)</th>
<th style="text-align:center">Prompt 提升</th>
</tr>
</thead>
<tbody>
<tr>
<td>基线</td>
<td>当前配置（MTP3 + ubatch1024）</td>
<td style="text-align:center">34.66</td>
<td style="text-align:center">52.79</td>
<td style="text-align:center">—</td>
</tr>
<tr>
<td>A</td>
<td><code>--tensor-split 4,5</code></td>
<td style="text-align:center">33.89</td>
<td style="text-align:center">52.27</td>
<td style="text-align:center">-2.2%</td>
</tr>
<tr>
<td>B</td>
<td><code>--spec-draft-n-max 5</code></td>
<td style="text-align:center">34.47</td>
<td style="text-align:center">50.96</td>
<td style="text-align:center">-0.5%</td>
</tr>
<tr>
<td>C</td>
<td><code>--spec-type draft-mtp,ngram-mod</code></td>
<td style="text-align:center">36.42</td>
<td style="text-align:center">53.02</td>
<td style="text-align:center">+5.1%</td>
</tr>
<tr>
<td>D</td>
<td><code>--ubatch-size 2048</code></td>
<td style="text-align:center"><strong>45.51</strong></td>
<td style="text-align:center">52.49</td>
<td style="text-align:center"><strong>+31.3%</strong></td>
</tr>
<tr>
<td>E</td>
<td><code>CUDA_SCALE_LAUNCH_QUEUES=8</code></td>
<td style="text-align:center">34.09</td>
<td style="text-align:center">52.56</td>
<td style="text-align:center">-1.6%</td>
</tr>
<tr>
<td><strong>F</strong></td>
<td><strong>ubatch2048 + MTP+ngram（最优）</strong></td>
<td style="text-align:center"><strong>50.17</strong></td>
<td style="text-align:center"><strong>53.18</strong></td>
<td style="text-align:center"><strong>+44.7%</strong></td>
</tr>
</tbody>
</table>
<h2>6. 结论与建议</h2>
<h3>最优启动命令</h3>
<pre><code class="language-bash">CUDA_SCALE_LAUNCH_QUEUES=4 /home/simon/llama.cpp/build/bin/llama-server \
  -m /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  -ngl 99 --host 127.0.0.1 --port 8082 -c 131072 --temp 0 \
  --spec-type draft-mtp,ngram-mod \
  --spec-draft-n-max 3 \
  --spec-ngram-mod-n-max 5 --spec-ngram-mod-n-min 3 \
  --ubatch-size 2048 --batch-size 2048 \
  -fa on -ctk q4_0 -ctv q4_0
</code></pre>
<h3>关键发现</h3>
<ol>
<li>
<p dir="auto"><strong><code>--ubatch-size 2048</code> 是最大的单一优化点</strong>，prompt 处理速度提升 31-45%。这个参数在大多数 llama.cpp 教程中维持默认值 512 或保守的 1024，但在 40GB VRAM 的双卡配置上可以安全升到 2048，无需担心 OOM。</p>
</li>
<li>
<p dir="auto"><strong>MTP 推测解码的 draft-n-max 不宜过大</strong>，3 个草稿 token 是 Qwen3.6-27B 的最佳值。更大的值（5）会降低速度，因为草稿接受率在第 3 个 token 后已经饱和。</p>
</li>
<li>
<p dir="auto"><strong>双卡显存天然不均衡</strong>。默认 layer 分载下，GPU 0 占 13.6GB，GPU 1 占 17.6GB（相差 4GB）。<code>--tensor-split</code> 可以平衡显存，但对速度无明显影响。</p>
</li>
<li>
<p dir="auto"><strong>生成速度存在上限</strong>。双 RTX 3080 对 Qwen3.6-27B（Q4）的生成速度天花板约 53 tok/s，受限于 Ampere 架构的 FP16 计算能力和 PCIe 带宽。纯粹的自回归生成阶段，GPU 利用率天然不高。</p>
</li>
<li>
<p dir="auto"><strong>prompt 处理（prefill）才是双卡配置的主要优化发力点</strong>——因为 prefill 阶段可以利用 batch 并行度和两卡的全部算力，而自回归生成只能串行。</p>
</li>
</ol>
<h3>适用建议</h3>
<ul>
<li><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4cc.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--pushpin" style="height:23px;width:auto;vertical-align:middle" title="📌" alt="📌" /> <strong>长上下文 / 长 prompt 场景</strong>：优先采用方案 F，prefill 速度提升极大改善首 token 延迟</li>
<li><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4cc.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--pushpin" style="height:23px;width:auto;vertical-align:middle" title="📌" alt="📌" /> <strong>短对话 / 流式场景</strong>：方案 F 仍是最优，但提升主要体现在首 token 延迟上</li>
<li><img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f4cc.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--pushpin" style="height:23px;width:auto;vertical-align:middle" title="📌" alt="📌" /> <strong>低于 20GB VRAM 的单卡</strong>：不建议直接套用 ubatch-size 2048，需要根据显存余量逐步尝试（1024 → 2048，我自己设置的是1280，没有任何OOM报错）</li>
</ul>
]]></description><link>https://lcz.me/topic/315/llama.cpp-双-rtx-3080-推理加速实测-qwen3.6-27b-从-35-到-50-tok-s</link><generator>RSS for Node</generator><lastBuildDate>Sun, 31 May 2026 05:50:47 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/315.rss" rel="self" type="application/rss+xml"/><pubDate>Tue, 26 May 2026 01:25:23 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to llama.cpp 双 RTX 3080 推理加速实测：Qwen3.6-27B 从 35 到 50 tok&#x2F;s on Wed, 27 May 2026 03:38:10 GMT]]></title><description><![CDATA[<p dir="auto"><img src="https://upload.lcz.me/uploads/a4b0c38b-9ada-40ef-889f-b396da2c3dd3.jpeg" alt="704ef76a-30a4-4919-aab7-390e026d8022-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">上面截图是我在真是工作场景下的各项指标。</p>
<p dir="auto">Trae+Roo code调用本地qwen3.6 27B+MTP，开发一个工作中要用到的项目进度看板系统。</p>
]]></description><link>https://lcz.me/post/3880</link><guid isPermaLink="true">https://lcz.me/post/3880</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Wed, 27 May 2026 03:38:10 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp 双 RTX 3080 推理加速实测：Qwen3.6-27B 从 35 到 50 tok&#x2F;s on Tue, 26 May 2026 10:56:11 GMT]]></title><description><![CDATA[<h2>跟贴补充：多模态版 Qwen3.6-27B 测试（ubatch-768）</h2>
<hr />
<p dir="auto"><strong>测试日期：2026.05.26</strong></p>
<p dir="auto">看到前面讨论 ubatch-size 加大能提 prompt 速度 —— 说个反直觉的发现：<strong>换了多模态版之后，ubatch 降到 768，生成速度反而没降反升。</strong></p>
<h3>背景</h3>
<p dir="auto">从纯文本版升级到多模态版（支持图片理解），模型文件从 1 个变成 3 个：</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>文件</th>
<th style="text-align:center">大小</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen3.6-27B-Q4_K_M.gguf（主模型）</td>
<td style="text-align:center">17GB</td>
</tr>
<tr>
<td>mmproj-Qwen_Qwen3.6-27B-f16.gguf（视觉编码器）</td>
<td style="text-align:center">884MB</td>
</tr>
<tr>
<td>mtp-Qwen_Qwen3.6-27B-Q8_0.gguf（MTP 权重）</td>
<td style="text-align:center">3GB</td>
</tr>
</tbody>
</table>
<p dir="auto">多了 4GB 权重，GPU1 从 17.2GB → 17.5GB，ubatch-size 从 2048 降到 768（1024 会 OOM）。</p>
<h3>启动命令</h3>
<pre><code class="language-bash">CUDA_SCALE_LAUNCH_QUEUES=4 /home/simon/llama.cpp/build/bin/llama-server \
  -m /home/simon/models/Qwen3.6-27B-Q4_K_M.gguf \
  --mmproj /home/simon/models/mmproj-Qwen_Qwen3.6-27B-f16.gguf \
  -ngl 99 --host 127.0.0.1 --port 8082 -c 131072 --temp 0 \
  --spec-type draft-mtp,ngram-mod \
  --spec-draft-model /home/simon/models/mtp-Qwen_Qwen3.6-27B-Q8_0.gguf \
  --spec-draft-n-max 3 \
  --spec-ngram-mod-n-max 5 --spec-ngram-mod-n-min 3 \
  --ubatch-size 768 --batch-size 2048 \
  -fa on -ctk q4_0 -ctv q4_0
</code></pre>
<p dir="auto">注意：多模态版需要额外指定 <code>--mmproj</code> 和 <code>--spec-draft-model</code>（MTP 权重单独文件）。</p>
<h3>测试结果</h3>
<p dir="auto">每次冷加载后测三次取中值：</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>测试项</th>
<th style="text-align:center">纯文本版（ubatch-2048）</th>
<th style="text-align:center">多模态版（ubatch-768）</th>
<th style="text-align:center">变化</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理速度</td>
<td style="text-align:center">50.17 tok/s</td>
<td style="text-align:center">180~380 tok/s</td>
<td style="text-align:center">受 prompt 长度影响</td>
</tr>
<tr>
<td><strong>生成速度</strong></td>
<td style="text-align:center"><strong>53.18 tok/s</strong></td>
<td style="text-align:center"><strong>~55 tok/s</strong></td>
<td style="text-align:center"><strong>+3%</strong></td>
</tr>
<tr>
<td>MTP 接受率</td>
<td style="text-align:center">—</td>
<td style="text-align:center">77~85%</td>
<td style="text-align:center">—</td>
</tr>
<tr>
<td>GPU0 VRAM</td>
<td style="text-align:center">16.1 GB</td>
<td style="text-align:center">14.9 GB</td>
<td style="text-align:center">↓</td>
</tr>
<tr>
<td>GPU1 VRAM</td>
<td style="text-align:center">17.2 GB</td>
<td style="text-align:center">17.5 GB</td>
<td style="text-align:center">↑</td>
</tr>
</tbody>
</table>
<h3>几点观察</h3>
<ol>
<li>
<p dir="auto"><strong>ubatch 降到 768 生成速度反而 54~55 tok/s</strong>，比之前 53 tok/s 略高。猜测是 MTP 权重在独立文件中布局更好，或者 ngram 互补起了作用。（也可能是测量误差范围，但同一天多次测试结果一致）</p>
</li>
<li>
<p dir="auto"><strong>Prompt 速度有波动</strong>：短 prompt（17t）→ 28 tok/s，长 prompt（55t）→ 378 tok/s。受 CUDA kernel 启动开销影响，prompt 越长越接近纯文本版的 best case。</p>
</li>
<li>
<p dir="auto"><strong>显存接近极限但并不需要紧张</strong>：GPU1 占 17.5GB/20GB，只要不上 ubatch-1024 就不会 OOM。128K 上下文也正常工作。</p>
</li>
<li>
<p dir="auto"><strong>多 3 分钟调试时间成本，换来图片理解能力</strong>，性价比不错。</p>
</li>
</ol>
<h3>与之前结论的关系</h3>
<p dir="auto">之前建议 "ubatch-size 尽量拉大" —— 这个结论<strong>仍然成立</strong>，只是多模态版因额外文件占显存，需要降 ubatch 来腾空间。好消息是<strong>降 ubatch 对生成速度的影响微乎其微</strong>，主要牺牲的是首 token 延迟和 prompt 处理吞吐。</p>
<p dir="auto">如果你也在跑多模态模型，建议 <strong>ubatch 从 512 开始试，慢慢加到 OOM 之前回退一格</strong>，生成速度不会吃亏。</p>
]]></description><link>https://lcz.me/post/3781</link><guid isPermaLink="true">https://lcz.me/post/3781</guid><dc:creator><![CDATA[rock shi]]></dc:creator><pubDate>Tue, 26 May 2026 10:56:11 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp 双 RTX 3080 推理加速实测：Qwen3.6-27B 从 35 到 50 tok&#x2F;s on Tue, 26 May 2026 05:36:38 GMT]]></title><description><![CDATA[<p dir="auto">使用llama-b9329-bin-win-cuda-12.4-x64这个官方的Release<br />
启动参数：<br />
--reasoning off ^<br />
--n-gpu-layers -1 ^<br />
--ctx-size 131072 ^<br />
--batch-size 2048 ^<br />
--ubatch-size 1024 ^<br />
--flash-attn on ^<br />
--cache-type-k q4_0 ^<br />
--cache-type-v q4_0 ^<br />
--spec-type draft-mtp,ngram-mod ^<br />
--spec-draft-n-max 3 ^<br />
--spec-ngram-mod-n-max 5 ^<br />
--spec-ngram-mod-n-min 3 ^<br />
--temp 0.7 ^<br />
--parallel 1</p>
<p dir="auto">处理一个128KB的md文件，日志：【<br />
[34m3.11.560.628[0m [32mI [0msrv  params_from_: Chat format: peg-native<br />
[34m3.11.562.098[0m [32mI [0mslot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 101040854<br />
[34m3.11.562.100[0m [32mI [0msrv  get_availabl: updating prompt cache<br />
[34m3.11.564.872[0m [35mW srv   prompt_save:  - saving prompt with length 12163, total state size = 411.405 MiB (draft: 47.744 MiB)<br />
[0m[34m3.11.766.865[0m [32mI [0msrv          load:  - looking for better prompt, base f_keep = 0.000, sim = 0.000<br />
[34m3.11.766.874[0m [32mI [0msrv        update:  - cache state: 1 prompts, 596.347 MiB (limits: 8192.000 MiB, 131072 tokens, 167082 est)<br />
[34m3.11.766.877[0m [32mI [0msrv        update:    - prompt 0000029F0C14B3A0:   12163 tokens, checkpoints:  1,   596.347 MiB<br />
[34m3.11.766.879[0m [32mI [0msrv  get_availabl: prompt cache update took 204.78 ms<br />
[34m3.11.767.318[0m [32mI [0mslot launch_slot_: id  0 | task 1045 | processing task, is_child = 0<br />
[34m3.11.767.333[0m [32mI [0mslot update_slots: id  0 | task 1045 | Checking checkpoint with [8996, 8996] against 2...<br />
[34m3.11.767.336[0m [35mW slot update_slots: id  0 | task 1045 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see <a href="https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055" rel="nofollow ugc">https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055</a>)<br />
[0m[34m3.11.767.340[0m [35mW slot update_slots: id  0 | task 1045 | erased invalidated context checkpoint (pos_min = 8996, pos_max = 8996, n_tokens = 8997, n_swa = 0, pos_next = 0, size = 184.942 MiB)<br />
[0m[34m3.12.161.249[0m [32mI [0mslot create_check: id  0 | task 1045 | created context checkpoint 1 of 32 (pos_min = 361, pos_max = 361, n_tokens = 362, size = 151.047 MiB)<br />
[34m3.15.050.013[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =   4458, progress = 0.08, t =   3.28 s / 1358.04 tokens per second<br />
[34m3.16.509.158[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =   6506, progress = 0.12, t =   4.74 s / 1372.05 tokens per second<br />
[34m3.18.007.190[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =   8554, progress = 0.15, t =   6.24 s / 1370.87 tokens per second<br />
[34m3.19.532.959[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  10602, progress = 0.19, t =   7.77 s / 1365.25 tokens per second<br />
[34m3.21.088.746[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  12650, progress = 0.23, t =   9.32 s / 1357.09 tokens per second<br />
[34m3.22.683.336[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  14698, progress = 0.26, t =  10.92 s / 1346.47 tokens per second<br />
[34m3.24.307.549[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  16746, progress = 0.30, t =  12.54 s / 1335.39 tokens per second<br />
[34m3.25.964.943[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  18794, progress = 0.34, t =  14.20 s / 1323.75 tokens per second<br />
[34m3.27.650.395[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  20842, progress = 0.37, t =  15.88 s / 1312.22 tokens per second<br />
[34m3.29.372.484[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  22890, progress = 0.41, t =  17.61 s / 1300.19 tokens per second<br />
[34m3.31.133.380[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  24938, progress = 0.45, t =  19.37 s / 1287.72 tokens per second<br />
[34m3.32.933.422[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  26986, progress = 0.48, t =  21.17 s / 1274.96 tokens per second<br />
[34m3.34.766.091[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  29034, progress = 0.52, t =  23.00 s / 1262.42 tokens per second<br />
[34m3.36.628.129[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  31082, progress = 0.56, t =  24.86 s / 1250.24 tokens per second<br />
[34m3.38.523.583[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  33130, progress = 0.60, t =  26.76 s / 1238.22 tokens per second<br />
[34m3.40.449.198[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  35178, progress = 0.63, t =  28.68 s / 1226.49 tokens per second<br />
[34m3.42.421.302[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  37226, progress = 0.67, t =  30.65 s / 1214.39 tokens per second<br />
[34m3.44.426.882[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  39274, progress = 0.71, t =  32.66 s / 1202.53 tokens per second<br />
[34m3.46.471.948[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  41322, progress = 0.74, t =  34.70 s / 1190.68 tokens per second<br />
[34m3.48.549.836[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  43370, progress = 0.78, t =  36.78 s / 1179.09 tokens per second<br />
[34m3.50.662.193[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  45418, progress = 0.82, t =  38.89 s / 1167.71 tokens per second<br />
[34m3.52.837.951[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  47466, progress = 0.85, t =  41.07 s / 1155.72 tokens per second<br />
[34m3.55.019.000[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  49514, progress = 0.89, t =  43.25 s / 1144.79 tokens per second<br />
[34m3.57.260.487[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  51562, progress = 0.93, t =  45.49 s / 1133.40 tokens per second<br />
[34m3.59.525.014[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  53610, progress = 0.96, t =  47.76 s / 1122.54 tokens per second<br />
[34m4.00.773.859[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  54644, progress = 0.98, t =  49.01 s / 1115.04 tokens per second<br />
[34m4.00.916.683[0m [32mI [0mslot create_check: id  0 | task 1045 | created context checkpoint 2 of 32 (pos_min = 54643, pos_max = 54643, n_tokens = 54644, size = 364.122 MiB)<br />
[34m4.02.074.532[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt processing, n_tokens =  55668, progress = 1.00, t =  50.31 s / 1106.56 tokens per second<br />
[34m4.02.231.320[0m [32mI [0mslot create_check: id  0 | task 1045 | created context checkpoint 3 of 32 (pos_min = 55667, pos_max = 55667, n_tokens = 55668, size = 368.141 MiB)<br />
[34m4.02.295.988[0m [32mI [0mbegin: ngram_mod occupancy = 48153/4194304 (0.01)<br />
[34m4.04.729.586[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =    101, tg =  41.53 t/s<br />
[34m4.07.759.011[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =    264, tg =  48.34 t/s<br />
[34m4.10.786.113[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =    429, tg =  50.54 t/s<br />
[34m4.13.814.218[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =    567, tg =  49.23 t/s<br />
[34m4.16.829.972[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =    703, tg =  48.38 t/s<br />
[34m4.19.845.676[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =    865, tg =  49.29 t/s<br />
[34m4.22.875.983[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =   1008, tg =  48.98 t/s<br />
[34m4.25.920.495[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =   1167, tg =  49.40 t/s<br />
[34m4.28.956.098[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =   1326, tg =  49.74 t/s<br />
[34m4.31.977.717[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =   1475, tg =  49.70 t/s<br />
[34m4.34.979.991[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =   1619, tg =  49.54 t/s<br />
[34m4.37.989.947[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =   1741, tg =  48.78 t/s<br />
[34m4.41.046.861[0m [32mI [0mslot print_timing: id  0 | task 1045 | n_decoded =   1864, tg =  48.10 t/s<br />
[34m4.43.158.621[0m [32mI [0mslot print_timing: id  0 | task 1045 | prompt eval time =   50530.12 ms / 55672 tokens (    0.91 ms per token,  1101.76 tokens per second)<br />
[34m4.43.158.626[0m [32mI [0mslot print_timing: id  0 | task 1045 |        eval time =   40860.81 ms /  1956 tokens (   20.89 ms per token,    47.87 tokens per second)<br />
[34m4.43.158.627[0m [32mI [0mslot print_timing: id  0 | task 1045 |       total time =   91390.93 ms / 57628 tokens<br />
[34m4.43.158.629[0m [32mI [0mslot print_timing: id  0 | task 1045 |    graphs reused =       1683<br />
[34m4.43.158.630[0m [32mI [0mslot print_timing: id  0 | task 1045 | draft acceptance = 0.63998 ( 1287 accepted /  2011 generated)<br />
[34m4.43.158.671[0m [32mI [0mstatistics        ngram-mod: #calls(b,g,a) =    2   1706      2, #gen drafts =      2, #acc drafts =     2, #gen tokens =     10, #acc tokens =     6, dur(b,g,a) = 7.376, 3.912, 0.002 ms<br />
[34m4.43.158.678[0m [32mI [0mstatistics        draft-mtp: #calls(b,g,a) =    2   1704   1704, #gen drafts =   1704, #acc drafts =  1393, #gen tokens =   5112, #acc tokens =  3364, dur(b,g,a) = 0.002, 12570.379, 3.195 ms<br />
[34m4.43.160.706[0m [32mI [0mslot      release: id  0 | task 1045 | stop processing: n_tokens = 57628, truncated = 0<br />
[34m4.43.160.751[0m [32mI [0msrv  update_slots: all slots are idle<br />
】</p>
<p dir="auto">输出Tokens速度：<br />
<img src="https://upload.lcz.me/uploads/95d5a891-e395-4826-88e5-e4cdfdca6c55.jpeg" alt="9ed8a6f4-0cee-474b-adb2-fbab913c4a5c-image.jpeg" class=" img-fluid img-markdown" /></p>
<p dir="auto">运行时显卡信息：<br />
<img src="https://upload.lcz.me/uploads/89d4c8d7-0865-4be6-ad47-98ea9d93ebd5.jpeg" alt="6b05d352-6055-4012-ad97-445c19ccaf5c-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/3733</link><guid isPermaLink="true">https://lcz.me/post/3733</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Tue, 26 May 2026 05:36:38 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp 双 RTX 3080 推理加速实测：Qwen3.6-27B 从 35 到 50 tok&#x2F;s on Tue, 26 May 2026 05:28:24 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/rock-shi" aria-label="Profile: rock-shi">@<bdi>rock-shi</bdi></a> 谢谢大佬，确实速度快了！<br />
<img src="https://upload.lcz.me/uploads/443e2dd1-05ca-416f-bccc-2cd852215c1d.jpeg" alt="78dc1308-1fe5-445a-9baf-82c95509cf90-image.jpeg" class=" img-fluid img-markdown" /><br />
<img src="https://upload.lcz.me/uploads/93c82600-4bac-4f5c-913e-b78fcd77e0c3.jpeg" alt="bad64b42-4b5f-4f24-8c32-61e830c68468-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/3732</link><guid isPermaLink="true">https://lcz.me/post/3732</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Tue, 26 May 2026 05:28:24 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp 双 RTX 3080 推理加速实测：Qwen3.6-27B 从 35 到 50 tok&#x2F;s on Tue, 26 May 2026 04:52:11 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/joker_chang" aria-label="Profile: joker_chang">@<bdi>joker_chang</bdi></a> --ubatch-size 2048改成1024就行，跟Windows没关系</p>
]]></description><link>https://lcz.me/post/3727</link><guid isPermaLink="true">https://lcz.me/post/3727</guid><dc:creator><![CDATA[rock shi]]></dc:creator><pubDate>Tue, 26 May 2026 04:52:11 GMT</pubDate></item><item><title><![CDATA[Reply to llama.cpp 双 RTX 3080 推理加速实测：Qwen3.6-27B 从 35 到 50 tok&#x2F;s on Tue, 26 May 2026 04:16:55 GMT]]></title><description><![CDATA[<p dir="auto">CUDA_SCALE_LAUNCH_QUEUES=4 /home/simon/llama.cpp/build/bin/llama-server <br />
-m /path/to/Qwen3.6-27B-Q4_K_M.gguf <br />
-ngl 99 --host 127.0.0.1 --port 8082 -c 131072 --temp 0 <br />
--spec-type draft-mtp,ngram-mod <br />
--spec-draft-n-max 3 <br />
--spec-ngram-mod-n-max 5 --spec-ngram-mod-n-min 3 <br />
--ubatch-size 2048 --batch-size 2048 <br />
-fa on -ctk q4_0 -ctv q4_0<br />
大哥，您这个参数我的3090Ti跑不起来，显存超了，难道这是windows10和windows11的区别吗？</p>
]]></description><link>https://lcz.me/post/3725</link><guid isPermaLink="true">https://lcz.me/post/3725</guid><dc:creator><![CDATA[joker_chang]]></dc:creator><pubDate>Tue, 26 May 2026 04:16:55 GMT</pubDate></item></channel></rss>