<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[我尝试了mtp和tuboquant]]></title><description><![CDATA[<p dir="auto">感觉我linux服务器上的4090-24G显卡好像也没突破限制阿 ，我是llama.cpp架构，该45token/s还是一样，奶奶的，你测试怎么样老特？36-27B养马香是香就是推理有点慢</p>
]]></description><link>https://lcz.me/topic/58/我尝试了mtp和tuboquant</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 07:04:19 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/58.rss" rel="self" type="application/rss+xml"/><pubDate>Fri, 08 May 2026 06:58:11 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Mon, 11 May 2026 19:39:43 GMT]]></title><description><![CDATA[<p dir="auto">闻闻 4090-24G 的味都是好的。前代神卡。够玩一段了。</p>
]]></description><link>https://lcz.me/post/1086</link><guid isPermaLink="true">https://lcz.me/post/1086</guid><dc:creator><![CDATA[williamlouis]]></dc:creator><pubDate>Mon, 11 May 2026 19:39:43 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Mon, 11 May 2026 17:41:44 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E9%AB%98%E4%B9%90%E5%A4%A9" aria-label="Profile: 高乐天">@<bdi>高乐天</bdi></a><br />
感谢这位仁兄，我也一样是  ai max 395 目前用Ollama 跑 qwen3.6-27b  只有 12T/s<br />
但是用了你介绍的方法，速度几乎翻倍了。以下贴上具体数据给大家参考一下。<br />
再次感谢 <a class="plugin-mentions-user plugin-mentions-a" href="/user/%E9%AB%98%E4%B9%90%E5%A4%A9" aria-label="Profile: 高乐天">@<bdi>高乐天</bdi></a> ！</p>
<p dir="auto">&lt;当前运行环境 &amp; 模型&gt;</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>项目</th>
<th>详情</th>
</tr>
</thead>
<tbody>
<tr>
<td>模型</td>
<td><code>qwen3.6-27b-mtp</code>（Qwen 3.6 27B + MTP 推测解码）</td>
</tr>
<tr>
<td>运行硬件</td>
<td>Ryzen AI Max+ 395 + Radeon 8060S 集显</td>
</tr>
<tr>
<td>MTP draft 设定</td>
<td>3</td>
</tr>
</tbody>
</table>
<p dir="auto">&lt;最新测速结果&gt;</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>阶段</th>
<th>Token 数</th>
<th>耗时</th>
<th>速度</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prompt 处理</td>
<td>45 tokens</td>
<td>421ms</td>
<td>~107 token/s</td>
</tr>
<tr>
<td>Token 生成（MTP）</td>
<td>500 tokens</td>
<td>24.8s</td>
<td>~20.2 token/s</td>
</tr>
<tr>
<td>总计</td>
<td>545 tokens</td>
<td>~25.2s</td>
<td>~21.6 token/s</td>
</tr>
</tbody>
</table>
<p dir="auto">&lt;MTP 推测解码效率&gt;</p>
<table class="table table-bordered table-striped">
<thead>
<tr>
<th>指标</th>
<th>数值</th>
<th>说明</th>
</tr>
</thead>
<tbody>
<tr>
<td>Draft tokens（草稿）</td>
<td>585</td>
<td>推测解码产生的草稿 token 总数</td>
</tr>
<tr>
<td>Accepted（接受）</td>
<td>304</td>
<td>通过验证直接跳过的 token</td>
</tr>
<tr>
<td>接受率</td>
<td>~52%</td>
<td>约一半的草稿被直接接受，省掉了验证开销</td>
</tr>
<tr>
<td>预测加速比</td>
<td>500 / 304 ≈ 1.64x</td>
<td>相比无 MTP 的纯串行生成，理论加速约 1.6 倍</td>
</tr>
</tbody>
</table>
]]></description><link>https://lcz.me/post/1066</link><guid isPermaLink="true">https://lcz.me/post/1066</guid><dc:creator><![CDATA[饲养员]]></dc:creator><pubDate>Mon, 11 May 2026 17:41:44 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Sun, 10 May 2026 15:42:09 GMT]]></title><description><![CDATA[<p dir="auto">llama.cpp mtp 确实可以用， 我的 ai max 395 跑 qwen3.6-27b    24T/s</p>
<p dir="auto">参考这个社区主题</p>
<p dir="auto"><a href="https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/qwen3627b_with_mtp_grafted_on_unsloth_ud_xl_25x/" rel="nofollow ugc">https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/qwen3627b_with_mtp_grafted_on_unsloth_ud_xl_25x/</a></p>
<p dir="auto">mtp 分支还没有合并到主分支，目前还存在的问题</p>
<ol>
<li>只支持np = 1</li>
<li>暂不支持多模态</li>
</ol>
<p dir="auto"><img src="https://upload.lcz.me/uploads/177ffb76-6a1c-48a0-9b32-970d874cdfc4.jpeg" alt="00af767d-8cbe-418a-a0a3-e15866ddabb1-image.jpeg" class=" img-fluid img-markdown" /></p>
]]></description><link>https://lcz.me/post/821</link><guid isPermaLink="true">https://lcz.me/post/821</guid><dc:creator><![CDATA[高乐天]]></dc:creator><pubDate>Sun, 10 May 2026 15:42:09 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 22:21:05 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/laihzang619" aria-label="Profile: laihzang619">@<bdi>laihzang619</bdi></a> 配置好了VLLM肯定是tokens最高的，比sg-lang还高，我完全没优化也比llama.cpp好一点点。</p>
]]></description><link>https://lcz.me/post/576</link><guid isPermaLink="true">https://lcz.me/post/576</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Fri, 08 May 2026 22:21:05 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 15:32:13 GMT]]></title><description><![CDATA[<p dir="auto">我测试了vllm 3090 24G 开启MTP就爆显存了 没法用  llama有45t/s不错了 我的vllm只有34t/s</p>
]]></description><link>https://lcz.me/post/560</link><guid isPermaLink="true">https://lcz.me/post/560</guid><dc:creator><![CDATA[laihzang619]]></dc:creator><pubDate>Fri, 08 May 2026 15:32:13 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 14:02:00 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%A2%99%E5%86%85%E4%BA%BA" aria-label="Profile: 墙内人">@<bdi>墙内人</bdi></a> 好像vllm+mtp在24G显卡上上下文是很短的</p>
]]></description><link>https://lcz.me/post/557</link><guid isPermaLink="true">https://lcz.me/post/557</guid><dc:creator><![CDATA[bily j]]></dc:creator><pubDate>Fri, 08 May 2026 14:02:00 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 13:46:32 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%A2%99%E5%86%85%E4%BA%BA" aria-label="Profile: 墙内人">@<bdi>墙内人</bdi></a> 你的显卡是多少？</p>
]]></description><link>https://lcz.me/post/554</link><guid isPermaLink="true">https://lcz.me/post/554</guid><dc:creator><![CDATA[bily j]]></dc:creator><pubDate>Fri, 08 May 2026 13:46:32 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 11:25:21 GMT]]></title><description><![CDATA[<p dir="auto">vllm的mtp是肯定有用的，llama.cpp不知道。</p>
]]></description><link>https://lcz.me/post/547</link><guid isPermaLink="true">https://lcz.me/post/547</guid><dc:creator><![CDATA[墙内人]]></dc:creator><pubDate>Fri, 08 May 2026 11:25:21 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 10:50:27 GMT]]></title><description><![CDATA[<p dir="auto">llama.cpp是不是要吃足他的上下文，是不是只要nvidia-smi只要没高于24就好了？AI配置这个上下文窗口一般都给的很保守</p>
]]></description><link>https://lcz.me/post/546</link><guid isPermaLink="true">https://lcz.me/post/546</guid><dc:creator><![CDATA[bily j]]></dc:creator><pubDate>Fri, 08 May 2026 10:50:27 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 10:47:52 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/%E5%A4%A7%E9%AD%94%E5%A4%B4" aria-label="Profile: 大魔头">@<bdi>大魔头</bdi></a> 感觉没啥卵用</p>
]]></description><link>https://lcz.me/post/545</link><guid isPermaLink="true">https://lcz.me/post/545</guid><dc:creator><![CDATA[bily j]]></dc:creator><pubDate>Fri, 08 May 2026 10:47:52 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 08:03:55 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/bily-j" aria-label="Profile: bily-j">@<bdi>bily-j</bdi></a> vllm呢，试试看，我最近不会优化llm了，要做一下数字人频道。</p>
]]></description><link>https://lcz.me/post/537</link><guid isPermaLink="true">https://lcz.me/post/537</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Fri, 08 May 2026 08:03:55 GMT</pubDate></item><item><title><![CDATA[Reply to 我尝试了mtp和tuboquant on Fri, 08 May 2026 07:37:48 GMT]]></title><description><![CDATA[<p dir="auto">llama.cpp能跑mtp和tuboquant了？我去搜搜，我也想试试</p>
]]></description><link>https://lcz.me/post/533</link><guid isPermaLink="true">https://lcz.me/post/533</guid><dc:creator><![CDATA[大魔头]]></dc:creator><pubDate>Fri, 08 May 2026 07:37:48 GMT</pubDate></item></channel></rss>