<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[【求助】vLLM 单卡 3090 部署 Qwen3.6-27B-INT4，开启 MTP 投机采样触发无限复读（死循环）]]></title><description><![CDATA[<p dir="auto">环境摘要<br />
• 显卡：单张 NVIDIA RTX 3090 (24GB)<br />
• 系统/运行：Ubuntu，Docker Compose，vLLM 镜像 vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08<br />
• 模型：Qwen3.6-27b-autoround-int4（AutoRound INT4）<br />
• 目标：在 24GB 显存内启用 MTP（speculative sampling）以提升吞吐，同时保留 --reasoning-parser qwen3 正常解析 &lt;think&gt; 标签</p>
<p dir="auto"><strong>最小复现命令（启动参数片段）</strong><br />
command:</p>
<ul>
<li>--model</li>
<li>/root/.cache/huggingface/qwen3.6-27b-autoround-int4</li>
<li>--quantization</li>
<li>auto_round</li>
<li>--dtype</li>
<li>float16</li>
<li>--max-model-len</li>
<li>"48000"</li>
<li>--gpu-memory-utilization</li>
<li>"0.92"</li>
<li>--kv-cache-dtype</li>
<li>turboquant_3bit_nc</li>
<li>--reasoning-parser</li>
<li>qwen3</li>
<li>--speculative-config</li>
<li>'{"method":"mtp","num_speculative_tokens":3}'</li>
</ul>
<p dir="auto"><strong>测试请求（curl 流式）</strong><br />
curl -N <a href="http://10.10.10.81:8020/v1/chat/completions" rel="nofollow ugc">http://10.10.10.81:8020/v1/chat/completions</a> <br />
-H "Content-Type: application/json" <br />
-d '{<br />
"model": "qwen3.6-27b-autoround",<br />
"messages": [{"role": "user", "content": "9.11 和 9.8 哪个大？"}],<br />
"stream": true, "max_tokens": 1024, "temperature": 0.6<br />
}'</p>
<p dir="auto"><strong>观测到的异常行为（关键日志片段）</strong><br />
• 生成速度：Avg generation throughput 可达 40–50 tokens/s，Draft acceptance rate 40%–60%（看起来 MTP 在高速工作）<br />
• 实际输出：流式 chunk 全部卡在 &lt;think&gt; 阶段并出现无意义复读，例如：<br />
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":null}]}<br />
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":null}]}<br />
data: {"choices":[{"delta":{"reasoning":"的是的是"},"finish_reason":null}]}<br />
...<br />
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":"length"}]}</p>
<p dir="auto"><strong>• 复现条件：</strong><br />
只要启用 --speculative-config '{"method":"mtp","num_speculative_tokens":3}' 且使用 INT4（turboquant_3bit_nc / 4bit_nc）就会触发；关闭 speculative 或改回更高精度（4bit/FP16）后问题消失但吞吐下降。</p>
<p dir="auto"><strong>已排查但未解决的项（避免重复建议）</strong><br />
• OOM/显存不足：排除（待机显存 ~20.9G，运行稳定）<br />
• 前端/渲染问题：排除（直接用 curl 抓底层流）<br />
• KV-cache dtype：从 turboquant_3bit_nc 改 turboquant_4bit_nc 无效<br />
• Prompt 影响：任意输入均可触发（从简单问候到逻辑题）<br />
• vLLM 版本：Nightly 版本（上面镜像哈希），未尝试更旧稳定版</p>
]]></description><link>https://lcz.me/topic/98/求助-vllm-单卡-3090-部署-qwen3.6-27b-int4-开启-mtp-投机采样触发无限复读-死循环</link><generator>RSS for Node</generator><lastBuildDate>Wed, 20 May 2026 07:04:19 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/98.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 11 May 2026 02:32:41 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 【求助】vLLM 单卡 3090 部署 Qwen3.6-27B-INT4，开启 MTP 投机采样触发无限复读（死循环） on Mon, 11 May 2026 02:40:21 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/terry" aria-label="Profile: terry">@<bdi>terry</bdi></a> Thx,我先试试，<img src="https://lcz.me/assets/plugins/nodebb-plugin-emoji/emoji/android/1f44d.png?v=d348ca29232" class="not-responsive emoji emoji-android emoji--+1" style="height:23px;width:auto;vertical-align:middle" title=":+1:" alt="👍" /></p>
]]></description><link>https://lcz.me/post/935</link><guid isPermaLink="true">https://lcz.me/post/935</guid><dc:creator><![CDATA[ai]]></dc:creator><pubDate>Mon, 11 May 2026 02:40:21 GMT</pubDate></item><item><title><![CDATA[Reply to 【求助】vLLM 单卡 3090 部署 Qwen3.6-27B-INT4，开启 MTP 投机采样触发无限复读（死循环） on Mon, 11 May 2026 02:35:58 GMT]]></title><description><![CDATA[<p dir="auto">尝试将 num_speculative_tokens 改为 1 或 2。<br />
最大的可能是turboquant 精度崩了，你换成fp8 kv看看，24G显存够你用的。投机解码和turboquant都还不成熟，你先用一个，别贪心。</p>
]]></description><link>https://lcz.me/post/932</link><guid isPermaLink="true">https://lcz.me/post/932</guid><dc:creator><![CDATA[terry]]></dc:creator><pubDate>Mon, 11 May 2026 02:35:58 GMT</pubDate></item></channel></rss>