【求助】vLLM 单卡 3090 部署 Qwen3.6-27B-INT4，开启 MTP 投机采样触发无限复读（死循环）

ai

环境摘要
• 显卡：单张 NVIDIA RTX 3090 (24GB)
• 系统/运行：Ubuntu，Docker Compose，vLLM 镜像 vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
• 模型：Qwen3.6-27b-autoround-int4（AutoRound INT4）
• 目标：在 24GB 显存内启用 MTP（speculative sampling）以提升吞吐，同时保留 --reasoning-parser qwen3 正常解析 <think> 标签

最小复现命令（启动参数片段）
command:

--model
/root/.cache/huggingface/qwen3.6-27b-autoround-int4
--quantization
auto_round
--dtype
float16
--max-model-len
"48000"
--gpu-memory-utilization
"0.92"
--kv-cache-dtype
turboquant_3bit_nc
--reasoning-parser
qwen3
--speculative-config
'{"method":"mtp","num_speculative_tokens":3}'

测试请求（curl 流式）
curl -N http://10.10.10.81:8020/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "qwen3.6-27b-autoround",
"messages": [{"role": "user", "content": "9.11 和 9.8 哪个大？"}],
"stream": true, "max_tokens": 1024, "temperature": 0.6
}'

观测到的异常行为（关键日志片段）
• 生成速度：Avg generation throughput 可达 40–50 tokens/s，Draft acceptance rate 40%–60%（看起来 MTP 在高速工作）
• 实际输出：流式 chunk 全部卡在 <think> 阶段并出现无意义复读，例如：
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":null}]}
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":null}]}
data: {"choices":[{"delta":{"reasoning":"的是的是"},"finish_reason":null}]}
...
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":"length"}]}

• 复现条件：
只要启用 --speculative-config '{"method":"mtp","num_speculative_tokens":3}' 且使用 INT4（turboquant_3bit_nc / 4bit_nc）就会触发；关闭 speculative 或改回更高精度（4bit/FP16）后问题消失但吞吐下降。

已排查但未解决的项（避免重复建议）
• OOM/显存不足：排除（待机显存 ~20.9G，运行稳定）
• 前端/渲染问题：排除（直接用 curl 抓底层流）
• KV-cache dtype：从 turboquant_3bit_nc 改 turboquant_4bit_nc 无效
• Prompt 影响：任意输入均可触发（从简单问候到逻辑题）
• vLLM 版本：Nightly 版本（上面镜像哈希），未尝试更旧稳定版

terry

尝试将 num_speculative_tokens 改为 1 或 2。
最大的可能是turboquant 精度崩了，你换成fp8 kv看看，24G显存够你用的。投机解码和turboquant都还不成熟，你先用一个，别贪心。

ai

@terry Thx,我先试试，

David Zhang

--repetition-penalty 1.25 或者 --repetition-penalty 1.5 试试，

https://share.google/aimode/OnrWR3GVLkPPNiemW

im17me

3090单卡跑llama.cpp，比vllm强。显存不够。3090单卡vllm极限配置。
vllm serve /home/ubuntu/models/Qwen3.6-27B-AWQ-INT4
--served-model-name Qwen3.6-27B-AWQ-INT4
--host 0.0.0.0
--port 8080
--gpu-memory-utilization 0.98
--max-model-len 40960
--max-num-seqs 4
--max-num-batched-tokens 4096
--kv-cache-dtype fp8
--quantization compressed-tensors

pangfat

可以参考这个，明确提出单卡跑vllm会有问题，目前尚未解决：
https://github.com/noonghunna/club-3090/blob/master/docs/SINGLE_CARD.md

用llama.cpp + MTP, fast + long ctx 或者ik_llama + two-stage spec-dec是目前不错的选择。

抡锤者

【求助】vLLM 单卡 3090 部署 Qwen3.6-27B-INT4，开启 MTP 投机采样触发无限复读（死循环）