抡锤者

ai

X99,3090 llama 256K 45t/s多，vllm64k 8并平均每路38t/s左右

ai

@terry Thx,我先试试，

ai

环境摘要
• 显卡：单张 NVIDIA RTX 3090 (24GB)
• 系统/运行：Ubuntu，Docker Compose，vLLM 镜像 vllm/vllm-openai:nightly-07351e0883470724dd5a7e9730ed10e01fc99d08
• 模型：Qwen3.6-27b-autoround-int4（AutoRound INT4）
• 目标：在 24GB 显存内启用 MTP（speculative sampling）以提升吞吐，同时保留 --reasoning-parser qwen3 正常解析 <think> 标签

最小复现命令（启动参数片段）
command:

--model
/root/.cache/huggingface/qwen3.6-27b-autoround-int4
--quantization
auto_round
--dtype
float16
--max-model-len
"48000"
--gpu-memory-utilization
"0.92"
--kv-cache-dtype
turboquant_3bit_nc
--reasoning-parser
qwen3
--speculative-config
'{"method":"mtp","num_speculative_tokens":3}'

测试请求（curl 流式）
curl -N http://10.10.10.81:8020/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "qwen3.6-27b-autoround",
"messages": [{"role": "user", "content": "9.11 和 9.8 哪个大？"}],
"stream": true, "max_tokens": 1024, "temperature": 0.6
}'

观测到的异常行为（关键日志片段）
• 生成速度：Avg generation throughput 可达 40–50 tokens/s，Draft acceptance rate 40%–60%（看起来 MTP 在高速工作）
• 实际输出：流式 chunk 全部卡在 <think> 阶段并出现无意义复读，例如：
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":null}]}
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":null}]}
data: {"choices":[{"delta":{"reasoning":"的是的是"},"finish_reason":null}]}
...
data: {"choices":[{"delta":{"reasoning":"的是"},"finish_reason":"length"}]}

• 复现条件：
只要启用 --speculative-config '{"method":"mtp","num_speculative_tokens":3}' 且使用 INT4（turboquant_3bit_nc / 4bit_nc）就会触发；关闭 speculative 或改回更高精度（4bit/FP16）后问题消失但吞吐下降。

已排查但未解决的项（避免重复建议）
• OOM/显存不足：排除（待机显存 ~20.9G，运行稳定）
• 前端/渲染问题：排除（直接用 curl 抓底层流）
• KV-cache dtype：从 turboquant_3bit_nc 改 turboquant_4bit_nc 无效
• Prompt 影响：任意输入均可触发（从简单问候到逻辑题）
• vLLM 版本：Nightly 版本（上面镜像哈希），未尝试更旧稳定版

抡锤者

ai

帖子