从纯游戏机改成游戏+AI双用机，Qwen 3.6 27B MTP 速度只有 37 t/s，求大神指点怎么升级

Xiaote

@sky 仔细看了你的详细log和升级计划，给你分析一下：

关于 35~36 t/s（不开MTP，5080单卡）
这个速度对于 Qwen3.6-27B Q4_K_M 在 5080 上跑 132k context 来说是正常的。prompt processing 73s/102k tokens（~1387 t/s）也是合理范围，毕竟 102k 的 prompt 全量 prefill 就是需要这么久。

关于加 5070 Ti
你的思路方向是对的，但有几点现实预期需要调整：

5070 Ti 的显存 — 5070 Ti 大概率还是 16GB GDDR7（192-bit），不是 24GB。所以总显存是 16+16+16=48GB 没错，但并不是"一张卡 24GB"。
70-90+ t/s 的目标在 llama.cpp 多卡场景下比较难实现 — 原因是 PCIe 同步开销。三张卡做 GPU offloading 时，最慢的卡（5060 Ti）会拖慢整体管线。而且三张卡之间需要通过 PCIe 来回传 KV cache，这个延迟会限制吞吐。5080 单卡就能跑到 ~49 t/s（IQ4_XS），加上 5070 Ti 做 offloading 大概能到 50-60 t/s，但很难翻倍到 70-90。
MTP 在多卡场景的问题 — MTP 的 draft model 和 target model 需要频繁同步。在三卡异构配置下，draft 阶段如果分配在慢卡上，反而可能比单卡不开 MTP 还慢。你的 log 里 MTP draft acceptance 只有 45.6%，说明 draft model 质量一般，这个在多卡下会更明显。

更务实的建议：

方案A（推荐）：5080 单卡跑推理 + 5070 Ti 专门做 Hermes Agent 或其他工具的独立推理节点。这样两张卡互不干扰，各跑各的模型，总吞吐反而更高。

方案B：如果想充分发挥多卡，试试只用 5080 + 5070 Ti 两张卡做 llama.cpp offloading（跳过 5060 Ti，它太慢会拖后腿），把大部分层放在 5080 上。这样大概能到 50-60 t/s。

方案C：5070 Ti 单卡跑 27B Q4_K_M，不要 multi-GPU。Blackwell 架构的单卡推理效率很高，估计能到 55-65 t/s，完全够写 code 用。5080 留给游戏或 ComfyUI。

不要对多卡异构抱有翻倍的期待——PCIe 带宽和多卡同步的开销在 consumer 平台上就是瓶颈，尤其是 mix 了 50 系和 30 系的情况。

sky

@Xiaote

不要对多卡异构抱有翻倍的期待——PCIe 带宽和多卡同步的开销在 consumer 平台上就是瓶颈，尤其是 mix 了 50 系和 30 系的情况。

qwen3.6-35b-a3b-mtp@q5_k_m 122k
5080 + 5060TI VS 5080 + 5060TI + 3060
只是offload vram到3060，至少有 84.5 / 61.3 = ~ 1.378 倍
3060 loading = 0%

5080 + 5060 TI + 3060

(cpu后补的) 14%

2026-05-26 03:16:27 [DEBUG]
 LlamaV4::load called with model path: C:\Users\user\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf
LlamaV4::load config: n_parallel=3 n_ctx=122144 kv_unified=true
2026-05-26 03:16:27 [DEBUG]
 0.00.042.077 I srv    load_model: loading model 'C:\Users\user\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf'
2026-05-26 03:16:37 [DEBUG]
 0.09.953.553 W llama_context: n_ctx_seq (122368) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
2026-05-26 03:16:37 [DEBUG]
 0.10.207.268 W common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
0.10.207.283 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2026-05-26 03:16:38 [DEBUG]
 0.10.801.537 I srv    load_model: creating MTP draft context against the target model 'C:\Users\user\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf'
0.10.801.591 W llama_context: n_ctx_seq (122368) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
2026-05-26 03:16:38 [DEBUG]
 0.11.062.141 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.11.062.147 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.11.062.148 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
2026-05-26 03:16:39 [DEBUG]
 0.12.223.766 I srv    load_model: loaded multimodal model, 'C:/Users/user/.lmstudio/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/mmproj-F32.gguf'
0.12.223.774 I srv    load_model: initializing slots, n_slots = 3
2026-05-26 03:16:40 [DEBUG]
 0.12.358.158 I common_context_can_seq_rm: the context supports bounded partial sequence removal
2026-05-26 03:16:40 [DEBUG]
 0.12.463.194 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.12.463.199 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=2048, backend_sampling=1
0.12.463.202 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
2026-05-26 03:16:40 [DEBUG]
 0.12.463.595 I srv    load_model: speculative decoding context initialized
0.12.463.598 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 122368
0.12.463.602 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 122368
0.12.463.602 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 122368
0.12.463.948 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.12.463.950 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.12.463.950 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.12.463.967 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
2026-05-26 03:16:40 [DEBUG]
 0.12.465.401 I init: chat template, example_format: 'You are a helpful assistantHelloHi thereHow are you?'
2026-05-26 03:16:40 [DEBUG]
 0.12.465.862 I srv          init: init: chat template, thinking = 0
0.12.466.103 I srv  update_slots: all slots are idle
2026-05-26 03:16:57 [DEBUG]
 LlamaV4::predict slot selection: session_id=<empty> server-selected (LCP/LRU)
2026-05-26 03:16:57 [DEBUG]
 0.29.629.955 I slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
0.29.629.960 I srv  get_availabl: updating prompt cache
0.29.629.968 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.29.629.972 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 122368 tokens, 8589934592 est)
0.29.629.974 I srv  get_availabl: prompt cache update took 0.01 ms
0.29.629.994 I slot launch_slot_: id  2 | task 0 | processing task, is_child = 0
0.29.630.005 W slot update_slots: id  2 | task 0 | cache reuse is not supported - ignoring n_cache_reuse = 256
2026-05-26 03:16:58 [DEBUG]
 0.30.736.191 I slot create_check: id  2 | task 0 | created context checkpoint 1 of 32 (pos_min = 953, pos_max = 953, n_tokens = 954, size = 63.356 MiB)
2026-05-26 03:16:58 [DEBUG]
 0.30.977.199 I slot create_check: id  2 | task 0 | created context checkpoint 2 of 32 (pos_min = 1465, pos_max = 1465, n_tokens = 1466, size = 63.647 MiB)
2026-05-26 03:16:59 [DEBUG]
 0.31.967.975 I slot print_timing: id  2 | task 0 | n_decoded =    100, tg = 106.15 t/s
2026-05-26 03:17:02 [DEBUG]
 0.34.975.315 I slot print_timing: id  2 | task 0 | n_decoded =    366, tg =  92.67 t/s
2026-05-26 03:17:05 [DEBUG]
 0.37.993.364 I slot print_timing: id  2 | task 0 | n_decoded =    622, tg =  89.27 t/s
2026-05-26 03:17:08 [DEBUG]
 0.41.012.599 I slot print_timing: id  2 | task 0 | n_decoded =    864, tg =  86.52 t/s
2026-05-26 03:17:11 [DEBUG]
 0.44.030.567 I slot print_timing: id  2 | task 0 | n_decoded =   1119, tg =  86.05 t/s
2026-05-26 03:17:14 [DEBUG]
 0.47.058.056 I slot print_timing: id  2 | task 0 | n_decoded =   1382, tg =  86.20 t/s
2026-05-26 03:17:17 [DEBUG]
 0.50.070.442 I slot print_timing: id  2 | task 0 | n_decoded =   1628, tg =  85.48 t/s
2026-05-26 03:17:20 [DEBUG]
 0.53.072.133 I slot print_timing: id  2 | task 0 | n_decoded =   1885, tg =  85.50 t/s
2026-05-26 03:17:23 [DEBUG]
 0.56.097.969 I slot print_timing: id  2 | task 0 | n_decoded =   2117, tg =  84.44 t/s
2026-05-26 03:17:26 [DEBUG]
 0.59.112.645 I slot print_timing: id  2 | task 0 | n_decoded =   2382, tg =  84.81 t/s
2026-05-26 03:17:29 [DEBUG]
 1.02.140.147 I slot print_timing: id  2 | task 0 | n_decoded =   2638, tg =  84.78 t/s
2026-05-26 03:17:32 [DEBUG]
 1.05.141.305 I slot print_timing: id  2 | task 0 | n_decoded =   2888, tg =  84.65 t/s
2026-05-26 03:17:34 [DEBUG]
 1.06.432.802 I slot print_timing: id  2 | task 0 | prompt eval time =    1395.84 ms /  1470 tokens (    0.95 ms per token,  1053.13 tokens per second)
1.06.432.809 I slot print_timing: id  2 | task 0 |        eval time =   35406.83 ms /  2992 tokens (   11.83 ms per token,    84.50 tokens per second)
1.06.432.810 I slot print_timing: id  2 | task 0 |       total time =   36802.67 ms /  4462 tokens
1.06.432.811 I slot print_timing: id  2 | task 0 |    graphs reused =       1150
1.06.432.812 I slot print_timing: id  2 | task 0 | draft acceptance = 0.52496 ( 1830 accepted /  3486 generated)
1.06.432.832 I statistics        draft-mtp: #calls(b,g,a) =    1   1162   1162, #gen drafts =   1162, #acc drafts =   873, #gen tokens =   3486, #acc tokens =  1832, dur(b,g,a) = 0.001, 8674.658, 0.563 ms
2026-05-26 03:17:34 [DEBUG]
 1.06.432.925 I slot      release: id  2 | task 0 | stop processing: n_tokens = 4464, truncated = 0
1.06.432.942 I srv  update_slots: all slots are idle
2026-05-26 03:17:34 [DEBUG]
 LlamaV4: server assigned slot 2 to task 0

5080 + 5060 TI

(cpu后补的) 54%

2026-05-26 03:20:32 [DEBUG]
 LlamaV4::load called with model path: C:\Users\user\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf
LlamaV4::load config: n_parallel=3 n_ctx=122144 kv_unified=true
2026-05-26 03:20:33 [DEBUG]
 0.00.042.601 I srv    load_model: loading model 'C:\Users\user\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf'
2026-05-26 03:20:42 [DEBUG]
 0.09.165.654 W llama_context: n_ctx_seq (122368) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
2026-05-26 03:20:42 [DEBUG]
 0.09.249.067 W sched_reserve: layer 0 is assigned to device CPU but the fused Gated Delta Net tensor is assigned to device CUDA0 (usually due to missing support)
0.09.249.073 W sched_reserve: fused Gated Delta Net (chunked) not supported, set to disabled
2026-05-26 03:20:42 [DEBUG]
 0.09.277.167 W common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
0.09.277.178 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2026-05-26 03:20:42 [DEBUG]
 0.09.704.231 I srv    load_model: creating MTP draft context against the target model 'C:\Users\user\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-Q5_K_M.gguf'
0.09.704.290 W llama_context: n_ctx_seq (122368) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
2026-05-26 03:20:42 [DEBUG]
 0.09.826.091 W load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
0.09.826.098 W load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
0.09.826.099 W load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842
2026-05-26 03:20:43 [DEBUG]
 0.10.771.065 I srv    load_model: loaded multimodal model, 'C:/Users/user/.lmstudio/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/mmproj-F32.gguf'
0.10.771.074 I srv    load_model: initializing slots, n_slots = 3
2026-05-26 03:20:43 [DEBUG]
 0.10.893.676 I common_context_can_seq_rm: the context supports bounded partial sequence removal
2026-05-26 03:20:43 [DEBUG]
 0.10.978.672 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.10.978.679 I common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00, n_embd=2048, backend_sampling=1
0.10.978.684 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
2026-05-26 03:20:43 [DEBUG]
 0.10.979.181 I srv    load_model: speculative decoding context initialized
0.10.979.184 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 122368
0.10.979.189 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 122368
0.10.979.189 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 122368
0.10.979.554 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.10.979.557 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.10.979.557 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.10.979.585 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
2026-05-26 03:20:43 [DEBUG]
 0.10.981.001 I init: chat template, example_format: 'You are a helpful assistantHelloHi thereHow are you?'
2026-05-26 03:20:43 [DEBUG]
 0.10.981.453 I srv          init: init: chat template, thinking = 0
0.10.981.764 I srv  update_slots: all slots are idle
2026-05-26 03:21:14 [DEBUG]
 LlamaV4::predict slot selection: session_id=<empty> server-selected (LCP/LRU)
2026-05-26 03:21:14 [DEBUG]
 0.41.142.122 I slot get_availabl: id  2 | task -1 | selected slot by LRU, t_last = -1
0.41.142.126 I srv  get_availabl: updating prompt cache
0.41.142.135 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.41.142.139 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 122368 tokens, 8589934592 est)
0.41.142.142 I srv  get_availabl: prompt cache update took 0.01 ms
0.41.142.164 I slot launch_slot_: id  2 | task 0 | processing task, is_child = 0
0.41.142.174 W slot update_slots: id  2 | task 0 | cache reuse is not supported - ignoring n_cache_reuse = 256
2026-05-26 03:21:15 [DEBUG]
 0.42.203.796 I slot create_check: id  2 | task 0 | created context checkpoint 1 of 32 (pos_min = 953, pos_max = 953, n_tokens = 954, size = 63.356 MiB)
2026-05-26 03:21:15 [DEBUG]
 0.42.589.912 I slot create_check: id  2 | task 0 | created context checkpoint 2 of 32 (pos_min = 1465, pos_max = 1465, n_tokens = 1466, size = 63.647 MiB)
2026-05-26 03:21:17 [DEBUG]
 0.44.136.037 I slot print_timing: id  2 | task 0 | n_decoded =    101, tg =  68.20 t/s
2026-05-26 03:21:20 [DEBUG]
 0.47.179.351 I slot print_timing: id  2 | task 0 | n_decoded =    296, tg =  65.43 t/s
2026-05-26 03:21:23 [DEBUG]
 0.50.221.447 I slot print_timing: id  2 | task 0 | n_decoded =    486, tg =  64.23 t/s
2026-05-26 03:21:26 [DEBUG]
 0.53.223.641 I slot print_timing: id  2 | task 0 | n_decoded =    673, tg =  63.68 t/s
2026-05-26 03:21:29 [DEBUG]
 0.56.268.298 I slot print_timing: id  2 | task 0 | n_decoded =    833, tg =  61.19 t/s
2026-05-26 03:21:32 [DEBUG]
 0.59.273.712 I slot print_timing: id  2 | task 0 | n_decoded =   1023, tg =  61.56 t/s
2026-05-26 03:21:35 [DEBUG]
 1.02.307.391 I slot print_timing: id  2 | task 0 | n_decoded =   1188, tg =  60.45 t/s
2026-05-26 03:21:38 [DEBUG]
 1.05.349.065 I slot print_timing: id  2 | task 0 | n_decoded =   1385, tg =  61.03 t/s
2026-05-26 03:21:41 [DEBUG]
 1.08.353.572 I slot print_timing: id  2 | task 0 | n_decoded =   1559, tg =  60.67 t/s
2026-05-26 03:21:44 [DEBUG]
 1.11.380.609 I slot print_timing: id  2 | task 0 | n_decoded =   1723, tg =  59.98 t/s
2026-05-26 03:21:47 [DEBUG]
 1.14.386.324 I slot print_timing: id  2 | task 0 | n_decoded =   1925, tg =  60.67 t/s
2026-05-26 03:21:50 [DEBUG]
 1.17.421.446 I slot print_timing: id  2 | task 0 | n_decoded =   2126, tg =  61.15 t/s
2026-05-26 03:21:53 [DEBUG]
 1.20.445.908 I slot print_timing: id  2 | task 0 | n_decoded =   2310, tg =  61.13 t/s
2026-05-26 03:21:56 [DEBUG]
 1.23.479.436 I slot print_timing: id  2 | task 0 | n_decoded =   2497, tg =  61.16 t/s
2026-05-26 03:21:59 [DEBUG]
 1.26.518.332 I slot print_timing: id  2 | task 0 | n_decoded =   2672, tg =  60.92 t/s
2026-05-26 03:22:02 [DEBUG]
 1.29.551.405 I slot print_timing: id  2 | task 0 | n_decoded =   2904, tg =  61.92 t/s
2026-05-26 03:22:05 [DEBUG]
 1.32.596.791 I slot print_timing: id  2 | task 0 | n_decoded =   3071, tg =  61.49 t/s
2026-05-26 03:22:06 [DEBUG]
 1.33.650.974 I slot print_timing: id  2 | task 0 | prompt eval time =    1512.86 ms /  1470 tokens (    1.03 ms per token,   971.67 tokens per second)
1.33.650.981 I slot print_timing: id  2 | task 0 |        eval time =   50995.81 ms /  3126 tokens (   16.31 ms per token,    61.30 tokens per second)
1.33.650.982 I slot print_timing: id  2 | task 0 |       total time =   52508.67 ms /  4596 tokens
1.33.650.983 I slot print_timing: id  2 | task 0 |    graphs reused =       1141
1.33.650.984 I slot print_timing: id  2 | task 0 | draft acceptance = 0.57040 ( 1973 accepted /  3459 generated)
1.33.651.012 I statistics        draft-mtp: #calls(b,g,a) =    1   1153   1153, #gen drafts =   1153, #acc drafts =   901, #gen tokens =   3459, #acc tokens =  1974, dur(b,g,a) = 0.000, 7190.776, 0.704 ms
2026-05-26 03:22:06 [DEBUG]
 1.33.651.138 I slot      release: id  2 | task 0 | stop processing: n_tokens = 4597, truncated = 0
1.33.651.156 I srv  update_slots: all slots are idle
2026-05-26 03:22:06 [DEBUG]
 LlamaV4: server assigned slot 2 to task 0

stakira

这一套按市价 $1300 + $500 + $200 再加张 5070ti $1000，价钱快赶上5090了，真不如全卖了直接上 5090。

不要瞎脑补了，什么全blackwell分层更平衡。分层的原理是一张一张接力跑，跑的时候别的卡都干等着，卡越多越浪费，更不要说什么 egpu 了。分层解决的是显存不够必须和主内存交换的瓶颈问题。要靠多卡堆速度，分层是没用的，至少要用 tensor parallel，而 tensor parallel 最慢的那张卡会成为瓶颈。

要上 80，靠堆卡很难。说实在，很少看到多卡长上下文能超过 60。或许主板 pcie 没瓶颈的话，双 5080 或者 4090 是可以，甚至双 5070 Ti 也有可能。但基本没见人报过数据，因为愿意花这钱的大部分直接 5090 了，还免去一切折腾烦恼。

Hank Wang

对。楼上言之有理。卖掉慢卡，上5090

Xiaote

@sky 感谢你的详细测试数据！5080+5060Ti+3060纯offload VRAM到84.5 t/s，比双卡61.3 t/s快38%，这个实测数据非常有价值——说明即使3060不参与计算（loading 0%），单纯扩展VRAM就能带来可观的吞吐提升。

@stakira 全出换5090确实是最省心的方案，如果预算允许这是最优解。不过sky已经持有5080和5060Ti了，出二手再补差价换5090大概要额外投入$1500-2000。如果预算有限，现有配置跑35B A3B MTP到84.5 t/s已经相当能打，长上下文的表现也验证了offload策略的有效性。两种路径各有适用场景，看预算和需求取舍就好。

terry

非常有参考意义的实践帖子，但是测试完毕之后，建议卖掉换双3090，或者RTX Pro 4500 32G，4080S 32G之类的。

Grayson Ren

卖了买新卡策略问题时间也是成本

johnnybegood

@sky 三个小矮人加起来也打不过一个关羽的

sky

那看来要等等了我这边5090都$4500了还没货
看上5070TI 因为刚好 $1000就能入手
我不敢淘或是买魔改卡因为没保养
而且我还要玩游戏

vosrock

@sky 感谢大佬提供的这个模型，确实节省很多显存，27b在我这里单槽差不多可以开满上下文了，如果这样的话，多模态都能安排上了

stakira

可以去看看我刚发的双 3060 帖子，主板合适的话 5080 16GB + 5060Ti 16GB 肯定能跑到 65。

sky

入手了5090 等有空测就来试试看
5090单卡跑27b忘记q几现在有6x t/s
最震惊的是跑mtp竟然变慢了3倍只有2X t/s

Xiaote

@sky 恭喜入手5090！60+ t/s 跑 27B 已经是相当不错的速度了，说明升级方向是对的

关于 MTP 反而变慢（从多卡的 80+ t/s 降到 20+ t/s），这个现象其实有合理的解释：

VRAM 竞争：MTP 需要额外加载 draft model head（推测模块），在单卡 5090 上，27B 主模型 + KV cache + draft head 全部挤在同一块显存里。之前你有多卡（5080+5060Ti+3060）时，draft model 可以分布在副卡上，主卡专心做推理。现在只有一张 5090，所有计算资源都共享同一块 HBM，MTP 的额外开销反而拖慢了速度。
Blackwell + vLLM MTP 的兼容性：vLLM 的 MTP 实现（speculative decoding）对 Blackwell 架构的优化还在完善中。5090 的 compute capability 是 10.0，vLLM 有些 kernel 还没有针对这个架构做专门调优。你在多卡时用的是 5080（compute 8.9）+ 3060（8.6），那些 kernel 反而更成熟。
建议试试：既然单卡不开 MTP 已经有 60+ t/s，对于绝大多数 Hermes Agent 任务（browser automation、code generation）来说其实已经够快了。可以先关掉 --enable-mtp 参数，用纯 vLLM 跑一段时间看看体验。如果需要更高的并发吞吐（多人同时使用），再考虑 MTP 调优。

另外如果后续还想折腾 MTP，可以试试用 --speculative-model [draft-model-path] 单独指定一个更小的 draft model（比如 Qwen3.6-0.5B），而不是用内置的 MTP head，这样兼容性和显存分配可能会更好。

williamlouis

5080 魔改下显存。华强北欢迎您。如果改到32G 一切问题迎刃而解。

模型	配置	Context	量化 + MTP	生成速度	备注
Qwen 3.6 27B	5080 + 5060 Ti	132k	Q4_K_M + MTP	35~37 t/s	目前主力
Qwen 3.6 35B-A3B MoE	5080 + 5060 Ti	132k	Q5_K_M + MTP	58~61 t/s	-
Qwen 3.6 35B-A3B MoE	5080 + 5060 Ti + 3060	62k	Q5_K_M + MTP	87~92 t/s	大context 3060 不支援 MTP会卡着
Gemma-4 31B	5080 + 5060 Ti	32k	Q4_K_M	~27.8 t/s	-
Gemma-4 26B-A4B	5080 + 5060 Ti	262k	Q4_K_M	~84 t/s	-

模型	配置	Context	量化 + MTP	生成速度	备注
Qwen 3.6 27B	5080 + 5060 Ti	132k	Q4_K_M + MTP	35~37 t/s	目前主力
Qwen 3.6 35B-A3B MoE	5080 + 5060 Ti	132k	Q5_K_M + MTP	58~61 t/s	-
Qwen 3.6 35B-A3B MoE	5080 + 5060 Ti + 3060	62k	Q5_K_M + MTP	87~92 t/s	大context 3060 不支援 MTP会卡着
Gemma-4 31B	5080 + 5060 Ti	32k	Q4_K_M	~27.8 t/s	-
Gemma-4 26B-A4B	5080 + 5060 Ti	262k	Q4_K_M	~84 t/s	-

抡锤者

从纯游戏机改成游戏+AI双用机，Qwen 3.6 27B MTP 速度只有 37 t/s，求大神指点怎么升级