TLDR
先上個實體圖
Beelink Ser 8 8745HS 用Oculink連接 RTX Pro 4500
跑在Ubuntu 26.04, Kernel 7.0

啓動咒語, 注意這個是我在vLLM cu130 nightly (0.20)設立的, cu129 0.22估計會有更多優化, 我會試試看其他版本
docker run -d \
--name vllm-Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP \
--restart unless-stopped \
--ipc host \
--gpus '"device=0"' \
-p 0.0.0.0:7380:8000 \
-v "~/vllm/models:/models:ro" \
-v "~/vllm/.cache/huggingface:/root/.cache/huggingface" \
-e GPU_MEMORY_UTILIZATION="0.95" \
-e HF_HUB_OFFLINE="1" \
-e KV_CACHE_DTYPE="fp8" \
-e MAX_MODEL_LEN="230400" \
-e MODEL_PATH="/models/sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP" \
-e PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" \
-e SERVED_MODEL_NAME="Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP" \
-e VLLM_ATTENTION_BACKEND="FLASHINFER" \
-e VLLM_EXTRA_ARGS='--quantization modelopt --trust-remote-code --enable-chunked-prefill --reasoning-parser qwen3 --tool-call-parser qwen3_coder --enable-auto-tool-choice --max-num-seqs 1 --max-num-batched-tokens 4096 --speculative-config {"method":"qwen3_5_mtp","num_speculative_tokens":3} --language-model-only --performance-mode interactivity --attention-backend flashinfer --skip-mm-profiling --enable-prefix-caching --no-disable-hybrid-kv-cache-manager' \
-e VLLM_LOGGING_LEVEL="INFO" \
-e VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass" \
-e VLLM_USE_FLASHINFER_MOE_FP4="0" \
-e VLLM_USE_FLASHINFER_SAMPLER="1" \
--health-cmd 'curl -fsS http://localhost:8000/v1/models || exit 1' \
--health-timeout 5s \
--health-interval 30s \
--health-retries 5 \
--health-start-period 5m \
--entrypoint /bin/bash \
vllm/vllm-openai:cu130-nightly \
-lc 'exec vllm serve "$MODEL_PATH" --served-model-name "$SERVED_MODEL_NAME" --host 0.0.0.0 --port 8000 --max-model-len "$MAX_MODEL_LEN" --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" --kv-cache-dtype "$KV_CACHE_DTYPE" $VLLM_EXTRA_ARGS'
llama-benchy benchmark
llama-benchy \
--base-url "http://localhost:7380/v1" \
--model "Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP" \
--tokenizer "$HOME/vllm/models/sakamakismile/Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP" \
--pp 2048 \
--tg 480 \
--depth 0 1000 5000 10000 20000 50000 100000 150000 200000 \ #(不同上下文長度)
--latency-mode generation \
--skip-coherence \
--concurrency 1 \
效果
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:-----------------------------------------|-----------------:|------------------:|-------------:|------------------:|------------------:|------------------:|
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 | 7741.01 ± 1375.30 | | 373.94 ± 54.49 | 274.26 ± 54.49 | 373.94 ± 54.49 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 | 68.87 ± 6.65 | 81.33 ± 3.68 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d1000 | 8136.73 ± 32.84 | | 474.32 ± 1.44 | 374.64 ± 1.44 | 474.32 ± 1.44 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d1000 | 67.73 ± 5.06 | 88.00 ± 5.72 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d5000 | 6615.23 ± 22.79 | | 1165.21 ± 3.86 | 1065.53 ± 3.86 | 1165.21 ± 3.86 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d5000 | 72.92 ± 3.56 | 89.33 ± 3.77 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d10000 | 6008.73 ± 10.16 | | 2104.88 ± 3.47 | 2005.20 ± 3.47 | 2104.88 ± 3.47 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d10000 | 65.25 ± 2.21 | 82.00 ± 4.32 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d20000 | 5152.21 ± 0.52 | | 4379.13 ± 0.52 | 4279.45 ± 0.52 | 4380.19 ± 0.46 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d20000 | 70.45 ± 1.27 | 89.67 ± 0.47 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d50000 | 3690.36 ± 5.88 | | 14203.66 ± 22.59 | 14103.98 ± 22.59 | 14205.86 ± 22.80 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d50000 | 67.03 ± 1.67 | 84.67 ± 0.47 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d100000 | 2528.58 ± 0.55 | | 40457.51 ± 8.72 | 40357.83 ± 8.72 | 40461.50 ± 8.69 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d100000 | 60.96 ± 0.75 | 78.33 ± 3.68 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d150000 | 1922.36 ± 0.98 | | 79194.84 ± 39.68 | 79095.17 ± 39.68 | 79201.49 ± 39.50 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d150000 | 62.53 ± 3.29 | 76.33 ± 1.89 | | | |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | pp2048 @ d200000 | 1556.00 ± 0.99 | | 129951.65 ± 82.49 | 129851.97 ± 82.49 | 129959.72 ± 82.53 |
| Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP | tg480 @ d200000 | 59.58 ± 1.31 | 69.67 ± 1.70 | | | |
碎碎唸, 講一下參數選擇邏輯
GPU_MEMORY_UTILIZATION => 0.95, Headless伺服器, 顯示輸出由iGPU負責
KV_CACHE_DTYPE => FP8, Ada架構以後基本統一FP8
MAX_MODEL_LEN => 230K, 之前有嘗試試過極限拉到240K左右, 但是會在部分長上下文出現OOM, 穩定點用230K
PYTORCH_CUDA_ALLOC_CONF => Pytorch實驗性參數, 透過呼叫CUDA内核API整理VRAM碎塊, 降低OOM機會
VLLM_ATTENTION_BACKEND => FLASHINFER, 很奇怪的是vLLM是推薦用這個而不是Flash Attention, 理論上在NVFP4在sm 12X (Desktop Blackwell)還沒完善下的情況用FA估計會比較好, 在sm 10X (Datacenter Blackwell)則FLASHINFER比較好
quantization => modelopt, vllm會跑去讀hf_quant_config.json裏的quant_algo, 這個模型是nvfp4
enable-chunked-prefill => 必開不解釋, 優化VRAM避免Spike導致OOM
speculative-config => 2 或者 3 都可, 激進點就用了3
skip-mm-profiling => 因爲這個模型只支持Text, 所以不需要multi model設定,省點VRAM
enable-prefix-caching => 降低TTRT
no-disable-hybrid-kv-cache-manager => 避免因爲Qwen模型的混合Attention導致挂掉
VLLM_NVFP4_GEMM_BACKEND => 叫vLLM 使用 FlashInfer/Cutlass NVFP4 kernels進行矩陣計算, Blackwell特點
VLLM_USE_FLASHINFER_MOE_FP4 (0) + VLLM_USE_FLASHINFER_SAMPLER (1) => 優化CUDA内核









