Fishbubu 创建的主题

一台GX10跑qwen3.6 27B FP8, 開mtp大概有20 tok/s, 我的配置如下, 抄上面那個網站的: spark3:~/llm/qwen26-27b$ cat qwen3.6-27b-fp8.yaml recipe_version: "1" name: Qwen3.6-27B-FP8-MTP description: vLLM serving Qwen3.6-27B-FP8 with MTP=3 on a single GB10 (Spark Arena recipe) model: Qwen/Qwen3.6-27B-FP8 container: vllm/vllm-openai:v0.20.0-aarch64-cu130-ubuntu2404-ws solo_only: true defaults: port: 8004 host: 0.0.0.0 tensor_parallel: 1 gpu_memory_utilization: 0.8069 max_model_len: 262144 max_num_batched_tokens: 32768 max_num_seqs: 8 env: VLLM_MARLIN_USE_ATOMIC_ADD: "1" VLLM_USE_DEEP_GEMM: "0" CUDA_MANAGED_FORCE_DEVICE_ALLOC: "1" PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True" OMP_NUM_THREADS: "4" command: | vllm serve Qwen/Qwen3.6-27B-FP8 --served-model-name Qwen/Qwen3.6-27B-FP8 qwen3.6-27b --tensor-parallel-size {tensor_parallel} --port {port} --host {host} --max-model-len {max_model_len} --max-num-seqs {max_num_seqs} --max-num-batched-tokens {max_num_batched_tokens} --gpu-memory-utilization {gpu_memory_utilization} --kv-cache-dtype fp8 --enable-prefix-caching --language-model-only --async-scheduling --max-cudagraph-capture-size 128 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative-config '{{"method":"mtp","num_speculative_tokens":3}}' --trust-remote-code 如果有兩台會再快一些

抡锤者

Fishbubu

主题

GX10主機vLLM部屬Qwen3.6請益