抡锤者

VS Studio

Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
Fetch and switch to the Gemma 4 MTP PR branch
git fetch origin pull/23398/head:gemma4-mtp
git checkout gemma4-mtp
Build with CUDA support for NVIDIA GPUs
cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j$(nproc)
Download Unsloth's Gemma 4 12B QAT here: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF
Download Google's Gemma 4 assistant / draft here https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF
Load the models with llama-server
llama-server
-m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf
--model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf
--spec-type draft-mtp
--spec-draft-n-max 4
--ctx-size 131072
--temp 1.0
--top-p 0.95
--top-k 64

VS Studio

我知道了，A3B, active 3B, 所以更快。难怪比27B还快。

VS Studio

RTX3090

git clone https://github.com/TheTom/llama-cpp-turboquant
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

llama-cli -m d:\llama.cpp\models\Qwen_Qwen3.6-35B-A3B-Q4_K_M.gguf -ngl 99 --no-mmap --mlock --cache-type-k turbo4 --cache-type-v turbo3 --ctx-size 262144 --flash-attn on

Write me a poem
[ Prompt: 187.8 t/s | Generation: 127.0 t/s ]

请问这个速度是正常吗？35B 的千问有那么快吗？

抡锤者

VS Studio

帖子