diffusiongemma是基于gemma4 26B出的急速版本, 介绍在这里. https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF, 最大特点是号称1000 tokens/s的速度, 这可是离谱到家了.
于是赶紧按照说明部署了, 看到编译时候参数cmake -B build -DGGML_CUDA=ON, 眼前一黑, 莫不是只支持cuda.
硬着头皮把参数改成vulkun, 编译通过. 跑起来就傻眼了.
sun@homeserver6:/data/diffusiongemma/llama.cpp$ ./build/bin/llama-diffusion-cli -m /data/diffusiongemma/diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 32768
0.00.420.924 W load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.421.244 W load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.437.141 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.06.155.371 I diffusion: -n 32768 -> 128 blocks, n_ubatch=34816 n_batch=34816 n_ctx=34816 (canvas_length=256)
0.06.155.375 I diffusion: --fit has no effect here; context is sized from -n and the canvas. Set -ngl / --n-cpu-moe to control device memory.
0.06.155.805 W llama_context: n_ctx_seq (34816) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.09.436.495 W ggml_vulkan: Failed to allocate pinned memory (Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory)
0.09.444.362 I diffusion_params: steps=128 schedule=0 algorithm=4 temperature=0.800 eps=0.001000 mask_token=4
0.09.444.366 I diffusion_eb: max_steps=48 t=[0.400,0.800] entropy_bound=0.1000 stability=1 confidence=0.0050 kv_cache=on gpu_sampling=on sample_reduce=on
0.09.444.366 I conversation mode: /help for commands, /clear to reset, /exit to quit
> hello
0.14.709.774 W diffusion_generate_entropy_bound: on-device sampling unsupported on this backend; using host sampling
diffusion step: 8/48 [======== ] 16%0.18.486.966 I
<|channel>thought
The user said "hello".
This is a standard greeting.
I should respond politely and offer assistance.
Plan:
1. Greet the user.
2. Ask how I can help them.<channel|>Hello! How can I help you today?
total time: 4945.55ms, time per step: 549.51ms (9 steps over 1 blocks, entropy-bound)
throughput: 51.8 tok/s (256 tok in 4945.55ms), in-step parallel 466 tok/s (256-tok canvas x 9.0 steps/block)
> 你好
0.28.106.836 W diffusion_generate_entropy_bound: on-device sampling unsupported on this backend; using host sampling
diffusion step: 20/48 [==================== ] 41%0.37.561.306 I
<|channel>thought
The user said "你好" (Nǐ hǎo), which means "Hello" in Chinese.
The user is speaking in Chinese.
I should respond politely in Chinese and offer assistance.
* Option 1: 你好!(Nǐ hǎo!) - Simple hello.
* Option 2: 你好!有什么可以帮您的吗?(Nǐ hǎo! Yǒu shénme wǒ kě yǐ b帮 nín de ma?) - Hello! How can I help you?
Option 2 is more helpful and engaging.<channel|>你好!请问有什么我可以帮您的吗?
total time: 9934.09ms, time per step: 473.05ms (21 steps over 1 blocks, entropy-bound)
throughput: 25.8 tok/s (256 tok in 9934.09ms), in-step parallel 541 tok/s (256-tok canvas x 21.0 steps/block)
有报错不说, 速度也才25.8, 还不如Qwen3.6 27B
调整一下参数
./build/bin/llama-diffusion-cli \
-m /data/diffusiongemma/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
-ngl 99 \
-cnv \
-n 1024 \
--diffusion-eb-max-steps 24
首次输出速度也才53 tokens/s.
看报错, 发现很明确不支持:
diffusion_generate_entropy_bound: on-device sampling unsupported on this backend; using host sampling
换ROCm也一样.
如果有人真跑起来, 欢迎分享下, 如果没有1000tokens/s的速度, 有500也行啊