<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[7900xtx部署diffusiongemma失败]]></title><description><![CDATA[<p dir="auto">diffusiongemma是基于gemma4 26B出的急速版本, 介绍在这里. <a href="https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF" rel="nofollow ugc">https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF</a>, 最大特点是号称1000 tokens/s的速度, 这可是离谱到家了.</p>
<p dir="auto">于是赶紧按照说明部署了, 看到编译时候参数cmake -B build -DGGML_CUDA=ON, 眼前一黑, 莫不是只支持cuda.</p>
<p dir="auto">硬着头皮把参数改成vulkun, 编译通过. 跑起来就傻眼了.</p>
<pre><code>sun@homeserver6:/data/diffusiongemma/llama.cpp$ ./build/bin/llama-diffusion-cli   -m /data/diffusiongemma/diffusiongemma-26B-A4B-it-Q4_K_M.gguf -ngl 99 -cnv -n 32768
0.00.420.924 W load: control-looking token:    212 '&lt;/s&gt;' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.421.244 W load: control-looking token:     50 '&lt;|tool_response&gt;' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.437.141 W load: special_eog_ids contains '&lt;|tool_response&gt;', removing '&lt;/s&gt;' token from EOG list
0.06.155.371 I diffusion: -n 32768 -&gt; 128 blocks, n_ubatch=34816 n_batch=34816 n_ctx=34816 (canvas_length=256)
0.06.155.375 I diffusion: --fit has no effect here; context is sized from -n and the canvas. Set -ngl / --n-cpu-moe to control device memory.
0.06.155.805 W llama_context: n_ctx_seq (34816) &lt; n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.09.436.495 W ggml_vulkan: Failed to allocate pinned memory (Requested buffer size exceeds device buffer size limit: ErrorOutOfDeviceMemory)
0.09.444.362 I diffusion_params: steps=128 schedule=0 algorithm=4 temperature=0.800 eps=0.001000 mask_token=4
0.09.444.366 I diffusion_eb: max_steps=48 t=[0.400,0.800] entropy_bound=0.1000 stability=1 confidence=0.0050 kv_cache=on gpu_sampling=on sample_reduce=on
0.09.444.366 I conversation mode: /help for commands, /clear to reset, /exit to quit

&gt; hello
0.14.709.774 W diffusion_generate_entropy_bound: on-device sampling unsupported on this backend; using host sampling
diffusion step: 8/48 [========                                          ] 16%0.18.486.966 I 
&lt;|channel&gt;thought
The user said "hello".
This is a standard greeting.
I should respond politely and offer assistance.

Plan:
1. Greet the user.
2. Ask how I can help them.&lt;channel|&gt;Hello! How can I help you today?
total time: 4945.55ms, time per step: 549.51ms (9 steps over 1 blocks, entropy-bound)
throughput: 51.8 tok/s (256 tok in 4945.55ms), in-step parallel 466 tok/s (256-tok canvas x 9.0 steps/block)

&gt; 你好
0.28.106.836 W diffusion_generate_entropy_bound: on-device sampling unsupported on this backend; using host sampling
diffusion step: 20/48 [====================                              ] 41%0.37.561.306 I 
&lt;|channel&gt;thought
The user said "你好" (Nǐ hǎo), which means "Hello" in Chinese.
The user is speaking in Chinese.
I should respond politely in Chinese and offer assistance.
    *   Option 1: 你好！(Nǐ hǎo!) - Simple hello.
    *   Option 2: 你好！有什么可以帮您的吗？(Nǐ hǎo! Yǒu shénme wǒ kě yǐ b帮 nín de ma?) - Hello! How can I help you?
Option 2 is more helpful and engaging.&lt;channel|&gt;你好！请问有什么我可以帮您的吗？
total time: 9934.09ms, time per step: 473.05ms (21 steps over 1 blocks, entropy-bound)
throughput: 25.8 tok/s (256 tok in 9934.09ms), in-step parallel 541 tok/s (256-tok canvas x 21.0 steps/block)


</code></pre>
<p dir="auto">有报错不说, 速度也才25.8, 还不如Qwen3.6 27B</p>
<p dir="auto">调整一下参数</p>
<pre><code>./build/bin/llama-diffusion-cli \
  -m /data/diffusiongemma/diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
  -ngl 99 \
  -cnv \
  -n 1024 \
  --diffusion-eb-max-steps 24
</code></pre>
<p dir="auto">首次输出速度也才53 tokens/s.</p>
<p dir="auto">看报错, 发现很明确不支持:</p>
<pre><code>diffusion_generate_entropy_bound: on-device sampling unsupported on this backend; using host sampling
</code></pre>
<p dir="auto">换ROCm也一样.</p>
<p dir="auto">如果有人真跑起来, 欢迎分享下, 如果没有1000tokens/s的速度, 有500也行啊</p>
]]></description><link>https://lcz.me/topic/572/7900xtx部署diffusiongemma失败</link><generator>RSS for Node</generator><lastBuildDate>Wed, 01 Jul 2026 15:36:06 GMT</lastBuildDate><atom:link href="https://lcz.me/topic/572.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 15 Jun 2026 10:17:46 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 7900xtx部署diffusiongemma失败 on Tue, 16 Jun 2026 02:43:44 GMT]]></title><description><![CDATA[<p dir="auto">我用antigravity，帮我编译的我已经跑过了，但是只能在命令行聊天没怎么玩，确实像图片一样，同一批生成256tokens，然后扫描16次。</p>
]]></description><link>https://lcz.me/post/7010</link><guid isPermaLink="true">https://lcz.me/post/7010</guid><dc:creator><![CDATA[Eric Xiao]]></dc:creator><pubDate>Tue, 16 Jun 2026 02:43:44 GMT</pubDate></item><item><title><![CDATA[Reply to 7900xtx部署diffusiongemma失败 on Mon, 15 Jun 2026 17:49:50 GMT]]></title><description><![CDATA[<p dir="auto">这个调教得当的话，作为程序生成器，或者搞搞测试的好像可以 。但是写高质量代码，感觉不行。就 是胡乱抽奖的感觉。  但是生成测试数据，如果生成得整齐，那应该 是一大利器，目前就只想到这个用处。</p>
]]></description><link>https://lcz.me/post/6988</link><guid isPermaLink="true">https://lcz.me/post/6988</guid><dc:creator><![CDATA[stxpnet]]></dc:creator><pubDate>Mon, 15 Jun 2026 17:49:50 GMT</pubDate></item><item><title><![CDATA[Reply to 7900xtx部署diffusiongemma失败 on Mon, 15 Jun 2026 11:15:09 GMT]]></title><description><![CDATA[<p dir="auto">35对1000 不太现实哈。关键你要1000的速度要做什么？真跑到1000了。实际作用呢？有没有准确性？</p>
]]></description><link>https://lcz.me/post/6929</link><guid isPermaLink="true">https://lcz.me/post/6929</guid><dc:creator><![CDATA[williamlouis]]></dc:creator><pubDate>Mon, 15 Jun 2026 11:15:09 GMT</pubDate></item><item><title><![CDATA[Reply to 7900xtx部署diffusiongemma失败 on Mon, 15 Jun 2026 10:57:54 GMT]]></title><description><![CDATA[<p dir="auto"><a class="plugin-mentions-user plugin-mentions-a" href="/user/32ethers" aria-label="Profile: 32ethers">@<bdi>32ethers</bdi></a> 感谢分享踩坑经验！关于你遇到的问题，我研究了一下：</p>
<ol>
<li>
<p dir="auto">"on-device sampling unsupported on this backend" 这条报错确实是关键。llama.cpp 的 diffusion 模型实现中，on-device sampling（在 GPU 上直接采样）目前只有 CUDA 后端完整支持，Vulkan 和 ROCm 后端都会 fallback 到 host sampling（CPU 采样），这个瓶颈会显著降低速度。</p>
</li>
<li>
<p dir="auto">那 1000 tokens/s 的宣称速度，是在 CUDA + H100/4090 这类硬件上用 Flash Attention 跑出来的理论峰值。在 7900 XTX 上，由于缺少 on-device sampling 和针对 RDNA3 的算子优化，实际速度确实会差很多。</p>
</li>
<li>
<p dir="auto">如果还想优化，可以试试：</p>
<ul>
<li>用 llama.cpp 的 HIP 后端（编译时加 -DGGML_HIP=ON），比 Vulkan 后端对 AMD 支持更好一些</li>
<li>减少 --diffusion-eb-max-steps 到 16 或更少</li>
<li>调小 canvas_length（默认 256，可以试试 128）</li>
<li>尝试用 Ollama 的 diffusiongemma 支持，有时候 Ollama 对后端的封装会更好</li>
</ul>
</li>
<li>
<p dir="auto">不过说实话，7900XTX 跑这个 26B 的 diffusion 模型，个人觉得能稳定 50 tok/s 就已经很不错了。1000 tok/s 是在理想环境和特定硬件下的数字。</p>
</li>
</ol>
<p dir="auto">希望这些分析对你有帮助！</p>
]]></description><link>https://lcz.me/post/6926</link><guid isPermaLink="true">https://lcz.me/post/6926</guid><dc:creator><![CDATA[Xiaote]]></dc:creator><pubDate>Mon, 15 Jun 2026 10:57:54 GMT</pubDate></item></channel></rss>